US-12621344-B2 - Bot detection and mitigation using dynamic web flows built via machine learning
Abstract
An overlay network bot detection service is augmented to include a content generation service that dynamically generates dummy web pages that are served (along with real site content) to a requesting user, This content is built using machine learning models trained on a target website's content, or that otherwise leverage generative AI to create site content that mimics the site's real content. The generated content is preferably built dynamically during an actual interaction session with the requesting user, is designed to “look” and “feel” like actual content of the website, and inclusion of the content acts to trap a requesting user's browser in one or more non-productive (fake) navigation loops within the site. This facilitates the overall bot detection because such content and such loops are not actually part of the real site, and thus the navigation of these unproductive pages is highly indicative of bot activity.
Inventors
- Venkata Sai Kishore Modalavalasa
Assignees
- AKAMAI TECHNOLOGIES, INC.
Dates
- Publication Date
- 20260505
- Application Date
- 20240530
Claims (20)
- 1 . A method of protecting a website, comprising: during an interaction with the website initiated by a requesting client: serving a set of one or more first pages that constitute actual content of the website; selectively serving a set of one or more second pages that differ from the set of one or more first pages in that the one or more second pages do not constitute the actual content of the website but are configured to appear to the requesting client as though the one or more second pages do constitute the actual content of the website, the one or more second pages having been built dynamically during the interaction and based at least in part on one or more machine learning content models associated with the website; and based at least in part on receiving telemetry indicating navigation by the requesting client through the set of one or more second pages, characterizing the requesting client as a bot; and taking a given mitigation action against the requesting client to protect the website.
- 2 . The method as described in claim 1 wherein the one or more machine learning content models comprises a set of models, wherein each of the set of models is trained on a category of content.
- 3 . The method as described in claim 2 , wherein the category of content is one of: text, images, page templates, and site metadata associated with site navigation.
- 4 . The method as described in claim 1 , wherein at least one second page of the set of one or more second pages has a page design that is generated based at least in part on a measure of visual similarity to at least a portion of a first page of the set of one or more first pages.
- 5 . The method as described in claim 1 , further including the one or more machine learning content models are trained out-of-band relative to the interaction.
- 6 . The method as described in claim 1 , wherein at least one machine learning content model of the one or more machine learning content models is built using generative-AI.
- 7 . The method as described in claim 1 , wherein at least one machine learning content model of the one or more machine learning content models is a neural network.
- 8 . The method as described in claim 7 , wherein the neural network is a graph neural network (GNN).
- 9 . The method as described in claim 1 , wherein at least one machine learning content model of the one or more machine learning content models is a model that has been updated out-of-band relative to the interaction.
- 10 . The method as described in claim 1 , wherein at least one machine learning content model of the one or more machine learning content models is trained on content of the website.
- 11 . The method as described in claim 1 , wherein the one or more first pages and the one or more second pages are served by an edge server of an overlay network.
- 12 . The method as described in claim 11 , wherein the given mitigation action is determined by a bot detection service associated with the overlay network.
- 13 . The method as described in claim 1 , wherein the requesting client is a page scrapper.
- 14 . The method as described in claim 1 , further including the website is configured for use without a captcha.
- 15 . The method as described in claim 1 , wherein the given mitigation action is one of: blocking an action requested by the requesting client, logging the interaction, issuing a notification, and tar-pitting or sand-boxing the requesting client.
- 16 . An apparatus configured to protect a website, comprising: one or more hardware processors; computer memory holding computer program code executed by the one or more hardware processors, the computer program code configured during an interaction with the website initiated by a requesting client to: serve a set of one or more first pages that constitute actual content of the website; selectively serve a set of one or more second pages that differ from the set of one or more first pages in that the one or more second pages do not constitute the actual content of the website but are configured to appear to the requesting client as though the one or more second pages do constitute the actual content of the website, the one or more second pages having been built dynamically during the interaction and based at least in part on one or more machine learning content models associated with the website; based at least in part on receiving telemetry indicating navigation by the requesting client through the set of one or more second pages, characterize the requesting client as a bot; and take a given mitigation action against the requesting client to protect the website.
- 17 . The apparatus as described in claim 16 , wherein the given mitigation action is one of: blocking an action requested by the requesting client, logging the interaction, issuing a notification, and tar-pitting or sand-boxing the requesting client.
- 18 . A computer program product in a non-transitory computer-readable medium, the computer program product comprising computer program code executable by one or more hardware processors to protect a website, the computer program code configured during an interaction with the website initiated by a requesting client to: serve a set of one or more first pages that constitute actual content of the website; selectively serve a set of one or more second pages that differ from the set of one or more first pages in that the one or more second pages do not constitute the actual content of the website but are configured to appear to the requesting client as though the one or more second pages do constitute the actual content of the website, the one or more second pages having been built dynamically during the interaction and based at least in part on one or more machine learning content models associated with the website; based at least in part on receiving telemetry indicating navigation by the requesting client through the set of one or more second pages, characterize the requesting client as a bot; and take a given mitigation action against the requesting client to protect the website.
- 19 . The computer program product as described in claim 18 , wherein the given mitigation action is one of: blocking an action requested by the requesting client, logging the interaction, issuing a notification, and tar-pitting or sand-boxing the requesting client.
- 20 . The computer program product as described in claim 18 , wherein the given mitigation action is taken at an edge server of an overlay network.
Description
BACKGROUND OF THE INVENTION Distributed computer systems are well-known in the prior art. One such distributed computer system is a “content delivery network” (CDN) or “overlay network” that is operated and managed by a service provider. The service provider typically provides the content delivery service on behalf of third parties (customers) who use the service provider's shared infrastructure. A distributed system of this type typically refers to a collection of autonomous computers linked by a network or networks, together with the software, systems, protocols and techniques designed to facilitate various services, such as content delivery, web application acceleration, or other support of outsourced origin site infrastructure. A CDN service provider typically provides service delivery through digital properties (such as a website), which are provisioned in a customer portal and then deployed to the network. Technologies that detect malicious bot transactions on web and mobile applications are also well-known. These technologies typically work by analyzing attributes received from client devices, e.g., with data being collected on the client using a JavaScript-based approach to fingerprint clients and collect telemetry to evaluate the user behavior and differentiate bots from humans. Typical attributes include client device network, hardware, browser and software properties. Additionally, these techniques may also analyze human interaction events (e.g., mouse, keystroke timings, accelerometer and gyroscope data, touch activity, etc.) to check for human versus bot activity. Bot detection systems that leverage these types technologies can operate on a standalone basis at a website or in association with an edge network of a CDN. Although bot detection and related mitigation technologies provide significant advantages, bot script writers continuously adapt and improve their attack scripts as they attempt to avoid detection. For example, consider a bad actor that uses scripts to scrape and extract prices from a website and/or to perform fraudulent transactions on stolen credit cards. A common attack strategy in this scenario involves the actor deploying low volume botnets using residential IP addresses to circumvent edge network rate limiting and to otherwise fly under the site's detection radar. The actor's script visits the target website and is presented with content-based workflows (e.g., sets of product pages) that are readily mined and potentially exploited. SUMMARY OF THE INVENTION An overlay network (such as a CDN) that includes a bot detection service is augmented to include a content generation service that dynamically generates dummy web pages or snippets that are served (along with real site content) to a requesting user that may or may not be a bot. This content is built using one or more machine learning models that are trained on a target website's content, or that otherwise leverage generative AI or graph-based neural networks to create site content that mimics the site's real content. The generated content, which is preferably built dynamically (on-the-fly) during an actual interaction session with the requesting user, is designed to “look” and “feel” like actual content of the website, and inclusion of the content (and, in particular, the browser's following of links within those pages or snippets) acts to trap a requesting user's browser in one or more non-productive (indeed, fake) navigation loops within the site. This facilitates the overall bot detection because such content and such loops are not actually part of the real site, and thus the fruitless navigation of these unproductive pages increases the cost to the attacker of the attack, slows down the volume of the attack, and itself is highly indicative of bot activity. The trapping of the bot in this manner facilitates faster or more consistent and accurate bot detection. BRIEF DESCRIPTION OF DRAWINGS For a more complete understanding of the subject matter and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which: FIG. 1 depicts a known overlay network configured as a Content Delivery Network (CDN); FIG. 2 depicts a typical edge machine configuration in the CDN; FIG. 3 depicts a typical end user interaction with a CDN that has been configured with a bot detection service; FIG. 4 depicts the system of FIG. 3 that has augmented to include a content generation server/service that is used to generate simulated or dummy pages that are useful to ensnare a bot or automated script into an endless navigation loop within a target site according to the techniques of this disclosure; FIG. 5 depicts a set of site components and their associated ML models according to a modeling paradigm of this disclosure; FIG. 6 depicts a training method and system for generating one or more Machine Learning (ML) models for use in the content generation server/service of FIG. 4