Search

EP-4742061-A2 - WEB SCRAPING THROUGH USE OF PROXIES, AND APPLICATIONS THEREOF

EP4742061A2EP 4742061 A2EP4742061 A2EP 4742061A2EP-4742061-A2

Abstract

The invention relates to a computer-implemented method for processing web scraping jobs, using a plurality of database servers (404A-404N) operating independently of one another and each being configured to manage data storage to at least a portion of a job database (314) that stores status of web scraping jobs while the web scraping jobs are being executed, the method comprising: - receiving a web scraping request from a client computing device (102); - when the web scraping request is received, selecting one of the plurality of database servers (404A-404N) that is identified as enabled in a table (1008); and sending a job description specified by the web scraping request to the selected database server (404A-404N) for storage in the job database (314) as a pending web scraping job; - repeatedly checking health of each of the plurality of database servers (404A-404N); and - based on the health checks, determine whether each of the plurality of database servers (404A-404N) are to be enabled or disabled in the table (1008).

Inventors

  • VILCINSKAS, EIVYDAS
  • PETRUSKEVICIUS, ARNAS
  • STALIORAITIS, GIEDRIUS
  • JURAVICIUS, MARTYNAS
  • STANKEVICIUS, RIMANTAS

Assignees

  • oxylabs, UAB

Dates

Publication Date
20260513
Application Date
20220624

Claims (15)

  1. A computer-implemented method for processing web scraping jobs, using a plurality of database servers (404A-404N) operating independently of one another and each being configured to manage data storage to at least a portion of a job database (314) that stores status of web scraping jobs while the web scraping jobs are being executed, the method comprising: - receiving a web scraping request from a client computing device (102); - when the web scraping request is received, selecting one of the plurality of database servers (404A-404N) that is identified as enabled in a table (1008); and sending a job description specified by the web scraping request to the selected database server (404A-404N) for storage in the job database (314) as a pending web scraping job; - repeatedly checking health of each of the plurality of database servers (404A-404N); and - based on the health checks, determine whether each of the plurality of database servers (404A-404N) are to be enabled or disabled in the table (1008); characterized in that each of the plurality of database servers (404A-404N) comprises a message broker (454) that queues job descriptions to be stored in the jobs database (314), and each of the repeatedly checking comprises, for each of the plurality of database servers (404A-404N), - checking a connection between a server (302) that receives web scraping requests from client computing devices (102) and the respective database server's message broker (454); and/or - checking a number of messages queued within the respective database server's message broker (454).
  2. The method of claim 1, wherein each of the repeated checking comprises, for each of the plurality of database servers (404A-404N), connecting to the portion of the job database (314) for the respective database server (404A-404N).
  3. The method of any one of the preceding claims, wherein by checking health, it is determined whether the respective database server (404A-404N) is available to accept new web scraping jobs.
  4. The method of any one of the preceding claims, wherein each of the plurality of database servers (404A-404N) is marked as disabled in the table (1008) if the connection between the server (302) and the respective database server's message broker (454) is nonresponsive or returns an error message.
  5. The method of any one of the preceding claims, wherein each of the plurality of database servers (404A-404N) is marked as disabled in the table (1008) if the number of messages exceeds a threshold.
  6. The method of any one of the preceding claims, wherein, each of the plurality of database servers (404A-404N) that is, based on the health checking, determined to be overworked or hung-up, is marked for read-only access in the table (1008).
  7. The method of any one of the preceding claims, further comprising - consuming, by a database microservice (456) of the selected database server (404A-404N), the job description specified by the web scraping request; and - initiating processing of the web scraping job.
  8. The method according to any of the preceding claims, wherein each of the plurality of database servers (404A-404N) is a shard managing storage in a horizontal partition of the jobs database (314).
  9. The method according to any of the preceding claims, wherein each of the plurality of database servers (404A-404N) do not synchronize states to one another.
  10. The method according to any of the preceding claims, wherein the plurality of database servers (404A-404N) are executed by a plurality of different computing devices.
  11. The method according to any of the preceding claims, further comprising: determining whether a number of database servers that are disabled in the plurality of database servers exceeds a threshold; and when the number of database servers that are disabled exceeds the threshold, alerting an administrator.
  12. The method according to any one of the preceding claims, further comprising: - retrieving a next value from a counter maintained by the selected database server (402A-404N); - concatenating an identification of the selected database server (402A-404N) with a timestamp and the next value to generate a job identifier associated with the web scraping job.
  13. A non-transitory computer-readable device having instructions stored thereon that, when executed by at least one computing device, cause the at least one computing device to perform operations, comprising the steps of any one of the preceding claims.
  14. A system for processing web scraping jobs, comprising: a processor; a job database (314) that stores the status of web scraping jobs while the web scraping jobs are being executed; a memory that stores the job database (314); a plurality of database servers (404A-404N) operating independently of one another, each database server (404A-404N) configured to manage data storage to at least a portion of the job database (314); a database monitor configured to repeatedly check health of each of the plurality of database servers (404A-404N) and, based on the results of the health checks, determine whether each of the plurality of database servers (404A-404N) are to be enabled or disabled in a table (1008); a database server selector configured to, when a web scraping request is received from a client computing device (103), select one of the database servers (404A-404N) identified as enabled in the table; and a request intake manager (302) configured to send a job description specified by the web scraping request (302) to the selected database server (404A-404N) for storage in the job database (314) as a pending web scraping job, characterized in that each of the plurality of database servers (404A-404N) comprises a message broker (454) that queues job descriptions to be stored in the jobs database (314), and the database monitor is configured to, for each of the plurality of database servers (404A-404N), - check a connection between a server (302) that receives web scraping requests from client computing devices (102) and the respective database server's message broker (454); and/or - check a number of messages queued within the respective database server's message broker (454).
  15. The system of claim 14, wherein the database monitor is configured to, for each of the plurality of database servers (404A-404N), check a connection between the request intake manager (302) and the jobs database (314).

Description

BACKGROUND Field This field is generally related to web scraping. Related Art Web scraping (also known as screen scraping, data mining, web harvesting) is the automated gathering of data from the Internet. It is the practice of gathering data from the Internet through any means other than a human using a web browser. Web scraping is usually accomplished by executing a program that queries a web server and requests data automatically, then parses the data to extract the requested information. To conduct web scraping, a program known as a web crawler may be used. A web crawler, sometimes called a web spider, is a program or an automated script which performs the first task, i.e. it navigates the web in an automated manner to retrieve data, such as Hypertext Transfer Markup Language (HTML) data, JSONs, XML, and binary files, of the accessed websites. Web scraping is useful for a variety of applications. In a first example, web scraping may be used for search engine optimization. Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. A web search engine, such as the Google search engine available from Google Inc. of Mountain View, California, has a particular way of ranking its results, including those that are unpaid. To raise the location of a website in search results, SEO may, for example, involve cross-linking between pages, adjusting the content of the website to include a particular keyword phrase, or updating content of the website more frequently. An automated SEO process may need to scrape search results from a search engine to determine how a website is ranked among search results. In a second example, web scraping may be used to identify possible copyright. In that example, the scraped web content may be compared to copyrighted material to automatically flag whether the web content may be infringing a copyright holder's rights. In one operation to detect copyright claims, a request may be made of a search engine, which has already gathered a great deal of content on the Internet. The scraped search results may then be compared to a copyrighted work. In a third example, web scraping may be useful to check placement of paid advertisements on a webpage. For example, many search engines sell keywords, and when a search request includes the sold keyword, they place paid advertisements above unpaid search results on the returned page. Search engines may sell the same keyword to various companies, charging more for preferred placement. In addition, search engines may segment as sales by geographic area. Automated web scraping may be used to determine ad placement for a particular keyword or in a particular geographic area. In a fourth example, web scraping may be useful to check prices or products listed on e-commerce websites. For example, a company may want to monitor a competitor's prices to guarantee that their prices remain competitive. To conduct web scraping, the web request may be sent from a proxy server. The proxy server then makes the request on the web scraper's behalf, collects the response from the web server, and forwards the web page data so that the scraper can parse and interpret the page. When the proxy server forwards the requests, it generally does not alter the underlying content, but merely forwards it back to the web scraper. A proxy server changes the request's source IP address, so the web server is not provided with the geographical location of the scraper. Using the proxy server in this way can make the request appear more organic and thus ensure that the results from web scraping represent what would actually be presented were a human to make the request from that geographical location. Proxy servers fall into various types depending on the IP address used to address a web server. A residential IP address is an address from the range specifically designated by the owning party, usually Internet service providers (ISPs), as assigned to private customers. Usually a residential proxy is an IP address linked to a physical device, for example, a mobile phone or desktop computer. However, businesswise, the blocks of residential IP addresses may be bought from the owning proxy service provider by another company directly, in bulk. Mobile IP proxies are a subset of the residential proxy category. A mobile IP proxy is one with an IP address that is obtained from mobile operators. Mobile IP proxies use mobile data, as opposed to a residential proxy that uses broadband ISPs or home Wi-Fi. A datacenter IP proxy is the proxy server assigned with a datacenter IP. Datacenter IPs are IPs owned by companies, not by individuals. The datacenter proxies are typically IP addresses that are not in a natural person's home. Exit node proxies, or simply exit nodes, are gateways where the traffic hits the Internet. There can be several proxies used to perform a user's request, but the exit node proxy is the