US-20260127168-A1 - GENERIC SCHEDULING

US20260127168A1US 20260127168 A1US20260127168 A1US 20260127168A1US-20260127168-A1

Abstract

A system and method for customized scheduling of sources, including breaking down a source of content into at least two categories, including posts and engagements, and gathering content related to a specific source. A scheduler handles scheduling of posts and engagement for a single source and entities that are due to be crawled are sent to a scheduling queue, in which each content type for a source can have its own queue. A process points to the correct scheduler queue in order to request content to be crawled, attaches to the proper queue, processes requests, queries the social network for content, parses the response and sends any new data to be saved to the system.

Inventors

Stuart Douglas McClune
Michael Gordon LUFF

Assignees

SALESFORCE, INC.

Dates

Publication Date: 20260507
Application Date: 20230421

Claims (9)

1 - 20 . (canceled)
21 . A method for setting a crawler scheduling algorithm by identifying patterns to reduce crawl frequency of a content from a web page, the method comprising: parsing, by a processor, the content into a first portion and a second portion, the first portion being associated with a new content to the web page, the second portion being associated with updated content related to the first portion; identifying, by the processor, whether the web page comprises a pattern of content being a first unedited type or a second edited type, wherein the pattern of content applies to a plurality of web pages; in response to a determination that the web page is the pattern of content of the first unedited type, causing the processor to refrain from fetching the content until the processor determines the content of the web page has been modified; and setting, by the processor, the scheduling algorithm according to a type of the web page, the type being the first unedited type or the second edited type.
22 . The method of claim 21 , further comprising rescheduling, by the processor and in response to a determination that the web page is the second edited type, the crawl of the content of the web page.
23 . The method of claim 21 , further comprising sending, by the processor to a scheduling queue, the scheduling algorithm of the crawl of the content from the web page.
24 . The method of claim 23 , wherein the scheduling queue comprises: a first queue for the first unedited type; and a second queue for the second edited type.
25 . The method of claim 23 , wherein the scheduling queue comprises: a first queue for the new content in web page; and a second queue for the updated content related to the new content in the web page.
26 . The method of claim 21 , further comprising querying, by the processor, a dynamics ADS active table in conjunction with rescheduling the crawl of the updated content.
27 . A non-transitory computer-readable medium storing computer code for setting a crawler scheduling algorithm by identifying patterns to reduce crawl frequency of a content from a web page, the computer code including instructions to cause a processor to: parse the content into a first portion and a second portion, the first portion being associated with a new content to the web page, the second portion being associated with updated content related to the first portion; identify whether the web page comprises a pattern of content being a first unedited type or a second edited type, wherein the pattern of content applies to a plurality of web pages; causing, in response to a determination that the web page is the pattern of content of the first unedited type, to refrain from fetching the unedited content until the processor determines that a new content on the web page is available; and set the scheduling algorithm according to a type of the web page, the type being the first unedited type or the second edited type.
28 . A system for setting a crawler scheduling algorithm by identifying patterns to reduce crawl frequency of a content from a web page, comprising: a memory configured to store the content; and a processor configured to: parse the content into a first portion and a second portion, the first portion being associated with a new content to the web page, the second portion being associated with updated content related to the first portion; identify whether the web page comprises a pattern of content being a first unedited type or a second edited type, wherein the pattern of content applies to a plurality of web pages; causing, in response to a determination that the web page is the pattern of content of the first unedited type, to refrain from fetching the unedited content until the processor determines that a new content on the web page is available; and set the scheduling algorithm according to a type of the web page, the type being the first unedited type or the second edited type.

Description

BACKGROUND A search engine is a tool that identifies documents, typically stored on hosts distributed over a network, that satisfy search queries specified by users. Web-type search engines work by storing information about a large number of web pages or documents. These documents are retrieved by a web crawler, which then follows links found in crawled documents so as to discover additional documents to download. The contents of the downloaded documents are indexed, mapping the terms in the documents to identifiers of the documents and the resulting index is configured to enable a search to identify documents matching the terms in search queries. Some search engines also store all or part of the document itself, in addition to the index entries. In such web-type search engines, web pages can be manually selected for crawling, or automated selection mechanisms can be used to determine which web pages to crawl and which web pages to avoid. A search engine crawler typically includes a set of schedulers that are associated with one or more segments of document identifiers (e.g., URLs) corresponding to documents on a network (e.g., WWW). Each scheduler handles the scheduling of document identifiers for crawling for a subset of the known document identifiers. Using a starting set of document identifiers, such as the document identifiers crawled or scheduled for crawling during the most recent completed crawl, the scheduler removes from the starting set those document identifiers that have been unreachable in one or more previous crawls. Other filtering and scheduling mechanisms may also be used to filter out some of the document identifiers in the starting set, and schedule the appropriate times for crawling others. As such, any number of factors may play a role in filtering and scheduling mechanisms. Accordingly, a need exists for a generic scheduling process that addresses these variables and allows for customized scheduling of such sources, including gathering content related to a specific source. BRIEF SUMMARY According to implementations of the disclosed subject matter, a system and method is provided for a generic scheduling process for use in computer network systems. According to one implementation of the disclosed subject matter, a system and method is provided that allows for customized scheduling of sources, hereinafter referred to as managed account-type sources, including gathering content related to a specific source. To do so, an implementation of the disclosed subject matter is provided to break down a source of content from a social network into at least two categories, including posts which represent top level content, and engagements which represent content driven from top level content ingested into the system and which has an associated ID (i.e., comments, replies, and so forth). An implementation of the disclosed subject matter is also provided to control a scheduler, hereinafter referred to as a managed account scheduler, to handle scheduling of posts and engagements for a single managed account-type source (e.g., Google+®, LinkedIn®, and the like). An implementation of the disclosed subject matter is also provided to send entities that are due to be crawled to a scheduling queue, such as a Redis Queue, in which each content type (e.g., posts and engagements) for a managed account may have its own queue within the scheduling queue that the scheduler will send entities to, based on the type of entity being scheduled. Herein, an entity may be any source of content from a social network, but is not limited thereto. An implementation of the disclosed subject matter is also provided to control a process, hereinafter referred to as a managed account worker process, to point to a queue within the scheduling queue in order to request content of the scheduler queue to be crawled. For each managed account, there may be two managed account worker process instances running, one for each content type within the managed account. An implementation of the disclosed subject matter is also provided to control a managed account worker process to attach to the proper scheduling queue, process the request, query the social network for content, parse the response and send any new data to another process, hereinafter referred to as a batch insert process, to be saved to the system. Any associated dynamics may also be updated if the managed account worker process is processing engagements-type posts. Accordingly, implementations of the disclosed subject matter provide a generic scheduling process that manages when a particular external entity is due to be crawled. An external entity may be any source of content from a social network and is broken down into two categories, including posts and engagements. Each managed account scheduler may handle scheduling of posts and engagements for a single managed account-type source, and entities that are due to be crawled may be sent to a scheduling queue in a format, and e