CN-121996828-A - Distributed webpage information automatic acquisition scheduling method and device based on multi-source hierarchical analysis, processor and storage medium thereof

CN121996828ACN 121996828 ACN121996828 ACN 121996828ACN-121996828-A

Abstract

The invention relates to an automatic acquisition scheduling method of distributed webpage information based on multi-source hierarchical analysis, which comprises the steps of receiving data to be processed through a link analysis layer, classifying the acquired data to be processed according to different dimensions, executing hierarchical grabbing tasks by a page grabbing layer according to input results sent by the link analysis layer, respectively acquiring the data according to link types and scheduling according to task priorities, extracting elements from grabbing results of the page grabbing layer by an element extraction layer according to key contents, secondary elements and related elements, filtering out irrelevant elements contained in the grabbing results, carrying out crawling quality judgment on the structured data by a quality assessment layer, expanding valuable links according to search engine rules, finally storing and submitting the assessment results to a decision optimization layer, and automatically adjusting grabbing strategies and extracting algorithm parameters by the decision optimization layer to construct an adaptive optimization mechanism so as to realize continuous improvement of a system in a large-scale and diversified scene.

Inventors

ZHANG YONGJUN
CHEN YUANZHI
YU YIMING
Lin jiuchuan
FAN LANG
XU JIE
CHEN WENXUAN

Assignees

公安部第三研究所

Dates

Publication Date: 20260508
Application Date: 20260129

Claims (12)

1. The distributed webpage information automatic acquisition and scheduling method based on multi-source hierarchical analysis is characterized by comprising the following steps of: (1) Receiving data to be processed through a link analysis layer, classifying the acquired data to be processed according to different dimensions, and completing link analysis; (2) The page grabbing layer executes hierarchical grabbing tasks according to the input results sent by the link analysis layer, respectively collects data according to the link types, and performs scheduling in combination with task priorities to realize efficient coverage; (3) The element extraction layer extracts elements from the grabbing results of the page grabbing layer according to important content, secondary elements and related elements, filters irrelevant elements contained in the elements, saves all the extraction results, generates structural data and transmits the structural data to perform quality evaluation; (4) The quality assessment layer performs crawling quality judgment on the structured data, identifies links, articles, advertisement content and wind control prompt information contained in the structured data, expands valuable links according to search engine rules, and finally saves and submits assessment results to the decision optimization layer; (5) The decision optimization layer receives the judging and extracting results, automatically adjusts the grabbing strategy and extracting algorithm parameters, and constructs a self-adaptive optimization mechanism to realize continuous improvement of the system in a large-scale and diversified scene.
2. The method for automatically collecting and scheduling distributed web page information based on multi-source hierarchical analysis according to claim 1, wherein the step (1) is specifically as follows: And the user imports a target task link to the system, and the link analysis layer performs hierarchical analysis processing of domain name analysis, extraction and classification, so as to finish classification analysis aiming at the data to be processed.
3. The method for automatically collecting and scheduling distributed web page information based on multi-source hierarchical analysis according to claim 2, wherein the step (2) is specifically as follows: After the link analysis processing is completed, the page grabbing layer respectively initiates grabbing tasks according to three levels of a domain name, a navigation bar and an article detail page, if grabbing fails, the system automatically triggers a retry mechanism and reports grabbing results to a decision optimization layer according to grabbing conditions, wherein a website robots. Txt file is extracted according to Robots Exclusion Protocol Internet default crawler rules, a sitemaps page is extracted according to rules, and links contained in sitemaps are sent to an element extraction layer.
4. The method for automatically collecting and scheduling distributed web page information based on multi-source hierarchical parsing according to claim 3, wherein after a complete HTML page is collected, the step (3) enters a page parsing stage, and performs link extraction according to the following manner: The system firstly traverses the DOM tree structure of the current page, extracts all < a > tags and other elements possibly with hyperlink attribute, comprehensively judges the required key content through URL pattern regularization, path structure depth calculation and semantic feature recognition of the link text and the peripheral tags, and directly eliminates the functional links.
5. The method for automated collection and scheduling of distributed web page information based on multi-source hierarchical parsing according to claim 4, wherein the step (3) further comprises filtering the extraneous elements according to the following method: (3.1) firstly, performing preliminary filtering on nodes obviously belonging to advertisement or promotion content based on the characteristic identifiers of common advertisement and recommendation components; (3.2) calculating the relative positions of related elements in the page hierarchy by combining the DOM structure position features of the page, and marking nodes in typical functional areas of the header, the side columns and the footers as low text relevance; (3.3) when the text length is extremely short, the text density is too low and a large number of repeated symbols or semantic-free fragments are contained in the page, judging the text to be non-text noise and automatically eliminating the text noise; and (3.4) stably filtering irrelevant contents through comprehensive judgment of the structural features, the semantic features and the layout features.
6. The method for automatically collecting and scheduling distributed web page information based on multi-source hierarchical parsing according to claim 4, wherein the step (3) further comprises positioning text elements according to the following manner: if not, deducing the most probable title from the candidate nodes according to the text block length, the font size and the hierarchical position thereof in the DOM structure; Time and author information are extracted through regular matching of multi-class dates and signature patterns and context prompt words; The text region locates to a corresponding < div > or < p > node based on paragraph density, text distribution continuity, and text ratio features, while identifying a picture associated with the text as being linked to the attachment.
7. The method for automatically collecting and scheduling distributed web page information based on multi-source hierarchical parsing according to claim 4, wherein the step (3) further comprises the following steps of: The system records the XPath and the structural characteristics thereof which are successfully analyzed so as to be directly reused in the page with similar structure, when the page structure changes and the existing XPath fails to be matched, the system automatically generates new candidate XPath according to the structural similarity of the nodes and the text characteristics of the context and verifies the extraction effect of the XPath, and then dynamically updates the credibility according to the success rate and the stability of each XPath, reduces the weight of failure rules and improves the priority of stability rules, thereby continuously maintaining the accuracy, the stability and the self-adaption capability of the page analysis.
8. The method for automatically collecting and scheduling distributed web page information based on multi-source hierarchical analysis according to claim 7, wherein the step (4) specifically comprises: (4.1) identifying the text region plain text region, namely introducing a text extraction tool as supplement for the text range based on the XPath rule generated in the previous step, and ensuring the compatibility of complex pages; (4.2) filtering advertisements and redundant contents, namely removing non-core information comprising scripts, style sheets and advertisements, and establishing regular filtering rules aiming at common advertisement templates; (4.3) text cleaning and normalization, namely removing an HTML label, only reserving a plain text, processing special symbols and escape characters, uniformly replacing the special symbols and escape characters with standard characters, and simultaneously standardizing a time format to ensure the consistency of subsequent analysis; And (4.4) extracting the associated elements, namely extracting img and ATTATCHMENTS labels from the page, extracting links from the labels, downloading corresponding files, and storing the corresponding files after md5 operation is carried out, thus finally forming the JSON format.
9. The method for automatically collecting and scheduling distributed web page information based on multi-source hierarchical analysis according to claim 8, wherein the step (5) specifically comprises: (5.1) rule optimization and automatic correction, namely automatically identifying an abnormal record and marking an abnormal text as an analysis failure sample according to field integrity of a key element extraction result, and sending the analysis failure sample and an original HTML thereof into a structure comparison process by a system so as to identify whether a page has template variation or not; The system generates candidate XPath from the analysis model based on DOM characteristics, and carries out independent analysis and field coverage rate evaluation on each XPath, when the analysis effect of the candidate XPath is better than that of the original strategy, the system writes the candidate XPath into a standby strategy pool and carries out parallel verification in the subsequent same domain name or similar structure page; (5.3) template learning and self-adaptation, wherein the system performs cluster analysis on text length, similarity characteristics and structural characteristics of page contents to identify and classify page types possibly belonging to invalid contents, and caches the structural templates which are judged to be stable and reusable in a template fingerprint mode, and when the same domain name appears later or pages matched with the structural templates, corresponding links are skipped to be analyzed or removed directly; (5.4) unified language storage and translation, wherein the system firstly recognizes page languages and performs unified translation processing on texts in different languages, and simultaneously generates semantic vectors which can be aligned in cross languages, so that storage and comparison of multilingual contents in the same semantic space are realized; And (5.5) continuously improving and monitoring, namely, comparing a history with a current analysis result in a set time window by the system, automatically identifying the reduction of the analysis capability of a site by monitoring the change of the analysis success rate, triggering early warning and executing strategy adjustment by the system once the success rate of a certain site is lower than a threshold value, wherein the method comprises the steps of starting a standby analysis path or regenerating an analysis rule, and entering a model learning process by a related failure sample for updating analysis model parameters.
10. The utility model provides a distributed web page information automation gathers dispatch device based on multisource layering analysis which characterized in that, the device include: a processor configured to execute computer-executable instructions; A memory storing one or more computer-executable instructions which, when executed by the processor, implement the steps of the distributed web page information automated collection scheduling method based on multi-source hierarchical resolution of any one of claims 1 to 9.
11. A distributed web page information automation acquisition and scheduling processor based on multi-source hierarchical resolution, characterized in that the processor is configured to execute computer executable instructions, which when executed by the processor, implement the steps of the distributed web page information automation acquisition and scheduling method based on multi-source hierarchical resolution as claimed in any one of claims 1 to 9.
12. A computer readable storage medium having stored thereon a computer program executable by a processor to implement the steps of the distributed web page information automation acquisition scheduling method based on multi-source hierarchical parsing of any one of claims 1 to 9.

Description

Distributed webpage information automatic acquisition scheduling method and device based on multi-source hierarchical analysis, processor and storage medium thereof Technical Field The invention relates to the technical field of internet data acquisition and processing, in particular to a data distributed automatic hierarchical analysis and scheduling technology, and specifically relates to a distributed webpage information automatic acquisition and scheduling method, device, processor and computer readable storage medium based on multi-source hierarchical analysis. Background The existing webpage acquisition technology mainly depends on two methods, namely an acquisition mode based on fixed rules, and the webpage structure is subjected to templated analysis through technical means such as regular expressions, XPath and CSS selectors. The method can be rapidly realized in the scene of stable page structure and low updating frequency, but has poor migration and robustness, and once the page structure of the target site changes, manual intervention and regular rewriting are needed, so that the method is difficult to adapt to the acquisition requirements of large-scale dynamic change. Another class of methods is based on scheduled acquisition of a distributed crawler framework, such as common Scrapy, heritrix, etc. Such systems typically promote acquisition efficiency through task distribution and parallel scheduling, but lack the ability to intelligently parse and monitor acquisition targets. In practical application, problems such as incorrect extraction of page contents, repeated grabbing, a large number of invalid links and the like easily occur, and waste of storage and bandwidth resources is caused. In recent years, with the development of artificial intelligence, some collection modes based on machine learning or large language models are presented, for example, analysis logic is automatically written through an AI script generator. The method improves the adaptation capability of the webpage to a certain extent, but the training and maintenance cost is extremely high due to the fact that a large number of labeling samples and computing resources are needed. In addition, the method mostly takes a single-source single page as a core analysis object, and is difficult to realize the structured extraction of cross-site and cross-level pages, so that the method has insufficient practicability in a large-scale general acquisition scene. In summary, the existing web page collection technology has the following prominent problems: (1) The rule dependence is serious, the acquisition task highly depends on the fixed template and the rule, and the dynamic adaptation capability is lacking. (2) The resource utilization rate is low, the effective monitoring and scheduling of the acquisition targets are lacked, and repeated acquisition and resource waste are easy to cause. (3) The intelligent method has the defects that the AI-driven analysis method has progress, but has the problems of high cost and poor generalization. (4) The method has the defect of lacking generality, and is difficult to realize stable and accurate content extraction in a multi-source and multi-level complex webpage environment. Therefore, a new acquisition framework is urgently needed, multi-source hierarchical analysis, intelligent scheduling and large-model driving optimization can be fused, and the stability, universality and automation level of acquisition are improved, so that the requirements of large-scale data acquisition and processing are met. Disclosure of Invention The invention aims to overcome the defects of the prior art and provides a distributed webpage information automatic acquisition and scheduling method, device, processor and computer readable storage medium thereof based on multi-source hierarchical analysis. In order to achieve the above object, the method, the device, the processor and the computer readable storage medium for automatically collecting and scheduling distributed webpage information based on multi-source hierarchical analysis of the present invention are as follows: the distributed webpage information automatic acquisition and scheduling method based on multi-source hierarchical analysis is mainly characterized by comprising the following steps of: (1) Receiving data to be processed through a link analysis layer, classifying the acquired data to be processed according to different dimensions, and completing link analysis; (2) The page grabbing layer executes hierarchical grabbing tasks according to the input results sent by the link analysis layer, respectively collects data according to the link types, and performs scheduling in combination with task priorities to realize efficient coverage; (3) The element extraction layer extracts elements from the grabbing results of the page grabbing layer according to important content, secondary elements and related elements, filters irrelevant elements contained in