CN-121658699-B - Small language webpage self-adaptive collection method and device based on large language model

CN121658699BCN 121658699 BCN121658699 BCN 121658699BCN-121658699-B

Abstract

The application relates to the technical field of artificial intelligence and discloses a method and a device for adaptively collecting small language webpages based on a large language model, wherein the method comprises the steps of loading and rendering target small language webpages through a headless browser to obtain a document object model tree structure; analyzing and semantically marking a document object model tree structure based on a large language model, identifying dynamic content nodes and generating a self-adaptive content positioning rule, semantically vectorizing at least two continuously acquired webpage contents based on the self-adaptive content positioning rule and calculating semantic similarity, dynamically adjusting the initiation frequency of a subsequent acquisition request according to the preset frequency mapping rule of the semantic similarity query to generate an acquisition strategy, executing the acquisition strategy to access a target webpage, and injecting an operation sequence for simulating human interactive behaviors in the acquisition process. The application can realize complete collection of dynamic content of the small language webpage and ensure continuous and stable collection process.

Inventors

HE ZHONGQING
ZENG WEIJIA
CHEN DAWEI
WANG XUETENG
XU LINGZI
XU KUNYANG
ZHAO SHAN
Xie Qiongbing

Assignees

深圳市明心数智科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260205

Claims (9)

1. A method for adaptively collecting small language webpages based on a large language model is characterized by comprising the following steps: Loading and rendering a target small language webpage through a headless browser to obtain a corresponding document object model tree structure; The method comprises the steps of inputting a document object model tree structure into a pre-trained multi-language large language model, obtaining structural description and semantic annotation information of nodes in the document object model tree through the large language model, wherein the semantic annotation information at least comprises node function types and content language labels; Based on the self-adaptive content positioning rule, carrying out semantic vectorization on at least two continuously collected webpage contents, and calculating corresponding semantic similarity; inquiring a preset frequency mapping rule according to the semantic similarity, dynamically adjusting the initiating frequency of a subsequent acquisition request, and generating an acquisition strategy; Executing the acquisition strategy to access the target webpage, and injecting an operation sequence for simulating human interaction behavior in the acquisition process.
2. The method for adaptively collecting small language web pages based on a large language model according to claim 1, wherein the generating or updating the adaptive content locating rule for locating the target node content comprises: constructing an initial positioning path expression according to the path information of the target node; based on the node function category in the semantic annotation information, performing generalization processing or specialization processing on the initial positioning path expression to form the self-adaptive content positioning rule; And storing the self-adaptive content positioning rule and the content language label in a rule base in an associated mode.
3. The method for adaptively collecting small language web pages based on a large language model according to claim 1, wherein said performing semantic vectorization on at least two web page contents collected continuously based on the adaptive content location rule and calculating the corresponding semantic similarity comprises: respectively inputting the continuously collected webpage contents into a cross-language semantic coding model; Acquiring vector representations of the webpage contents in a unified semantic space through the cross-language semantic coding model; And calculating cosine distances between at least two vector representations, and taking the cosine distances as measurement values of the semantic similarity.
4. The method for adaptively collecting small language web pages based on a large language model according to claim 1, wherein the step of dynamically adjusting the initiation frequency of the subsequent collection request according to the preset frequency mapping rule of the semantic similarity query to generate the collection strategy comprises the following steps: Presetting at least two semantic similarity threshold intervals, wherein each threshold interval is associated with one acquisition frequency grade; matching the calculated semantic similarity with the threshold interval, and determining the acquisition frequency grade to which the semantic similarity belongs; And adjusting the request time interval of the next batch of acquisition tasks according to the determined acquisition frequency level to generate the acquisition strategy.
5. The method for adaptively collecting small language web pages based on large language model according to claim 1, wherein said executing the collection policy to access the target web pages and injecting the operation sequence for simulating human interactive behavior in the collection process comprises: generating an acquisition schedule comprising random delays according to the request time intervals in the acquisition strategy; Constructing a script instruction set containing at least one event of simulating a mouse moving event, a scrolling event or a clicking event; And after the headless browser initiates a request and loads a page according to the acquisition schedule, executing the script instruction set, and extracting to obtain corresponding page content.
6. The method for adaptively collecting small language web pages based on large language model according to claim 1, wherein said executing the collection policy to access the target web pages and injecting the operation sequence for simulating human interactive behavior in the collection process further comprises: Determining a jurisdiction area to which a target webpage belongs according to a domain name or an Internet protocol address of the target webpage, and loading a corresponding data compliance clause set; Analyzing the data compliance clause set, and extracting a defined sensitive data mode; and matching the acquired original data with the sensitive data mode in real time, and filtering or desensitizing the successfully matched sensitive data.
7. The method for adaptively collecting small language web pages based on a large language model according to claim 6, wherein the steps of performing real-time matching between the collected original data and the sensitive data pattern, and performing filtering or desensitizing on the successfully matched sensitive data comprise: Comparing text fields in the original data with the sensitive data patterns by adopting a regular expression or a keyword matching algorithm; when the comparison is successful, performing a handling operation on the respective field according to the handling requirements in the set of data compliance terms, the handling operation including at least one of a replacement operation, a masking operation, or a deletion operation; and recording the operation type, the target field and the time stamp of the treatment operation, and generating a compliance audit log.
8. The method for adaptive collection of small language web pages based on a large language model according to claim 1, wherein before said inputting the document object model tree structure into a pre-trained multi-language large language model, the method further comprises: acquiring a webpage corpus training set of a target small language, wherein the webpage corpus training set comprises webpage texts and structure labels of corresponding basic document object models; Performing incremental training on a basic multi-language large language model by using the web corpus training set so as to optimize semantic understanding capability and node classification capability of a target small language web structure of the basic multi-language large language model and obtain an enhanced large language model; the document object model tree structure is input to the enhanced large language model.
9. A little language webpage self-adaptation collection system based on big language model, characterized by comprising: The webpage rendering module is used for loading and rendering the target small language webpage through the headless browser and obtaining a corresponding document object model tree structure; The system comprises a rule generation module, a self-adaptive content positioning rule generation module, a dynamic content loading module and a self-adaptive content positioning module, wherein the rule generation module is used for inputting the document object model tree structure into a pre-trained multilingual large language model, acquiring structural description and semantic annotation information of nodes in the document object model tree through the large language model, wherein the semantic annotation information at least comprises node function types and content language labels; The semantic analysis module is used for carrying out semantic vectorization on at least two continuously collected webpage contents based on the self-adaptive content positioning rule and calculating corresponding semantic similarity; The strategy generation module is used for inquiring a preset frequency mapping rule according to the semantic similarity, dynamically adjusting the initiation frequency of a subsequent acquisition request and generating an acquisition strategy; and the execution acquisition module is used for executing the acquisition strategy to access the target webpage and injecting an operation sequence for simulating human interaction behavior in the acquisition process.

Description

Small language webpage self-adaptive collection method and device based on large language model Technical Field The application relates to the technical field of artificial intelligence, in particular to a small language webpage self-adaptive collection method and device based on a large language model. Background With the deep advancement of global industry in the sea, the commercial data value of small-language markets such as southeast Asia, middle east and the like is continuously released, and a web crawler becomes a core technical means for obtaining the data in a cross-region way, so that the method is widely applied to scenes such as market research, commercial analysis and the like. At present, the existing web crawlers commonly adopt a mode of presetting fixed positioning rules to capture web page contents, and the collection frequency is relatively fixed, so that the flexibility of operation behaviors is lacking. In the small-language webpage acquisition scene, the prior art has some defects that the small-language webpage always presents core content through a dynamic loading technology, a Document Object Model (DOM) structure of the small-language webpage has dynamic change and diversity characteristics, a fixed positioning rule is difficult to adapt to the structure, so that dynamic content is not completely grabbed, meanwhile, an acquisition mode with fixed frequency is easy to trigger a target website anti-climbing mechanism, and an anthropomorphic operation behavior is lacking, so that IP blocking risks are further increased, and acquisition continuity is influenced. Therefore, the prior art cannot simultaneously realize complete grabbing of dynamic content of the small-language webpage and continuous stability of the acquisition process, and is difficult to meet the actual requirements of data acquisition of the small-language market. The foregoing description is provided for general background information and does not necessarily constitute prior art. Disclosure of Invention The embodiment of the application provides a method and a device for adaptively collecting small-language webpages based on a large language model, which can realize complete collection of dynamic contents of the small-language webpages, and simultaneously reduce anti-climbing triggering risk by dynamically adjusting collection frequency and simulating human interaction behaviors, so that continuous stability of a collection process is ensured. In a first aspect, an embodiment of the present application provides a method for adaptively collecting a small language web page based on a large language model, including: Loading and rendering a target small language webpage through a headless browser to obtain a corresponding document object model tree structure; analyzing and semantically labeling the document object model tree structure based on a large language model, identifying dynamic content nodes and generating corresponding self-adaptive content positioning rules; Based on the self-adaptive content positioning rule, carrying out semantic vectorization on at least two continuously collected webpage contents, and calculating corresponding semantic similarity; inquiring a preset frequency mapping rule according to the semantic similarity, dynamically adjusting the initiating frequency of a subsequent acquisition request, and generating an acquisition strategy; Executing the acquisition strategy to access the target webpage, and injecting an operation sequence for simulating human interaction behavior in the acquisition process. Further, in some embodiments of the present application, the parsing and semantic labeling of the document object model tree structure based on the large language model, identifying dynamic content nodes and generating corresponding adaptive content localization rules includes: Inputting the document object model tree structure into a pre-trained multilingual large language model; obtaining structural description and semantic annotation information of nodes in the document object model tree through the large language model, wherein the semantic annotation information at least comprises node function categories and content language labels; identifying a target node related to dynamic content loading according to the structural description and the semantic annotation information; And generating or updating an adaptive rule for positioning the content of the target node based on the path information and the attribute characteristics of the target node. Further, in some embodiments of the present application, the generating or updating the adaptive rule for locating the content of the target node includes: constructing an initial positioning path expression according to the path information of the target node; Based on the node function category in the semantic annotation information, performing generalization processing or specialization processing on the initial positioning path expression to for