CN-121980098-A - Data perception method and system embedded with large language model prompt word optimizer

CN121980098ACN 121980098 ACN121980098 ACN 121980098ACN-121980098-A

Abstract

The invention discloses a data perception method and a system for embedding a large language model prompt word optimizer, and belongs to the technical field of data acquisition in large language model application. The method comprises the steps of obtaining a result according to initial intention through a first neural network crawler Agent embedded with a prompt word optimizer and inputting the result into the optimizer, calculating various scores based on a multidimensional evaluation system, judging according to a preset threshold value, iteratively optimizing the prompt word through a self-adaptive strategy selection algorithm until the prompt word meets a standard, splicing a dynamic URL list corresponding to the optimized result with a pre-stored static list to generate a set to be processed, traversing the set through a second neural network crawler Agent not embedded with the optimizer, obtaining target webpage content and storing the target webpage content in a standardized format, and updating the optimization strategy selection result to a rule base. The invention realizes the automatic evaluation and optimization of the prompt words and improves the quality, the integrity and the efficiency of data acquisition.

Inventors

LI ZHENG
QIU HUI
FANG YUJUAN

Assignees

清华大学

Dates

Publication Date: 20260505
Application Date: 20251210

Claims (9)

1. A data perception method embedded in a large language model hint word optimizer, comprising: S1, inputting an initial prompt word according to data acquisition intention, acquiring an initial result through a first neural network crawler Agent which is based on a large language model and is embedded with a prompt word optimizer, and inputting the initial result into the prompt word optimizer; S2, calculating a relevance score, an accuracy score, an integrity score and an efficiency score based on a multidimensional evaluation system of the prompt word optimizer, judging whether optimization is required according to a preset threshold, and if so, iteratively optimizing the prompt word through a self-adaptive strategy selection algorithm until an evaluation result meets a preset standard; s3, splicing the dynamic URL list corresponding to the optimized prompt word with a static URL list prestored in a network space to generate a URL set to be processed; And S4, traversing the URL set to be processed through a second neural network crawler Agent without the prompt word optimizer embedded therein, acquiring target webpage content and storing the target webpage content in a standardized format, and updating a strategy selection result in the optimization process to a rule base.
2. The method of claim 1, wherein the inputting the initial prompt word according to the data acquisition intention, acquiring the initial result by a first neural network crawler Agent based on the large language model and embedded with the prompt word optimizer, and inputting the initial result into the prompt word optimizer, comprises: s11, the initial prompt word comprises a specific field limiting word and a time range limiting word; S12, setting an ' Authorization ' and ' beer < token > ' and ' X-Respond-With ' no-content ' parameter in a request head to control a response format when calling through an API.
3. The method of claim 1, wherein the multi-dimensional evaluation system based on the hint word optimizer calculates a relevance score, an accuracy score, an integrity score, and an efficiency score, determines whether optimization is required according to a preset threshold, and if so, iteratively optimizes the hint word by an adaptive policy selection algorithm until the evaluation result meets a preset standard, comprising: The calculation formula of the relevance score RS is that RS=0.4×KWM+0.3×ITM+0.3×CTX, KWM is the ratio of the occurrence times of keywords to the total number of expected keywords in the response, ITM is the intended matching degree score based on a five-level scale, CTX is the ratio of 1 minus the number of context deviation points to the total number of response paragraphs, the calculation formula of the accuracy score AS is that AS=0.5×FC+0.3×CC+0.2×LC, FC is the fact accuracy rate, CC is the calculation accuracy rate, LC is the logical consistency, the calculation formula of the integrity score CS is that CS=0.4×RR+0.4×CR+0.2×DR, RR is the required response rate, CR is the coverage rate, DR is the depth ratio, the calculation formula of the efficiency score ES is that ES=0.5×TE+0.5×LE, TE is the time efficiency, LE is the length efficiency; S22, the self-adaptive strategy selection algorithm optimizes the prompt words by calculating four optimization strategy SCOREs including an instruction definition strategy SCORE IC_SCORE, a context supplement strategy SCORE CS_SCORE, a constraint reinforcement strategy SCORE CE_SCORE and a structure recombination strategy SCORE SR_SCORE and selecting a corresponding optimization strategy according to the height of each strategy SCORE and preset conditions.
4. The method of claim 1, wherein traversing the URL set to be processed by a second neural network crawler Agent without embedded hint word optimizer, obtaining target web content and storing in a standardized format, and updating the policy selection result in the optimization process to a rule base, comprises: S41, a standardized format is a markdown format, wherein the markdown format comprises a title, a text and a metadata field; s42, the unique identification number is generated by parsing the keywords in the URL and combining them with underlining.
5. The method as recited in claim 1, further comprising: S5, performing time sequence monitoring on a specific target link in the static URL list, dynamically replacing a keyword part in the URL according to a preset updating rule, and generating an updated static URL list; And S6, performing secondary splicing on the updated static URL list and the dynamic URL list to form a mixed URL set containing historical data and latest data.
6. A data perception device embedded in a large language model hint word optimizer, comprising: The intention analysis module is used for inputting an initial prompt word according to the data acquisition intention, acquiring an initial result through a first neural network crawler Agent which is based on a large language model and is embedded with a prompt word optimizer, and inputting the initial result into the prompt word optimizer; The evaluation calculation module is used for calculating a relevance score, an accuracy score, an integrity score and an efficiency score based on a multidimensional evaluation system of the prompt word optimizer, judging whether optimization is needed according to a preset threshold value, and if so, iteratively optimizing the prompt word through a self-adaptive strategy selection algorithm until an evaluation result meets a preset standard; The data source aggregation module is used for splicing the dynamic URL list corresponding to the optimized prompt word with the static URL list prestored in the network space to generate a URL set to be processed; And the large language model acquisition and storage module is used for traversing the URL set to be processed through a second neural network crawler Agent without the prompt word optimizer embedded therein, acquiring target webpage content and storing the target webpage content in a standardized format, and updating a strategy selection result in the optimization process to a rule base.
7. The apparatus as recited in claim 6, further comprising: the time sequence monitoring module is used for performing time sequence monitoring on a specific target link in the static URL list, dynamically replacing a keyword part in the URL according to a preset updating rule and generating an updated static URL list; And the secondary splicing module is used for performing secondary splicing on the updated static URL list and the dynamic URL list to form a mixed URL set containing historical data and latest data.
8. An electronic device, comprising: A processor; And the processor executes the instructions to realize the data perception method embedded with the large language model prompt word optimizer according to any one of claims 1 to 5.
9. A computer readable storage medium storing a computer program which, when executed by a processor, implements a data perception method of an embedded large language model hint word optimizer as claimed in any one of claims 1 to 5.

Description

Data perception method and system embedded with large language model prompt word optimizer Technical Field The present invention relates to the field of data acquisition technologies in large language model applications, and in particular, to a data sensing method, system, device, and storage medium for embedding a large language model hint word optimizer. Background With the rise of large language model technology, especially in vertical field application, the combination with RAG technology can significantly improve the intelligentization level and decision efficiency of industry specific tasks, so that the field knowledge base data requirement for quickly constructing high-efficiency RAG is urgent. In addition to existing local static data, real-time data of large amounts of network space is of great value, however, how to quickly and intelligently acquire it still has efficiency problems. Traditional data acquisition modes are various, such as identification and extraction depending on predefined rules and modes, operation of simulating a browser environment, and the like, but are not efficient and flexible enough, and are difficult to rapidly face the requirements of a complex network space environment and an artificial intelligence knowledge base. In addition, in the large language model application process, the prompt words directly influence the understanding and output quality of the model, and the accurate, clear and specific prompt words can guide the model to capture the user intention more accurately, so that more relevant and high-quality content is generated. It is therefore necessary to optimize the hint words. On the one hand, the response speed and efficiency of the system can be improved, misunderstanding and repeated work are reduced, and user experience and satisfaction are improved. On the other hand, the method is helpful to reduce deviation and error of the model and ensure reliability and consistency of generated content. However, how to effectively optimize the hint words to improve system performance. At present, although a large language model has strong semantic understanding capability, when the large language model is directly used for data perception in an open network environment, two major challenges are faced, namely, firstly, the prompt word structure is highly dependent on expert experience and the effect is unstable, so that the model retrieval intention deviates from the real requirement of a user, secondly, the noise of network data is extremely high, an original result (such as a URL list) generated by the large model lacks a quantitative evaluation and closed-loop optimization mechanism, and the quality of a follow-up RAG knowledge base can be influenced by directly serving as a data source. Therefore, how to construct an intelligent data sensing framework with the capability depth adaptation with the large model and the self-optimization capability becomes a technical problem to be solved in the field. Traditional data acquisition methods (such as rule-based pattern matching and static crawlers) are grammatical in nature, lack semantic understanding capability for deep intention of users, cannot generate high-quality prompt words and structured data sources meeting the complex reasoning requirements of a large language model, and are difficult to directly serve intelligent application based on the large language model. Therefore, it is important to propose an efficient data acquisition method that facilitates the establishment of an artificial intelligence knowledge base. Disclosure of Invention The present invention aims to solve at least one of the technical problems in the related art to some extent. Therefore, the invention provides a data perception method for embedding a large language model prompt word optimizer. The method comprises the steps of obtaining a result based on initial intention through a neural network crawler agent embedded with an optimizer, performing multidimensional evaluation, iterating and optimizing prompt words through a self-adaptive strategy until the prompt words meet the standard, splicing the dynamic links and the static list corresponding to the optimized prompt words into a set to be processed, performing traversal processing through another crawler agent to obtain and standardize storage target content, and updating the optimization strategy to a rule base. The automatic optimization of the prompt words and the data acquisition flow is realized, and the accuracy and the integrity of data acquisition are improved. Another object of the present invention is to provide a data sensing device embedded with a large language model hint word optimizer. A third object of the invention is to propose a computer device. A fourth object of the present invention is to propose a non-transitory computer readable storage medium. In order to achieve the above object, an aspect of the present invention provides a data sensing method for embedding