CN-121996717-A - Data processing method based on prompt engineering and knowledge graph

CN121996717ACN 121996717 ACN121996717 ACN 121996717ACN-121996717-A

Abstract

The invention relates to the technical field of data processing and discloses a data processing method based on prompt engineering and a knowledge graph, which comprises the steps of preprocessing multi-source data to obtain structured data; the method comprises the steps of extracting a knowledge triplet with confidence and metadata through a field prompt driving large model, constructing a field knowledge graph through quality evaluation, conflict resolution and man-machine collaborative verification, identifying target elements through the large model and generating query sentences when responding to tasks, searching the knowledge graph to obtain a query result with basis, and finally generating a traceable and interpretable processing report. The invention can improve the automation efficiency and reliability of data processing and ensure the interpretability and traceability of the processing result.

Inventors

ZENG TIJIAN
HE YASHAN
ZHANG YUJI
XIAO JIAN
GU TIANXING
XIE ZHIQI
DU ZEXIN
SU QIAN
LUO YU

Assignees

贵州乌江水电开发有限责任公司

Dates

Publication Date: 20260508
Application Date: 20251215

Claims (10)

1. The data processing method based on prompt engineering and knowledge graph is characterized by comprising the following steps: preprocessing operation is carried out on the multi-source heterogeneous original data, and standardized structured data is generated; extracting domain knowledge triples comprising confidence, source metadata and original text positions from the structured data based on a domain-specific prompt engineering-driven multi-modal large language model; performing quality evaluation and conflict detection on the domain knowledge triples, and automatically resolving conflict knowledge according to a preset priority rule to obtain a candidate knowledge set; responding to the data processing task, identifying target elements in the data to be processed through a multi-mode large language model, and generating a structured query statement; Inquiring the domain knowledge graph based on the structured inquiry statement to obtain an inquiry data packet, wherein the inquiry data packet comprises target knowledge and corresponding metadata thereof; And carrying out target processing on the data to be processed based on the query data packet, and generating a processing result report attached with the original basis reference.
2. The data processing method based on prompt engineering and knowledge graph according to claim 1, wherein the preprocessing operation is performed on the multi-source heterogeneous raw data to generate standardized structured data, and the method comprises the following steps: acquiring multi-source heterogeneous original data in different formats, and establishing a mapping relation between unique identifiers of all original data and data format types; Extracting text content of the scanning data by adopting an optical character recognition technology, and recognizing and determining boundary information of a hierarchical title, text content, a table area and a formula area in the data by using a layout analysis algorithm; And carrying out hierarchical segmentation on the extracted content according to the hierarchical title, carrying out alignment classification on the segmented content according to semantics, carrying out row-column analysis on a table area, converting the table area into a key value pair format, removing redundant information, and generating standardized structured data.
3. The data processing method based on prompt engineering and knowledge graph according to claim 1, wherein extracting the domain knowledge triples including confidence, source metadata and original text position from the structured data based on the domain-specific prompt engineering driving multi-modal large language model comprises: Constructing a domain-specific prompt template, wherein the prompt template comprises a domain-specific term dictionary, a knowledge triplet forced output format definition and a thinking chain reasoning guide statement, the domain-specific term dictionary contains ambiguous terms paraphrasing, and the thinking chain reasoning guide statement is used for indicating logic association in multi-mode large language model step-by-step analysis data; Splitting the structured data into data fragments according to a hierarchy, splicing each data fragment with a prompt template one by one, inputting the data fragments into a multi-mode large language model, driving the multi-mode large language model to identify a core entity and a logic relation in the data through thinking chain reasoning, and generating a preliminary knowledge triplet, wherein the core entity comprises a subject and an object; and associating source identification, level number and position information corresponding to the preliminary knowledge triples to complement source metadata, and binding the confidence level output by the multi-mode large language model with the preliminary knowledge triples, the source metadata and the original text positions to form the domain knowledge triples.
4. The data processing method based on prompt engineering and knowledge graph according to claim 1, wherein the quality evaluation and conflict detection are performed on the domain knowledge triples, and conflict knowledge is automatically resolved according to a preset priority rule to obtain a candidate knowledge set, and the method comprises the following steps: setting a preset confidence coefficient threshold, removing domain knowledge triples with confidence coefficient lower than the preset confidence coefficient threshold, and reserving high-confidence-coefficient knowledge triples; Comparing the high-confidence knowledge triples through a knowledge matching algorithm, and judging that knowledge conflicts exist in the two high-confidence knowledge triples when the subjects and the relations of the two high-confidence knowledge triples are completely consistent and the objects are different; And calling a preset priority rule to process the high-confidence knowledge triples with the knowledge conflict, marking the knowledge triples conforming to the preset priority rule as candidate effective knowledge, marking the rest knowledge triples with the knowledge conflict as knowledge to be confirmed by the expert, and combining the candidate effective knowledge and the knowledge to be confirmed by the expert to obtain a candidate knowledge set.
5. The data processing method based on prompt engineering and knowledge graph according to claim 1, wherein the verifying and confirming the candidate knowledge set by a man-machine collaborative checking mechanism, constructing and forming the domain knowledge graph, comprises the following steps: Based on the history check record of the expert in the field of analysis of the active learning mechanism, identifying the knowledge field of which the error rate is higher than a preset error threshold value extracted by the multi-mode large language model, and preferentially pushing candidate knowledge in the knowledge field to an expert check interface; clustering the knowledge triples with the same sources and consistent relation types in the candidate knowledge set to generate batch verification tasks, and providing an operation inlet on an expert verification interface; acquiring operation data for confirming or adjusting candidate effective knowledge by an expert in the field through an expert checking interface, based on the operation data, And writing the knowledge triples passing the verification into the domain knowledge graph, and recording the effective time, the revocation time and the version number of each knowledge triplet.
6. The data processing method based on prompt engineering and knowledge graph according to claim 1, wherein the generating a structured query sentence by identifying target elements in data to be processed through a multimodal large language model in response to a data processing task comprises: responding to a data processing task, inputting the data to be processed into a multi-mode large language model, and identifying target elements in the data to be processed through the multi-mode large language model; the target element comprises a core object, object parameters and an association relation; and automatically generating a structured query statement conforming to the domain knowledge graph query algorithm according to the target element through the multi-mode large language model, wherein the structured query statement comprises a main body to be checked and a relation to be checked.
7. The data processing method based on prompt engineering and knowledge graph according to claim 1, wherein query field knowledge graph is queried based on a structured query statement to obtain a query data packet, the query data packet including target knowledge and corresponding metadata thereof, and the method comprises the steps of: after the domain knowledge graph receives the structured query statement, matching the target knowledge in the effective state currently; And extracting metadata corresponding to the target knowledge in the effective state, and packaging the target knowledge and the corresponding metadata to form a query data packet, wherein the metadata comprises an original data name, a standard number and a specific clause number.
8. The data processing method based on prompt engineering and knowledge graph according to claim 1, wherein the target processing is performed on the data to be processed based on the query data packet, and the processing result report with the original basis reference is generated, comprising: Extracting standard values or rule requirements of target knowledge from the query data packet, comparing and analyzing actual values in the data to be processed with the standard values or rule requirements, and generating corresponding processing results; Integrating the processing result, the actual value, the standard value or the rule requirement and the corresponding metadata to generate a processing result report; the processing result report includes conclusions, description of problems, actual values, standard values or rule requirements and original basis reference information.
9. The data processing method based on prompt engineering and knowledge graph according to claim 1, wherein the domain knowledge graph adopts a three-level hierarchical architecture comprising a body layer, a data layer and a service layer, and the body layer is used for defining domain core concepts and association relations thereof and constructing a domain knowledge system; The data layer is used for storing knowledge triples and version information by adopting the time sequence database to carry out multidimensional retrieval, and the service layer is used for providing knowledge inquiry and incremental updating service through a standardized interface to realize integrated docking with a data processing system.
10. The data processing method based on prompt engineering and knowledge graph according to claim 1, wherein the multi-source heterogeneous original data comprises at least one of document data, image data and text data, and the data processing task comprises at least one of compliance examination, information inquiry and problem diagnosis.

Description

Data processing method based on prompt engineering and knowledge graph Technical Field The invention relates to the technical field of data processing, in particular to a data processing method based on prompt engineering and a knowledge graph. Background In modern data processing scenarios, the processing requirements of multi-source heterogeneous data are increasing, and the multi-source heterogeneous data cover a plurality of fields of medical treatment, electric power, finance and the like. The traditional data processing mode is highly dependent on the knowledge and experience of professionals, and a large amount of original data and standard documents are manually consulted to complete processing tasks, but the manual processing mode has the problems of low efficiency, easiness in being influenced by human factors, difficulty in ensuring consistency of results and the like. Along with the continuous expansion of the data scale and the increasing complexity of related specifications, the original data has various sources, different formats and frequent updating, the manual processing is difficult to keep pace with the updating rhythm of the data and the specifications, and the condition of outdated processing basis is easy to occur. Meanwhile, conflicts may exist among specifications or data of different sources and different versions, and the priority is difficult to judge rapidly and accurately during manual processing, so that processing result errors are easy to cause. In addition, the manual processing process is difficult to leave a complete basis tracing record, and once the processing result is in dispute, the processing basis is difficult to trace back, so that the credibility of the processing result is affected. Therefore, the existing data processing technology has the problems of low efficiency and insufficient reliability, and cannot meet the high-quality requirements of data processing in various industries. Disclosure of Invention The embodiment of the invention provides a data processing method based on prompt engineering and a knowledge graph, which can improve the automation efficiency and reliability of data processing, ensure the interpretability and traceability of a processing result and meet the requirements of various industries on data processing. The method comprises the following steps of S1, preprocessing multi-source heterogeneous original data to generate standardized structured data, S2, driving a multi-mode large language model based on domain specificity prompt engineering, extracting domain knowledge triples comprising confidence, source metadata and original positions from the structured data, S3, carrying out quality assessment and conflict detection on the domain knowledge triples, carrying out automatic resolution on conflict knowledge according to a preset priority rule to obtain a candidate knowledge set, S4, carrying out audit confirmation on the candidate knowledge set through a man-machine cooperation verification mechanism to construct a domain knowledge map, S5, responding to a data processing task, identifying target elements in the data to be processed through the multi-mode large language model to generate a structured query statement, S6, obtaining a query data packet based on the structured query statement, carrying out target processing on the data to be processed according to the query data packet, and generating a processing result attached with the original reference report. Further, in some embodiments of the present invention, step S1 "the preprocessing operation is performed on the multi-source heterogeneous raw data to generate standardized structured data" specifically includes S11. The multi-source heterogeneous raw data in different formats is obtained, a mapping relationship between a unique identifier of each raw data and a data format type is established, where the raw data may include document data such as national standards, industry standards, enterprise standards, etc., and may also include image scan pieces, sensor acquisition data, etc., and may be PDF, word, HTML, images or texts in format, etc. In order to realize efficient management and processing, a unique identifier is required to be allocated to each original data, and a mapping relation between the identifier and the data format type is established, so that the corresponding processing method can be quickly identified and called in subsequent processing. In practical application, the multi-source heterogeneous data can be stored and managed through the document management system, a universal unique identifier is generated by adopting a universal unique identifier, and information such as data format, source, version and the like is associated by combining metadata management technology, so that a complete data information base is formed. S12, extracting text content of the scanning data by adopting an optical character recognition technology, recognizing and determining boun