CN-122021832-A - Large model extraction knowledge graph construction method and device oriented to petroleum exploration and development field and electronic equipment
Abstract
The application provides a large model extraction knowledge graph construction method, device and electronic equipment for petroleum exploration and development fields, which are completed by applying an intelligent batch processing architecture, wherein the method comprises the steps of executing semantic segmentation, knowledge extraction flows comprising named entity identification, event extraction, relationship reasoning and entity standardization on documents in the petroleum exploration and development fields based on an initialized and loaded field knowledge system conforming to petroleum exploration and development operation logic to obtain standardized subgraphs, realizing entity standardization by adopting a three-layer progressive relationship normalization algorithm, merging all extracted and standardized subgraphs and loading the merged and standardized subgraphs into a graph database to form a knowledge graph for the petroleum exploration and development fields, and recording input and output Token consumption and knowledge extraction reasoning paths through a log system in the whole knowledge graph construction process. The application is a new paradigm for intelligently constructing the knowledge graph, which is advanced in technology, economically feasible, controllable in process and reliable in result.
Inventors
- ZHOU CHANGBING
- GUO ZHIHANG
- ZHAO DENG
- WANG SIHUI
- LIU JIALONG
- Pan Ruixi
Assignees
- 中国地质大学(北京)
Dates
- Publication Date
- 20260512
- Application Date
- 20260210
Claims (10)
- 1. A large model extraction knowledge graph construction method oriented to the field of petroleum exploration and development is characterized by comprising the following steps: the intelligent batch processing architecture is applied, and the steps are completed in a multitasking parallel mode, namely, based on an initialized and loaded domain knowledge system conforming to petroleum exploration and development operation logic, semantic segmentation and knowledge extraction processes are carried out on documents in the petroleum exploration and development domain to obtain a standardized subgraph, wherein the knowledge extraction processes comprise named entity identification, event extraction, relationship reasoning and entity standardization; Merging and loading all extracted and standardized subgraphs into a graph database to form a unified, inquireable and analyzable knowledge graph facing the petroleum exploration and development field, and providing a visual inquiry and analysis interface; in the whole knowledge graph construction process, the log system records the consumption of input and output Token and the knowledge extraction reasoning path.
- 2. The method of claim 1, wherein the domain knowledge system conforming to the logic of the petroleum exploration and development operations comprises a plurality of definitions including a concept type, a cause classification, a solution classification, a development stage classification, an index change classification, a reservoir classification, a heterogeneity classification, a sedimentary facies classification, a construction unit classification, and a well pattern classification, a well classification, and a production pattern classification related to engineering parameters, a entity type, a development index and parameter entity, and a dynamic analysis and decision entity, wherein the entity relationship comprises a static structural relationship, an index attribution relationship, and a dynamic causal relationship.
- 3. The method of claim 1, wherein the method further comprises the steps of Schema relation extraction and mapping construction, relation list structured assembly, embedding pre-calculation and index construction, and log and monitoring system initialization before the steps of applying the intelligent batch architecture and completing the steps in a multitasking parallel manner, wherein the Embedding pre-calculation can realize parallel rapid calculation through the intelligent batch architecture.
- 4. The method of claim 1, wherein the semantic segmentation employs LLM-driven semantic segmentation to divide text into knowledge blocks with complete semantic units.
- 5. The method of claim 1, wherein the three-layer progressive relation normalization algorithm comprises sequentially adopting a rapid lexical matching strategy, a high-efficiency semantic matching strategy and a deep intelligent decision strategy according to a mode of priority from high to low, and continuing adopting a strategy of the next priority for normalization if normalization of any strategy is unsuccessful.
- 6. The method of claim 5, wherein the fast lexical matching strategy comprises cleaning terms to be normalized, performing accurate matching on the cleaned terms with a standard knowledge base, wherein the standard knowledge base is defined in the domain knowledge system, performing normalization processing with highest confidence if matching is successful, performing fuzzy matching by calculating similarity with standard terms if matching is unsuccessful, performing normalization processing by the efficient semantic matching strategy if fuzzy matching is unsuccessful, the efficient semantic matching strategy comprises calling a vectorization model to convert the terms to semantic vectors, calculating the similarity between the semantic vectors and standard semantic vectors in the vectorization model, determining a designated number of candidate vectors with higher similarity ranks, performing normalization processing based on vectors with similarity exceeding a confidence threshold in the candidate vectors, performing normalization processing by adopting a deep intelligent decision strategy if no similarity exceeds the set confidence threshold, and performing the normalization processing by dynamically constructing a semantic model comprising the terms to be normalized, the context and the high-efficient semantic matching strategy, and making a full understanding on a final language model of the complex candidate language with the high-level requirements.
- 7. The method of claim 1, wherein the step of logging the input-output Token consumption by a logging system comprises: Before calling the big language model, estimating Token consumption of the forthcoming API call according to the task type and the length of the input text, after each time of calling the big language model, recording the number of input tokens and the number of output tokens actually consumed by the call, after all batches of tasks are completed, automatically generating a cost summary report, and counting the total Token consumption of the constructed task in each link.
- 8. The method of claim 1, wherein the step of recording knowledge extraction inference paths by a log system comprises: When the knowledge graph is built, initializing a log system, creating a unique log file with a time stamp, and simultaneously configuring a setting console processor and a setting file processor, and recording each key knowledge decision process in a machine-readable format in the execution of a building flow, wherein the knowledge decision process comprises a complete path of a normalization flow.
- 9. The utility model provides a big model extraction knowledge graph construction device towards oil exploration development field which characterized in that, the device includes: The knowledge extraction module is used for applying an intelligent batch processing architecture and completing the steps by a multitasking parallel mode, wherein the knowledge extraction module is used for executing semantic segmentation and knowledge extraction processes on documents in the petroleum exploration and development field based on an initialized and loaded domain knowledge system conforming to petroleum exploration and development operation logic to obtain a standardized subgraph, and the knowledge extraction processes comprise named entity identification, event extraction, relationship reasoning and entity standardization; The map construction module is used for merging and loading all extracted and standardized subgraphs into a map database to form a unified, inquireable and analyzable knowledge map facing the petroleum exploration and development field, and providing a visual inquiry and analysis interface; the log recording module is used for recording the input/output Token consumption and the knowledge extraction reasoning path through the log system in the whole knowledge graph construction process.
- 10. An electronic device comprising a processor and a memory, the memory storing computer-executable instructions executable by the processor, the processor executing the computer-executable instructions to implement the method of any one of claims 1 to 8.
Description
Large model extraction knowledge graph construction method and device oriented to petroleum exploration and development field and electronic equipment Technical Field The application relates to the technical field of knowledge graph construction, in particular to a large model extraction knowledge graph construction method and device and electronic equipment oriented to the field of petroleum exploration and development. Background In recent years, the advent of Large Language Models (LLM) has provided unprecedented opportunities for automated processing of large amounts of unstructured text. Academia and industry are beginning to explore the use of the powerful understanding capabilities of LLM to automatically construct knowledge maps. In this context, a batch of LLM-based knowledge graph construction frameworks, such as the prior art represented by the KAG (Knowledge Augmented Generation) framework of OpenSPG (open semantic enhanced programmable graph), have emerged. KAG such frameworks provide an advanced, automated knowledge extraction paradigm based on the constraints of the domain knowledge system (Schema). They allow a developer to first define a domain-specific knowledge hierarchy and then automatically identify and extract entities, attributes and relationships from unstructured text that conform to the hierarchy definition using LLM as a "knowledge extractor". Compared with the traditional NLP (natural language processing) technology which relies on a large amount of annotation data and complex model training, the method has great advantages in flexibility, quick deployment and Zero-sample (Zero-shot) extraction capability. However, in practice, it has been found that when the general, prototyping knowledge extraction framework of KAG is directly applied to the industrial field of petroleum exploration and development, which is highly specialized, strict and has extremely high requirements on cost and stability, the problem of "water and soil shortage" appears rapidly, and is specifically expressed in: 1. The risk of cost runaway and performance bottleneck are long reporting space and dense content in petroleum field. The framework of KAG typically adopts a simple asynchronous concurrent call mode, and when faced with a corpus of tens of thousands of documents in an oilfield, massive API call requests are generated. The method is not only very easy to trigger the API rate limit (RATELIMIT) of the LLM service provider, resulting in a large number of task failures and retries, so that the overall processing efficiency of the system is not increased, the stability cannot be ensured, and meanwhile, the API calling cost can reach hundreds of thousands or even millions of yuan, which is completely unacceptable in the industry pursuing cost reduction and synergy. "Hallucination" problem and fact consistency-generic LLM may "create" facts that do not conform to geological laws or common engineering knowledge in the absence of strict domain knowledge constraints. For example, two geographically distant wells may be erroneously connected or a formation age may be generated that does not exist. While KAG provides the Schema constraints, its underlying promt engineering (instruction engineering) design still makes it difficult to completely circumvent the "illusion" of extraction that arises under complex semantics, which is fatal to the petroleum industry that requires high stringency. 3. Knowledge standardization capability is lacking, and a large number of synonymous, near-sense, upper-lower-level complex semantic relationships (such as 'porosity' and 'porosity') exist in the petroleum field. Existing KAG such frameworks mostly rely on the understanding capabilities of the LLM itself for implicit normalization or use simple string matching. The method solves the problems of large-scale and deep knowledge isomerism and inconsistency in the field, has poor effect and high cost, and lacks a robust normalization mechanism which takes the cost and the effect into account. 4. The process "black box" and the results are not trusted-the KAG framework, whose internal decision process (such as why LLM makes this determination) is a "black box" for the user when performing the extraction. When LLM returns a result of the extraction, we are unaware of its reasoning process. If a critical "fault seal" relationship is extracted incorrectly, subsequent drilling decisions may be misled, resulting in economic losses of tens of millions of yuan. Research has focused on final extraction result evaluation, but lacks record for decision paths in knowledge formation process. For example, one relationship is normalized by what way (exact match, fuzzy match, semantic match) and how confidence it is, and this process information is critical to building a highly reliable knowledge graph. The lack of traceability and verifiability of the process makes it difficult for business professionals to trust and adopt their results. Disclos