Search

CN-121501984-B - Entity pair-guided scientific literature document level relation extraction method and system

CN121501984BCN 121501984 BCN121501984 BCN 121501984BCN-121501984-B

Abstract

The invention provides a method and a system for extracting a scientific document level relation guided by an entity pair, wherein the method comprises the steps of carrying out entity identification on an input scientific document to obtain an entity set in the scientific document, screening candidate entity pair sets from all possible entity pairs of the entity set based on a entity pair pre-screening mechanism for multi-sampling and similarity verification, generating enhanced relation description fused with relation semantics between corresponding entity type information and entity pairs, searching out corresponding fine screening candidate relation sets for each enhanced relation description based on a pre-constructed relation semantic knowledge base by adopting a double-layer filtering mechanism, and guiding a large language model to carry out triple fact judgment by utilizing detailed semantic description of the candidate relations to obtain an output result. The invention obviously reduces the calculation cost of long text processing while ensuring high-precision relation extraction, so that the extraction of the scientific and technological literature document level relation is more accurate and efficient.

Inventors

  • YANG SHUAI
  • XU YUNDONG
  • LU YING
  • JIANG RUIZHE
  • TU QIONG
  • WANG YIMENG

Assignees

  • 中国科学院成都文献情报中心

Dates

Publication Date
20260508
Application Date
20260114

Claims (8)

  1. 1. A method for extracting a scientific literature document level relation of entity pair guidance, which is characterized by comprising the following steps: S1, performing entity identification on an input scientific and technological literature document to obtain an entity set in the scientific and technological literature document; Step S2, screening a candidate entity pair set from all possible entity pairs of the entity set based on an entity pair pre-screening mechanism of multi-sampling and similarity verification; step S3, generating enhanced relation description fused with relation semantics between corresponding entity type information and entity pairs aiming at each entity pair in the candidate entity pair set; Step S4, based on a pre-constructed relation semantic knowledge base, a double-layer filtering mechanism comprising coarse screening and fine screening is adopted to search out a corresponding fine screening candidate relation set for each enhanced relation description; step S5, for each candidate relation in the fine screening candidate relation set, guiding a large language model to judge the fact of the triples by utilizing the detailed semantic description of the candidate relation, and outputting the triples comprising the head entity, the tail entity and the relation type if the judgment is true; The step S2 includes: Step S21, defining an instruction template psi, and guiding a large language model to identify possible entity relation pairs from the scientific literature document D and the entity set E; step S22, using the large language model, based on the instruction template ψ, performing K independent samples on the document D and the entity set E, and generating an entity pair subset Ω i possibly including a relationship in each sample: Ω i =LLM(Ψ,D,E); Step S23, for each entity pair (E p ,E q ) in the entity pair subset omega i obtained by each sampling, calculating cosine similarity sigma (u, v) between the entity pair subset omega i and a corresponding real entity pair (E p ',E q ') in the entity set E by using a sentence embedding model; step S24, a similarity threshold is set for filtering, and a subset of the effective entity pairs is obtained; Step S25, merging the effective entity pair subsets obtained after all K times of sampling to obtain a final candidate entity pair set; In the step S4, a double-layer filtering mechanism including coarse screening and fine screening is adopted to search out a corresponding fine screening candidate relationship set for each enhanced relationship description, including: using the same sentence translator model Encoder as building the relational semantic knowledge base, the enhanced relational descriptions Γ (E p ,E q ) of the entity pairs to be queried are encoded into a query vector V q : V q =Encoder(Γ*(E p ,E q )); Calculating cosine similarity sigma (V q ,V T ) of the query vector V q and each vector V T in the relational semantic knowledge base K, arranging according to the cosine similarity sigma (V q ,V T ), selecting samples corresponding to the first k most similar vectors, extracting relational tags corresponding to the k samples, and performing de-duplication to form a coarse screening candidate relational set R coarse ; Converting each relationship tag in the coarse screening candidate relationship set R coarse into a natural language statement sentence about an entity pair (E p ,E q ) based on a predefined relationship conversion template T rel and entity prior knowledge P know , forming an initial option set Q; Inputting the scientific literature document D, the initial option set Q, the entity pair (E p ,E q ) and the fine screening instruction ψ refine into a large language model, and adding the irrelevant system as an independent option to obtain an answer A (E p ,E q ): A(E p ,E q )=LLM(Ψ refine ,D,Q,E p ,E q ); The answer a (E p ,E q ) is converted to a final candidate relationship set R fine .
  2. 2. The method for extracting the relationships between the entity pairs and the guided scientific literature documents according to claim 1, wherein in the step S4, based on a pre-constructed relationship semantic knowledge base, the method comprises the following steps: obtaining each entity pair marked with relation from training set T ) And the relation label lambda ) Generating enhanced relationship description Γ of each entity pair with marked relationship ); Encoding each enhanced relationship description into a fixed-dimension dense vector V using a pre-trained sentence transformer model Encoder ): V( )=Encoder(Γ*( )); Establishing a mapping relation set to form the relation semantic knowledge base K: К={V( ):Λ( )}。
  3. 3. The method for extracting a relationship between entity pairs according to claim 1, wherein the step S3 includes: Step S31, for any entity pair (E p ,E q ) in the candidate entity pair set, invoking a large language model, and extracting entity description information including types and key attributes from the first entity E p and the entity E q , respectively, from the scientific literature document: Φ(E p )=LLM(E p ,D); Φ(E q )=LLM(E q ,D); Step S32, calling a large language model, summarizing potential relation semantics between the entity pairs (E p ,E q ) based on the scientific literature document D, and generating a relation description text gamma (p,q) : Γ (p,q) =LLM(E p ,E q ,D); Step S33, splicing the entity description information and the relationship description text to generate an enhanced relationship description Γ (E p ,E q ): Γ*(E p ,E q )=Φ(E p )⊕Φ(E q )⊕Γ (p,q) ; Where # -represents a text splicing operation.
  4. 4. The method for extracting a relationship between entity pairs according to claim 1, wherein the step S5 includes: Step S51, constructing a judging instruction psi r for each candidate relation r, and definitely requiring whether a large language judging triplet (E p ,r,E q ) is established; Step S52, introducing a structured relation semantic description delta r for each candidate relation r, wherein delta r at least comprises the definition of the relation r, the type constraint of an entity E p and the type constraint of an entity E q ; Step S53, submitting the judging instruction ψ r , the relationship semantic description Δ r , and the test input x including the scientific literature document D and the entity pair (E p ,E q ) to a large language model together for binary judgment, so as to obtain an output y with a yes or no output result: y=LLM(Ψ r ,Δ r ,x); in step S54, if the output y is yes, the relation triplet (E p ,r,E q ) is output.
  5. 5. The method for extracting a relationship between entity pairs according to claim 1, wherein the step S1 includes: Performing entity identification on an input scientific literature document D to obtain an entity set E in the scientific literature document: Ω={(E i ,E j )|i≠j,i,j∈{1,...,N}}; wherein N is the number of entities contained in the scientific literature document D, and E i ,E j is the ith entity and the jth entity respectively.
  6. 6. The method for extracting guided scientific literature document level relationships by an entity according to claim 1, wherein the large language model is optimized by using a staged supervised fine tuning strategy, comprising: And carrying out parameter efficient fine adjustment on three tasks of entity pre-screening, multi-choice question-answer screening and triad fact judgment by adopting a low-rank adaptation technology.
  7. 7. The method for extracting guided scientific literature document-level relationships by an entity of claim 6, wherein said large language model is optimized using a phased supervised fine tuning strategy, further comprising: in the reasoning stage, differentiated temperature parameters are adopted for different tasks.
  8. 8. An entity-pair guided scientific and technological document level relation extraction system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for entity-pair guided scientific and technological document level relation extraction of any one of claims 1 to 7 when executing the computer program.

Description

Entity pair-guided scientific literature document level relation extraction method and system Technical Field The invention relates to the technical field of data analysis, in particular to a method and a system for extracting a scientific literature document-level relation of entity pair guidance. Background The entity relation extraction in the scientific and technological literature is an important task of natural language processing and knowledge graph construction, and has wide application value in the fields of scientific and technological information analysis, literature knowledge mining, academic recommendation systems and the like. With the rapid development of scientific research, the number of scientific and technological documents grows exponentially, and how to automatically extract structured entity relationship information from massive documents becomes a hotspot and difficulty of current research. The traditional relation extraction method mainly aims at short texts at sentence level, and adopts a method based on rules, statistics or deep learning. However, scientific literature typically contains multiple paragraphs, and relationships between entities often require reasoning across sentences and even across paragraphs to identify, which presents a significant challenge to conventional approaches. In recent years, document-level relation extraction is becoming the focus of research. Early document-level approaches were based primarily on Graph Neural Networks (GNNs) that modeled dependencies between entities and sentences by building the graph structure of the document. For example, the EoG method in 2019 adopts an edge-oriented graph structure, the GAIN method in 2020 designs a double graph inference mechanism, and the SIRE method in 2021 separates the intra-sentence and inter-sentence inference processes. The method improves the performance of document-level relation extraction to a certain extent, but the problems of difficulty in predefining the graph structure, insufficient long-distance dependence capture and the like still exist when complex scientific and technological documents are processed. Subsequently, methods based on pre-trained language models are emerging. For example, the ATLOP method in 2021 employs adaptive thresholding and local context pooling techniques, the EIDER method in 2022 introduces an efficient evidence extraction and reasoning phase fusion mechanism, and the DREEAM method in 2023 improves relationship extraction using an evidence-guided attention mechanism. These approaches take advantage of the powerful semantic understanding capabilities of the Transformer architecture, making significant progress in document-level relationship extraction tasks, but are still limited by the parameter scale and training data of the pre-training model. Recently, large Language Models (LLMs) exhibit breakthrough capability in the field of natural language processing, bringing new research directions for document-level relationship extraction. The existing relation extraction method based on the large language model is mainly divided into two types: (1) Non-fine tuning methods. For example, the PromptRE method of 2023 combines prompt technology and data programming to enhance the relation extraction capability through a plurality of weak supervision sources, and the DocGNRE method of 2023 integrates a large language model and a natural language reasoning module to enhance the document-level relation extraction dataset. These non-fine tuning methods do not require task-specific fine tuning of large language models, but performance is often limited by hint design and weakly supervised data quality. (2) A fine tuning method. The LMRC method in 2024 adds a relation set and entity pair information to the prompt, and improves the relation extraction performance by fine tuning. The AutoRE method in 2024 proposes a three-stage processing paradigm by first screening the entire document for possible relationship types as candidate relationships, then identifying head entities in the document, and finally performing triplet fact extraction for the head entities and candidate relationships. The AutoRE method is mainly characterized in that firstly, a document-level candidate relation screening strategy is adopted, candidate relations are screened based on the whole content of a document, and a large language model is used for analyzing the document and then a list of possible relation types is output. The advantage of this approach is that all potential relationships in the document can be obtained at once. Second, introducing a header entity recognition step, after obtaining candidate relationships, recognizes header entities that may be subjects of the relationships, contributing to narrowing the search. Thirdly, in the final stage, relation judgment is carried out on the head entity and the candidate relation, and the relation description is contained in the prompt to guide the large lan