CN-122003675-A - Method for extracting entities and relations from a corpus to populate a knowledge graph
Abstract
The present invention discloses a method for extracting entities and relationships from technical documents, which is performed in a highly automated manner, achieving more complete and accurate results in a shortened time frame. Several deep learning models are trained using a corpus of areas of interest labeled by experts and linguists. The graph vector model is also trained. Manual labeling and revision are the lowest requirements for obtaining an automated model that can automatically extract entities and relationships from a corpus. Once trained, these models can be used for any corpus within the same knowledge domain.
Inventors
- Fabio Correa Cordero
- Maria Claudia virtue. Freitas
- Christian Enrique Munoz Villalobo
- Lucas Monteiro Martinio
- Antonio Marcelo Azevedo Alexander
- Carlos Gilhelm Silva Tavares
- Evelyn Consejio Santos. Batista
- Sofia virtue. Abreu Lima Correa
- Patricia Ferreira Da Silva
- Max virtue. Castro Rodriguez
- Alexander. Tessarolo
- Diogo Da Silva Magahas Gomez
- Elvis Alves virtue. Sosa
- Raphael Andrelo Rubo
- Vitol Alcantara Batista
- Leonardo Alfredo Frero Mendoza
- Renato Sayo Kristalino Da rocha
- Tatiana Sosa virtue. Orlando Cavalcanti
Assignees
- 巴西石油公司
- 里约热内卢天主教大学
Dates
- Publication Date
- 20260508
- Application Date
- 20240905
- Priority Date
- 20230905
Claims (2)
- 1. A method for extracting entities and relationships from a corpus to populate a knowledge-graph, comprising the steps of: Filling the base ontology (A1) with entities from the pick list (A2) to generate an incomplete knowledge-graph (A3); 202, encoding nodes and edges of the incomplete knowledge-graph (A3) by using a graph vectorization algorithm to generate a vector model (A4); 203 identifying entities in the corpus (A5) using a previously trained deep learning model (A6); 204 linking the identified entities in the corpus (A5) to vectors of the vector model (A4) using a previously trained deep learning model (A7), and clustering the identified entities in the corpus (A5) using another previously trained deep learning model (A8); 205 identifying relationships between sentences having more than one entity using a previously trained deep learning model (A9); 206 populating the partially populated knowledge-graph (A3) with new entities and relationships extracted from the corpus (A5) to form a populated knowledge-graph (A10), and including Unique Resource Identifiers (URIs) from the populated knowledge-graph (A10) as sentence metadata (A11).
- 2. The method according to claim 1, further comprising the step, prior to step 201, of: 101, selecting a corpus (B1); 102, manually labeling part of the documents in the corpus (B1) to obtain a golden corpus (B3); training a machine learning model B4 using the golden corpus (B3); 104, labeling another document set by means of the model B4 to obtain a labeled corpus B5; Developing rules B6 for labeling entities based on the corpus B5 and the partially filled knowledge graph A3 by means of a linguist team; 106, checking the rule B6 by an expert team in the field of the corpus B5 to obtain a revised rule B7; 107, repeating the steps 105 and 106 until the revised rule B7 is determined to cover all labels, and obtaining a golden corpus B9 based on the revised rule; training model A6, model A7 and model A8 using the golden corpus B9; 109, labeling the relation between entities in the atlas A3 based on the corpus B9 to obtain a data set B11 for training the model A9; Training (B12) the word vector model to obtain a trained word vector model (B13); 302 the trained word model (B13) is used in the initialization of the training (B14) of the graph vectorization model (A4).
Description
Method for extracting entities and relations from a corpus to populate a knowledge graph Technical Field The invention relates to a method for populating a knowledge graph with information automatically extracted from a collection of technical documents. More specifically, the present invention discloses a method of linking artificial intelligence supervised training techniques to extract the largest and best possible set of entities and tags from a corpus. Background The extraction of information from large databases is the basis of many recent technological developments and has become an important topic of research. Many companies are interested in collecting strategic information from their document databases. This is particularly relevant for the oil and gas industry, as it has a vast database containing geoscience reports from decades of production. The cost of developing knowledge representations or ontologies is so high that this problem is known as the "acquisition bottleneck". The most common method of building ontologies involves bringing together domain experts and ontology engineers through a series of interview conferences so that they can describe the concepts, relationships, and attributes that make up a knowledge graph. An ideal way to overcome the "acquisition bottleneck" is to use a computational algorithm developed for this purpose, collecting the necessary information directly from the data sources. This type of procedure may reduce the need for expert intervention and the cost of developing the ontology. Ontology machine learning (or "ontology learning") involves extracting knowledge representations from data sources rich in information about the field in question. The data sources may be relational databases, tabular structures, semi-structured databases (such as HTML pages and XML documents), or unstructured databases composed of text written in natural language. Prior Art Document US2023134798A1 entitled "Reasonable language model learning for text generation from a knowledge graph" discloses generating, by a processor, a reasonable language learning model for text data in a knowledge-graph on a computing system. One or more data sources and one or more triples may be analyzed from the knowledge-graph. Training data may be generated with one or more candidate tags associated with one or more triples. One or more language models may be trained based on the training data. It is worth emphasizing that the document US2023134798A1 uses the text and knowledge-graph to generate new text, and no new triples are created. Document US2019354544A1 entitled "MACHINE LEARNING-based relationship association AND RELATED discovery AND SEARCH ENGINES" refers to systems and techniques for determining relationships and associated importance between entities. The systems and techniques automatically identify supply chain relationships between companies based on unstructured text. The system combines a machine learning model for identifying phrases (evidence) that mention the supply chain between two companies and an aggregation layer for considering the evidence found and assigning confidence scores to the relationships between the companies. Disclosure of Invention An automated method for obtaining a knowledge graph (A10) populated with a large number of entities and relationships obtained from a corpus of interest (A5) in a shortened time frame is disclosed. The invention starts from an incomplete knowledge-graph (A3), which incomplete knowledge-graph (A3) is obtained from a ontology (A1) and a structured database (A2). The knowledge graph is encoded using a graph vectorization algorithm, generating a vector model (A4). At the same time, the deep learning model (A6) previously trained for identifying entities, the deep learning models (A7 and A8) for linking entities, and the deep learning model (A9) for extracting relationships are used to extract triples from the corpus (A5). Finally, the incomplete knowledge-graph (A3) is filled with the triples, forming a rich knowledge-graph (A10). In addition, the URI (Unique Resource Identifier ) of the knowledge-graph (a 10) is also included as metadata (a 11) of the sentence. To train the "deep learning" model, a corpus (B1) from the field of interest is used, some of which are annotated by a team of experts (B3, B9, B11). Compared with the prior art, the method disclosed herein has at least the following advantages: the method aims at processing technical documents mainly from the fields of interest of the oil, gas and energy industries; most of the steps can be automated; The entities and the relations with more quantity and higher precision are obtained; The time required to obtain the complete spectrum a10 is much shorter; The main advantage of the developed approach compared to the prior art is the linking of the models A4, A7, A8, not the attempt to classify the instances into a predefined list, but rather the training of a specific model that predicts t