CN-122019743-A - Knowledge extraction method, device, equipment and storage medium
Abstract
The application provides a network operation and maintenance domain knowledge extraction method based on a generated artificial intelligent model, which comprises the steps of combining rare Chinese characters with a relation word list to form a new input sequence by using the rare Chinese characters as auxiliary symbols, selecting a pre-trained generated artificial intelligent model, performing P-Tuning fine Tuning to adapt to a knowledge extraction task in the network operation and maintenance domain, and segmenting and identifying different components of a knowledge triplet by using specific isolation words in model output. By applying the technical scheme of the application, the technical problem that the model in the prior art is difficult to process a complex entity structure and can not flexibly output knowledge triples can be effectively solved.
Inventors
- XU JING
- LI YILE
- WU PENGCHENG
- CHENG NAN
- NIE ZHENLIN
- He Juanfei
- HUA CHENGMING
- DONG XIAO
- ZENG YU
- Cao tianjiao
- WANG XIANYANG
- WANG XIDIAN
- LI JIAYUAN
- AN RUI
- Gan shu
- LIU QINGYAO
- CHANG YUAN
- WANG HAOBO
- Bai Zhiran
- WANG CHUNSHAN
- WANG JIAYI
- LIANG CHEN
- XU HAOTIAN
- HAN YUNBO
- SHI DUO
- JIA ZIHAN
- YU SHAOSHAO
- WANG YANAN
- Guan Yuankai
Assignees
- 中国移动通信集团设计院有限公司
- 中国移动通信集团有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20251210
Claims (10)
- 1. A knowledge extraction method, comprising: Determining a relation word list contained in the original text according to the obtained original text; According to the relation word list and a preset first Chinese character set, performing splicing processing on the original text to obtain an input sequence; inputting the input sequence into a pre-trained text processing model, so that the text processing model generates a corresponding output sequence according to the input sequence, wherein the output sequence comprises at least one knowledge triplet corresponding to an original text, and in the output sequence, at least one first Chinese character in a first Chinese character set is used as a separator to divide the output sequence into knowledge triples; And analyzing the output sequence to obtain at least one knowledge triplet corresponding to the original text.
- 2. The method of claim 1, wherein the first set of chinese characters is selected from the group consisting of uncommon words outside of the general standard chinese character list, and the frequency of occurrence in the conventional corpus is below a preset threshold.
- 3. The method of claim 1, wherein the performing a stitching process on the original text according to the relational word list and at least one preset first type chinese character set to obtain an input sequence specifically includes: splicing each relation word in the relation word list with the original text to obtain an initial input sequence; and dividing each related word in the initial input sequence from the original text according to at least one first Chinese character in the first Chinese character set to obtain an input sequence.
- 4. The method according to claim 1, wherein said dividing the output sequence into knowledge triples using at least one of the first type of chinese characters as a separator, comprises: using a first Chinese character in the first Chinese character set as a starting identifier of a knowledge triplet; Using a second Chinese character in the first Chinese character set as a separation identifier between a subject and a predicate in the knowledge triplet; using a third Chinese character in the first Chinese character set as a separation identifier between predicates and objects in the knowledge triplet; And using a fourth Chinese character in the first Chinese character set as a termination identifier of the knowledge triplet.
- 5. The method of claim 1, wherein the text processing model is constructed based on a transducer architecture encoder-decoder model.
- 6. The method according to claim 5, characterized in that pre-training the text processing model, in particular comprises: acquiring a training data set, wherein the training data set comprises a plurality of training samples, and each training sample comprises an original text and a corresponding knowledge triplet tag; determining a relation word list according to the original text of the training sample; According to the relation word list and a preset first Chinese character set, performing splicing processing on the original text of each training sample to obtain a training input sequence; Generating a training output sequence according to the knowledge triplet tag and a preset first Chinese character set, wherein the first Chinese character is used as a separator to divide the knowledge triplet; And performing parameter fine adjustment on the pre-training model based on the transducer architecture by using the training input sequence and the training output sequence to obtain the text processing model.
- 7. A knowledge extraction device, comprising: The text acquisition module is used for acquiring an original text; Guan Jici a determining module, configured to determine one or more related words contained in the original text, so as to form a related word list; the sequence construction module is used for carrying out splicing processing on the original text according to the relation word list and a preset first Chinese character set to obtain an input sequence, wherein the first Chinese character set consists of uncommon words; The model processing module is used for inputting the input sequence into a pre-trained text processing model to generate a corresponding output sequence, and the output sequence comprises at least one knowledge triplet which is separated by Chinese characters in the first Chinese character set and corresponds to the original text; And the result analysis module is used for analyzing the output sequence and extracting the at least one knowledge triplet.
- 8. The apparatus of claim 7, wherein the sequence construction module is specifically configured to: splicing each relation word in the relation word list with the original text to obtain an initial input sequence; and dividing each related word in the initial input sequence from the original text according to at least one first Chinese character in the first Chinese character set to obtain an input sequence.
- 9. An electronic device, comprising: one or more processors; a memory; One or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the method of any of claims 1-6.
- 10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1 to 6.
Description
Knowledge extraction method, device, equipment and storage medium Technical Field The application relates to the technical field of network operation and artificial intelligence, in particular to a knowledge extraction method, a device, equipment and a storage medium. Background In the field of natural language processing (Natural Language Processing, NLP), knowledge-graph construction is one of the main applications of information extraction. The knowledge triplet extraction is a core link of knowledge graph construction, and the unstructured text is converted into a structured knowledge representation by automatically analyzing and identifying entities and relations thereof in the text. Named entity Recognition (NAMED ENTITY, NER) and relationship extraction (Relation Extraction, RE) are two common knowledge triplet extraction methods that are typically based on deep learning models, such as bert+bi-lstm+crf, for entity Recognition and relationship classification. The prior art scheme mainly relies on sequence labeling technology and a relation classification algorithm. The method uses a bidirectional long-short-term memory network (Bidirectional Long Short-Term Memory Network, bi-LSTM) and a conditional random field (Conditional Random Field, CRF) combined model to conduct named entity identification, and then performs relationship classification through an additional relationship extraction model to construct a knowledge graph. Although this approach shows some effectiveness and accuracy in processing structured and semi-structured data, the prior art has the following drawbacks in the particular field of network operation, particularly for highly unstructured text data: 1) The text in the network operation and maintenance field often contains complex entity nesting and crossing phenomena, such as equipment configuration parameters, network topology information and the like, and the existence of the complex entity structures makes it difficult for a traditional NER method based on sequence labeling to accurately identify and distinguish entity boundaries, so that the accuracy of triad extraction is reduced. 2) The model output lacks flexibility, namely when the model in the prior art performs knowledge triplet extraction, the model excessively depends on the existing continuous words in the text, and when the triples to be extracted cannot completely appear in the original text, the model cannot automatically complement the missing information, so that the knowledge extraction range is limited, and particularly when the model faces to the special terms and concepts of network equipment, the generalization capability and the generation capability of the model are insufficient, and the application of the model in the construction of knowledge maps in the operation and maintenance field is limited. These deficiencies indicate that the prior art has significant limitations in addressing the specific challenges of the network operation area, particularly in dealing with complex physical structures and flexible output requirements. Therefore, a new solution is needed to overcome the above-mentioned drawbacks, and more effectively extract the knowledge triples in the network operation and maintenance field, so as to improve the accuracy and coverage of knowledge graph construction and meet the increasing demands of automatic network operation and maintenance. Disclosure of Invention The application provides a knowledge extraction method, which aims to overcome the obvious defects of the knowledge extraction method based on sequence labeling in the prior art in the aspects of processing complex entity structures and flexibly outputting knowledge triples. On one hand, the traditional model such as LSTM+CRF is difficult to effectively identify nested or crossed entities, so that the processing capacity of complex structures is limited, and on the other hand, when text information is incomplete, the model cannot generate missing knowledge triples, and the integrity and accuracy of extraction are affected. Therefore, a knowledge extraction method capable of flexibly generating a knowledge triplet, adapting to a complex entity structure and having a strong semantic understanding capability is needed, so as to improve the comprehensiveness and intelligence of knowledge graph construction. In a first aspect, a knowledge extraction method is provided. The method comprises the steps of determining a relation word list contained in an original text according to the obtained original text, performing splicing processing on the original text according to the Guan Jici list and a preset first type Chinese character set to obtain an input sequence, inputting the input sequence into a pre-trained text processing model to enable the text processing model to generate a corresponding output sequence according to the input sequence, wherein the output sequence contains at least one knowledge triplet corresponding to the original