CN-116662480-B - Text semantic representation method and system based on entity enhancement

CN116662480BCN 116662480 BCN116662480 BCN 116662480BCN-116662480-B

Abstract

The invention provides a text semantic representation method and a text semantic representation system based on entity enhancement, wherein the text semantic representation method and the text semantic representation system based on entity enhancement comprise the steps of obtaining a text coding model and an entity enhancement coding model, wherein the entity enhancement coding model consists of an external entity vector representation module and a context-related entity information coding module which are connected in series, extracting a first representation vector of a target text through the text coding model, identifying a word sequence of a represented entity in the target text by the external entity vector representation module, determining which entities in a knowledge base are associated with the word sequence to obtain entity vector representations of the entities corresponding to the word sequence in the knowledge base, the entity information coding module comprises a plurality of entity adapter layers connected in series, the input of the entity information coding module is the hidden states of the entity vector representations and the intermediate layers of the text coding model, and the feature output by the last entity adapter layer is used as a second representation vector of the target text, and fusing the first representation vector and the second representation vector to obtain a semantic representation result of the target text.

Inventors

XUE YUANHAI
XIA HAOYUN
CHEN CUITING
HE GUANGFU
YU XIAOMING
SHEN HUAWEI
CHENG XUEQI

Assignees

中国科学院计算技术研究所

Dates

Publication Date: 20260508
Application Date: 20230506

Claims (8)

1. A text semantic representation method based on entity enhancement, comprising: Step 1, obtaining a target text to be semantically represented, and obtaining a text coding model and an entity enhancement coding model, wherein the entity enhancement coding model consists of an external entity vector representation module and a context-related entity information coding module which are connected in series; Step 2, recognizing a vocabulary term sequence of the represented entity in the target text by using the external entity vector representation module, determining which entities in the knowledge base are associated with the vocabulary term sequence, and obtaining entity vector representation of the entity corresponding to the vocabulary term sequence in the knowledge base; Step 3, the entity information coding module comprises a plurality of entity adapter layers connected in series, wherein the input of the entity information coding module is the vector representation of an entity in a target text and the hidden state of a middle layer of a text coding model, and the input of each entity adapter layer is the hidden state of the middle layer output of the text coding model and the output of the previous entity adapter layer corresponding to the input of the entity adapter layer; step 4, fusing the first representation vector and the second representation vector through an entity information gating unit to obtain a semantic representation result of the target text; Wherein, this step 2 includes: Identifying a term sequence representing an entity in the target text through named entity identification, determining the entity in a knowledge base corresponding to the term sequence through entity link, and for each word or entity Obtaining vector representations of entities in a knowledge base by a word vector tool Wikipedia2vec ; Given vocabulary And entity table Mapping function To map words and entities into the same vector space; According to equation 1, a linear transformation matrix is obtained , For the vector dimensions in the text encoding model, The dimension of the entity vector encoded for Wikipedia2vec for the vocabulary And entity table Each word or entity in (a) It is expressed by the resulting vector of the word vectorizer Wikipedia2vec Obtaining after linear transformation Vector representation by an embedding layer of a text encoding model Transform matrix Mapping the vector representation obtained by the word and the entity through the Wikipedia2vec to the space where the text coding model embedded layer is located through the linear transformation of the formula 1; Obtaining a mapping function according to formula 2, obtaining a vector representation of words and entities mapped to the input vector space of the text encoding model For a table of non-existent entities Vector representation obtained directly using the embedded layer of a text coding model ; The step 3 comprises the following steps: The hidden state output for the j-th middle layer of the text coding model is in the shape of The BatchSize is the number of samples for the current training batch, For the maximum number of text tokens for the text encoder, Hidden layer dimension for text encoder; representing the output of the ith physical adapter as being in the shape of Wherein "1" corresponds to the location where the text encoding model output is stored, ENTITYLEN is the number of entities, for the 1 st entity adapter its input is , 0 Th position of (2) Vector corresponding to all 0's, the Each position Corresponding to the first For the i-layer entity adapter layer, the hidden state of the j-th middle layer [ CLS ] token output of the text encoding model is formulated as shown in equation 3 below Position 0 of physical adapter layer output added to layer i-1 And according to the formula 4, the obtained As input, the dimension is reduced by the projection layer to lead the dimension to be from Becomes into According to equation 5, the N converter layers are encoded, and the projection layer is used to increase the dimension according to equation 6, so that the dimension is increased from Recovery to And is connected with Adding to form a residual connection; As shown in equation 7, the final output of the physical enhancement coding model is Which is the output of the last physical adaptation layer The coding result corresponding to the 0 th position in the code pattern, wherein Adapting the number of layers for an entity; 。
2. The entity-enhanced text semantic representation method according to claim 1, wherein the step 4 comprises: control weight calculation of entity gating unit as shown in formula 8, the first representation vector encoded by the text encoding model based on the pre-trained language model is input And the second representation vector , wherein, Representing the presentation to be And As a result of the stitching being carried out, And Respectively representing the weight and bias of the gating network, sigma represents the Sigmoid function' "Means that the dot product operation is performed, ; G is a control weight having a value between 0 and 1; as shown in equation 9, the weight g is used to pair And Weighted summation is carried out to obtain the final semantic representation result of the target text 。
3. The entity-enhancement-based text semantic representation method of claim 1, wherein the step 4 comprises performing a text retrieval or text similarity determination task with the semantic representation extraction result.
4. A text semantic representation system based on entity enhancement, comprising: The method comprises the steps of 1, acquiring a target text to be semantically represented, and acquiring a text coding model and an entity enhancement coding model, wherein the entity enhancement coding model consists of an external entity vector representation module and a context-related entity information coding module which are connected in series; the module 2, identify the vocabulary term sequence of the representation entity in the target text with the external entity vector representation module, and confirm which entities in the knowledge base are correlated with the vocabulary term sequence, obtain the entity vector representation in the knowledge base of the entity corresponding to the vocabulary term sequence; The module 3, the entity information coding module includes a plurality of entity adapter layers connected in series, the input of the entity information coding module is the vector representation of the entity in the target text and the hidden state of the middle layer of the text coding model, the input of each entity adapter layer is the hidden state of the middle layer output of the text coding model and the output of the previous entity adapter layer corresponding to the input of each entity adapter layer; the module 4 fuses the first representation vector and the second representation vector through the entity information gating unit to obtain a semantic representation result of the target text; Wherein the module 2 comprises: Identifying a term sequence representing an entity in the target text through named entity identification, determining the entity in a knowledge base corresponding to the term sequence through entity link, and for each word or entity Obtaining vector representations of entities in a knowledge base by a word vector tool Wikipedia2vec ; Given vocabulary And entity table Mapping function To map words and entities into the same vector space; According to equation 1, a linear transformation matrix is obtained , For the vector dimensions in the text encoding model, The dimension of the entity vector encoded for Wikipedia2vec for the vocabulary And entity table Each word or entity in (a) It is expressed by the resulting vector of the word vectorizer Wikipedia2vec Obtaining after linear transformation Vector representation by an embedding layer of a text encoding model Transform matrix Mapping the vector representation obtained by the word and the entity through the Wikipedia2vec to the space where the text coding model embedded layer is located through the linear transformation of the formula 1; Obtaining a mapping function according to formula 2, obtaining a vector representation of words and entities mapped to the input vector space of the text encoding model For a table of non-existent entities Vector representation obtained directly using the embedded layer of a text coding model ; The module 3 comprises: The hidden state output for the j-th middle layer of the text coding model is in the shape of The BatchSize is the number of samples for the current training batch, For the maximum number of text tokens for the text encoder, Hidden layer dimension for text encoder; representing the output of the ith physical adapter as being in the shape of Wherein "1" corresponds to the location where the text encoding model output is stored, ENTITYLEN is the number of entities, for the 1 st entity adapter its input is , 0 Th position of (2) Vector corresponding to all 0's, the Each position Corresponding to the first For the i-layer entity adapter layer, the hidden state of the j-th middle layer [ CLS ] token output of the text encoding model is formulated as shown in equation 3 below Position 0 of physical adapter layer output added to layer i-1 And according to the formula 4, the obtained As input, the dimension is reduced by the projection layer to lead the dimension to be from Becomes into According to equation 5, the N converter layers are encoded, and the projection layer is used to increase the dimension according to equation 6, so that the dimension is increased from Recovery to And is connected with Adding to form a residual connection; As shown in equation 7, the final output of the physical enhancement coding model is Which is the output of the last physical adaptation layer The coding result corresponding to the 0 th position in the code pattern, wherein Adapting the number of layers for an entity; 。
5. The entity-enhanced text semantic representation system of claim 4, wherein the module 4 comprises: control weight calculation of entity gating unit as shown in formula 8, the first representation vector encoded by the text encoding model based on the pre-trained language model is input And the second representation vector , wherein, Representing the presentation to be And As a result of the stitching being carried out, And Respectively representing the weight and bias of the gating network, sigma represents the Sigmoid function' "Means that the dot product operation is performed, ; G is a control weight having a value between 0 and 1; as shown in equation 9, the weight g is used to pair And Weighted summation is carried out to obtain the final semantic representation result of the target text 。
6. The entity-enhanced based text semantic representation system of claim 4, wherein the module 4 comprises performing text retrieval or text similarity determination tasks with the semantic representation extraction results.
7. A storage medium storing a program for executing the entity-enhanced text semantic representation method according to any one of claims 1 to 3.
8. A client for use in the entity-enhanced text-based semantic representation system of any one of claims 4 to 6.

Description

Text semantic representation method and system based on entity enhancement Technical Field The invention relates to the field of information retrieval, in particular to a semantic meaning method for text in semantic-based retrieval. Background With the rapid development of the internet, a great deal of content is widely streamed and generated on the network every day, and the number of common websites reaches 398 ten thousand by 2022 and 6 months. As a platform like a well-known red book and a small red book, a user creates a large amount of content every day, and as long as 2021, 12 months and 31 days, the content which is distributed in a cumulative way is known to be more than 4.9 hundred million, wherein the content of questions and answers reaches 4.2 hundred million. In the field of information retrieval, diversified search scenes provide higher requirements for the characterization of semantic relevance. However, the search technology based on word matching has limitations, such as inability to deal with synonyms, polysemous words, sentence whole semantics, and the like, and is difficult to face diversified search scenes. Therefore, semantic-based search techniques have become a research direction of great interest. With the introduction of a pre-training language model technology, a deep semantic retrieval method based on dense vector retrieval has obtained extremely high accuracy, has recently succeeded in completely surpassing the traditional BM25 algorithm, and has excellent performance, for example, on NaturalQuestions data sets, the semantic retrieval model based on dense vectors can surpass the BM25 algorithm only by 1000 training data. However, the existing semantic-based retrieval model has a defect in capturing the semantics of the entities in the text, so that the effect when the entity query is contained is not ideal. For a query of "Who plays Thoros of MYR IN GAME of Thrones? the BM25 algorithm successfully matches the" Thoros of Myr "entity, a text is found that meets the requirements of the user, while the search model based on the dense vector searches the encyclopedia pages of another irrelevant actor, the situation has a great negative effect on the search effect. Existing semantic retrieval models encode queries or documents into fixed-length dense vectors, such as 128-dimensional or 768-dimensional, primarily through pre-trained language models, such as BERT. Current research in this area focuses mainly on how better to obtain vector representations of text, e.g. by adjusting training strategies, mining of hard-to-score negative samples, etc. There are also some studies attempting to solve the problem of poor semantic representation of text containing entities in current semantic search. For example, one would like to exploit the benefits of the BM25 algorithm in terms of retrieving queries containing entities, and to migrate the benefits of BM25 to a dense vector retrieval-based model by way of knowledge distillation. However, experiments have found that while this does enhance the retrieval effect of the dense vector retrieval model on queries containing entities, the benefits of the BM25 algorithm are learned, but the benefits of itself capturing text semantic relevance are also lost. Still others construct a search dataset based on triples representing relationships between entities, with templates designed for each relationship. For example, for a triplet < capital in china, beijing >, the answer to the query "where capital in china. In this way, the queries in the training set can all contain entities, and the text representation model in the semantic retrieval model is forced to pay more attention to the entities by a data enhancement mode, although the attention to the entities is enhanced after fine tuning on the enhanced data set finally. However, when the search data set is tested on a general-purpose search data set, it was found that the search effect was deteriorated. Existing text semantic representation models based on pre-trained language models do not capture the semantics of entities well. The main reasons are that, firstly, in order to reduce the vocabulary size, the current pre-training language model usually uses WordPiece word segmentation algorithm, and for some unusual words, the pre-training language model is segmented into subwords (subword) such as English "Jinan" corresponding to Jinan, and the word list of the model which does not appear in BERT is segmented into "Jin" and "# # an", wherein "# # represents" an "as a subword and is not the beginning part of the word. Such a segmentation approach makes it difficult to recover the semantics of the original word "Jinan". Secondly, when the conventional semantic retrieval model is trained, the model is not explicitly allowed to distinguish whether a term is an entity or a part of an entity or a common term, especially for an entity composed of a plurality of terms. For example, in the sentence "WHERE IS WH