CN-115391500-B - Conversational information retrieval method based on pre-training language model

CN115391500BCN 115391500 BCN115391500 BCN 115391500BCN-115391500-B

Abstract

The invention relates to the technical field of information retrieval methods and discloses a conversational information retrieval method based on a pre-training language model. Through screening the history query information related to prepositions and through a double-tower type fine granularity semantic interaction model, the problem that the retrieval in the prior art is easy to disregard semantic relations, and the query result is insufficient in accuracy is solved.

Inventors

WANG JUNMEI
SHENG JINHUA
ZENG JING

Assignees

杭州电子科技大学

Dates

Publication Date: 20260505
Application Date: 20220727

Claims (8)

1. A dialogue type information retrieval method based on a pre-training language model is characterized in that: S1, obtaining coding representation of a document by using an existing text representation model BERT; S2 for a set of conversational queries For the first Round-robin inquiry Finding historical queries related to the query requirement The two are spliced and then input into a text representation model BERT for the current query If a pronoun appears, the current query is compared with the last historical query Splicing, additionally checking queries If similar pronouns appear, if so, continuing to trace back the previous history information and inquiring the previous history information Splice with the current inquiry, if not, do not trace back; S3, through contrast learning, enabling the representation of the code of the query constructed by the learned model to approach to the representation of the code of the manually rewritten query; s4, inputting a trained model of S3 after each query sentence of a group of conversational queries and the queries related to the history thereof are spliced for encoding, calculating the semantic similarity of the document encoding representation obtained in S1, and sequencing the documents according to the size from large to small; s5, constructing a double-tower type fine granularity semantic interaction model by using a contrast learning method, and calculating the sorting loss of the model by using the set training constructed in the S4; And S6, searching the test set query by using the model trained in the S5 to obtain a sequencing result.
2. The method for conversational information retrieval of claim 1, wherein the method further comprises: S2, historical queries related to the query requirement The history information is selected with the following rules: Wherein the method comprises the steps of To and inquire about Query batches of related historical rounds; For a manually rewritten query, the encoded representation can be obtained using only a single query record, as follows: 。
3. A method for conversational information retrieval based on a pre-trained language model according to claim 2, wherein: s2 formula The code of the medium inquiry and the document is expressed as the output of each node of the hidden layer, and the output of the hidden layer is reserved.
4. The method for conversational information retrieval of claim 1, wherein the method further comprises: s3, contrast learning, wherein the loss function is specifically as follows: where batch size is the maximum allowed input length of the query, Queries for manual overwriting Is a coded representation of the number (c), An encoded representation of the query; the loss function represents the difference between the representation of the query and the coded representation of the manually rewritten query, the difference between the two is reduced by the loss function, the training model is refined, and the coded representation of the manually rewritten query is learned by contrast And an encoded representation of the constructed query Training is performed separately, such that the encoded representation of the query generated by the trained model approximates the encoded representation of the manually rewritten query.
5. The method for conversational information retrieval of claim 1, wherein the method further comprises: S4, performing conversational search by using the model trained in the S3, sorting the results of the query, intercepting N documents which are the nearest in the query results, marking the documents, selecting positive examples related to the query and negative examples not related to the query from the first N results which are the nearest in sorting, and constructing a training data set of the sorting model, and constructing a triplet of the sorting model according to the results: Wherein the method comprises the steps of To be marked as and inquired Related documents, and Is a set of negatively related documents.
6. The method for conversational information retrieval of claim 5, wherein the method further comprises: Wherein the number of negative examples that are not related to the query should be controlled between 50 and 100.
7. The method for conversational information retrieval of claim 1, wherein the method further comprises: S5, constructing a double-tower semantic matching model by using BERT, and training a model M3 by using the training set constructed in S4, wherein the model ordering loss is calculated by cross entropy loss and is as follows: Wherein the method comprises the steps of And Respectively, are inquires of Encoded representations of related and unrelated documents; representing the similarity of the query and the positive example document, Representing the similarity of the query and the negative documents.
8. The method for conversational information retrieval of claim 1, wherein the method further comprises: s6, executing session type search by using the trained model, inquiring And documents The method for calculating the semantic similarity of the query words comprises the steps of calculating the accumulated sum of the semantic similarity of words closest to each query word in the document, helping to capture fine-grained correlation with the query, calculating the similarity of average vector representations of the query and the document, and adding the average vector representations as the semantic similarity of the query and the document, wherein the calculation method comprises the following steps: Wherein the method comprises the steps of To input a query Is the first of (2) The coded representation of the individual query terms, Is a document Is the first of (2) A coded representation of the individual words; the number of the token is inquired; the number of the document token; and sequencing the results through the similarity value to obtain the optimized conversational search result.

Description

Conversational information retrieval method based on pre-training language model Technical Field The invention relates to the technical field of information retrieval methods, in particular to a dialogue type information retrieval method based on a pre-training language model. Background The popularization of new generation dialogue assistants (such as Alexa, siri, cortana, bixby and google Assistant) widens the application scene of the dialogue type information searching method and increases the importance of dialogue type information searching technology. The goal of the conversational information retrieval task is that the conversational model can understand the user behavior in the interactive search process and express the user's demand transfer according to the query turn. At the same time, the information requirements are characterized by complexity (requiring multiple rounds of refinement), diversity (across different information categories), open area (without accessing expert domain knowledge), and answerable (with sufficient coverage in the article collection). Conversational information retrieval one of the directions of next generation information retrieval proposed by the international text information retrieval agency at 2020. The method has the advantages that the method is favorable for meeting the complex information requirements of users, and convenient and accurate information access can be provided for the users through a dialogue interface and portable equipment. At present, the solutions for solving the conversational search are mainly divided into two types, wherein the first is to rewrite conversational queries into separate queries independent of context by using a generating method adopting GPT-3. The second approach is to use a vector representation that represents the query and document separately as dense. The representation of the conversational query is stitched together using all of the historical queries and the current query to obtain the current query representation. While the representation of the document is represented by following the previously ad hoc retrieved document, since there is not much variation between the two. Finally, dense vector retrieval is used. However, not all historical queries are useful for current queries. Therefore, the useful historical information is selected by formulating rules, and meanwhile, a model is constructed according to a comparison learning method, so that the representation of the combined query is continuously approximate to the manually rewritten query, less noise is obtained, the more useful query coding representation is obtained, and the conversational search model is beneficial to finding the document relevant to the context of the current query. Chinese patent discloses a semantic inference dialogue retrieval method and system of pretrained dual attention neural network, retrieval equipment and storage medium, wherein the patent number is CN202110795247.7, and the scheme uses a BM25 model based on binary independent assumption, which is mutually independent between terms. The relevance of a query to a document is measured by counting the number of times a query term appears in the document, and the "rarity" (inverse document frequency) of the query term in a multitude of documents. However, the binary independent assumption is inaccurate because the semantics of the terms are context dependent and not independent. The scheme does not screen out useful information for the current query through the context information, so that the retrieval result contains irrelevant information, and the accuracy is reduced. Because a portion of the irrelevant information supports binary independent assumptions, it is difficult to optimize it further. Disclosure of Invention Existing conversational information retrieval techniques splice all historical queries as input when entering a query in order to take into account contextual information. However, not all the historical information is useful with the current query, and irrelevant information can introduce noise to the model, resulting in reduced accuracy of the search results. Aiming at the problems, the invention provides a dialogue type information retrieval method based on a pre-training language model so as to improve the accuracy of information retrieval results. The technical scheme of the invention comprises the following steps: S1, obtaining the coded representation of the document by using an existing text representation model BERT. S2 for a set of conversational queriesFor the firstRound-robin inquiryFinding historical queries related to the query requirement. And (5) inputting the two to a text representation model BERT after splicing. And S3, through contrast learning, enabling the coded representation of the query constructed by the learned model to approach to the coded representation of the manually rewritten query. And S4, inputting the spliced query sentences of a group