CN-121979997-A - Bidirectional information retrieval enhancement generation method for large language model

CN121979997ACN 121979997 ACN121979997 ACN 121979997ACN-121979997-A

Abstract

The invention discloses a bidirectional information retrieval enhancement generation method for a large language model, and belongs to the technical field of artificial intelligence. In order to overcome the defects that the traditional RAG only carries out one-way search of 'query-document' and introduces noise and omits key evidence, the invention constructs a two-way semantic perception search enhancement generation model and a two-stage training framework, wherein the first stage pulls the distance between positive examples and negative examples in an embedded space in a comparison learning self-supervision mode, the second stage carries out fine granularity relevance discrimination on two-way sentence pairs of query-document by supervised two-class classification and outputs probabilistic relevance score, and the reasoning stage fuses the two-way probabilities according to Bayesian to obtain final relevance and reorders the documents, so that plug and play can be realized without fine adjustment of LLM in the whole process. The invention obviously improves the accuracy and consistency of single-hop, multi-hop question-answering and fact checking tasks, and has the advantages of light weight and low deployment cost.

Inventors

JIA HAO
ZHANG ZIXIN
HUANG YONG
ZHAO LIANG
LIU PENG
LI MING
LIU JUN

Assignees

中华人民共和国大连海关

Dates

Publication Date: 20260505
Application Date: 20260210

Claims (8)

1. A method for generating a two-way information retrieval enhancement for a large language model, comprising: Firstly, constructing a bidirectional semantic perception retrieval enhancement generation model, wherein the model comprises an encoder and a bidirectional probability fusion module; Secondly, sequencing candidate documents, sampling pseudo positive examples and pseudo negative examples, and endowing pseudo labels to form triples by inquiring, the pseudo positive examples and the pseudo negative examples; Thirdly, performing self-supervision training based on semantic comparison, namely obtaining hidden vectors of query, pseudo-positive examples and pseudo-negative examples through an encoder based on triples so as to minimize a comparison type loss function, driving the query vectors to be close to the pseudo-positive examples and far from the pseudo-negative examples, updating encoder parameters and constructing a low-noise semantic space; The fourth step of two-way probability matching training, namely, for each query-document pair, constructing two-way splicing sequences according to the sequence of the prior query and the prior document respectively, inputting an encoder to acquire a hidden state, outputting relevant probabilities in two directions through a classification head in a two-way probability fusion module, taking the minimum binary cross entropy loss as a training target, and jointly updating encoder parameters and trainable weights of the classification head to finish model training; And fifthly, searching a plurality of candidate documents for user inquiry, constructing a bidirectional splicing sequence for each candidate document, respectively obtaining relevant probabilities corresponding to two directions through a trained bidirectional semantic perception retrieval enhancement generation model, calculating joint probabilities through a Bayesian fusion formula, sequencing the candidate documents according to the joint probabilities, and selecting the first K documents as context input large language models to generate answers.
2. The method of claim 1, wherein the bi-directional semantic aware retrieval enhancement generation model is based on an existing retrieval enhancement generation system for replacing an existing retriever, comprising: the encoder takes the pre-training language model as an encoder and is used for respectively generating a bidirectional splicing sequence of a front query and a front document based on the query-document pair and obtaining a corresponding hidden state; And the bidirectional probability fusion module is used for mapping the hidden state into corresponding relevant probability through the classification head and calculating joint probability by using a Bayesian fusion formula.
3. The method of claim 2, wherein the classification header is a single hidden layer multi-layer perceptron, outputting the correlation probabilities using a Sigmoid function.
4. The method of claim 1, wherein the second step is embodied as a query Corresponding candidate documents { Performing relevance ranking, and randomly extracting pseudo-positive examples according to the positive and negative examples of the relevance scores And pseudo negative example And labeling pseudo labels to form triples of inquiry, pseudo positive examples and pseudo negative examples , , )。
5. The method of claim 1, wherein the comparative loss function is as follows: (2) Wherein, the 、、 Respectively represent queries Pseudo-normal example Negative example of pseudo Is used to determine the hidden vector of (c), In order to set the margin to be the preset margin, The function of the hinge is represented by, Is the current set of batch triples.
6. The method according to claim 1, wherein the fourth step is based on a low noise semantic space, for each query, based on the pseudo tags noted in the second step And corresponding document Assigning genuine labels Wherein 1 represents correlation and 0 represents uncorrelation, and forms a supervision pair Forming a set of supervision sentence pairs Splicing each supervision sentence pair into a bidirectional splicing sequence according to the two directions of the front inquiry and the front document, and then sending the bidirectional splicing sequence into an encoder to respectively obtain hidden states And ; For a pair of And Applying classification head to output corresponding correlation probabilities And ; Training is targeted to minimize the binary cross entropy loss, where the loss function is as follows: (3) Of the formula (I) Respectively take out And (3) with And summing the calculated losses in two directions, averaging to obtain binary cross entropy loss, and returning the binary cross entropy loss, and jointly updating the encoder parameters and the trainable weights of the classification head to obtain a trained bidirectional semantic perception retrieval enhancement generation model.
7. The method of claim 1, wherein the bayesian fusion formula is: ; Wherein, the The joint probability is represented as a function of the joint probability, And The associated probabilities in both directions, respectively.
8. The method of claim 1, wherein in the fifth step, parameters of the large language model are frozen when generating the answer.

Description

Bidirectional information retrieval enhancement generation method for large language model Technical Field The invention belongs to the technical field of artificial intelligence, and particularly relates to a search enhancement generation method based on a two-way semantic matching mechanism, which is used for improving the question-answer accuracy and the context correlation of a large language model in a knowledge-intensive task. Background The Large Language Model (LLM) has shown strong capability in the scenes of open domain question-answering, dialogue, content creation and the like, but the knowledge of the large language model is completely solidified in parameters, the updating cost is high, the timeliness is poor, and the content which is inconsistent with the fact is easily output due to 'illusion'. The retrieval enhancement generation (RAG) injects real-time external documents into the context through a two-stage model of 'first retrieval and then generation', so that LLM can obtain the latest knowledge without incremental training, and the method has become a main stream technical route for improving the credibility. Compared with the continuous expansion of parameter scale, RAG realizes 'instant learning' with extremely small calculation force, and has obvious advantages in the technical knowledge service fields of medical treatment, finance, law and the like. Most of the existing RAG methods use classical information retrieval links, namely, firstly, a sparse (BM 25) or dense vector (DPR) model is used for primary screening of candidate segments, then Top-K is selected according to single-point similarity sorting, and finally, a document is spliced to the tail of a prompt word for LLM reading. Recent improvements have focused on training better embedded models, introducing lightweight reorderers, or reducing length with digest compression, but the core still belongs to "one-way" retrieval-only computing "query- > document" relevance scores, not verifying whether "document- > query" is equally true, resulting in a large number of surface hits, real-time irrelevant noise segments being fed into the context. In addition, in the prior art, retrieval and generation are regarded as black boxes independent of each other, contextualized secondary verification on candidate segments cannot be performed by utilizing the deep semantic capability of LLM, and higher requirements of complex tasks such as multi-hop reasoning, numerical facts, fine-grained entities and the like on 'complete, mutually exclusive and traceable' evidence are difficult to meet. Therefore, there is increasing interest in how to take full advantage of semantic similarity between queries and documents for retrieval. At present, the following ways are adopted to try to solve the problem: Keyword-based retrieval model For text data, it can be characterized as an inverted index format, which is a data structure that maps document content to keywords, and includes only keywords, document IDs where the keywords appear, word frequencies, and position lists, but can indicate whether any piece of text includes a certain keyword, and complete semantics of which documents and where the word appears. By the process of inquiring keywords, merging the inverted list and sorting according to weights, all related documents are fished out from massive texts and relevance sorting is given. However, the method has some problems that firstly, the keywords can not fully capture the relevant information, for example, the same concept has a plurality of expressions or synonyms and near-meaning words, the traditional inverted index can only be accurately matched with the literal form, semantic equivalence can not be recognized, so that a model can not fully utilize similarity information to influence final performance, meanwhile, keyword matching is easily affected by word sequence, grammar change and noise interference, for example, apples can refer to fruits as well as apple companies, and the apple companies lack of context semantic judgment to cause false detection or omission, and in addition, the inverted index has weak support for long-tail query or fuzzy query, and when user input is not normal or slightly deviated, a system is difficult to return an effective result, so that the robustness of retrieval is further limited. (II) vector-based retrieval model Another way to effectively map the similarity information is to map the text (words, sentences or whole document) into dense high-dimensional vectors by using a vector space model, so that the contents with similar semantics are closer in the vector space, and the degree of correlation between the texts is directly measured through vector similarity calculation (such as cosine similarity and dot product). For vector space models, there is a key problem-semantic gap-a vector representation based on word frequency cannot capture synonyms, contextual semantics or word polysemous (e.g. "apple" refers