CN-122019766-A - Generated document reordering method based on retrieval enhancement generation
Abstract
The invention discloses a generated document reordering method based on retrieval enhancement, which comprises the steps of preprocessing an original document set to obtain a candidate document set, inputting a query problem and the candidate document set into a large language model to generate an answer and a thinking chain, disassembling the thinking chain into atomic reasoning steps, then screening samples, calculating information gain scores and semantic similarity scores of each candidate document for each atomic reasoning step, taking the maximum value as a final contribution score of the document after weighted fusion, ordering the candidate documents according to the scores to form a training data set, training a generated reordering model to enable the candidate document set to output document identifier ordered sequences, and applying a pre-trained generated reordering model to perform online reasoning stage. According to the method, the actual contribution of the ordering target steering document to the reasoning step is combined with a high-quality supervision signal to construct and slide window global rearrangement strategy, so that the retrieval accuracy of the key document and the practicability of the rearrangement model in a complex question-answer scene are improved.
Inventors
- Li Youhuizi
- Weng Kaiqi
- YIN YUYU
- LIANG TINGTING
- SUN QIANQIAN
- LI YU
Assignees
- 杭州电子科技大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260414
Claims (11)
- 1. The method for reordering the generated document based on the retrieval enhancement generation is characterized by being executed by computing equipment and comprising offline training and online reasoning stages, and comprises the following specific steps: (1) The method comprises the steps of performing cleaning and structuring treatment on an original document set, segmenting the original document set into document fragments, constructing vector indexes by using a Faiss database after vector coding, and establishing the mapping relation between the fragments and the original document; (2) Inputting the query questions and the candidate document sets into a large language model to generate answers and thinking chains, and after the thinking chains are disassembled into atomic reasoning steps, performing sample screening to remove unqualified data samples through answer comparison, evidence verification and logic consistency assessment; (3) Calculating the information gain score and the semantic similarity score of each candidate document for each atomic reasoning step, taking the maximum value as the final contribution score of the document after weighted fusion, and sequencing the candidate documents according to the score to form a training data set; (4) Encoding the query problem and the training data set into a large language model for input, iterating the training to generate a reordering model to enable the reordering model to output a document identifier ordered sequence, introducing a position-aware weighting loss function during training, and endowing higher loss weight to documents ordered in front; (5) In the online reasoning stage, if the number of candidate documents is not limited by the input length of the super-generated reordering model, directly inputting the model to obtain the ordering result, if the number of candidate documents is exceeded, determining window capacity and sliding step length according to a context window of the model, and realizing global reordering through window iterative sliding.
- 2. The method of claim 1, wherein in step (1), the document is split, preferably according to a natural paragraph or a semantic boundary, and the paragraph is split twice according to a fixed token length when the paragraph exceeds a preset threshold, and an overlapping area is set between adjacent document fragments, and the target length of the document fragments is 256 or 512 tokens.
- 3. The method of claim 1 wherein the vectorizing encoding in step (1) uses bge-m3 embedded models to map both document snippets and query questions to the same semantic vector space, wherein the vector retrieval uses cosine similarity to measure similarity between query vectors and document snippet vectors, and wherein the retrieval results are aggregated from snippet layer to document layer to form candidate document sets.
- 4. The method according to claim 1, wherein the method of disassembling the mind chain in the step (2) is an atomic reasoning step, and each semantic unit capable of independently expressing intermediate reasoning is taken as an atomic reasoning step according to a logical sequence or a syntactic boundary in the mind chain text.
- 5. The method of claim 1, wherein the sample screening step of step (2) includes comparing final answers generated by the large language model with standard answers, determining whether the atomic reasoning step is capable of locating supporting evidence in the candidate documents, and determining whether there is a logical conflict or causal inconsistency between the atomic reasoning steps.
- 6. The method of claim 1, wherein the information gain score in step (3) is obtained by: And for each candidate document and each atomic reasoning step, calculating the probability of generating the reasoning step under the condition of introducing the document, calculating the probability of generating the reasoning step under the condition of not introducing the document, and calculating the information gain score of the document to the reasoning step based on the probability.
- 7. The method of claim 6, wherein the obtaining the semantic similarity score in step (3) comprises: And mapping the atomic reasoning steps and the candidate documents to the same semantic vector space respectively, and calculating the semantic similarity score between each candidate document and each atomic reasoning step.
- 8. The method of claim 7, wherein the weighted fusion in step (3) is a result of multiplying the information gain score by a first weight parameter, adding the semantic similarity score by a difference of 1 from the weight parameter, the weight parameter balancing causal contribution and semantic relevance.
- 9. The method of claim 1 wherein the location aware weighted loss function of step (4) is a negative inverse number of candidate documents, multiplied by a sum of the product of the weight value of each ranking location and the log of the probability generated by the corresponding location document number, wherein the weight value of the ranking location is a ratio of 1 plus a hyper-parameter to 1 plus the log of the location order, the hyper-parameter being greater than 0 and the weight value decreasing as the ranking location increases.
- 10. The method of claim 1, wherein in step (5) the sliding step is a one-half of the window capacity, the window capacity is determined to ensure that the model input does not exceed a preset proportion of the maximum input length and an output space is reserved, and the global high-correlation document is generated by window iterative sliding and candidate updating strategy aggregation.
- 11. The method of claim 1, wherein the generative reordering model is obtained by fine tuning instructions on a LLaMA-7B base model in Listwise ranking mode, and the large language model is a GPT-4 model with reasoning capability and is used for generating a thinking chain and constructing a ranking supervision signal.
Description
Generated document reordering method based on retrieval enhancement generation Technical Field The invention relates to the technical field of natural language processing and information retrieval, in particular to a document reordering method for retrieving and enhancing generated scenes, aiming at improving the document ordering performance. Background In a question-answering system of search enhancement generation (RETRIEVAL-Augmented Generation, RAG), document ranking is a key element affecting system performance. The system recalls candidate documents related to the query from a large-scale document library, and then screens out high-quality documents through a reordering model to provide reliable evidence support for subsequent answer generation. Currently, document reordering methods mostly rely on surface level relevance metrics between queries and documents, such as keyword-based matching scores, sparse retrieval scores, or dense vector similarity scores. Although these methods perform well in simple fact-based question-and-answer scenarios, in complex question-and-answer tasks involving multi-hop reasoning, causal inference, comparative analysis, or integration across document evidence, the ordering results tend to be mismatched with the evidence support required by the reasoning process. The following two typical problems are embodied: "similar but not available". Some documents are semantically highly similar to the query topic, but lack the necessary facts or evidence to support some key step in the inference chain, thus failing to truly support answer generation; Some documents contain key evidence, but the expression mode and the query have large difference, so that the similarity of the surface layer is low, and the documents are difficult to be effectively utilized by a follow-up reasoning module after being ranked in a candidate list. In addition, the training of existing reordering models usually relies on coarse-grained manual labeling such as "correlation" or "uncorrelation", and cannot reflect the specific contribution of documents to different steps in the complex reasoning process. The lack of such supervisory signals limits the performance improvement of the model in the task of inferential perceptual ordering. On the other hand, in practical deployments, the reordering model is often limited to a preset context window length. When the number of candidate documents is large, the model cannot process all the documents at one time, and input is often required to be reduced in a truncated or random sampling mode, so that high-value evidence is omitted, and the overall performance of the system is further reduced. Therefore, how to design a reordering method capable of perceiving reasoning logic, refining supervision signals and adapting to long document lists has become a key challenge for improving the accuracy and reliability of a retrieval enhancement generation system. Disclosure of Invention In order to solve the technical problems, the core objective of the invention is to provide a generated document reordering method which gives consideration to both inference utility and global ordering capability, by turning ordering targets to actual contribution of documents to inference steps and combining high-quality supervision signal construction and sliding window global reordering strategy, retrieval accuracy of key documents in complex question-answer scenes and practicability of reordering models are improved. In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: The method for reordering the generated document based on the retrieval enhancement generation is executed by a computing device and comprises offline training and online reasoning stages, and comprises the following specific steps: (1) The method comprises the steps of performing cleaning and structuring treatment on an original document set, segmenting the original document set into document fragments, constructing vector indexes by using a Faiss database after vector coding, and establishing the mapping relation between the fragments and the original document; (2) Inputting the query questions and the candidate document sets into a large language model to generate answers and thinking chains, and after the thinking chains are disassembled into atomic reasoning steps, performing sample screening to remove unqualified data samples through answer comparison, evidence verification and logic consistency assessment; (3) Calculating the information gain score and the semantic similarity score of each candidate document for each atomic reasoning step, taking the maximum value as the final contribution score of the document after weighted fusion, and sequencing the candidate documents according to the score to form a training data set; (4) Encoding the query problem and the training data set as model input, iterating the training to generate a reordering model to enable the reorderin