CN-121997918-A - Two-stage document filtering and robust fine tuning method based on graph attention network

CN121997918ACN 121997918 ACN121997918 ACN 121997918ACN-121997918-A

Abstract

A two-stage document filtering and robust fine tuning method based on a graph attention network relates to the technical field of natural language processing and deep learning. The method solves the problems that the existing retrieval enhancement generation system is difficult to accurately identify useful documents and is easy to be interfered by anti-fact information under the mixed document environment. The method comprises the steps of constructing a multi-class document pool comprising correct documents, anti-fact documents, noise documents and irrelevant documents, constructing a semantic graph on a paragraph level, adopting a two-stage graph attention network to sequentially filter the irrelevant documents and the noise documents to obtain a reference document set, constructing a document discrimination training sample and a question-answer training sample based on the reference document set, and carrying out instruction fine tuning on a large language model by combining the two with fine tuning data to enable the model to have document reliability discrimination capability and keep stable output under the mixed document situation, thereby improving the facts and credibility of a retrieval enhancement generation system under the anti-fact attack and noise environment.

Inventors

REN WEIWU
LIU YANG

Assignees

长春理工大学

Dates

Publication Date: 20260508
Application Date: 20260410

Claims (9)

1. A two-stage document filtering and robust fine tuning method based on a graph attention network is characterized by comprising the following steps: step one, constructing a multi-class document pool and constructing a mixed document set; step two, performing text preprocessing on the mixed document set to construct a paragraph level semantic graph; training an irrelevant document discrimination model and a noise document discrimination model based on a graph attention network to obtain a reference document set; Constructing a document discrimination training sample and a question-answer training sample according to the reference document set obtained in the step three, and combining the document discrimination training sample and the question-answer training sample to obtain a joint robust fine tuning data set; And step five, carrying out joint instruction fine tuning on the large language model by using the joint robust fine tuning data set obtained in the step six, integrating the fine-tuned large language model, the irrelevant document discrimination model and the noise document discrimination model in the step three into a retrieval enhancement generation system, and inputting the document set filtered by the retrieval enhancement generation system and the query text into the robust fine-tuned large language model to output a final answer.
2. The method for two-stage document filtering and robust trimming based on graph attention network of claim 1, wherein in the first step, the specific process of constructing the multi-class document pool is as follows: step one, obtaining the original data of a retrieval enhancement generation system and an external knowledge base corpus, and carrying out vector coding on the external knowledge base corpus to establish a similarity retrieval index; Dividing the query text and the corresponding standard answer in the original data into a training set and a testing set according to a preset rule, respectively searching each query text in the training set and the testing set according to the index to obtain a plurality of documents, and forming a correct document pool, an irrelevant document pool and a noise document pool corresponding to each query text; and thirdly, generating a counterfactual document pool according to the query text and the correct document in the correct document pool, and aligning the counterfactual document pool with the correct document pool, the irrelevant document pool and the noise document pool according to the query numbers to obtain a multi-class document pool.
3. The method for two-stage document filtering and robust trimming based on graph attention network of claim 2, wherein in step one three, semantic replacement, rewrite or perturbation operations are performed on the correct documents in the correct document pool through a large language model to obtain a misleading but reasonable-surface anti-fact document pool.
4. The method for two-stage document filtering and robust trimming based on graph attention network as recited in claim 3, wherein in step one, documents in a correct document pool are extracted, a candidate document sequence retrieved by a retrieval enhancement generation system is simulated, and a mixed document set is constructed according to a set attack type and a set replacement probability.
5. The two-stage document filtering and robust fine tuning method based on the graph attention network according to claim 1 is characterized in that in the second step, semantic vectors of each document are obtained through a sentence vector coding model, meanwhile, document attribute features are calculated, the semantic vectors and the document attribute features are spliced to form document node feature vectors, query texts are used as query nodes, documents in a mixed document set are used as document nodes, and paragraph-level semantic graphs corresponding to the query texts are constructed.
6. The two-stage document filtering and robust trimming method based on graph attention network of claim 1, wherein in step three, each document node in the paragraph level semantic graph is screened out by adopting the irrelevant document discrimination model, specifically: in the training process, irrelevant document nodes are used as positive class supervision signals, the rest document nodes are used as negative class supervision signals, binary cross entropy loss optimization model parameters are adopted, and irrelevant document judgment thresholds are determined according to a preset accuracy rate target; And after the label marking of all the document nodes is completed, deleting the document nodes with labels of 0 and the corresponding associated edges thereof, and reserving the document nodes with labels of 1 to obtain the document set after the first-stage filtering.
7. The method for two-stage document filtering and robust trimming based on graph attention network of claim 6, wherein in step three, each document node in the document set after the first stage filtering is screened out by adopting a noise document discrimination model, specifically: in the training process, taking the noise document as a supervision signal to carry out binary classification training, so that the noise document discrimination model can identify the noise document which can mislead the answer direction or introduce bias, and determining a noise document discrimination threshold according to a preset accuracy target; And after the label marking of all the document nodes is completed, deleting the document nodes with labels of 0 and corresponding associated edges thereof, reserving the document nodes with labels of 1, and obtaining a document set filtered in the second stage as a reference document set.
8. The two-stage document filtering and robust trimming method based on graph attention network of claim 1, wherein in step four, a classification label is generated for each document in the reference document set, wherein when the document is sourced from a correct document pool, the classification label is marked as an evaluation 1, and when the document is sourced from the rest of document pools, the classification label is marked as an evaluation 2 to form a document discrimination training sample; And taking the query text and the reference document set as input and standard answers as output to form a question and answer training sample for training a large language model according to the multi-document answer questions.
9. The method for two-stage document filtering and robust trimming based on graph attention network of claim 1, wherein in step five, after receiving new query, the search enhancement generation system firstly searches for a document set based on similarity between a query text vector and a knowledge base document vector, then sequentially executes irrelevant document filtering and noise document filtering, finally inputs the filtered document set and the query text into a robust trimmed large language model to output a final answer, and simultaneously outputs a classification evaluation result of a reference document.

Description

Two-stage document filtering and robust fine tuning method based on graph attention network Technical Field The invention relates to the technical fields of natural language processing, information retrieval and deep learning, in particular to a document filtering and model training method for improving the robustness of a retrieval enhancement generation system (RAG), and more particularly relates to a two-stage document filtering and robust fine tuning method based on a graph attention network. Background At present, RAG technology is widely applied in the fields of intelligent question-answering, text analysis, knowledge service and the like, but as the scale of an external knowledge base and the complexity of contents are continuously improved, candidate documents searched by a system often contain correct evidence, noise fragments, subject irrelevant contents and misleading anti-facts information at the same time, after the contents are mixed into input, a large language model is easily deviated from a real answer in the reasoning process, so that risks of fact errors, phantom answers and the like are caused, and the reliability of the system in an actual scene is seriously affected. The existing document filtering method is multi-dependency similarity retrieval, single document scoring or rule-based ordering strategies, and has limited capability for capturing deep semantic relations between queries and multiple documents. When the anti-fact information with similar semantics but opposite conclusions exists in the document set, the traditional method is difficult to distinguish, the misleading document is easy to be kept at a high position, and meanwhile, the rules and the threshold values are sensitive to data distribution and are easy to be interfered by noise to be unstable. In addition, the reordering method based on the large model has high calculation cost and high requirement on training data, and still has obvious vulnerability when facing the counter fact attack. The traditional RAG still mainly depends on the attention mechanism of the model to judge the document quality when processing the mixed document environment, but the large language model has insufficient explicit perceptibility of the document credibility, so that the reliable choice is difficult to be made when a plurality of documents contradict each other, and when the noise or the counterfactual document is high, the model is easily pulled by false evidence and an unreliable answer is output. Therefore, in order to maintain stable performance in complex search scenarios, a technical solution is needed that can model the overall structure of a document, effectively identify noisy documents, and simultaneously enhance the robustness of large language models under multi-document conditions. In order to solve the above problems, the present invention provides a two-stage document filtering and robust trimming method based on graph attention network. Disclosure of Invention The invention provides a two-stage document filtering and robust fine tuning method based on a graph attention network, which aims to solve the problems that an existing retrieval enhancement generation system is difficult to accurately identify a useful document and is easy to be interfered by anti-fact information under a mixed document environment. A two-stage document filtering and robust fine tuning method based on a graph attention network is realized by the following steps: step one, constructing a multi-class document pool and constructing a mixed document set; step two, performing text preprocessing on the mixed document set to construct a paragraph level semantic graph; training an irrelevant document discrimination model and a noise document discrimination model based on a graph attention network to obtain a reference document set; Constructing a document discrimination training sample and a question-answer training sample according to the reference document set obtained in the step three, and combining the document discrimination training sample and the question-answer training sample to obtain a joint robust fine tuning data set; And step five, carrying out joint instruction fine tuning on the large language model by using the joint robust fine tuning data set obtained in the step six, integrating the fine-tuned large language model, the irrelevant document discrimination model and the noise document discrimination model in the step three into a retrieval enhancement generation system, and inputting the document set filtered by the retrieval enhancement generation system and the query text into the robust fine-tuned large language model to output a final answer. The invention has the beneficial effects that: The method comprises the steps of constructing a two-stage document filtering model based on a graph attention network, namely an irrelevant document distinguishing model and a noise document distinguishing model based on the graph attention netwo