CN-121834025-B - FocusedRAG-based trusted academic question answering method and FocusedRAG-based trusted academic question answering device
Abstract
The invention discloses a FocusedRAG-based trusted academic question and answer method and a FocusedRAG-based trusted academic question and answer device, and belongs to the field of artificial intelligence and information retrieval. The method comprises the steps of downloading academic paper data from a public data source to serve as an original data set, segmenting document content in the original data set along chapter boundaries, utilizing a large language model to rewrite and definitely search intention for user inquiry, designing an inquiry self-adaptive signal-to-noise separation mechanism to screen chapters related to inquiry to serve as search space, expanding the inquiry self-adaptive signal-to-noise separation mechanism in the selected chapters to obtain sentence-level signal-to-noise demarcation threshold values, designing a sentence selection mechanism to obtain candidate evidence intervals, sorting based on accumulated relevance scores of the candidate evidence intervals, performing greedy selection on the candidate evidence intervals under the constraint of a search token budget to form search context, and inputting the search context and the user inquiry into the large language model to generate a final answer. The invention can establish an accurate attribution basis and remarkably improve the accuracy and credibility of questions and answers.
Inventors
- WANG YINGLONG
- YU RUI
- WANG TIANYI
- LIU RUIXIA
Assignees
- 齐鲁工业大学(山东省科学院)
- 山东省计算中心(国家超级计算济南中心)
Dates
- Publication Date
- 20260508
- Application Date
- 20260313
Claims (6)
- 1. A FocusedRAG-based trusted academic question-answering method, comprising the steps of: s1, data acquisition, namely acquiring academic paper data from a public data source as an original data set; S2, preprocessing, namely segmenting document contents in an original data set along chapter boundaries, and simultaneously rewriting a user query by using a large language model to obtain a rewritten query; s3, search space compression, namely constructing a query self-adaptive signal-to-noise separation mechanism, screening chapters related to queries based on rewritten queries as search spaces, and specifically comprising the following steps: S31, inputting the rewritten query and the target object into a pre-training intention recognition model to obtain intention correlation scores of the target object and the rewritten query, and aggregating intention correlation scores of all the target objects and the rewritten query to obtain intention correlation distribution; s32, arranging the intention correlation distribution in a descending order, then dividing the target object into signal classes and noise classes in sequence based on each candidate boundary position, and calculating the inter-class variance of each division; s33, identifying the optimal demarcation position by searching the maximum inter-class variance Based on the optimal demarcation location And (3) with Calculating an average intent correlation score as an optimal demarcation threshold for the target object intent correlation score; s34, taking a section with the intention relevance score higher than the optimal demarcation threshold value as a signal section for subsequent sentence selection; S4, screening the evidence intervals, namely expanding a query self-adaptive signal-to-noise separation mechanism in a search space, acquiring sentence-level signal-to-noise demarcation threshold values, acquiring candidate evidence intervals from the search space through a sentence selection mechanism guided by correlation density, sorting the candidate evidence intervals based on accumulated correlation scores of the candidate evidence intervals, and preferentially selecting the candidate evidence intervals under the constraint of a search token budget to form a search context; The extended query adaptive signal-to-noise separation mechanism specifically comprises: S411, inputting all sentences in the rewritten query and signal section into a query self-adaptive signal-to-noise separation mechanism, and calculating the intent correlation score of the rewritten query and each sentence by using a pre-training intent recognition model to obtain sentence-level intent correlation distribution; S412, carrying out Gaussian smoothing on the sentence-level intention correlation distribution to obtain smoothed intention correlation distribution; s413, multiplexing the smoothed intention correlation distribution with the operations of S32, S33 and S34 to obtain a sentence-level optimal demarcation threshold value; The sentence selection mechanism guided by the correlation density specifically includes: s421, calculating intent correlation surplus: for the smoothed intent correlation score of the sentence in the signal section, calculating an intent correlation surplus thereof relative to a sentence-level optimal demarcation threshold; S422, constructing a candidate evidence interval set: For each starting position meeting the intention correlation surplus is larger than zero, identifying a termination position which enables the accumulated surplus to be maximum through a maximum subarray principle, and collecting all intervals with positive accumulated surplus to form a candidate interval set; s423, constructing a search context: arranging according to the accumulated relevance of each interval in the candidate evidence interval set in a descending order to obtain a priority sequence, and then performing greedy selection on the priority sequence to obtain a final retrieval context; s5, generating answers, namely inputting the search context and the user query into a large language model, and generating academic question-answering answers aiming at the user query by the large language model.
- 2. The trusted academic question answering method based on FocusedRAG of claim 1, wherein the step S2 preprocessing process specifically includes: S21, extracting chapter titles from each document content in an original data set, dividing the corresponding document content into chapter units based on the detected chapter titles, and further dividing the content in each chapter into independent sentences; s22, inputting the user query and all detected chapter titles as context information into a large language model, and re-expressing the query to generate a rewritten query.
- 3. A trusted academic question answering method based on FocusedRAG as claimed in claim 1, wherein said greedy selection specifically comprises: In retrieving token budgets Performs greedy selection on the priority sequence under the constraint of (a): in the formula, For the optimal boundary position of the two-way valve, Representing priority sequences Token number of (a); Front is put forward The content of the intervals is aggregated into a selected search context, and the intervals in the selected search context are reordered according to the original document position to form a final search context.
- 4. The trusted academic question answering method based on FocusedRAG of claim 1, wherein the step S5 answer generation process specifically includes: and receiving the final search context and the user query output in the step S4 as input, and generating a final answer by using the large language model.
- 5. A trusted academic question answering device based on FocusedRAG, performing a trusted academic question answering method based on FocusedRAG as claimed in any one of claims 1 to 4, comprising: the data acquisition module acquires an academic paper data set containing fine-grained evidence marks from a public data source; The preprocessing module is used for extracting text content, detecting the line where the chapter titles are located, dividing the document along the chapter boundary, further dividing the chapter content according to sentences, and rewriting the user inquiry by combining all the chapter titles by using a large language model so as to clearly search the intention; The search space compression module is used for designing a query self-adaptive signal-to-noise separation mechanism to divide the content of the full text file into a signal section related to the query and a noise section unrelated to the query, and compressing the search range from the full text file space to a section space related to the query; the evidence interval screening module firstly expands and applies a query self-adaptive signal-to-noise separation mechanism to sentence levels based on the content of the selected signal section to establish a sentence-level signal-to-noise separation threshold value, then designs a sentence selection mechanism guided by correlation density to identify continuously-appearing high-correlation sentence areas as candidate evidence intervals, sorts the candidate evidence intervals based on accumulated correlation scores of the candidate evidence intervals, and finally performs greedy selection on the sorted candidate evidence intervals under the constraint of a search token budget to form a final search context; and the answer generation module is used for receiving the final search context and the user query as input, generating a final answer by using the large language model and outputting the final answer.
- 6. A computer readable storage medium having a computer program stored therein, the computer program being executable by a processor to implement a FocusedRAG-based trusted academic question answering method according to any one of claims 1-4.
Description
FocusedRAG-based trusted academic question answering method and FocusedRAG-based trusted academic question answering device Technical Field The invention belongs to the field of artificial intelligence and information retrieval, and particularly relates to a FocusedRAG-based trusted academic question-answering method and device. Background Retrieval enhancement generation (RETRIEVAL-Augmented Generation, RAG) is a technique that combines large language models with external knowledge bases to generate more accurate, reliable text. The core of the RAG technology is to provide context support for a large language model by retrieving relevant information from an external knowledge base, thereby improving the quality and accuracy of the generated content. In the field of academic paper processing, the application value of RAG technology is particularly outstanding. For example, in the academic research assistance, the RAG can provide accurate knowledge support for researchers through quick retrieval and analysis of related documents to accelerate the research process, and in the intelligent question-answering system, the RAG can generate high-quality question-answering results based on academic documents to meet the demands of academic users on depth information. This not only improves the efficiency of academic research, but also promotes the intelligent development of knowledge discovery and academic communication. Existing RAG techniques face significant challenges in handling academic questions and answers, because trusted academic questions and answers not only require accurate replies, but also require accurate evidence attribution to ensure interpretability and establish user trust. On one hand, the traditional RAG technology based on keywords or vector retrieval cannot distinguish information quality, is easy to be interfered by noise, the noise is squeezed or diluted by real evidence, and the context pollution is caused, so that attribution precision is reduced, on the other hand, a high-value area cannot be adaptively focused, and one key characteristic of academic evidence distribution, namely spatial concentration, is ignored. In particular, academic authoring practices have determined that the content of a particular intent tends to be concentrated in local functional areas, and the inability to adaptively focus on these areas can result in severely degrading the accuracy of evidence attribution and overall credibility of the academic questions and answers. Disclosure of Invention The invention provides a FocusedRAG-based credible academic question-answering method and device for solving the technical problems of context pollution, low evidence attribution precision and the like of the conventional RAG technology in an academic question-answering scene, and by means of multidimensional mechanism innovation and flow optimization, accurate compression of a search space and efficient extraction of high-value evidence are realized, a traceable evidence attribution basis is established, and accuracy and credibility of academic question-answering are remarkably improved. The invention provides a FocusedRAG-based trusted academic question and answer method, which comprises the following steps: s1, data acquisition, namely acquiring academic paper data from a public data source as an original data set; S2, preprocessing, namely segmenting document contents in an original data set along chapter boundaries, and simultaneously rewriting a user query by using a large language model to obtain a rewritten query; S3, search space compression, namely constructing a query self-adaptive signal-to-noise separation mechanism, and screening chapters related to the query based on the rewritten query to serve as a search space; S4, screening the evidence intervals, namely expanding a query self-adaptive signal-to-noise separation mechanism in a search space, acquiring sentence-level signal-to-noise demarcation threshold values, acquiring candidate evidence intervals from the search space through a sentence selection mechanism guided by correlation density, sorting the candidate evidence intervals based on accumulated correlation scores of the candidate evidence intervals, and preferentially selecting the candidate evidence intervals under the constraint of a search token budget to form a search context; s5, generating answers, namely inputting the search context and the user query into a large language model, and generating academic question answer aiming at the initial question content by the large language model. Further, the preprocessing module specifically includes: Extracting chapter titles from each document content in the original dataset, dividing the corresponding document content into chapter units based on the detected chapter titles, and further dividing the content in each chapter into independent sentences; and inputting the user query and all detected chapter titles as context information into a large lang