CN-121980015-A - Retrieval enhancement evaluation method and device based on multi-round dialogue context awareness
Abstract
The application relates to the technical field of search enhancement generation evaluation, and provides a search enhancement evaluation method and device based on multi-round dialogue context awareness, wherein the method comprises the steps of constructing a first large language model; the method comprises the steps of carrying out question rewriting according to questions and contexts by a first large language model, extracting historical multi-round question and answer data of a real user, generating a multi-round evaluation set based on the historical multi-round question and answer data, a knowledge base document, a second large language model and an RAG system, summarizing service classification, question types and interaction norms of each multi-round question and answer by the second large language model, converting the multi-round evaluation set into a single-round question set by the first large language model, and evaluating the RAG system by using evaluation indexes, the multi-round evaluation set and the single-round question set. The method and the device realize automatic generation of the multiple evaluation sets, reduce the manual marking cost, the construction difficulty of the multiple evaluation sets and realize evaluation on the multiple dialogue RAG tasks.
Inventors
- DENG GANG
- JI YIHUI
- CAO XIANGBO
- ZHANG XINLIN
- ZUO QIAN
- XIE KANGLIN
Assignees
- 湘财证券股份有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260408
Claims (10)
- 1. A search enhancement evaluation method based on multi-round dialogue context awareness, the method comprising: Constructing a first large language model, wherein the first large language model is used for rewriting the problems according to the problems and the context; Extracting historical multi-round question-answering data of a real user from the RAG system; Generating a multi-round evaluation set based on the historical multi-round question and answer data, a knowledge base document, a second large language model and the RAG system, wherein the second large language model is used for summarizing service classification, question types and interaction paradigms of each multi-round question and answer; Converting the multiple-round evaluation set into an independent single-round problem set by using the first large language model; and evaluating the RAG system by using the evaluation index, the single-round problem set and the multi-round evaluation set.
- 2. The method for enhanced evaluation of multiple-round dialogue context awareness based retrieval of claim 1, wherein the step of generating multiple-round evaluation sets based on the historical multiple-round question-answer data, knowledge base documents, second biggest language model, and the RAG system comprises: Performing cluster analysis on the historical multi-round question and answer data to obtain a high-frequency question and answer data set and a document slice referenced by answers; summarizing the service classification, the question type and the interaction paradigm of each multi-round question and answer by using a second large language model based on the high-frequency question and answer data set and the document slices referenced by the answers; according to the service classification, obtaining corresponding documents in the knowledge base documents, and randomly extracting document slices; inputting the random combination of the question type, the interaction paradigm and the extracted document slice into a second large language model, and automatically generating a multi-round question set through the second large language model; Automatically generating multiple rounds of evaluation data by using the RAG system based on the multiple rounds of question sets, wherein the multiple rounds of evaluation data comprise questions, model answers and document slices referenced by the answers; And integrating each multi-round evaluation data to obtain a multi-round evaluation set.
- 3. The method for multi-round dialogue context aware-based search enhancement evaluation as claimed in claim 1, wherein the step of converting the multi-round evaluation set into an independent single-round question set using the first large language model comprises: Converting each multi-round evaluation data in the multi-round evaluation into a plurality of independent single-round questions according to the following step S211, and integrating the plurality of independent single-round questions corresponding to the multi-round evaluation data to obtain the single-round question set; s211, sequentially inputting each round of questions and contexts into the first large language model, and outputting the rewritten questions corresponding to each round by the first large language model.
- 4. The method for enhanced search evaluation based on multi-round conversational context according to claim 2, wherein the step of evaluating the RAG system using an evaluation index, a single-round set of questions, and the multi-round set of evaluations comprises: Acquiring a multi-round evaluation set and a corresponding converted single-round problem set; generating reference answer data of the single round of question set through a RAG system, wherein the reference answer data comprises model answers and document slices referenced by the answers; And comparing the answer data of the multiple-time evaluation sets with the corresponding reference answer data by using the evaluation index to obtain an evaluation score, wherein the answer data of the multiple-time evaluation sets comprises model answers and document slices referenced by the answers.
- 5. The multi-turn dialog context-aware-based retrieval enhancement assessment method of claim 1, wherein the assessment metrics include one or more of context consistency, intent understanding rate, context recall rate, answer accuracy, and reference confidence.
- 6. A search enhancement evaluation device based on multi-round dialogue context awareness, the device comprising: the system comprises a construction module, a first language model and a first language model, wherein the construction module is used for constructing the first language model, and the first language model is used for carrying out problem rewriting according to problems and contexts; the extraction module is used for extracting historical multi-round question-answering data of the real user from the RAG system; the generation module is used for generating a multi-round evaluation set based on the historical multi-round question and answer data, the knowledge base document, a second large language model and the RAG system, wherein the second large language model is used for summarizing the service classification, the question type and the interaction paradigm of each multi-round question and answer; The conversion module is used for converting the multi-round evaluation set into an independent single-round problem set by utilizing the first large language model; And the evaluation module is used for evaluating the RAG system by using the evaluation index, the single-round problem set and the multiple-round evaluation set.
- 7. The multi-round dialogue context aware-based retrieval enhancement assessment apparatus as claimed in claim 6, wherein the generating a multi-round assessment set based on the historical multi-round question-answer data, knowledge base documents, second bigram language model, and RAG system comprises: Performing cluster analysis on the historical multi-round question and answer data to obtain a high-frequency question and answer data set and a document slice referenced by answers; summarizing the service classification, the question type and the interaction paradigm of each multi-round question and answer by using a second large language model based on the high-frequency question and answer data set and the document slices referenced by the answers; according to the service classification, obtaining corresponding documents in the knowledge base documents, and randomly extracting document slices; inputting the random combination of the question type, the interaction paradigm and the extracted document slice into a second large language model, and automatically generating a multi-round question set through the second large language model; Automatically generating multiple rounds of evaluation data by using the RAG system based on the multiple rounds of question sets, wherein the multiple rounds of evaluation data comprise questions, model answers and document slices referenced by the answers; And integrating each multi-round evaluation data to obtain a multi-round evaluation set.
- 8. The multi-round dialogue context aware-based retrieval enhancement assessment apparatus as claimed in claim 6, wherein said converting the multi-round assessment set into an independent single-round question set using the first large language model comprises: Converting each multi-round evaluation data in the multi-round evaluation into a plurality of independent single-round questions according to the following step S211, and integrating the plurality of independent single-round questions corresponding to the multi-round evaluation data to obtain the single-round question set; s211, sequentially inputting each round of questions and contexts into the first large language model, and outputting the rewritten questions corresponding to each round by the first large language model.
- 9. The multi-round dialogue context aware-based retrieval enhancement assessment apparatus as claimed in claim 7, wherein the evaluating the RAG system using an assessment index, a single-round question set, and the multi-round assessment set comprises: Acquiring a multi-round evaluation set and a corresponding converted single-round problem set; generating reference answer data of the single round of question set through a RAG system, wherein the reference answer data comprises model answers and document slices referenced by the answers; And comparing the answer data of the multiple-time evaluation sets with the corresponding reference answer data by using the evaluation index to obtain an evaluation score, wherein the answer data of the multiple-time evaluation sets comprises model answers and document slices referenced by the answers.
- 10. The multi-turn dialog context-aware based search enhancement evaluation device of claim 6, wherein the evaluation metrics include one or more of context consistency, intent understanding rate, context recall rate, answer accuracy, and reference confidence.
Description
Retrieval enhancement evaluation method and device based on multi-round dialogue context awareness Technical Field The application relates to the technical field of search enhancement generation evaluation, in particular to a search enhancement evaluation method and device based on multi-round dialogue context awareness. Background With the technical progress, the leading edge technology represented by the generated artificial intelligence promotes service innovation, product innovation and business innovation in the financial field, and the intelligent level in the financial field is further improved. The generated artificial intelligence has the defects of strong creativity and efficiency, limited model illusion and dynamic adaptability and the like, and the problems are particularly prominent in the fields with extremely high accuracy and safety requirements such as finance and the like. Retrieval enhancement generation (RETRIEVAL-Augmented Generation, RAG) is a technique that combines information retrieval techniques with a generative model to enhance the accuracy and reliability of the generative artificial intelligence model by utilizing information from specific related data sources. In financial field business scenarios, which pay attention to market data timeliness and complex information processing, RAG significantly improves knowledge coverage and answer accuracy of models, and relieves limitation of inherent knowledge only by training data. As with all systems, RAG systems require a comprehensive and objective evaluation scheme to facilitate continuous iterative optimization. The invention patent application of an evaluation set based on manual labeling or large model auxiliary production of the RAG evaluation framework at present discloses evaluation analysis of the RAG system by using general dimensions such as context precision, context recall, answer fidelity, answer relativity and the like. The invention patent application of application number 202411715139.4 discloses an evaluation method for generating a large model by long document retrieval enhancement, which comprises the steps of constructing an automatic question-answer strategy based on a focus segment, constructing an evaluation data set, processing the evaluation data set based on the automatic question strategy to obtain an optimal evaluation data set for generating the large model by long document retrieval enhancement, and evaluating based on the optimal evaluation data set for generating the large model by long document retrieval enhancement. Aiming at evaluating a put-in-production RAG system, the invention patent application of application number 202411792590.6 discloses a method for evaluating RAG indexes of the put-in-production RAG system, which prepares a testing set with a content of a (original question, correct answer) binary group list in advance, calls an RAG module interface of a production environment by using the original question in the testing set, acquires a model reply and a reference text list, combines the model reply and the reference text list into evaluation data according to an input parameter format of an evaluation frame, and performs RAG evaluation under the condition that a model and a database interface used by the production environment are not provided. In an automatic evaluation method and system, the invention patent application of application number 202411334441.5 discloses an automatic evaluation method and system for a search enhancement generation system, which are used for evaluating answers generated by the search enhancement generation system and answers in an evaluation data set by adopting different answer evaluation algorithms to obtain a plurality of algorithm evaluation indexes, calculating a plurality of overall evaluation indexes of each search enhancement generation system, and carrying out principal component analysis on all the algorithm evaluation indexes and the overall evaluation indexes so as to evaluate an RAG system. However, these evaluation schemes are typically evaluated based on a single-round dialogue evaluation dataset. In practical financial field business scenarios such as investment consultation, it is difficult for clients to describe information such as consultation intention, risk preference, investment target in one round, and the system needs to finally give reasonable investment scheme advice according to information obtained by multiple rounds of dialogue. In the face of such complex task scenarios, higher requirements are put on end-to-end performance of the RAG system. However, the existing RAG evaluation method is insufficient in consideration of multiple conversations, and the main reasons are that the difficulty of constructing a special multiple conversational evaluation data set is high, and the evaluation method and index for multiple conversational RAG tasks are lacking. Therefore, the efficient construction of the evaluation data set of the multi