CN-122020084-A - Method and device for evaluating large language model

CN122020084ACN 122020084 ACN122020084 ACN 122020084ACN-122020084-A

Abstract

The embodiment of the specification provides a method and a device for evaluating a large language model. The method comprises the steps of obtaining an evaluation set, wherein the evaluation set comprises user questions and intention categories corresponding to the user questions, inputting the user questions into a large language model to obtain answers output by the large language model, determining a plurality of indexes to be evaluated based on the intention categories of the user questions, and evaluating the answers based on the indexes to be evaluated to obtain an evaluation result. The method can enable the evaluation result of the model performance to be closer to the actual feeling of the user.

Inventors

ZHENG JUN
WANG WENJIE
WANG HUAN
DONG CHENGHANG
SUN GE
CHEN BIN
WANG HAOYU
JIANG YUFENG
CHEN LIXIN

Assignees

支付宝(杭州)数字服务技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260123

Claims (15)

1. A method of evaluating a large language model, the method comprising: Acquiring an evaluation set, wherein the evaluation set comprises user questions and intention categories corresponding to the user questions; inputting the user questions into a large language model to obtain answers output by the large language model; Determining a plurality of indexes to be evaluated based on the intention category of the user problem; and evaluating the answer based on each index to be evaluated, and obtaining an evaluation result.
2. The method of claim 1, wherein, The determining a plurality of indexes to be evaluated based on the intention category of the user problem comprises the following steps: determining at least one advanced index to be evaluated based on the intention category of the user problem, wherein the advanced index comprises richness, practicability, insight and heuristics; and evaluating the answer based on each index to be evaluated to obtain an evaluation result, wherein the evaluation result comprises the following steps: And evaluating the answer based on at least one advanced index to be evaluated and a basic index, wherein the basic index comprises facts, integrity and relevance, and an evaluation result is obtained.
3. The method of claim 1, wherein evaluating the answer based on each of the metrics to be evaluated results in an evaluation result, comprising: Scoring the answers by adopting an evaluation mode corresponding to each index aiming at each index to obtain a scoring result, wherein the evaluation mode comprises at least one of evaluation based on a checklist, evaluation based on a scoring rule and fact checking; And obtaining an evaluation result based on the grading result of each index.
4. The method of claim 1, wherein the acquiring an evaluation set comprises: A user question and an intent category of the user question are obtained, And constructing an evaluation set based on the user questions and the intention category.
5. The method of claim 4, wherein the obtaining the intent category of the user question comprises: Performing intention classification on the user problems by a plurality of first large models respectively to obtain corresponding user intention categories; And determining the user intention category with the largest occurrence number as the intention category of the user problem.
6. The method of claim 4, wherein the constructing an evaluation set based on the user question and the intent category comprises: generating a checklist comprising a plurality of check items according to a checklist generation mode corresponding to the type of the user question, and based on the user question and a standard answer associated with the user question, wherein the check items correspond to key information points in the user question and/or the standard answer; And constructing an evaluation set based on the user questions, the intention category and the checklist, wherein the checklist in the evaluation set is used for checking key information points in the answers.
7. The method of claim 6, wherein the type of user question is any one of an open question, a time-efficient question, a multi-hop inference question, wherein the open question is a question without a fixed answer in a real scene, the time-efficient question is a question that changes over time, and the multi-hop inference question is a question that requires multiple intermediate step inferences to solve.
8. The method of claim 7, wherein the acquiring the user question comprises: And acquiring news data, and generating the timeliness problem based on key information points in the news data, wherein the key information points comprise entities, time, places and data, and the news data has standard answers associated with the timeliness problem.
9. The method of claim 8, wherein the generating a checklist including a plurality of checkitems based on the user questions and standard answers associated with the user questions according to a checklist generation manner corresponding to the type of the user questions comprises: inputting the timeliness problem and the news data into a second large model, and generating an inspection list containing a plurality of inspection items by the second large model.
10. The method of claim 7, wherein the checklist generating method according to the type of the user question generates a checklist including a plurality of checkitems based on the user question and standard answers associated with the user question: generating a plurality of reference answers by a plurality of third large models based on the openness questions; generating a standard answer associated with the open question based on the plurality of reference answers; and generating inspection items corresponding to the key information points respectively based on the key information points in the standard answers to obtain an inspection list.
11. The method of claim 7, wherein the acquiring the user question comprises: extracting a relationship triplet based on a first data sample in the knowledge base, wherein the relationship triplet is used for representing a relationship between a first entity and a second entity; Acquiring a second data sample based on the second entity; Generating a multi-hop reasoning problem based on the first data sample, the second data sample and the relation triplet, wherein the multi-hop reasoning problem is used for inquiring based on key information points corresponding to the second entity in the second data sample; the generating a checklist including a plurality of checkitems based on the user questions and the standard answers associated with the user questions according to the checklist generating mode corresponding to the type of the user questions includes: And generating a check item corresponding to each intermediate reasoning step based on a plurality of intermediate reasoning steps corresponding to the multi-hop reasoning problem and standard answers associated with the intermediate reasoning steps, and obtaining a check list.
12. The method of claim 7, wherein the acquiring the user question comprises: extracting a plurality of target triples based on target entities of a third data sample in a knowledge base, wherein the target triples are respectively used for representing the relation between the target entities and different tail entities; Acquiring a plurality of search keywords based on the tail entity; acquiring a plurality of fourth data samples through the search keywords, wherein the fourth data samples comprise clue information of the tail entity; generating a multi-hop reasoning problem based on the clue information of the tail entity and the reasoning relation between the target entity, wherein the multi-hop reasoning problem is used for identifying the target entity based on the clue information of the tail entity, and the target entity is a standard answer associated with the multi-hop reasoning problem; the generating a checklist including a plurality of checkitems based on the user questions and the standard answers associated with the user questions according to the checklist generating mode corresponding to the type of the user questions includes: a checklist is generated containing a plurality of checkitems based on the target entity and the tail entity in each of the target triples.
13. The method according to claim 6, wherein each key information point corresponds to a number of check items for checking whether the key information point is present or not, and/or for checking whether the key information point is accurate or not.
14. An apparatus for evaluating a large language model, the apparatus comprising: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring an evaluation set, and the evaluation set comprises user questions and intention categories corresponding to the user questions; The reasoning module is used for inputting the user questions into a large language model to obtain answers output by the large language model; The determining module is used for determining a plurality of indexes to be evaluated based on the intention category of the user problem; And the evaluation module evaluates the answer based on each index to be evaluated to obtain an evaluation result.
15. A computing device comprising a memory having executable code stored therein and a processor, which when executing the executable code, implements the method of any of claims 1-13.

Description

Method and device for evaluating large language model Technical Field The embodiment of the specification relates to the technical field of artificial intelligence, in particular to a method and a device for evaluating a large language model. Background A large language model (Large Language Model, LLM), referred to as a large model for short, refers to a deep learning model based on natural language processing that is trained on a large scale text corpus and contains a number of parameters at or above the order of one hundred million. The user may speak with the large model to get a solution to the problem that the large model would like to know. With the continuous improvement of the capability of a large language model, the method is increasingly used for solving various problems in daily life of people, and the accurate measurement of the performance and the reliability is of great importance. The existing evaluation is generally based on a benchmark test set and uniformly adopts fixed indexes (such as accuracy and similarity) to compare the consistency of model output under various types of questions with reference answers. However, the user problem in the practical application environment is often an open, complex and multi-purpose problem, and the above consistency comparison method is difficult to give an evaluation result reflecting the actual experience of the user, so that the evaluation result may deviate from the actual use effect, and the model score may be high, but the experience is poor when the user actually uses a large model. Therefore, a large model evaluation method is needed to accurately measure whether the model performance really meets the actual requirements of the user. Disclosure of Invention The embodiment of the specification provides a method and a device for evaluating a large language model, which can enable an evaluation result of model performance to be closer to the actual feeling of a user. According to the method, an evaluation set is obtained, the evaluation set comprises user questions and intention categories corresponding to the user questions, the user questions are input into the large language model to obtain answers output by the large language model, a plurality of indexes to be evaluated are determined based on the intention categories of the user questions, and evaluation is conducted on the answers based on the indexes to be evaluated to obtain evaluation results. In some alternative embodiments, the determining the plurality of indexes to be evaluated based on the intention category of the user question comprises determining at least one high-level index to be evaluated based on the intention category of the user question, wherein the high-level index comprises richness, practicability, insight and heuristics, and the evaluating the answer based on each index to be evaluated to obtain an evaluation result comprises evaluating the answer based on at least one high-level index to be evaluated and a basic index to obtain an evaluation result, and the basic index comprises facts, integrity and relativity. In some optional embodiments, the step of evaluating the answer based on the indexes to be evaluated to obtain an evaluation result includes, for each index, scoring the answer by using an evaluation mode corresponding to the index to obtain a scoring result, where the evaluation mode includes at least one of evaluating based on a checklist, evaluating based on a scoring rule and checking facts, and obtaining an evaluation result based on the scoring result of each index. In some alternative embodiments, the obtaining an evaluation set includes obtaining a user question and an intent category of the user question, and constructing an evaluation set based on the user question and the intent category. In some optional embodiments, the obtaining the intention category of the user problem includes respectively classifying the intention of the user problem by a plurality of first large models to obtain a corresponding user intention category, and determining the user intention category with the largest occurrence number as the intention category of the user problem. In some alternative embodiments, the constructing an evaluation set based on the user question and the intention category includes generating a checklist including a plurality of checkitems corresponding to key information points in the user question and/or the standard answer based on standard answers associated with the user question and the user question according to a checklist generation manner corresponding to the type of the user question, and constructing an evaluation set based on the intention category and the checklist, wherein the checklist in the evaluation set is used for checking the key information points in the answer. In some alternative embodiments, the user questions are of any one of an open question, a time-efficient question, and a multi-hop inference question, wherein the open q