CN-122021937-A - Large model automatic evaluation method, device, equipment and readable storage medium

CN122021937ACN 122021937 ACN122021937 ACN 122021937ACN-122021937-A

Abstract

The invention discloses a large model automatic evaluation method, a device, equipment and a readable storage medium, wherein the method comprises the steps of receiving an evaluation request for a target large model; the method comprises the steps of determining a question and answer record of a target large model based on the type of an evaluation request, determining a question and answer record of the target large model when the type of the evaluation request is model capacity evaluation, wherein the question and answer record comprises a plurality of test question and answer cases, each test question and answer case comprises a test question, a expected answer of the test question and a final answer of the target large model to the test question output, when the type of the evaluation request is model output decision, the question and answer record comprises a real-time question and answer cases comprising a question input by a user in real time and a plurality of candidate answers of the target large model to the question in real time, determining an evaluation index set of the target large model based on the question and answer record and the evaluation index set, evaluating the target large model through a preset referent large model, and obtaining an evaluation result.

Inventors

ZHOU JIANJUN
ZHUANG MINGGUANG
SHEN QI

Assignees

汇添富基金管理股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260402

Claims (10)

1. A method for automated evaluation of a large model, the method comprising: receiving an evaluation request for a target large model; Determining a question-answer record of the target large model based on the type of the evaluation request, wherein when the type of the evaluation request is model capacity evaluation, the question-answer record comprises a plurality of test question-answer cases, each test question-answer case comprises a test question, a desired answer of the test question and a final answer of the target large model to the test question, and when the type of the evaluation request is model output decision, the question-answer record comprises a real-time question-answer case, wherein the real-time question-answer case comprises a question input by a user in real time and a plurality of candidate answers of the target large model to the question in real time; Determining an evaluation index set of the target large model based on the question-answer records; And evaluating the target large model through a preset judge large model based on the type of the evaluation request, the question-answer record and the evaluation index set, and obtaining an evaluation result.
2. The method for automated evaluation of large models according to claim 1, wherein the determining a question-answer record of the target large model based on the type of the evaluation request comprises: When the type of the evaluation request is model capability evaluation, selecting a plurality of test question-answer samples from a local preset evaluation database, wherein each test question-answer sample comprises a test question and a desired answer of the test question; inputting the test questions in each test question-answering sample to the target large model respectively to obtain final answers of the target large model to each test question; respectively combining each test question and answer sample with the corresponding final answer to generate a plurality of test question and answer cases; and taking all the generated test question and answer cases as the question and answer records.
3. The method for automated evaluation of large models according to claim 1, wherein the determining a question-answer record of the target large model based on the type of the evaluation request comprises: When the type of the evaluation request is a model output decision, monitoring the target large model in real time; When the target large model is monitored to output a plurality of candidate answers in real time based on the questions input by the user in real time, combining the questions input by the user in real time with the candidate answers output by the target large model in real time to generate a real-time question-answer case; And taking the real-time question and answer use cases as the question and answer records.
4. The automated large model evaluation method according to claim 1, wherein the determining the set of evaluation indicators of the target large model based on the question-answer record comprises: traversing each question and answer case in the question and answer record, wherein the question and answer case is the test question and answer case or the real-time question and answer case; and respectively determining an evaluation index set for evaluating the target large model based on each traversed question and answer case.
5. The automated large model evaluation method according to claim 4, wherein the determining an evaluation index set for evaluating the target large model based on each traversed question and answer case includes: judging whether the question-answer case further comprises context information related to the questions in the question-answer case; When the question-answer case comprises context information, selecting a first number of evaluation indexes from a local preset evaluation index library as an evaluation index set which corresponds to the question-answer case and is used for evaluating the target large model; When the question-answer case does not comprise the context information, selecting a second number of evaluation indexes from a locally preset evaluation index library as an evaluation index set which corresponds to the question-answer case and is used for evaluating the target large model; wherein the first number and the second number are integers greater than 1, and the first number is greater than the second number.
6. The automated large model evaluation method according to claim 4 or 5, wherein the evaluating the target large model by a preset referee large model based on the type of the evaluation request, the question-answer record and the evaluation index set, and obtaining an evaluation result, comprises: When the type of the evaluation request is model capability evaluation, the judge large model is used for scoring the model capability of the target large model based on each test question and answer case and an evaluation index set corresponding to each test question and answer case, and generating a first scattered scoring result corresponding to each test question and answer case, wherein the first scattered scoring result comprises scores of all evaluation indexes in a corresponding evaluation index set; and performing aggregation analysis on all the dispersion scoring results to generate a capability evaluation report of the target large model.
7. The automated large model evaluation method according to claim 1, 4 or 5, wherein the evaluating the target large model by a preset referee large model based on the type of the evaluation request, the question-answer record and the evaluation index set, and obtaining an evaluation result, comprises: When the type of the evaluation request is a model output decision, respectively scoring each candidate answer in the real-time question and answer case based on the evaluation index set through the judge big model, and generating a second dispersion scoring result of each candidate answer; Calculating a comprehensive scoring result of each candidate answer based on the second dispersion scoring result of each candidate answer; and determining the candidate answer corresponding to the optimal comprehensive scoring result as a recommended answer.
8. The method for automated large model evaluation according to claim 1, wherein prior to the receiving an evaluation request for a target large model, the method further comprises: selecting a large model to be evaluated from a plurality of large models which are locally butted as the target large model; And selecting a large model with a higher grade than the target large model from a plurality of large models which are locally butted as the judge large model.
9. A large model automated evaluation apparatus, the apparatus comprising: The receiving module is used for receiving an evaluation request of the target large model; The system comprises a first determining module, a first judging module and a second judging module, wherein the first determining module is used for determining a question-answer record of the target big model based on the type of the evaluation request, when the type of the evaluation request is model capacity evaluation, the question-answer record comprises a plurality of test question-answer cases, each test question-answer case comprises a test question, a desired answer of the test question and a final answer of the target big model to the test question, when the type of the evaluation request is model output decision, the question-answer record comprises a real-time question-answer case, and the real-time question-answer case comprises a question input by a user in real time and a plurality of candidate answers of the target big model to the question in real time; The second determining module is used for determining an evaluation index set of the target large model based on the question-answer records; And the evaluation module is used for evaluating the target large model through a preset judge large model based on the type of the evaluation request, the question-answer record and the evaluation index set, and obtaining an evaluation result.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program is for implementing the steps of the method according to any one of claims 1 to 8 when being executed by a processor.

Description

Large model automatic evaluation method, device, equipment and readable storage medium Technical Field The present invention relates to the field of large model evaluation technologies, and in particular, to a large model automatic evaluation method, apparatus, device, and readable storage medium. Background Evaluation of a large-scale language model is a key element for measuring its ability, optimizing its ability, and ensuring its reliability in practical applications. The inventor researches find that the traditional automatic evaluation method mainly has the following limitations that on one hand, the existing evaluation system can only process single type of evaluation tasks generally, and the resource utilization field Jing Shouxian, and on the other hand, the existing evaluation system usually adopts a preset fixed index set and cannot be dynamically adapted. In view of the above problems of the prior art, no effective solution exists at present. Disclosure of Invention The invention aims to provide a large model automatic evaluation method, device, equipment and readable storage medium, which can uniformly support offline model capacity evaluation and online real-time output decision, adaptively determine evaluation standards aiming at different question-answer records and improve evaluation efficiency, precision and practicability. According to one aspect of the present invention, there is provided a large model automated evaluation method, the method comprising: receiving an evaluation request for a target large model; Determining a question-answer record of the target large model based on the type of the evaluation request, wherein when the type of the evaluation request is model capacity evaluation, the question-answer record comprises a plurality of test question-answer cases, each test question-answer case comprises a test question, a desired answer of the test question and a final answer of the target large model to the test question, and when the type of the evaluation request is model output decision, the question-answer record comprises a real-time question-answer case, wherein the real-time question-answer case comprises a question input by a user in real time and a plurality of candidate answers of the target large model to the question in real time; Determining an evaluation index set of the target large model based on the question-answer records; And evaluating the target large model through a preset judge large model based on the type of the evaluation request, the question-answer record and the evaluation index set, and obtaining an evaluation result. Optionally, the determining the question-answer record of the target big model based on the type of the evaluation request includes: When the type of the evaluation request is model capability evaluation, selecting a plurality of test question-answer samples from a local preset evaluation database, wherein each test question-answer sample comprises a test question and a desired answer of the test question; inputting the test questions in each test question-answering sample to the target large model respectively to obtain final answers of the target large model to each test question; respectively combining each test question and answer sample with the corresponding final answer to generate a plurality of test question and answer cases; and taking all the generated test question and answer cases as the question and answer records. Optionally, the determining the question-answer record of the target big model based on the type of the evaluation request includes: When the type of the evaluation request is a model output decision, monitoring the target large model in real time; When the target large model is monitored to output a plurality of candidate answers in real time based on the questions input by the user in real time, combining the questions input by the user in real time with the candidate answers output by the target large model in real time to generate a real-time question-answer case; And taking the real-time question and answer use cases as the question and answer records. Optionally, the determining, based on the question-answer record, an evaluation index set of the target large model includes: traversing each question and answer case in the question and answer record, wherein the question and answer case is the test question and answer case or the real-time question and answer case; and respectively determining an evaluation index set for evaluating the target large model based on each traversed question and answer case. Optionally, the determining an evaluation index set for evaluating the target large model based on each traversed question and answer case includes: judging whether the question-answer case further comprises context information related to the questions in the question-answer case; When the question-answer case comprises context information, selecting a first number of evaluation indexes from a local preset