CN-121980219-A - Large model classification evaluation method, device, equipment and storage medium

CN121980219ACN 121980219 ACN121980219 ACN 121980219ACN-121980219-A

Abstract

The application discloses a large model classification evaluation method, a device, equipment and a storage medium. The method comprises the steps of responding to a model online instruction of a large model to be evaluated, determining a classified corpus of the large model to be evaluated according to business scene data of the large model to be evaluated, detecting knowledge coverage rate of the large model to be evaluated according to the classified corpus to obtain candidate knowledge coverage rate of the large model to be evaluated, determining target knowledge coverage rate according to the candidate knowledge coverage rate and business scene data, evaluating model use performance of the large model to be evaluated in a current business system according to a preset sliding block time window under the condition that the target knowledge coverage rate does not meet model integrity conditions, obtaining at least one candidate performance index set of the large model to be evaluated, and integrating the target knowledge coverage rate and the at least one candidate performance index set to obtain a model evaluation report of the large model to be evaluated. By the scheme, the evaluation accuracy of large model knowledge is improved.

Inventors

CHEN MAOLIN
TANG XINJIE
XU YI
JIANG JIQING
Ye Naikan

Assignees

杭州新中大科技股份有限公司

Dates

Publication Date: 20260505
Application Date: 20260116

Claims (10)

1. A large model classification evaluation method, comprising: responding to a model online instruction of a large model to be evaluated, and determining classified corpus of the large model to be evaluated according to service scene data of a service system where the large model to be evaluated is currently located, wherein the classified corpus comprises regional rule corpus, professional literature corpus and service system corpus; according to the classified corpus, detecting the knowledge coverage rate of the large model to be evaluated, and obtaining candidate knowledge coverage rates of the large model to be evaluated under each corpus; Determining the target knowledge coverage rate of the large model to be evaluated according to the candidate knowledge coverage rate and the business scene data; Under the condition that the target knowledge coverage rate does not meet the model integrity condition, according to a preset sliding block time window, evaluating the model use performance of the large model to be evaluated in a current service system to obtain at least one candidate performance index set of the large model to be evaluated; And integrating the target knowledge coverage rate and the at least one candidate performance index set to obtain a model evaluation report of the large model to be evaluated.
2. The method of claim 1, wherein the determining the target knowledge coverage of the large model under evaluation from the candidate knowledge coverage and the business scenario data comprises: Determining knowledge classification categories of the large model to be evaluated according to the candidate knowledge coverage rate, wherein the knowledge classification categories comprise single classification categories and multiple classification categories; And determining the target knowledge coverage rate of the large model to be evaluated according to the knowledge classification category and the business scene data.
3. The method of claim 2, wherein the determining the target knowledge coverage of the large model under evaluation based on the knowledge classification category and the business scenario data comprises: If the knowledge classification category is a single classification category, determining candidate knowledge coverage under the corpus corresponding to the knowledge classification category as the target knowledge coverage of the large model to be evaluated; If the knowledge classification category is a plurality of classification categories, determining the corpus weight of each corpus corresponding to the knowledge classification category according to the business scene data, and carrying out weighted summation on candidate knowledge coverage rates under each corpus corresponding to the knowledge classification category according to the corpus weight to obtain the target knowledge coverage rate of the large model to be evaluated.
4. The method of claim 1, wherein evaluating the model usage performance of the large model to be evaluated in the current business system according to a preset slider time window, to obtain at least one candidate performance index set of the large model to be evaluated, comprises: For each preset sliding block time window, acquiring historical question-answer data of the large model to be evaluated before the preset sliding block time window; According to the question-answer frequency corresponding to each corpus in the historical question-answer data, determining a target evaluation corpus from the classified corpus, and determining question-answer data to be evaluated corresponding to the target evaluation corpus from the historical question-answer data; And evaluating the model use performance of the large model to be evaluated in the current business system according to the question and answer data to be evaluated to obtain a candidate performance index set of the large model to be evaluated in the preset sliding block time window.
5. The method of claim 4, wherein the evaluating the model usage performance of the large model to be evaluated in the current service system according to the question-answer data to be evaluated to obtain the candidate performance index set of the large model to be evaluated in the preset slider time window includes: averaging the problem hit rate data in the question and answer data to be evaluated to obtain a context accuracy index of the large model to be evaluated in the preset sliding block time window; Obtaining a context recall rate index of the large model to be evaluated in the preset sliding block time window by using the questions retrieval knowledge quantity and answer classification knowledge quantity in the question and answer data to be evaluated as a quotient; Obtaining the loyalty index of the large model to be evaluated in the preset sliding block time window by using the answer association knowledge quantity and the answer standard knowledge quantity in the question and answer data to be evaluated as a quotient; Performing cosine similarity calculation on the question retrieval knowledge data and the answer knowledge data in the question and answer data to be evaluated to obtain answer correlation indexes of the large model to be evaluated in the preset sliding block time window; and integrating the context precision index, the context recall index, the loyalty index and the answer correlation index to obtain a candidate performance index set of the large model to be evaluated in the preset sliding block time window.
6. The method of claim 1, wherein after determining the target knowledge coverage of the large model under evaluation from the candidate knowledge coverage and the business scenario data, the method further comprises: And if the target knowledge coverage rate meets the model integrity condition, determining a model evaluation report of the large model to be evaluated according to the target knowledge coverage rate and the business scene data.
7. A large model classification evaluating device, characterized by comprising: The corpus determining module is used for responding to a model online instruction of the large model to be evaluated, and determining the classified corpus of the large model to be evaluated according to the service scene data of the service system where the large model to be evaluated is currently located, wherein the classified corpus comprises regional rule corpus, professional literature corpus and service system corpus; the coverage rate detection module is used for detecting the knowledge coverage rate of the large model to be evaluated according to the classified corpus to obtain candidate knowledge coverage rates of the large model to be evaluated under each corpus; The coverage rate determining module is used for determining the target knowledge coverage rate of the large model to be evaluated according to the candidate knowledge coverage rate and the business scene data; The performance evaluation module is used for evaluating the model use performance of the large model to be evaluated in the current business system according to a preset sliding block time window under the condition that the target knowledge coverage rate does not meet the model integrity condition, so as to obtain at least one candidate performance index set of the large model to be evaluated; and the report generation module is used for integrating the target knowledge coverage rate and the at least one candidate performance index set to obtain a model evaluation report of the large model to be evaluated.
8. An electronic device, comprising: One or more processors; A memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the large model classification evaluation method of any of claims 1-6.
9. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the large model classification evaluation method according to any one of claims 1-6.
10. A computer program product comprising a computer program which, when executed by a processor, implements the large model classification evaluation method according to any one of claims 1-6.

Description

Large model classification evaluation method, device, equipment and storage medium Technical Field The embodiment of the application relates to the technical field of computers, in particular to the technical field of artificial intelligence, and particularly relates to a large model classification evaluation method, device, equipment and storage medium. Background Large language models perform well in various types of natural language processing tasks, such as generating text, question-answering systems, and dialogue simulations, however, as the complexity and application of large language models diversifies, it becomes increasingly difficult to select the large language model that is best suited for a particular task. The prior art mainly relies on a single evaluation index or manual experience, and can not evaluate the performance of a model comprehensively and objectively, for example, a logic index can check whether the answer of the model is related to the question of a user, and whether the answer can be reasonably according to the dialogue context, but the knowledge integrity of a large language model can not be evaluated comprehensively, and the model knowledge is evaluated manually, so that the model parameter is too small, the manual evaluation cannot cover the whole, and the input manpower is also relatively large. Disclosure of Invention The application provides a large model classification evaluation method, a device, equipment and a storage medium, which are used for improving the evaluation accuracy of large model knowledge. According to an aspect of the present application, there is provided a large model classification evaluation method including: responding to a model online instruction of a large model to be evaluated, and determining classified corpus of the large model to be evaluated according to service scene data of a service system where the large model to be evaluated is currently located, wherein the classified corpus comprises regional rule corpus, professional literature corpus and service system corpus; according to the classified corpus, detecting the knowledge coverage rate of the large model to be evaluated, and obtaining candidate knowledge coverage rates of the large model to be evaluated under each corpus; Determining the target knowledge coverage rate of the large model to be evaluated according to the candidate knowledge coverage rate and the business scene data; Under the condition that the target knowledge coverage rate does not meet the model integrity condition, according to a preset sliding block time window, evaluating the model use performance of the large model to be evaluated in a current service system to obtain at least one candidate performance index set of the large model to be evaluated; And integrating the target knowledge coverage rate and the at least one candidate performance index set to obtain a model evaluation report of the large model to be evaluated. According to another aspect of the present application, there is provided a large model classification evaluation apparatus including: The corpus determining module is used for responding to a model online instruction of the large model to be evaluated, and determining the classified corpus of the large model to be evaluated according to the service scene data of the service system where the large model to be evaluated is currently located, wherein the classified corpus comprises regional rule corpus, professional literature corpus and service system corpus; the coverage rate detection module is used for detecting the knowledge coverage rate of the large model to be evaluated according to the classified corpus to obtain candidate knowledge coverage rates of the large model to be evaluated under each corpus; The coverage rate determining module is used for determining the target knowledge coverage rate of the large model to be evaluated according to the candidate knowledge coverage rate and the business scene data; The performance evaluation module is used for evaluating the model use performance of the large model to be evaluated in the current business system according to a preset sliding block time window under the condition that the target knowledge coverage rate does not meet the model integrity condition, so as to obtain at least one candidate performance index set of the large model to be evaluated; and the report generation module is used for integrating the target knowledge coverage rate and the at least one candidate performance index set to obtain a model evaluation report of the large model to be evaluated. According to another aspect of the present application, there is provided an electronic apparatus including: One or more processors; A memory for storing one or more programs; When the one or more programs are executed by the one or more processors, the one or more processors implement any one of the large model classification evaluation methods provided by the embodiments of the prese