CN-121980221-A - Assessment method and device of large language model, electronic equipment and intelligent body

CN121980221ACN 121980221 ACN121980221 ACN 121980221ACN-121980221-A

Abstract

The disclosure provides a large language model evaluation method, a large language model evaluation device, electronic equipment and an intelligent body, relates to the technical field of artificial intelligence, and particularly relates to the technical fields of deep learning, large models and large data. The method comprises the steps of respectively processing a plurality of sample texts by utilizing a large language model to obtain respective category characteristics of the plurality of sample texts, wherein the sample texts comprise instruction sub-texts, the instruction sub-texts are used for guiding the large language model to output categories of the sample texts in a generated mode, the category characteristics are mapped through an output layer of the large language model to obtain the categories, and the category characteristics are subjected to classification performance evaluation according to respective category labels of the plurality of sample texts to obtain evaluation results of classification performance of the large language model.

Inventors

HAN MIAO
Lv Zhonghou
WANG GUOQIU
CHEN MUHAN
Hou Jinchang
Wen Dailin
BAO CHENFU

Assignees

北京百度网讯科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260122

Claims (20)

1. A method of evaluating a large language model, comprising: Processing a plurality of sample texts by using a large language model to obtain respective category characteristics of the plurality of sample texts, wherein the sample texts comprise instruction sub-texts which are used for guiding the large language model to output categories of the sample texts in a generating mode, the category characteristics are mapped by an output layer of the large language model to obtain the categories, and And according to the category labels of the sample texts, performing classification performance evaluation on the category characteristics to obtain an evaluation result of the classification performance of the large language model.
2. The method of claim 1, wherein the performing the classification performance evaluation on the plurality of class features according to the class labels of the plurality of sample texts to obtain the evaluation result of the classification performance of the large language model includes: Extracting the characteristics of the plurality of category characteristics respectively to obtain global category characteristics corresponding to the plurality of category characteristics, wherein the dimensions of the plurality of global category characteristics are the same; Performing intra-class spatial analysis on a plurality of global class features with the same class labels to obtain intra-class evaluation index values, and And determining the evaluation result according to the in-class evaluation index value.
3. The method of claim 2, wherein the intra-class evaluation index value includes a structural index value for evaluating whether a multi-semantic structure exists in a class, wherein the performing intra-class spatial analysis on the plurality of global class features having the same class label to obtain an intra-class evaluation index value includes: clustering a plurality of global category features with the same category labels to obtain at least one intra-class sub-cluster; determining a sub-cluster distance between a plurality of the intra-class sub-clusters if the number of the intra-class sub-clusters is greater than a number threshold; wherein the structure index value includes at least one of the number of sub-clusters in the class and the distance of the sub-clusters.
4. The method of claim 3, wherein the determining a sub-cluster distance between a plurality of the intra-class sub-clusters comprises: Determining a first class center of each of the plurality of intra-class sub-clusters based on the plurality of global class features belonging to the same intra-class sub-cluster, and Distances between a plurality of first class centers are determined as sub-cluster distances between a plurality of said class inner sub-clusters.
5. The method according to claim 3 or 4, wherein the structural index value further comprises an abnormal contour coefficient for evaluating whether a global category feature is misclassified, the method further comprising: Determining an anomaly characteristic from a plurality of global class characteristics having a first class label, wherein the anomaly characteristic is a global class characteristic having a distance from a first class center of each of a plurality of intra-class sub-clusters having the first class label greater than a distance threshold, and And determining the abnormal contour coefficient according to the distance between the abnormal feature and the plurality of global category features with the first category label and the distance between the abnormal feature and the plurality of global category features with the second category label, wherein the second category label is different from the first category label.
6. The method of claim 5, wherein the determining the anomaly profile coefficient based on a distance between the anomaly feature and a plurality of global category features having a first category label, a distance between the anomaly feature and a plurality of global category features having a second category label, comprises: Determining a first average distance according to the distance between the abnormal feature and a plurality of global category features with first category labels; Determining a second average distance from the distances between the anomaly feature and a plurality of global category features having a second category label, and And determining the abnormal contour coefficient according to the difference between the second average distance and the first average distance.
7. The method of claim 5, wherein the second category labels are determined using: Determining the inter-class distance between a plurality of class labels according to a second class center corresponding to the class labels respectively, wherein the second class center is determined by using a plurality of global class features with the same class label, and And determining the second class label with the smallest distance with the first class label according to the inter-class distances among the plurality of class labels.
8. The method according to any one of claims 2 to 7, wherein the intra-class evaluation index value includes an intra-class variance and an intra-class average distance for evaluating intra-class spatial distribution, and the performing intra-class spatial analysis on the plurality of global class features with the same class label to obtain the intra-class evaluation index value includes: Determining a second class center having a plurality of said global class features of the same class label, and And determining the intra-class variance and the intra-class average distance according to the distances between the global class features of the same class label and the second class center.
9. The method of any one of claims 2-8, further comprising: performing inter-class spatial analysis on a plurality of global class features with different class labels to obtain inter-class evaluation index values, and And determining the evaluation result according to the inter-class evaluation index value.
10. The method of claim 9, wherein the inter-class assessment index value includes a global inter-class separation degree for assessing inter-class differences, wherein the performing an inter-class spatial analysis on the plurality of global class features having different class labels to obtain the inter-class assessment index value includes: Determining the inter-class distance between a plurality of class labels according to the second class centers respectively corresponding to the class labels; and determining the global inter-class separation degree according to the inter-class distances among the plurality of class labels and the intra-class variances of the plurality of class labels.
11. The method of claim 9, wherein the inter-class evaluation index value includes a confusion index value for evaluating a degree of confusion between classes, the inter-class spatial analysis of the plurality of global class features having different class labels, resulting in an inter-class evaluation index value, comprising: determining a plurality of adjacent features of each of a plurality of global category features according to the distances among the plurality of global category features; Determining, for each of the global category features, a feature quantity of category labels in a plurality of neighbor features that are different from the global category features, and Determining a confusion index value for a third category label according to the feature quantity of each of the global category features with the third category label and the quantity of the global category features with the third category label.
12. The method of claim 9, wherein the inter-class evaluation index value includes a semantic overlap index value for evaluating a degree of inter-class semantic overlap, wherein the performing an inter-class spatial analysis on the plurality of global class features having different class labels to obtain an inter-class evaluation index value includes: and determining semantic overlap index values among the categories corresponding to the category labels according to the second class centers corresponding to the category labels respectively.
13. The method according to any one of claims 2-12, wherein the method further comprises: and determining the evaluation result according to the intra-class evaluation index value and the inter-class evaluation index value.
14. The method of claim 13, wherein the determining the evaluation result from the intra-class evaluation index value and the inter-class evaluation index value comprises: And determining a target evaluation result when the intra-class evaluation index value of the class corresponding to the fourth class label meets an abnormal condition and the inter-class evaluation index value meets a normal condition, wherein the target evaluation result is used for representing multiple evaluations of the class corresponding to the fourth class label.
15. The method of claim 14, wherein the method further comprises: Sample enhancement is carried out on a plurality of sample texts with the fourth category label according to the in-class evaluation index value to obtain a plurality of enhanced sample texts, and Respectively processing a plurality of enhanced sample texts by using the large language model to obtain respective category characteristics of the enhanced sample texts; Performing intra-class space analysis on the respective class characteristics of the plurality of enhanced sample texts to obtain updated intra-class evaluation index values; And determining an evaluation result of the large language model aiming at the class corresponding to the fourth class label according to the updated intra-class evaluation index value.
16. An evaluation device of a large language model, comprising: The feature acquisition module is used for respectively processing a plurality of sample texts by utilizing a large language model to obtain respective category features of the plurality of sample texts, wherein the sample texts comprise instruction sub-texts which are used for guiding the large language model to output categories of the sample texts in a generative manner, the category features are mapped by an output layer of the large language model to obtain the categories, and And the evaluation module is used for evaluating the classification performance of the plurality of the class features according to the class labels of the plurality of the sample texts to obtain an evaluation result of the classification performance of the large language model.
17. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-15.
18. An artificial intelligence based agent comprising: the input module is used for receiving input information; a processing module, configured to determine a target task based on the input information received by the input module, determine a large model based on the target task, and obtain output information by calling the large model to perform the method of any one of claims 1 to 15; and the output module is used for outputting the output information obtained by the processing module.
19. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-15.
20. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-15.

Description

Assessment method and device of large language model, electronic equipment and intelligent body Technical Field The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of deep learning, large models and large data, and specifically relates to a method and a device for evaluating a large language model, electronic equipment and an intelligent body. Background With the development of artificial intelligence technology, the application scenarios of large language models (Large Language Model, LLM) are also increasing. For example, classification tasks are performed using large language models, and are also critical to the evaluation of classification performance of large language models. Disclosure of Invention The disclosure provides a large language model evaluation method, a large language model evaluation device, electronic equipment and an intelligent agent. According to one aspect of the disclosure, an evaluation method of a large language model is provided, which comprises the steps of respectively processing a plurality of sample texts by using the large language model to obtain respective category characteristics of the plurality of sample texts, wherein the sample texts comprise instruction sub-texts, the instruction sub-texts are used for guiding the large language model to output categories of the sample texts in a generating mode, the category characteristics are mapped through an output layer of the large language model to obtain categories, and classifying performance evaluation is performed on the plurality of category characteristics according to respective category labels of the plurality of sample texts to obtain an evaluation result aiming at classifying performance of the large language model. According to another aspect of the disclosure, an evaluation device for a large language model is provided, which comprises a feature acquisition module, an evaluation module and a processing module, wherein the feature acquisition module is used for respectively processing a plurality of sample texts by using the large language model to obtain respective category features of the plurality of sample texts, the sample texts comprise instruction sub-texts, the instruction sub-texts are used for guiding the large language model to output categories of the sample texts in a generating mode, the category features are mapped by an output layer of the large language model to obtain the categories, and the evaluation module is used for evaluating the classification performance of the plurality of category features according to respective category labels of the plurality of sample texts to obtain an evaluation result aiming at the classification performance of the large language model. According to another aspect of the present disclosure, there is provided an electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as above. According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method as described above. According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described above. It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification. Drawings The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein: FIG. 1 schematically illustrates an exemplary system architecture of an evaluation method and apparatus that may be applied to a large language model according to an embodiment of the present disclosure. FIG. 2 schematically illustrates a flow chart of a method of evaluation of a large language model according to an embodiment of the disclosure. FIG. 3 schematically illustrates a scene graph of determining an evaluation result by intra-class spatial analysis according to an embodiment of the disclosure. Fig. 4A schematically illustrates a scene graph of a plurality of intra-class sub-clusters according to an embodiment of the disclosure. FIG. 4B schematically illustrates a scene graph with abnormal features within a category having a multi-semantic structure according to an embodiment of the present disclosure. Fig. 5 schematically illustrates a scenario diagram of determining a confusion index value according to an embodiment of the disclosure. FIG. 6 schem