CN-119669924-B - Performance evaluation method suitable for large-model intelligent body

CN119669924BCN 119669924 BCN119669924 BCN 119669924BCN-119669924-B

Abstract

The invention provides a performance evaluation method suitable for a large-model intelligent body, which comprises the steps of obtaining sample data quantity of the intelligent body to be evaluated, obtaining a task output type of the intelligent body to be evaluated, obtaining a comparison evaluation result of the intelligent body to be evaluated through a pairing t-test mode when the task output type of the intelligent body to be evaluated is a continuous output result, and obtaining the comparison evaluation result of the intelligent body to be evaluated through a pairing chi-square test mode when the task output type of the intelligent body to be evaluated is a classification output result, wherein the comparison evaluation result comprises a comparison experiment evaluation result and an ablation experiment evaluation result. According to the invention, the performance difference between the large-model intelligent agent and the original LLM is quantified through a comparison experiment, and the contribution of each module to the overall performance is analyzed through an ablation experiment, so that a data support and improvement basis is provided for further optimization of the intelligent agent. The performance of the intelligent agent is systematically and comprehensively evaluated, and the accuracy and the repeatability of the performance evaluation are ensured.

Inventors

LIU JUNPING
HAO JIANJUN
GUO YIJUN
ZHANG ZHILONG
HE XINXIN

Assignees

北京邮电大学

Dates

Publication Date: 20260512
Application Date: 20241114

Claims (10)

1. A performance evaluation method suitable for large model intelligent bodies is characterized by comprising the following steps: Acquiring a sample data size of network information authenticity detection of an agent to be evaluated, and acquiring a task output type of the agent to be evaluated; When the task output type of the intelligent agent to be evaluated is a continuous output result representing an authenticity value, obtaining a comparison evaluation result of the intelligent agent to be evaluated by a pairing t-test mode; when the task output type of the intelligent agent to be evaluated is a classification output result representing the authenticity classification, acquiring a comparison evaluation result of the intelligent agent to be evaluated by a pairing chi-square test mode; the comparison evaluation results comprise comparison experiment evaluation results and ablation experiment evaluation results.
2. The performance evaluation method applicable to a large-model agent according to claim 1, further comprising: when the comparison evaluation result of the to-be-evaluated intelligent agent shows a significant performance difference, selecting any one curve of an ROC curve or a PR curve to evaluate the to-be-evaluated intelligent agent according to the data type of the sample data volume, and obtaining the comprehensive performance evaluation result of the to-be-evaluated intelligent agent.
3. The performance evaluation method for large-model intelligent agents according to claim 2, wherein the evaluation of the intelligent agents to be evaluated by selecting any one of ROC curves and PR curves according to the data type of the sample data amount, to obtain the comprehensive performance evaluation result of the intelligent agents to be evaluated, comprises: And when the duty ratio of the positive sample data in the sample data amount is larger than a preset duty ratio threshold, selecting a PR curve to evaluate the intelligent agent to be evaluated, and obtaining the comprehensive performance evaluation result of the intelligent agent to be evaluated.
4. The performance evaluation method for large-model intelligent agents according to claim 2, wherein the evaluation of the intelligent agents to be evaluated by selecting any one of ROC curves and PR curves according to the data type of the sample data amount, to obtain the comprehensive performance evaluation result of the intelligent agents to be evaluated, comprises: and when the duty ratio of the positive sample data in the sample data amount is smaller than or equal to a preset duty ratio threshold, selecting an ROC curve to evaluate the intelligent agent to be evaluated, and obtaining the comprehensive performance evaluation result of the intelligent agent to be evaluated.
5. The performance evaluation method applicable to a large-model agent according to claim 1, further comprising: When any one of the comparative experiment evaluation result or the ablation experiment evaluation result shows a significant performance difference, determining that the comparative evaluation result of the agent to be evaluated shows a significant performance difference.
6. The performance evaluation method applicable to a large-model agent according to any one of claims 1 to 5, wherein obtaining a sample data amount of network information authenticity detection of an agent to be evaluated includes: And determining the sample data volume of the network information authenticity detection of the intelligent agent to be evaluated according to the standard effect size, the significance level and the statistical efficacy.
7. A performance evaluation device suitable for large model agents, comprising: the model data acquisition module is used for acquiring the sample data volume of the network information authenticity detection of the intelligent agent to be evaluated and acquiring the task output type of the intelligent agent to be evaluated; The first comparison evaluation module is used for acquiring a comparison evaluation result of the intelligent agent to be evaluated through a pairing t-test mode when the task output type of the intelligent agent to be evaluated is a continuous output result representing an authenticity value; The second comparison and evaluation module is used for acquiring a comparison and evaluation result of the intelligent agent to be evaluated through a pairing chi-square test mode when the task output type of the intelligent agent to be evaluated is a classification output result representing the authenticity classification; the comparison evaluation results comprise comparison experiment evaluation results and ablation experiment evaluation results.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the performance evaluation method for large model agents according to any one of claims 1 to 6 when executing the computer program.
9. A non-transitory computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the performance evaluation method for large model agents according to any one of claims 1 to 6.
10. A computer program product comprising a computer program which, when executed by a processor, implements the performance evaluation method for large model agents according to any one of claims 1 to 6.

Description

Performance evaluation method suitable for large-model intelligent body Technical Field The invention relates to the technical field of authenticity detection of network information, in particular to a performance evaluation method suitable for large-model intelligent bodies. Background Since the release of GPT-3.5, large language models (LLM, large Language Model) have received widespread attention due to their excellent natural language processing capabilities. With the advent of LLM programmable interfaces, researchers have not only used text generation, but have explored their potential as intelligent tools in a variety of applications, with LLM-based agents (agents) becoming increasingly a research hotspot. Compared with the traditional intelligent neural network, the Agent based on LLM can efficiently complete natural language related tasks in specific application scenes by means of powerful natural language processing capability and rich priori knowledge without relying on huge data sets and high-performance computing resources for model training and combining modules such as 'action', 'planning', 'memorization' and the like through reasonable design of prompt words, and particularly has good potential in the field of authenticity detection of network information. However, how to make systematic and comprehensive evaluations of their performance remains a challenge. This involves not only how to quantify the performance of an agent in a particular task, but also to ensure accuracy and repeatability of performance evaluations. Disclosure of Invention The invention provides a performance evaluation method suitable for a large-model intelligent body, which is used for realizing systematic and comprehensive evaluation of the performance of the large-model intelligent body and ensuring the accuracy and the repeatability of performance evaluation. The invention provides a performance evaluation method suitable for large-model intelligent bodies, which comprises the following steps: Acquiring sample data volume of an agent to be evaluated, and acquiring task output type of the agent to be evaluated; When the task output type of the intelligent agent to be evaluated is a continuous output result, acquiring a comparison evaluation result of the intelligent agent to be evaluated in a pairing t-test mode; When the task output type of the intelligent agent to be evaluated is a classification output result, acquiring a comparison evaluation result of the intelligent agent to be evaluated by a pairing chi-square test mode; the comparison evaluation results comprise comparison experiment evaluation results and ablation experiment evaluation results. The performance evaluation method suitable for the large-model intelligent agent provided by the invention further comprises the following steps: when the comparison evaluation result of the to-be-evaluated intelligent agent shows a significant performance difference, selecting any one curve of an ROC curve or a PR curve to evaluate the to-be-evaluated intelligent agent according to the data type of the sample data volume, and obtaining the comprehensive performance evaluation result of the to-be-evaluated intelligent agent. According to the performance evaluation method suitable for the large-model intelligent agent, any one curve of the ROC curve or the PR curve is selected to evaluate the intelligent agent to be evaluated according to the data type of the sample data volume, and the comprehensive performance evaluation result of the intelligent agent to be evaluated is obtained, and the method comprises the following steps: And when the duty ratio of the positive sample data in the sample data amount is larger than a preset duty ratio threshold, selecting a PR curve to evaluate the intelligent agent to be evaluated, and obtaining the comprehensive performance evaluation result of the intelligent agent to be evaluated. According to the performance evaluation method suitable for the large-model intelligent agent, any one curve of the ROC curve or the PR curve is selected to evaluate the intelligent agent to be evaluated according to the data type of the sample data volume, and the comprehensive performance evaluation result of the intelligent agent to be evaluated is obtained, and the method comprises the following steps: and when the duty ratio of the positive sample data in the sample data amount is smaller than or equal to a preset duty ratio threshold, selecting an ROC curve to evaluate the intelligent agent to be evaluated, and obtaining the comprehensive performance evaluation result of the intelligent agent to be evaluated. The performance evaluation method suitable for the large-model intelligent agent provided by the invention further comprises the following steps: When any one of the comparative experiment evaluation result or the ablation experiment evaluation result shows a significant performance difference, determining that the comparative evaluation resu