CN-121981836-A - Evaluation method, device, equipment and medium for financial insurance language model

CN121981836ACN 121981836 ACN121981836 ACN 121981836ACN-121981836-A

Abstract

The embodiment of the application discloses a method, a device, equipment and a medium for evaluating a financial insurance language model. The method comprises the steps of responding to an evaluation instruction aiming at a to-be-tested model, obtaining a test item from a preset financial insurance question bank, generating a new test item based on first answer content of the to-be-tested model to the test item and preset value dimension characteristics, and performing multidimensional evaluation on second answer content based on the value dimension characteristics to obtain an evaluation score of the to-be-tested model, wherein the second answer content is answer content of the to-be-tested model to the new test item. In the application, different from a static question bank, a new test item is automatically generated, so that various compound rule test scenes can be dynamically generated, and the technical problem of poor evaluation effect when the financial insurance language model is evaluated currently is solved according to the feedback self-adaptive updating difficulty.

Inventors

WANG JIANZONG
ZHANG NAN
QU XIAOYANG

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20260202

Claims (10)

1. A method of evaluating a financial insurance language model, the method comprising: responding to an evaluation instruction aiming at a model to be tested, and acquiring test items from a preset financial insurance question bank; Generating a new test item based on the first answer content of the test item by the to-be-tested model and preset value dimension characteristics; And carrying out multidimensional evaluation on second answer content based on the value dimension characteristics to obtain an evaluation score of the model to be tested, wherein the second answer content is the answer content of the model to be tested to the new test item.
2. The method of claim 1, wherein each test item is labeled with a corresponding test difficulty, each answer content is labeled with a corresponding value alignment score, the generating a new test item based on the first answer content of the test item by the model under test and a preset value dimension feature comprises: Determining a second test difficulty based on the value alignment score of the corresponding mark of the first answer content, the first test difficulty of the corresponding mark of the test item, and the value dimension characteristic; Generating candidate test items based on a preset prompt template comprising a plurality of difficulty dimensions, wherein the plurality of difficulty dimensions comprise information integrity, insurance clause complexity and model trust threshold; And performing difficulty evaluation on the candidate test items to obtain a difficulty evaluation result of each candidate test item, and screening candidate test items consistent with the second test difficulty from the candidate test items based on the difficulty evaluation result.
3. The method of claim 2, comprising, prior to determining the second test difficulty based on the value alignment score of the first answer content correspondence indicia, the first test difficulty of the test item correspondence indicia, and the value dimension characteristic: Determining a plurality of value dimensions for a financial insurance domain; Determining the division basis of each value dimension in the financial insurance field, wherein the division basis of the compliance value dimension is whether the compliance value dimension accords with the requirements of supervision policy and privacy protection, the division basis of the fairness value dimension is whether potential prejudice exists for a customer group, the division basis of the robustness value dimension is decision stability when the abnormal input or fraud information is faced, and the division basis of the credibility value dimension is whether an answer is transparent, interpretable and based on facts; coding the division basis of the dimensionality of different values respectively to obtain a dimensionality code; And performing secondary coding according to a feature type and a measurement hierarchy based on a plurality of dimension codes to obtain value dimension features under different value dimensions, wherein the feature type comprises a numerical value type, a category type, a binary type and a text description type, and the measurement hierarchy comprises a personal level, a family level, an enterprise level and a social level.
4. The method of claim 2, comprising, prior to generating candidate test items based on the preset hint templates comprising a plurality of difficulty dimensions: when the difficulty parameters with different difficulties are obtained, obtaining a relation mapping table for mapping the difficulty parameters and the element number of the elements; Extracting elements with the same number as the elements mapped by the difficulty coefficient from a preset element library based on the relation mapping table to obtain a plurality of target elements; And establishing a prompt template comprising a plurality of difficulty dimensions based on the target element.
5. The method of claim 1, wherein the evaluation dimension includes a local consistency dimension and a global consistency dimension, wherein the performing multidimensional evaluation on the second answer content based on the value dimension feature to obtain an evaluation score on the model to be tested includes: Evaluating the local consistency dimension of the second answer content based on the value dimension feature to obtain a first evaluation score of the model to be tested, wherein the logic consistency of the answer content of multiple rounds in the same scene is evaluated; performing the evaluation of the global consistency dimension on the second answer content based on the value dimension feature to obtain a second evaluation score of the model to be tested, wherein the evaluation answer content deviates from the deviation degree of standard answers; And determining an evaluation score of the model to be tested based on the first evaluation score and the second evaluation score.
6. The method according to claim 1, wherein the performing multidimensional evaluation on the second answer content based on the value dimension feature to obtain an evaluation score of the model to be tested includes: performing feature extraction processing on the second answer content based on a preset value dimension model to obtain a second value dimension feature vector corresponding to the second answer content; calculating the similarity between the second value dimension feature vector and a preset standard value dimension feature vector to obtain vector similarity; And scoring based on the vector similarity to obtain an evaluation score of the model to be tested.
7. The method of claim 1, wherein generating a new test item based on the first answer content of the model to be tested to the test item and a preset value dimension feature comprises: based on the first answer content and the preset value dimension characteristics, on the basis of a preset template and constraint rules, test questions with different difficulties and semantic interferences are generated through a control variable method, wherein the structure of the test questions is an atomization test unit structure.
8. An apparatus for evaluating a financial insurance language model, the apparatus comprising: the acquisition unit is used for responding to an evaluation instruction aiming at the model to be tested and acquiring test items from a preset financial insurance question bank; the test item generating unit is used for generating a new test item based on the first answer content of the to-be-tested model to the test item and preset value dimension characteristics; And the evaluation unit is used for carrying out multidimensional evaluation on second answer content based on the value dimension characteristics to obtain an evaluation score of the model to be tested, wherein the second answer content is the answer content of the model to be tested to the new test item.
9. An apparatus for evaluating a financial insurance language model, comprising a memory, a processor and a three-dimensional measurement program stored on the memory and executable on the processor, the processor executing the three-dimensional measurement program to implement the steps of the method for evaluating a financial insurance language model of any one of claims 1 to 7.
10. A medium having stored thereon a program for implementing a method of evaluating a financial insurance language model, the program for implementing a method of evaluating a financial insurance language model being executed by a processor to implement the steps of the method of evaluating a financial insurance language model as claimed in any one of claims 1 to 7.

Description

Evaluation method, device, equipment and medium for financial insurance language model Technical Field The present application relates to the field of financial insurance, and in particular, to a method, apparatus, device, and medium for evaluating a language model of financial insurance. Background Currently, intelligent customer service, product recommendation, claim auditing and risk assessment systems based on large language models are widely deployed in the field of financial insurance, and these models exhibit significant advantages in terms of language understanding, knowledge calling and policy generation. However, as the scale and complexity of language models increase, the problem of evaluating compliance by language models is increasingly prominent. The traditional evaluation method generally adopts a static benchmark set or expert manual test mode to score output compliance of the model and the like. The method is difficult to capture emerging financial risks, the evaluation result is excessively optimistic, the reliability of model evaluation is reduced, the evaluation timeliness is poor, and the current capability of the model cannot be truly reflected, so that the technical problem that the evaluation effect is poor when the financial insurance language model is evaluated currently exists in the prior art. Disclosure of Invention The embodiment of the application provides a method, a device, equipment and a medium for evaluating a financial insurance language model, which can solve the technical problem of poor evaluation effect when evaluating the financial insurance language model at present. In a first aspect, an embodiment of the present application provides a method for evaluating a language model of financial insurance, including: responding to an evaluation instruction aiming at a model to be tested, and acquiring test items from a preset financial insurance question bank; Generating a new test item based on the first answer content of the test item by the to-be-tested model and preset value dimension characteristics; And carrying out multidimensional evaluation on second answer content based on the value dimension characteristics to obtain an evaluation score of the model to be tested, wherein the second answer content is the answer content of the model to be tested to the new test item. In some embodiments, each test item is marked with a corresponding test difficulty, each answer content is marked with a corresponding value alignment score, and the generating a new test item based on the first answer content of the test item by the model to be tested and a preset value dimension feature includes: Determining a second test difficulty based on the value alignment score of the corresponding mark of the first answer content, the first test difficulty of the corresponding mark of the test item, and the value dimension characteristic; Generating candidate test items based on a preset prompt template comprising a plurality of difficulty dimensions, wherein the plurality of difficulty dimensions comprise information integrity, insurance clause complexity and model trust threshold; And performing difficulty evaluation on the candidate test items to obtain a difficulty evaluation result of each candidate test item, and screening candidate test items consistent with the second test difficulty from the candidate test items based on the difficulty evaluation result. In some embodiments, before determining the second test difficulty based on the value alignment score of the first answer content correspondence marker, the first test difficulty of the test item correspondence marker, and the value dimension feature, the method comprises: Determining a plurality of value dimensions for a financial insurance domain; Determining the division basis of each value dimension in the financial insurance field, wherein the division basis of the compliance value dimension is whether the compliance value dimension accords with the requirements of supervision policy and privacy protection, the division basis of the fairness value dimension is whether potential prejudice exists for a customer group, the division basis of the robustness value dimension is decision stability when the abnormal input or fraud information is faced, and the division basis of the credibility value dimension is whether an answer is transparent, interpretable and based on facts; coding the division basis of the dimensionality of different values respectively to obtain a dimensionality code; And performing secondary coding according to a feature type and a measurement hierarchy based on a plurality of dimension codes to obtain value dimension features under different value dimensions, wherein the feature type comprises a numerical value type, a category type, a binary type and a text description type, and the measurement hierarchy comprises a personal level, a family level, an enterprise level and a social level. In some embodiments, the