CN-122021915-A - Visual language big model reasoning thinking chain quantitative evaluation method, device and equipment

CN122021915ACN 122021915 ACN122021915 ACN 122021915ACN-122021915-A

Abstract

The application relates to the technical field of artificial intelligence, in particular to a visual language big model reasoning thinking chain quantitative evaluation method, a visual language big model reasoning thinking chain quantitative evaluation device and computer equipment. The method comprises the steps of inputting a preset input sample into a model to be evaluated, reasoning the model to be evaluated according to sub-thinking chains, calculating according to an original output value of a correct answer corresponding to the input sample at the last layer of the model to be evaluated, obtaining output scores corresponding to all the sub-thinking chains, calculating, for each sub-thinking chain, thinking chain interaction scores of the sub-thinking chains according to the sub-thinking chains and output scores of subsets of the sub-thinking chains, and calculating based on the thinking chain interaction scores, so as to obtain evaluation information of the model to be evaluated. The technical scheme provided by the application can provide quantitative and objective evaluation basis for model performance transverse comparison, model optimization and thinking chain optimization.

Inventors

LI CHAO
YAO KELU
XUE JUNXIAO
CHEN YAYING
XU NUO

Assignees

之江实验室

Dates

Publication Date: 20260512
Application Date: 20260206

Claims (10)

1. A method for generating quantitative evaluation of a visual language big model reasoning thinking chain is characterized by comprising the following steps: Inputting a preset input sample into a model to be evaluated, wherein the model to be evaluated performs reasoning according to a sub-thinking chain, and the sub-thinking chain is a subset of a complete thinking chain of the model to be evaluated; Calculating according to an original output value of a correct answer corresponding to the input sample at the last layer of the model to be evaluated, and obtaining output scores corresponding to each sub-thinking chain; For each sub-mental chain, calculating the mental chain interaction score of the sub-mental chain according to the sub-mental chain and the output scores of the sub-mental chain, wherein the mental chain interaction score represents the contribution degree of interaction cooperation among different reasoning steps to the improvement of the correctness of a reasoning result in one sub-mental chain; And calculating based on the thought chain interaction score, and acquiring evaluation information of the to-be-evaluated model, wherein the evaluation information at least comprises one of interaction strength of the to-be-evaluated model and continuity evaluation, the interaction strength is used for representing the contribution degree of interaction cooperation among different reasoning steps in the complete thought chain to the improvement of the correctness of a reasoning result, and the continuity evaluation is used for representing the logic association degree between two reasoning steps in the complete thought chain.
2. The method according to claim 1, wherein the calculating according to the thought chain interaction score to obtain the evaluation information of the model to be evaluated includes: Grouping the sub-thinking chains according to the number of included reasoning steps for each input sample; For each group, calculating mathematical expectations of interaction scores of the corresponding thinking chains of each sub-thinking chain in the group as intra-group mathematical expectations; and for all input samples, calculating mathematical expectation of mathematical expectation in all groups in the same group respectively, and carrying out normalization processing to obtain the interaction strength of the model to be evaluated.
3. The method according to claim 1, wherein the calculating according to the thought chain interaction score to obtain the evaluation information of the model to be evaluated includes: normalizing the thinking chain interaction score, and fitting the normalized thinking chain interaction score to obtain a fitting curve; Determining curvature extreme points of the fitting curve as significant interaction thresholds, selecting sub-thinking chains of significant interactions to form a significant interaction set, wherein normalized thinking chain interaction scores corresponding to the sub-thinking chains of significant interactions are larger than the significant interaction thresholds; and calculating according to the remarkable interaction set to obtain the evaluation information of the model to be evaluated.
4. A method according to claim 3, wherein the consistency assessment comprises an average interaction step size; the calculating according to the significant interaction set, obtaining the evaluation information of the model to be evaluated, includes: Counting the number of continuous step pairs in the remarkable interaction set, wherein the continuous step pairs are the combination of two adjacent reasoning steps in the same sub-thinking chain; Accumulating the step length of each continuous step pair relative to the complete thinking chain to obtain a total step length, and calculating the quotient of the total step length and the number of the continuous step pairs as a first average step length; and calculating mathematical expectations of the first average step length corresponding to all the input samples as the average interaction step length.
5. A method according to claim 3, wherein the consistency assessment comprises the step of accumulating a set of contributions; the calculating according to the significant interaction set, obtaining the evaluation information of the model to be evaluated, includes: Counting continuous step pairs in the remarkable interaction set, wherein the continuous step pairs are combinations of two adjacent reasoning steps in the same sub-thinking chain; for each continuous step pair, accumulating absolute values of interaction scores of the thinking chains corresponding to the sub-thinking chains comprising the continuous step pair in the remarkable interaction set, acquiring the accumulated contribution of the steps of the continuous step pair, normalizing, and acquiring the accumulated contribution set of the steps.
6. The method of claim 3, wherein the evaluation information further comprises a modality dependency; the calculating according to the significant interaction set, obtaining the evaluation information of the model to be evaluated, includes: Acquiring a noise input sample after adding noise in any mode of the input sample, inputting the noise into the model to be evaluated, calculating, and acquiring a significant interaction set corresponding to the noise input sample as a noise significant interaction set; For each noise input sample and an original input sample corresponding to the noise input sample, calculating an intersection and a union of the noise significant interaction set and the significant interaction set, and calculating a quotient of the intersection and the union as a sample mode dependence of a noise corresponding mode in the noise input sample; for the mode corresponding to the same noise, calculating the mathematical expectation of the mode dependence of all samples as the mode dependence.
7. The method according to claim 1, wherein the calculating, according to the original output value of the correct answer corresponding to the input sample at the last layer of the model to be evaluated, to obtain the output score corresponding to each sub-thinking chain includes: And for each sub-thinking chain, acquiring an original output value of a correct answer corresponding to the input sample in the last layer of the model to be evaluated, and calculating the logarithmic probability of the original output value as an output score corresponding to the sub-thinking chain.
8. The method of claim 1, wherein for each sub-chain of thought, calculating a chain interaction score for the sub-chain of thought from the sub-chain of thought and the output scores of the sub-chain of thought, comprising: Traversing each subset of the sub-thinking chain for each sub-thinking chain, giving signs according to the parity of the difference between the number of reasoning steps in the sub-thinking chain and the subset in the traversing process, and multiplying the signs with the output score of the subset to obtain the interaction score of the subset; the sum of the interaction scores of all subsets is calculated as the mental chain interaction score.
9. A visual language big model reasoning thinking chain quantitative evaluation device, which is characterized in that the device comprises: The reasoning module is used for inputting a preset input sample into the model to be evaluated, the model to be evaluated carries out reasoning according to a sub-thinking chain, and the sub-thinking chain is a subset of a complete thinking chain of the model to be evaluated; The first calculation module is used for calculating according to the original output value of the correct answer corresponding to the input sample at the last layer of the model to be evaluated, and obtaining the output score corresponding to each sub-thinking chain; The second calculation module is used for calculating the thinking chain interaction score of each sub-thinking chain according to the sub-thinking chain and the output score of the sub-thinking chain, wherein the thinking chain interaction score represents the contribution degree of interaction cooperation among different reasoning steps to the improvement of the correctness of the reasoning result in one sub-thinking chain; The evaluation generation module is used for calculating based on the thought chain interaction score to obtain evaluation information of the to-be-evaluated model, wherein the evaluation information at least comprises one of interaction strength of the to-be-evaluated model and consistency evaluation, the interaction strength is used for representing the contribution degree of interaction cooperation among different reasoning steps in the complete thought chain to improving the correctness of a reasoning result, and the consistency evaluation is used for representing the logical association degree between two reasoning steps in the complete thought chain.
10. A computer device, comprising: a memory and a processor, the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions, thereby executing the visual language big model reasoning thinking chain quantitative evaluation method of any one of claims 1 to 8.

Description

Visual language big model reasoning thinking chain quantitative evaluation method, device and equipment Technical Field The application relates to the technical field of artificial intelligence, in particular to a visual language big model reasoning thinking chain quantitative evaluation method, a visual language big model reasoning thinking chain quantitative evaluation device and computer equipment. Background The quantifiable evaluation of the reasoning ability of language big models has always been a technical bottleneck of the current big model technology. The existing evaluation methods comprise evaluation based on performance indexes such as accuracy or confusion, wherein the methods can reflect task completion effects, but are difficult to deeply analyze logic structures in a model, post-interpretation methods such as neuron activation analysis or attention visualization, which can observe local behaviors of the model, but still lack system evaluation capability of integral reasoning capability, and methods such as concept activation vector and reasoning path tracking, which can extract symbolic assumptions of an intermediate layer and perform qualitative analysis in a reasoning process, but are difficult to comprehensively quantify the reasoning capability of the whole model. In addition, the methods such as concept activation vector and inference path tracking technology can extract symbolic assumptions of the middle layer in the model inference process for qualitative analysis, but cannot quantify the inference capability of the whole evaluation model. Therefore, there is a need to establish a quantitatively comparable language model inference capability evaluation method to objectively evaluate the inference performance of the language model. Disclosure of Invention The application provides a visual language big model reasoning thinking chain quantitative evaluation method, a visual language big model reasoning thinking chain quantitative evaluation device and computer equipment, which can generate quantitative evaluation information and realize objective evaluation of the reasoning performance of a language big model. In order to achieve the above purpose, the main technical scheme adopted by the application comprises the following steps: in a first aspect, an embodiment of the present application provides a visual language big model reasoning thought chain quantitative evaluation method, the method including: Inputting a preset input sample into a model to be evaluated, wherein the model to be evaluated performs reasoning according to a sub-thinking chain, and the sub-thinking chain is a subset of a complete thinking chain of the model to be evaluated; Calculating according to an original output value of a correct answer corresponding to the input sample at the last layer of the model to be evaluated, and obtaining output scores corresponding to each sub-thinking chain; For each sub-mental chain, calculating the mental chain interaction score of the sub-mental chain according to the sub-mental chain and the output scores of the sub-mental chain, wherein the mental chain interaction score represents the contribution degree of interaction cooperation among different reasoning steps to the improvement of the correctness of a reasoning result in one sub-mental chain; And calculating based on the thought chain interaction score, and acquiring evaluation information of the to-be-evaluated model, wherein the evaluation information at least comprises one of interaction strength of the to-be-evaluated model and continuity evaluation, the interaction strength is used for representing the contribution degree of interaction cooperation among different reasoning steps in the complete thought chain to the improvement of the correctness of a reasoning result, and the continuity evaluation is used for representing the logic association degree between two reasoning steps in the complete thought chain. The embodiment of the application provides a visual language big model reasoning thinking chain quantitative evaluation method, which is characterized in that a sub-thinking chain is obtained by decomposing the thinking chain of the language big model, the interaction score of the thinking chain is obtained by calculating according to the output of the sub-thinking chain, and the evaluation of the interaction strength and the reasoning consistency between the reasoning steps of the language big model is obtained by calculating according to the interaction score of the thinking chain. The method explicitly quantifies the contribution degree of each reasoning step and cooperation thereof to the reasoning result in the reasoning process by introducing the output score of the sub-thinking chain and the interaction score of the thinking chain. The cooperation capability of each reasoning step in the complete thinking chain is reflected through the calculation of the interaction strength, and the logic connection tight