CN-122019363-A - Automatic testing method and system for multi-mode large model

CN122019363ACN 122019363 ACN122019363 ACN 122019363ACN-122019363-A

Abstract

The invention relates to the field of artificial intelligence testing, in particular to an automatic testing method and system of a multi-mode large model, comprising the steps of obtaining a reference multi-mode sample, wherein the reference multi-mode sample comprises first mode data and second mode data, and the first mode data comprises a core semantic entity; the method comprises the steps of generating a corresponding causal intervention problem based on a core semantic entity, executing automatic editing on first mode data to remove the core semantic entity, generating a counterfactual multi-mode sample, inputting the image structure similarity of the counterfactual sample and a reference sample outside an area where the core semantic entity is located into a multi-mode large model respectively, acquiring two output answers of the reference sample and the counterfactual sample to the causal intervention problem, quantifying causal sensitivity of the multi-mode large model to the core semantic entity by calculating the difference degree between the two output answers, and generating a robustness assessment report of the multi-mode large model based on the causal sensitivity.

Inventors

WU YONG
Cao Tuohuang
ZHANG LIN
LUO WEIJIA
XUE JIAN
HUANG WEIFENG
BU YUXIN

Assignees

广州掌测信息科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251223

Claims (10)

1. An automated testing method for a multimodal mass model, the method comprising: Acquiring a reference multi-modal sample, wherein the reference multi-modal sample comprises first modal data and second modal data, and the first modal data comprises a core semantic entity; Generating a corresponding causal intervention question based on the core semantic entity, the causal intervention question configured to verify the dependency of the multi-modal large model on the core semantic entity; Performing automatic editing on the first modality data to remove the core semantic entity, and generating a counterfactual multi-modality sample, wherein the image structure similarity of the counterfactual sample and the reference sample outside the area where the core semantic entity is located is higher than a preset threshold; Respectively inputting the reference sample and the counterfactual sample into the multi-mode large model, and obtaining two output answers of the multi-mode large model to the causal intervention problem; Quantifying causal sensitivity of the multi-modal large model to the core semantic entity by calculating a degree of difference between the two output answers; based on the causal sensitivity, a robustness assessment report of the multimodal large model is generated.
2. The automated testing method of a multimodal big model according to claim 1, wherein the step of performing automated editing on the first modality data to remove the core semantic entity, generating a counter fact multimodal sample comprises: identifying a modality type of the first modality data; selecting a corresponding target strategy from a plurality of preset editing strategies according to the identified mode type; And automatically editing the first modal data by utilizing the target strategy to generate the anti-facts multi-modal sample.
3. The automated testing method of a multimodal mass model as defined in claim 2, wherein selecting a corresponding target policy from a plurality of preset editing policies according to the identified modality type comprises: When the first mode data is a static image, an image segmentation model is called, an accurate pixel area of the core semantic entity in the image is identified, and a corresponding entity mask is generated; Inputting an original image and the entity mask into a preset generation type restoration model, and generating contents of pixel areas covered by the mask through the generation type restoration model so as to remove the core semantic entity and generate a visual consistent anti-facts image; When the first mode data is video data, performing frame-by-frame target detection and tracking on a video sequence, and identifying and locking a motion track of an object bearing the core semantic entity in the whole video; positioning a starting key frame which is triggered or executed by taking the core semantic entity as a cause according to the motion trail; one or more consecutive frames containing the starting key frame are removed from the video sequence to eliminate the key cause event or action and the video stream is re-spliced to generate a coherent anti-facts video.
4. The automated testing method of a multimodal mass model as defined in claim 2, wherein the selecting a corresponding target policy from a plurality of preset editing policies according to the identified modality type, further comprises: analyzing whether the core semantic entity meets a separable visual component that is a target object; If the core semantic entity satisfies a separable visual component that is the target object, removing the separable visual component, or The separable visual component is replaced with a functionally different component.
5. The automated testing method of a multimodal mass model of claim 4, wherein the replacing the separable visual component on the target object with a functionally different peer component comprises: analyzing whether the separable visual component is a sufficient condition or a necessary condition to achieve the target object core function; if the conditions are sufficient, generating a substitute component which is different from the original component in visual form but realizes the same core function; If the requirement is met, generating a failure component which is highly similar to the visual form of the original component but can not realize the core function of the original component; and based on the judging result, seamlessly integrating the failure component or the replacement component on the target object through an image generation model, and generating the anti-reality multi-mode sample.
6. The automated testing method of a multimodal large model according to claim 1, further comprising an adaptive testing loop after generating the robustness assessment report, the loop comprising the steps of: Based on the completed test results, calculating utility indexes of each executed test case, wherein the utility indexes comprise causal sensitivity values revealed by the test case, risk levels of core semantic entities aimed by the test case in a service scene, and uniqueness of a mode and semantic combination covered by the test; screening covered cases with the utility index lower than a threshold value from a preset test case library according to the utility index, and generating a new batch of candidate test cases based on a reinforcement learning strategy or a sampling strategy based on diversity; Inputting the new candidate test cases into a utility prediction model which is trained to predict potential test utilities of the candidate cases, and selecting one or more candidate cases with highest prediction utilities as reference multi-modal samples of the next round of test; inputting the selected new reference multi-mode sample into the test flow, and repeating the self-adaptive test cycle until reaching the preset test budget.
7. The automated testing method of a multimodal mass model of claim 6, wherein after the step of quantifying causal sensitivity of the multimodal mass model to the core semantic entity by calculating a degree of difference between the two output answers, the method further comprises: Judging whether the causal sensitivity exceeds a preset high threshold; If yes, a depth attribution test is executed, including: Generating a plurality of intermediate counterfactual samples of partial removal or progressive perturbation of the core semantic entity; sequentially inputting the reference sample and the intermediate counterfactual samples into the multi-modal large model, and obtaining output answers of the multi-modal large model to the causal intervention questions; and drawing a relation curve between the integrity of the core semantic entity and the confidence of the model output answer, and taking the curve into the robustness assessment report so as to locate the mutation critical point of the model decision.
8. An automated testing system for a multimodal mass model, comprising: The acquisition module is used for acquiring a reference multi-modal sample, wherein the reference multi-modal sample comprises first modal data and second modal data, and the first modal data comprises a core semantic entity; A first generation module for generating a corresponding causal intervention question based on the core semantic entity, the causal intervention question configured to verify the degree of dependence of the multi-modal large model on the core semantic entity; the editing module is used for executing automatic editing on the first mode data to remove the core semantic entity and generating a counterfactual multi-mode sample, wherein the similarity of the image structure of the counterfactual sample and the reference sample outside the area where the core semantic entity is located is higher than a preset threshold value; the input module is used for respectively inputting the reference sample and the inverse fact sample into the multi-mode large model and obtaining two output answers of the multi-mode large model to the causal intervention problem; The quantization module is used for quantizing the causal sensitivity of the multi-modal large model to the core semantic entity by calculating the difference degree between the two output answers; and the second generation module is used for generating a robustness assessment report of the multi-mode large model based on the causal sensitivity.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.

Description

Automatic testing method and system for multi-mode large model Technical Field The application relates to the technical field of artificial intelligence testing, in particular to an automatic testing method and system for a multi-mode large model. Background The automated testing of the multi-modal large model refers to the process of performing performance assessment on the large artificial intelligent model capable of understanding and generating multi-modal data (such as text, images, videos and the like) through a systematic technical means. With the increasing wide application of multi-modal models in safety critical fields such as autopilot and medical treatment, how to ensure the reliability and robustness of the cognitive ability of the multi-modal models has become an important issue. In existing testing methods, counterfactual contrast analysis is of interest as a test paradigm for leading edges. The method has the core thought that the cognitive boundary of the model is explored by constructing a comparison sample which only has key differences with the original sample. However, the prior art faces systematic challenges from sample generation to evaluation mechanisms when implementing this idea. At the generation level of the counterfactual samples, existing methods typically employ relatively coarse-grained editing strategies, such as adding random noise, performing global style migration, or performing simple object substitution. Such methods, while effecting changes in the input data, exhibit significant limitations in the practice. The generated samples often have a deficiency in visual reality or semantic rationality, possibly resulting in "abnormal" scenes that violate physical laws or daily awareness. When models perform poorly on such samples, it is difficult for an evaluator to clearly distinguish whether this is a true representation of the model's cognitive boundaries or a legitimate response to unreasonable inputs. A further problem is that such coarse-grained editing tends to change multiple semantic elements in the scene simultaneously, resulting in causal attribution ambiguity—it is difficult for the tester to determine from which particular semantic entity the behavior change of the model originated, which limits the value of the test results in locating specific cognitive deficits. At the test evaluation level, the existing method mainly focuses on the surface consistency of the output result of the model, such as whether the classification result or description text of the reference sample and the counterfactual sample are consistent by the comparison model. Although this way of assessment is able to detect significant output changes, there are shortcomings in exploring the deep cognitive mechanisms of the model. The existing method lacks quantification capability on the decision-making dependency degree of the model, and cannot accurately measure the dependency degree of the model on specific semantic entities. More importantly, existing test paradigms fail to effectively verify the causal reasoning ability of the model—even though the model gives the same output on two samples, it may only draw the same conclusions based on different surface features, and not exhibit stable causal logic. This limitation makes it difficult for existing methods to find defects in understanding the functional necessity of the components of the model, e.g., the model may not distinguish between definitional and incidental properties of the object, misinterpreting common components as essential features of the object. Accordingly, the prior art has drawbacks and needs improvement. Disclosure of Invention In order to solve one or more problems in the prior art, the main purpose of the application is to provide an automatic testing method and system for a multi-mode large model. In order to achieve the above object, the present application provides an automated testing method for a multi-modal large model, the method comprising: Acquiring a reference multi-modal sample, wherein the reference multi-modal sample comprises first modal data and second modal data, and the first modal data comprises a core semantic entity; Generating a corresponding causal intervention question based on the core semantic entity, the causal intervention question configured to verify the dependency of the multi-modal large model on the core semantic entity; Performing automatic editing on the first modality data to remove the core semantic entity, and generating a counterfactual multi-modality sample, wherein the image structure similarity of the counterfactual sample and the reference sample outside the area where the core semantic entity is located is higher than a preset threshold; Respectively inputting the reference sample and the counterfactual sample into the multi-mode large model, and obtaining two output answers of the multi-mode large model to the causal intervention problem; Quantifying causal sensitivity of the