CN-121979753-A - Assessment method for illusion of large visual language model

CN121979753ACN 121979753 ACN121979753 ACN 121979753ACN-121979753-A

Abstract

The invention provides an evaluation method for large visual language model illusions, which comprises the steps of constructing a plurality of text prompt word templates, constructing an evaluation data set based on the text prompt word templates, wherein the text prompt word templates are used for evaluating text prompt word templates corresponding to various faithful illusions and text prompt word templates corresponding to various factual illusions, the evaluation data set comprises a plurality of evaluation samples corresponding to various tasks, each evaluation sample comprises input data of the corresponding task and a real reference answer, the input data comprises a synthetic image and a question instruction aiming at the synthetic image, the evaluation data set comprises a plurality of disturbance samples processed under various disturbance scenes, the disturbance samples comprise real reference answers and input data after disturbance under the corresponding disturbance scenes, and the overall illusion resistance of the evaluation data set and the disturbance data set evaluation model is utilized.

Inventors

ZHANG JIE
YAN BEI
CHEN ZHIYUAN
Min Yuecong
SHAN SHIGUANG
CHEN XILIN

Assignees

中国科学院计算技术研究所

Dates

Publication Date: 20260505
Application Date: 20260120

Claims (10)

1. A method for evaluating the illusion of a large visual language model, the method for evaluating the overall anti-illusion capabilities of the large visual language model in terms of faithful illusions and in terms of factual illusions, the method comprising: S1, constructing a plurality of text prompt word templates, wherein the text prompt word templates are used for evaluating the text prompt word templates corresponding to various faithful illusions and evaluating the text prompt word templates corresponding to various factual illusions; S2, constructing an evaluation data set based on a plurality of text prompt word templates, wherein the evaluation data set comprises a plurality of evaluation samples of various tasks, each evaluation sample comprises input data of a corresponding task and a real reference answer, and the input data comprises a synthetic image generated based on the text prompt word templates and a question instruction aiming at the synthetic image; S3, constructing a disturbance data set based on the evaluation data set, wherein the disturbance data set comprises a plurality of disturbance samples obtained by processing each evaluation sample under a plurality of disturbance scenes, and each disturbance sample comprises a real reference answer and disturbed input data under a corresponding disturbance scene; S4, utilizing the evaluation data set and the disturbance data set to evaluate the overall anti-illusion capacity of the model, wherein the overall anti-illusion capacity comprises the steps of utilizing the model to generate a response result according to input data, and calculating the overall anti-illusion indexes of different tasks of the model under different disturbance scenes according to the difference of the respective response results of all evaluation samples and disturbance samples and the real reference answers.
2. The method according to claim 1, wherein in S1, the method for constructing a plurality of text prompt word templates includes: Constructing a plurality of candidate element pools, wherein each candidate element pool comprises a plurality of candidate element pools corresponding to each of a plurality of faithful illusions and a plurality of candidate element pools corresponding to each of a plurality of factual illusions, and each candidate element pool comprises a plurality of candidate elements belonging to the same dimension; for each faithful illusion dimension evaluation, randomly sampling candidate elements from the corresponding candidate element pools based on each placeholder attribute in the predefined prompting word templates and filling the candidate elements into the placeholders to obtain corresponding text prompting word templates; and for each evaluation of the actual illusion dimension, randomly sampling candidate elements from the corresponding candidate element pool based on each placeholder attribute in the predefined prompt word template, and filling the candidate elements into the placeholders to obtain the corresponding text prompt word template.
3. The method as recited in claim 1 wherein in said S1 the plurality of faithful illusions dimensions includes a faithful illusion of an entity type, a faithful illusion of a color, a faithful illusion of a spatial relationship, and a faithful illusion of a shape; The plurality of factual illusion dimensions includes a factual illusion of sports knowledge, a factual illusion of political knowledge, a factual illusion of entertainment knowledge, a factual illusion of religious knowledge, a factual illusion of physical culture knowledge, and a factual illusion of geographic knowledge.
4. The method according to claim 1, wherein in S2, the plurality of tasks includes a discriminant task and a generative task, and the evaluating the manner in which the data set is constructed includes: S21, inputting a plurality of text prompt word templates into a text-to-image model to obtain a plurality of candidate synthetic images, and eliminating candidate synthetic images with unqualified quality to obtain a plurality of synthetic images; s22, setting a question instruction and a real reference answer for each synthetic image according to a text prompt word template of each synthetic image for each task, wherein the setting of the question instruction comprises: Extracting semantic information of a text prompt word template, filling the semantic information into an instruction template which is preset to be a non-question or a multi-choice question for a discriminant task to obtain a question instruction requiring model judgment, and filling the semantic information into an instruction template which is preset to be a free question answer or image description for a generative task to obtain a question instruction enabling the model to answer freely; S23, constructing an evaluation data set comprising a plurality of evaluation samples under a plurality of tasks, wherein each synthesized image and a question instruction under a corresponding task serve as input data, and the input data and a real reference answer form one evaluation sample under the task.
5. The method according to claim 1, wherein in S3, the manner of constructing the disturbance data set includes: Carrying out multiple disturbance treatments on the synthesized image of the evaluation sample by adopting multiple image disturbance modes, and keeping the question instruction and the real reference answer of the evaluation sample unchanged to obtain disturbance samples under multiple image disturbance scenes; Carrying out multiple disturbance treatments on the question instruction of the evaluation sample by adopting multiple semantic disturbance modes, and keeping the synthesized image and the real reference answer of the evaluation sample unchanged to obtain disturbance samples under multiple semantic disturbance scenes; Constructing a plurality of combination perturbation modes, wherein each combination perturbation mode comprises an image perturbation mode selected from a plurality of image perturbation modes and a semantic perturbation mode selected from a plurality of semantic perturbation modes; Carrying out disturbance processing on the synthesized image of the evaluation sample by adopting an image disturbance mode in each of a plurality of combination disturbance modes, carrying out disturbance processing on a question instruction of the evaluation sample by adopting a semantic disturbance mode in each combination disturbance mode, and keeping a real reference answer of the evaluation sample unchanged to obtain disturbance samples under a plurality of combination disturbance scenes; obtaining disturbance samples under various image disturbance scenes, disturbance samples under various semantic disturbance scenes and disturbance samples under various combined disturbance scenes to obtain a disturbance data set.
6. The method of claim 5, wherein the plurality of image perturbation modes includes performing a style conversion process, an image corruption process, an anti-noise addition process, and a scene text injection process on the synthesized image in the evaluation sample; the multiple semantic perturbation approaches include processing of synonym substitutions in the problem instruction in the evaluation sample and adding misleading context prefixes.
7. The method according to claim 1, wherein in S4, the calculation method of the overall anti-illusion index of each task in each disturbance scenario includes: Counting the number of all disturbance samples obtained by processing all evaluation samples under the task through the disturbance scene, obtaining a response result through a model according to input data of each disturbance sample, and determining whether the response result is accurate or not according to the difference between the response result and a real reference answer; The ratio of the number of all accurate response results to the number of all disturbance samples is taken as the corresponding overall anti-illusion index.
8. The method according to claim 1, wherein in S4, the overall anti-illusion capability further comprises a disturbance resistance index of the model in different disturbance scenarios, the disturbance resistance index is calculated as follows: , Wherein, the Represents an index of disturbance resistance, The model is represented by a model of the model, Representing a composite image of the evaluation sample, A question instruction representing an evaluation sample, Representing the input data of the evaluation sample, Or (b) , The response results generated by the representation model from the input data of the evaluation sample are accurate, 0 Indicates that the response result generated by the model from the input data of the evaluation sample is erroneous, Representation pair evaluation sample The disturbance sample processed under the corresponding disturbance scene is carried out, Or (b) , The response results generated by the representation model from the input data of the disturbance samples are accurate, 0 Indicates that the response result generated by the model from the input data of the disturbance sample is erroneous, Representing the exact number of response results of the model under both the evaluation sample and the corresponding perturbation sample, Representing the total number of accurate response results of the model under the evaluation sample.
9. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of one of claims 1 to 8.
10. A computer readable storage medium, having stored thereon a computer program/instructions, which are executed by a processor to implement the steps of the method of one of claims 1-8.

Description

Assessment method for illusion of large visual language model Technical Field The invention relates to the technical field of artificial intelligence safety, in particular to the field of application of a large visual language model, and more particularly relates to an evaluation method for illusion of the large visual language model. Background In recent years, large models including large language models and large visual language models have been remarkably developed in the field of artificial intelligence, and have been widely used in a plurality of actual scenes such as automatic driving, medical image analysis, and the like. However, existing models are subject to the illusion that the content of the model output appears reasonable but not consistent with the user input or established world knowledge. Depending on the source of the inconsistency, the illusions may be divided into faithful illusions for the input and factual illusions for the world knowledge. Where faithful illusions refer to the model generating content inconsistent with user input, and factual illusions refer to the model generating content inconsistent with outside, established world knowledge. This phenomenon severely constrains the practicality and reliability of the model, especially when users lacking expertise rely excessively on the model, which is particularly risky. The existing illusion assessment method is mainly focused on the field of large language models. The method comprises the main technical means of designing a text question-answer benchmark, and evaluating the illusion degree by detecting the effective segmentation proportion of the model output similar to the benchmark facts. However, for large visual language models, the answer generation is based on cross-modal (visual and textual) inputs, with hallucination evaluation focusing on consistency of cross-modal input and output and combined consistency with outside world facts. The faithfulness illusion of the large visual language model evaluates whether the answer generated by the attention model has information that is contrary to the content of the input image. For example, the model generates answers that mention "many guests" but that the crowd is not actually present in the image. This illusion of off-sight facts is the faithful illusion. The factual illusion assessment focuses on the answers generated by the model, while conforming to the image, but against the real knowledge of the outside world. For example, the model correctly identified the Eiffel tower in the figure, but claimed to be "built in 1850", and the historical fact was that the building was built in 1887. Such errors, which are inconsistent with the accepted knowledge, are a realistic illusion. Through the analysis, the evaluation standard designed for the large-scale language model is difficult to be directly applied to the large-scale visual language model, and a new evaluation method is required to be designed by combining the characteristics of the large-scale visual language model. In the current assessment methods for large visual language models, some studies focus mainly on faithful illusions, with the consistency of the key assessment model answers to the input images themselves, as in reference [1]. These methods often ignore the factual illusions that contradict the established world facts, and the dimensions and tasks of the assessment are not comprehensive enough to make a comprehensive, systematic assessment of the overall illusive behavior of the large visual language model difficult. In addition, existing reference datasets often rely on costly manual construction or reuse of common datasets, which creates inefficient dataset construction, poor scalability, and potential risk of data leakage. Therefore, in the existing evaluation method for the large visual language model, the actual illusion evaluation is usually easy to ignore, the evaluation dimension and task are not comprehensive enough, the overall illusion performance of the large visual language model is difficult to evaluate systematically, and in addition, the data set construction efficiency adopted by the existing evaluation is low and potential data leakage risks exist. It should be noted that, the present background art is only for describing the relevant information of the present invention to facilitate understanding of the technical solution of the present invention, but does not mean that the relevant information is necessarily prior art. Related information is filed and published with the inventive arrangements, and should not be considered prior art, in the absence of evidence that related information was published prior to the filing date of the present application. The references are as follows: [1]Evaluating Object Hallucination in Large Vision-Language Models (EMNLP, 2023)、PhD: A ChatGPT-Prompted Visual Hallucination Evaluation Dataset (CVPR, 2025). Disclosure of Invention It is therefore an object of th