CN-121981268-A - Attention calibration method, device, equipment and storage medium of visual language model

CN121981268ACN 121981268 ACN121981268 ACN 121981268ACN-121981268-A

Abstract

The invention discloses a method, a device, equipment and a storage medium for calibrating the attention of a visual language model, which relate to the visual language model and the computer vision field and can be applied to the finance and medical field, and comprise the steps of determining one or more blind tokens from all image tokens of an input image; the method comprises the steps of establishing bias visual input, calculating bias logarithmic probability of candidate text tokens based on the bias visual input, calculating original logarithmic probability of the candidate text tokens based on the original visual input, obtaining calibrated probability distribution based on the bias logarithmic probability and the original logarithmic probability of the candidate text tokens, and sampling from the calibrated probability distribution to obtain final text tokens. According to the method and the device, the possibility that the visual language model generates the text which is inconsistent with the image content is reduced, the illusion problem in the visual language model is relieved, and the accuracy and the reliability of generating the text description are improved.

Inventors

ZHANG XULONG
PAN PENG

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20260120

Claims (10)

1. A method of attention calibration for a visual language model, the method comprising: Determining one or more blind tokens from all image tokens of the input image, wherein the blind tokens are image tokens with low information content relative to text query in the input image; Constructing an offset visual input, wherein the offset visual input comprises all image tokens of the input image, but feature vectors of the image tokens which are not determined as blind tokens in all image tokens are set to zero, and the feature vectors of the image tokens which are determined as blind tokens are kept unchanged; calculating a biased logarithmic probability of the candidate text token based on the biased visual input; Calculating the original logarithmic probability of the candidate text tokens based on an original visual input, wherein the original visual input comprises all image tokens of the input image, and the feature vectors of the image tokens keep original numerical values; And obtaining calibrated probability distribution based on the offset logarithmic probability and the original logarithmic probability of the candidate text token, and sampling from the calibrated probability distribution to obtain a final text token.
2. The method of claim 1, wherein the determining one or more blind tokens from all image tokens of the input image comprises: screening one or more target network layers from a plurality of network layers of the visual language model, wherein the target network layers play a main role in image content understanding and characterization; One or more blind tokens are determined from all image tokens of the input image based on the screened attention weights of the target network layer.
3. The method of claim 2, wherein the screening one or more target network layers from the plurality of network layers of the visual language model comprises: calculating a hierarchical image attention ratio of each network layer, wherein the hierarchical image attention ratio is the sum of attention weights distributed to all image tokens by all attention heads of the network layer and accounts for the proportion of the sum of the attention weights distributed to the image tokens by all network layers; and selecting the network layer as a target network layer according to a preset selection strategy based on the hierarchical image attention ratio of each network layer.
4. The method according to claim 3, wherein selecting the network layer as the target network layer according to a preset selection policy based on the hierarchical image attention ratio of each network layer comprises: Descending order according to the attention ratio of the hierarchical images corresponding to each network layer to obtain an ordered network layer sequence; sequentially selecting network layers from the starting end of the ordered network layer sequence and calculating the cumulative sum of the attention ratios of the hierarchical images of the selected network layers; Stopping selecting when the accumulated sum reaches or exceeds a preset threshold value for the first time, and taking all the selected network layers as target network layers.
5. The method of claim 2, wherein the determining one or more blind tokens from all image tokens of an input image based on the attention weight of the target network layer that is screened comprises: Calculating, for each of the image tokens, an average attention weight thereof in all attention heads of all target network layers; Calculating the mean value and standard deviation of the average attention weights of all the image tokens; determining the image token with the average attention weight larger than the dynamic threshold as a blind token; wherein the dynamic threshold is mu+lambda sigma, mu is the mean value, sigma is the standard deviation, and lambda is the preset super parameter.
6. The method of claim 1, wherein the deriving a calibrated probability distribution based on the biased log probability and the original log probability of the candidate text token comprises: the calibrated log probability of the candidate text token is calculated by the following formula: the log probability after calibration= (1+α) ×the original log probability- α×the biased log probability; wherein alpha is a super parameter for controlling contrast intensity, and alpha is more than 0; And carrying out softmax operation on the calibrated logarithmic probability to obtain a calibrated probability distribution.
7. The method of claim 6, wherein the value of the hyper-parameter α is in the range of 2.5-3.0.
8. An attention calibration device for a visual language model, the device comprising: the determining module is used for determining one or more blind tokens from all image tokens of the input image, wherein the blind tokens are image tokens with low information content relative to text query in the input image; a building module for building an offset visual input, wherein the offset visual input comprises all image tokens of the input image, but feature vectors of image tokens which are not determined as blind tokens in all image tokens are set to zero, and feature vectors of image tokens which are determined as blind tokens are kept unchanged; a first calculation module for calculating a biased log probability of a candidate text token based on the biased visual input; The second calculation module is used for calculating the original logarithmic probability of the candidate text tokens based on the original visual input, wherein the original visual input comprises all image tokens of the input image, and the feature vectors of the image tokens keep the original numerical values; and the sampling module is used for obtaining calibrated probability distribution based on the offset logarithmic probability and the original logarithmic probability of the candidate text tokens, and sampling from the calibrated probability distribution to obtain the final text tokens.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of attention calibration of the visual language model of any one of claims 1-7 when the computer program is executed.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the method of attention calibration of a visual language model according to any one of claims 1 to 7.

Description

Attention calibration method, device, equipment and storage medium of visual language model Technical Field The invention relates to the field of visual language models and computer vision, which can be applied to the fields of finance and medical treatment, in particular to a method, a device, equipment and a storage medium for calibrating the attention of the visual language models. Background In recent years, visual language models (VisualLanguageModels, LVLMs for short) have demonstrated significant capabilities in image understanding and text generation tasks, enabling the generation of consistent and semantically relevant text descriptions based on input images and text queries. However, the visual language model has the problem of "illusion" (Hallucination) in the reasoning stage, namely, generating information inconsistent with the image content or produced by blank kneading, thereby limiting the application of the visual language model in high-reliability scenes such as automatic driving, medical image analysis and fact checking. Therefore, how to effectively avoid the problem that the visual language model has illusion generally in the reasoning stage, thereby improving the accuracy and consistency of text description, and becoming the technical problem to be solved in the field. Disclosure of Invention The embodiment of the invention provides a method, a device, equipment and a storage medium for calibrating the attention of a visual language model, which are used for solving the problem that the visual language model has illusion generally in an reasoning stage. In a first aspect, there is provided a method of attention calibration of a visual language model, the method comprising: Determining one or more blind tokens from all image tokens of the input image, wherein the blind tokens are image tokens with low information content relative to text query in the input image; Constructing an offset visual input, wherein the offset visual input comprises all image tokens of the input image, but feature vectors of the image tokens which are not determined as blind tokens in all image tokens are set to zero, and the feature vectors of the image tokens which are determined as blind tokens are kept unchanged; calculating a biased logarithmic probability of the candidate text token based on the biased visual input; Calculating the original logarithmic probability of the candidate text tokens based on an original visual input, wherein the original visual input comprises all image tokens of the input image, and the feature vectors of the image tokens keep original numerical values; And obtaining calibrated probability distribution based on the offset logarithmic probability and the original logarithmic probability of the candidate text token, and sampling from the calibrated probability distribution to obtain a final text token. In a second aspect, there is provided an attention calibration device for a visual language model, the device comprising: the determining module is used for determining one or more blind tokens from all image tokens of the input image, wherein the blind tokens are image tokens with low information content relative to text query in the input image; a building module for building an offset visual input, wherein the offset visual input comprises all image tokens of the input image, but feature vectors of image tokens which are not determined as blind tokens in all image tokens are set to zero, and feature vectors of image tokens which are determined as blind tokens are kept unchanged; a first calculation module for calculating a biased log probability of a candidate text token based on the biased visual input; The second calculation module is used for calculating the original logarithmic probability of the candidate text tokens based on the original visual input, wherein the original visual input comprises all image tokens of the input image, and the feature vectors of the image tokens keep the original numerical values; and the sampling module is used for obtaining calibrated probability distribution based on the offset logarithmic probability and the original logarithmic probability of the candidate text tokens, and sampling from the calibrated probability distribution to obtain the final text tokens. In a third aspect, an electronic device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the method of calibrating attention of a visual language model as described above when executing the computer program. In a fourth aspect, a computer readable storage medium is provided, the computer readable storage medium storing a computer program which, when executed by a processor, implements the method of attention calibration of a visual language model as described above. The technical scheme has the advantages that the offset logarithmic probability of the candidate text token is obtained thro