CN-121982494-A - Entropy-triggered visual attention backtracking large language model illusion reduction method

CN121982494ACN 121982494 ACN121982494 ACN 121982494ACN-121982494-A

Abstract

The invention relates to the technical field of computers, and particularly provides a large language model illusion reduction method for entropy triggering visual attention backtracking. The method aims to solve the problems of object illusion and near-modal collapse caused by insufficient visual attention of a modal fusion layer and unmatched posterior layer characteristic distribution in a multi-modal large language model. The method comprises the steps of receiving image and text input, respectively extracting vision and text Token sequences, splicing, carrying out forward propagation on the input sequences through a multi-layer transducer network, logically dividing the input sequences into a shallow layer, a mode fusion layer and a posterior layer, respectively executing bottom feature extraction, cross-mode semantic fusion and text generation, calculating an input entropy value in real time in the mode fusion layer, weighting and injecting attention scores of an initial layer into a current layer to strengthen vision attention when the entropy value exceeds a preset threshold value, separating hidden states into vision and text parts in the posterior layer, adjusting vision feature distribution to approach text distribution through standardization and renovation, and dynamically fusing transformed features and original features according to dynamic adjustment proportion.

Inventors

LV PENGHUI
Gou Shuxiang
WAN SHAOHUA

Assignees

电子科技大学(深圳)高等研究院

Dates

Publication Date: 20260505
Application Date: 20260126

Claims (5)

1. An entropy-triggered visual attention backtracking large language model illusion reduction method is characterized by comprising the following steps of: The method comprises the steps of S1, receiving multi-mode input data comprising image data and text prompts, extracting image features through a visual encoder and mapping the image features to a language model embedding space to obtain a visual Token sequence, converting the text prompts into the text Token sequence through a word segmentation device, and splicing the visual Token sequence and the text Token sequence to form an input sequence; Step S2, forward propagation of a multi-layer converter network, namely, sending an input sequence into a backbone network comprising a multi-layer converter for forward propagation, wherein the backbone network is logically divided into a shallow layer, a modal fusion layer and a posterior layer, the shallow layer is used for extracting bottom layer features, the modal fusion layer is used for deep semantic fusion of visual and text information, and the posterior layer is used for generating final text output; Step S3, entropy triggered visual attention backtracking, namely, when reasoning is carried out to a modal fusion layer, acquiring a hidden state of the current layer input, obtaining a vocabulary probability distribution through mapping processing, determining a corresponding uncertainty metric value according to the vocabulary probability distribution to represent the confusion degree of the current layer on the input, and when the confusion degree of a model on the input is larger than a preset threshold value At the time, the attention score matrix of the initial layer of the model is injected into the attention matrix of the current layer in a weighted mode through the backtracking coefficient Controlling injection intensity; And S4, performing modal protection based on moment matching, namely separating a hidden state into a visual part and a text part in a posterior layer, calculating the mean value and standard deviation of text features, performing renormalization on the visual features by using the mean value and standard deviation of the text features after normalization to enable the visual features to approach to the text features, and then fusing the transformed visual features and the original visual features in proportion by dynamic residual fusion, wherein a fusion proportionality coefficient beta is linearly attenuated from an initial layer of the posterior layer to 0 of an output layer so as to prevent the near-modal collapse problem, namely preventing the characteristics after visual enhancement from being suppressed in a deep network.
2. The method according to claim 1, characterized in that: step S1.1 the system receives multimodal input data including image data And text prompting Wherein For the system template Prompt, The Prompt is given to the user; s1.2, processing the image data by using a visual encoder, extracting visual feature blocks, and mapping the visual feature blocks to an embedding space of a language model through a linear projection layer to obtain a visual Token sequence Wherein Represent the first The number of visual Token(s), Indicating the total number of visual Token. S1.3, processing the text prompt by using a word segmentation device and converting the text prompt into a text Token sequence Wherein Represent the first The number of text tokens, n, represents the total number of text tokens. Step S1.4, splicing the visual Token sequence and the text Token sequence to form an input sequence of a model 。
3. The method according to claim 1, wherein the step S2 of forward propagation of the multi-layer transducer network comprises: Sending the input sequence into a large language model backbone network, wherein the backbone network is formed by stacking L layers of transformers, and L represents the total layer number of the backbone network; At each layer In (1) the model calculates self-attention and updates hidden state, wherein A network layer index representing the current process; Logically dividing the backbone network into three functional areas: the shallow layer is mainly responsible for the feature extraction and physical alignment of the bottom layer; The mode fusion layer is a key area for carrying out depth semantic fusion on visual information and text information, and is also a main area for implementing visual attention backtracking; the posterior layer, responsible for converting the fused features into final text output, is also the main area for implementing "modal protection".
4. The method according to claim 1, wherein the step S3 of entropy triggered visual attention backtracking specifically comprises: step S3.1, in the initial stage of model reasoning, recording the attention score matrix of the initial layer Wherein Representing initial layer number An attention score matrix of the attention heads, An index representing an attention header; step S3.2, when the reasoning is carried out to the mode fusion layer, the current layer is set as the mode fusion layer And is also provided with Acquiring a current layer Hidden state of input Calculating the vocabulary probability distribution of the current step through layer normalization and classification heads And calculate shannon entropy : Where N represents the vocabulary size and, Represent the first The individual word is at the first Probability of a layer; setting a preset entropy threshold If the shannon entropy of the current layer Activating visual attention backtracking mechanism, and applying initial layer attention moment array Attention moment array injected into current layer in weighted mode Obtaining corrected attention score The calculation formula is as follows: Wherein the method comprises the steps of Represents backtracking coefficients and For controlling the intensity of the incoming shallow visual information; step S3.3, normalizing the corrected attention score to obtain the final attention weight The calculation formula is as follows: 。
5. the method according to claim 1, wherein step S4 of preserving the mode based on moment matching specifically comprises: step S4.1, in the posterior layer, the hidden state is hidden Separation into visual parts And (3) with Wherein Network layer index representing current processing, computing text features Mean of (2) ) And standard deviation The calculation formula is as follows: Wherein the method comprises the steps of The number of text Token is represented, Represent the first Layer number Hidden state of each Token, epsilon is a small constant for preventing zero removal; Computing visual features Mean of (2) And standard deviation The calculation formula is as follows: Wherein the method comprises the steps of The number of visual Token is indicated, Representing a summation range for the visual Token index; s4.2, transforming the visual features by using adaptive instance normalization, removing normalization of the visual features into standard normal distribution, and then carrying out renormalization by using the mean value and standard deviation of the text features to obtain transformed visual features The calculation formula is as follows: S4.3, fusing the original visual characteristics and the transformed visual characteristics by adopting a residual fusion mode, and introducing a dynamic attenuation coefficient beta to obtain corrected visual characteristics The calculation formula is as follows: where β is a dynamic adjustment parameter that gradually decreases as the number of layers increases, for controlling the intensity of the distribution correction.

Description

Entropy-triggered visual attention backtracking large language model illusion reduction method Technical Field The invention relates to the technical field of computers, and provides a large language model illusion reduction method for entropy triggering visual attention backtracking. Background In recent years, along with the continuous evolution of deep learning architecture, a multi-Modal Large Language Model (MLLMs) has made a significant breakthrough in the field of artificial intelligence. By integrating and co-processing data of multiple modalities such as vision, audio and text, MLLMs exhibits excellent cross-modal understanding and reasoning capabilities, and exhibits striking performance in multiple application scenarios such as image description generation (Image Captioning), complex visual questions and answers (Visual Question Answering, VQA), and multi-modal dialog systems. These models are typically based on large-scale pre-training data, with the powerful sequence modeling capabilities of the transducer architecture, enabling a perceived-to-cognitive span. However, while existing multimodal large language models perform well in multiple benchmarks, they still face significant technical challenges in practical application deployment. Among the most prominent and urgent problems to be solved is "object illusion" (Object Hallucination). In particular, object illusions refer to the inclusion in the model-generated text response of an object that is not actually present in the input image or data, or to the provision of a description that contradicts the content of the input data and cannot be verified via the input data. This phenomenon essentially reflects the deviation of the model from the pair qi in the visual perception feature and language decoding process. In the key application fields of automatic driving, medical image diagnosis, clinical auxiliary decision making, security monitoring and the like with extremely high requirements on safety and accuracy, the object illusion problem can seriously weaken the trust degree of a user on a system, and even cause unpredictable safety risks and ethical problems. In order to alleviate the above object hallucination problem, various solutions have been proposed by existing research efforts. In general, the prior art can be mainly summarized into the following technical categories: the first is a method of generating (RETRIEVAL-Augmented Generation, RAG) based on search enhancements. The method retrieves related information in the generation process by introducing an external knowledge base to assist model decision, thereby reducing illusion. However, this approach is highly dependent on the quality and coverage of the external retrieval system, and introducing an external knowledge base can significantly increase the complexity and maintenance costs of the system. The second category is based on the complementary fine tuning (Supplementary Fine-tuning) approach. The strategy aims to enhance the self-consistency of model generated text by further training on specific datasets. However, a common disadvantage of either RAG or trim strategies is that additional high quality annotation data is required for Post-de-biasing (Post-hoc Debiasing). The method not only increases the labor cost of data collection and cleaning, but also brings significant computing resource cost (Computational Overhead) and long training period, and is difficult to adapt to the application requirement of rapid iteration. The third class is an inference phase strategy without Training (Training-free), mainly consisting of attention intervention (Attention Intervention) and contrast decoding (Contrastive Decoding, CD). Attention intervention typically involves retrospective reassignment of attention attempts during reasoning, and contrast decoding requires adjustment of the probability of generation by building contrast models or contrast paths. While these two types of methods avoid additional model training, they often require complex computational operations in the decoding process, such as retrospective searching or multi-model parallel reasoning, which results in extremely high reasoning delays (INFERENCE LATENCY) that are difficult to meet the response speed requirements of real-time interactive systems. Among the above-mentioned various methods, the method based on visual attention enhancement (Visual Attention Enhancement) is now attracting attention as a technical direction because of its relatively simple and efficient characteristics. As the closest prior art in this direction, a visual magnification fusion (Visual Amplification Fusion, VAF) was proposed to solve the illusion problem. In particular, VAF technical strategies claim that in the modal fusion layer of the model (Modality Fusion Layer, e.g., layers 9-14 of the model), the attention weight of the model to visual information is amplified by way of human intervention. The core logic is that by enhancing the we