CN-122021866-A - Multi-modal large language model generation method and system based on visual information backtracking

CN122021866ACN 122021866 ACN122021866 ACN 122021866ACN-122021866-A

Abstract

The invention discloses a method and a system for generating a multi-modal large language model based on visual information backtracking, and belongs to the technical field of artificial intelligence and multi-modal information processing. The realization method comprises the steps of 1, obtaining an image token sequence and a text token sequence, 2, obtaining the real attention of a model to an input image, screening an ROI set by utilizing overlapping punishment, 3, generating statement sequence token prediction distribution and obtaining the normalized entropy of a hierarchy, 4, carrying out feedforward network residual injection when the normalized entropy of the hierarchy is larger than an entropy threshold value and the screened ROI set injection is triggered for the first time, 5, injecting the ROI set into a feedforward layer of a multi-modal large language model, updating visual information in the model, and further obtaining the visual characteristics of attention enhancement.

Inventors

LI KAN
YIN HANG

Assignees

北京理工大学

Dates

Publication Date: 20260512
Application Date: 20251201

Claims (7)

1. The multi-modal large language model generation method based on visual information backtracking is characterized by comprising the following steps, The method comprises the steps of 1, mapping an image into an image token sequence with aligned dimensions by using a visual encoder and a connector, and simultaneously obtaining a mapping table from an image token to a visual grid; step 2, multiplying the attention head average score of the language model obtained by the generated sentence with the attention head average score of the connector through a matrix to form the attention distribution of the visual grid, and carrying out element-by-element ratio with the visual grid of the general description instruction to obtain the real attention of the model to the input image; step 3, at the t-th position of the generated sentence sequence, reading the token prediction distribution of one or more middle layers of the language model for the next generated sentence sequence, and obtaining the normalized entropy of the hierarchy as shown in the formula (8); Wherein V is the size of the vocabulary, In order to share the vocabulary head, Is the hidden state of the first layer in step t, u is the normalized entropy of the hierarchy, Distributing the next word element of the first layer; Step 4, when the normalized entropy of the hierarchy is larger than the entropy threshold and the screened ROI set injection is triggered for the first time, carrying out feedforward network residual injection; Setting an entropy threshold gamma, and performing ROI set injection when the normalized entropy of the hierarchy is larger than the entropy threshold and the screened ROI set injection is triggered for the first time; step 4.2, injecting the ROI set by adopting feedforward network residual injection, and setting an upper limit of amplitude as shown in a formula (9) for injection intensity; α=min{α max ,max{0,k·(u (l) -γ)}} (9) and 5, injecting the ROI set into a feedforward layer of the multi-modal large language model, and updating visual information in the model to further acquire visual characteristics of attention enhancement.
2. The method for generating the multimodal large language model based on visual information backtracking according to claim 1, wherein the implementation method of the step1 is as follows, Step 1.1, splicing the images end to end according to a time sequence, and marking the index positions of the images in a visual grid to form an image set; step 1.2, mapping the images in the image set into an image token sequence Z shown in a formula (1) by using a visual encoder; wherein T is the length of the image token, j is the index of the image token; Step 1.3, obtaining a mapping table from the image token to the visual grid by utilizing the mapping relation between the index of the image token and the index in the visual grid; Step 1.4, transforming to a language model interaction dimension by using a connector to obtain a dimension aligned image token sequence; And 1.5, coding the text instruction q to obtain a text token sequence.
3. The method for generating the multimodal large language model based on visual information backtracking according to claim 1, wherein the step 2 is implemented by the following steps, Step 2.1, inputting a text token sequence and a dimension aligned image token sequence into LLM to obtain a generation statement; step 2.2, respectively obtaining the attention head average score of the language model and the attention head average score of the connector; step 2.3, obtaining the real attention of the model to the input image by using the element-by-element ratio; And 2.4, screening the ROI set by using the overlapping penalty.
4. The method for generating the multimodal large language model based on visual information backtracking according to claim 3, wherein the step 2.2 is implemented by the following steps, Step 2.2.1, backtracking from a language token at the starting position of a generated sentence to an image token through cross-modal attention, and obtaining an attention head average score of a language model shown in a formula (2) according to layers and heads; Wherein, l is the first layer of the model, H represents the H attention head, H is the total attention head number; Step 2.2.2, tracing back to the visual grid from the image token through the cross-attention of the connector, and obtaining the attention head average score of the connector shown in the formula (3) according to layers and heads; Wherein c is the c-th cross-modal attention module, and P is the dimension of the image token.
5. The method for generating the multimodal large language model based on visual information backtracking according to claim 3, wherein the step 2.3 is realized by the following steps, 2.3.1, Performing matrix multiplication on the attention head average score of the language model and the attention head average score of the connector to obtain the attention distribution of the visual grid shown in the formula (4); step 2.3.2, obtaining the true attention of the model as shown in the formula (5) to the input image by utilizing the attention distribution of the visual grid and the visual grid of the general description instruction q ′ to perform element-by-element ratio; Where x is the input image.
6. The method for generating the multimodal large language model based on visual information backtracking according to claim 3, wherein the step 2.4 is implemented by the following steps, Step 2.4.1, inputting the score of the true attention of the image in the visual grid, and acquiring an ROI set by using the formula (6); Wherein IoU max is the maximum value of the scores of the visual grids among the sliding windows, topK ω (DEG) is the K visual grids with the maximum scores, and NMS (DEG) is non-maximum value inhibition; Step 2.4.2, screening the ROI set by using the overlapping penalty as shown in the formula (7); Wherein b is Lambda is the force of overlapping penalty.
7. The multi-mode large language model reasoning system based on visual information backtracking for realizing the method of claim 1 is characterized by comprising a visual coding unit, a connector unit, a language modeling unit, an uncertainty evaluation unit, an evidence positioning unit, a secondary gazing unit and a scheduling unit; The visual coding unit is used for coding visual information of the image and takes the visual information as input of the connector unit module; The connector unit is used for mapping the image token sequence output by the visual encoder to the interaction dimension of the language model; The language modeling unit is used for receiving the image token sequence and the text token sequence aligned in dimension and performing autoregressive generation; The uncertainty evaluation unit is used for calculating the normalized entropy of the middle layer in the generation process to judge uncertainty; The evidence positioning unit is used for retrospectively determining a region of interest (ROI) set based on visual information, wherein the ROI set is used as input of the secondary gazing unit module; the secondary fixation unit is used for injecting the ROI set into a feedforward layer of the language model under the triggering condition; and the scheduling unit is used for comparing the output entropy value of the uncertainty evaluation unit with a threshold value to judge whether the secondary gazing operation is triggered or not and taking the secondary gazing operation as the input of the secondary gazing unit.

Description

Multi-modal large language model generation method and system based on visual information backtracking Technical Field The invention relates to a method and a system for generating a multi-modal large language model based on visual information backtracking, belongs to the technical field of artificial intelligence and multi-modal information processing, and is applied to the aspect of alleviating multi-modal illusion. Background The multi-Modal Large Language Model (MLLM) has made significant progress in tasks such as visual question-answering, teletext, visual reasoning, etc., but the model gives an answer inconsistent with image facts when visual evidence is insufficient or disturbed by language priors. A large number of evaluations show that the errors are particularly concentrated in small-target, fine-granularity text and texture scenes, and meanwhile, the resolution of the whole image or the stacking multi-view can be simply improved, but obvious calculation and time delay cost is brought, so that the requirements of real-time or large-scale deployment are difficult to meet. By concatenating the answer starting position in the language model to the cross-modal attention of the image token, with the attention of the image token in the connector to the visual grid, and semantically normalizing, a spatial visual grid highly correlated to the question can be obtained. Empirical results show that even if the final answer is wrong, the attention of the model is still focused significantly on the labeled target area and the problem is more perceptually detailed than the evidence localization itself. In the autoregressive generation process, errors and illusions often appear at the moment when the uncertainty of the middle layer is increased, and at the moment, if the visual evidence related to the problem is reused for one time in the middle layer in a light weight and training-independent way, the uncertainty can be obviously reduced and the fact consistency can be improved. Such an approach has the advantage of being less invasive to existing models and that the trigger conditions can be step-gated, thereby focusing the overhead on the region of interest. In the comprehensive view, the existing scheme still has three types of defects, namely, firstly, uncontrollable full graph up-sampling and fixed frequency re-query calculation, secondly, easy introduction of high-attention noise irrelevant to problems only based on global attention re-retrieval, and thirdly, lack of a unified generation framework for determining the ROI based on visual information backtracking and based on uncertainty gating triggering. Based on the evidence area obtained by the visual information backtracking, triggering a one-time secondary fixation only when the uncertainty exceeds a threshold value in the generation process, so that extra calculation force is focused on the region of interest and a high entropy generation step, and the problem of inconsistent image facts generated in multi-mode generation is relieved. Therefore, how to solve the problem that in the generation process of the multi-modal large language model, the phenomenon that the image is inconsistent with the fact in the generation process based on the image due to the strong priori of the text corpus in the pre-training process of the model is urgent to be solved. Disclosure of Invention The invention aims at solving the technical problem that in the generation process of a multi-modal large language model, the phenomenon that images are inconsistent with facts is generated in the generation process based on images due to strong priori caused by text corpus in the pre-training process of the model, and provides a multi-modal large language model generation method and a system based on visual information backtracking. The invention aims at realizing the following technical scheme: The invention discloses a method for generating a multi-modal large language model based on visual information backtracking, which comprises the following steps: the method comprises the steps of 1, mapping an image into an image token sequence with aligned dimensions by using a visual encoder and a connector, and simultaneously obtaining a mapping table from an image token to a visual grid; step 1.1, splicing the images end to end according to a time sequence, and marking the index positions of the images in a visual grid to form an image set; step 1.2, mapping the images in the image set into an image token sequence Z shown in a formula (1) by using a visual encoder; wherein T is the length of the image token, j is the index of the image token; Step 1.3, obtaining a mapping table from the image token to the visual grid by utilizing the mapping relation between the index of the image token and the index in the visual grid; Step 1.4, transforming to a language model interaction dimension by using a connector to obtain a dimension aligned image token sequence; step 1.5, coding a text instr