CN-121980265-A - Word element clipping method, device, equipment and medium of multi-mode visual language model

CN121980265ACN 121980265 ACN121980265 ACN 121980265ACN-121980265-A

Abstract

The invention relates to the technical field of artificial intelligence, can be applied to the fields of financial science and technology and medical science and technology, and discloses a word element cutting method, device, equipment and medium of a multi-mode visual language model, wherein the method receives questioning images and texts and converts the questioning images and texts into visual word element sets and target word element sets respectively; based on the visual word element set and the target word element set, calculating a contribution matrix by fusing visual perception cost functions of feature similarity, spatial relationship and center distance and taking the optimal transmission cost from the minimum visual word element to the target word element as a target, so as to globally quantify the contribution degree of each visual word element to the generated target word element, sorting the visual word element set according to the contribution values and combining a preset threshold value to cut the visual word element set once, reserving key Token, and selectively cutting Token through a uniform sampling recovery part to adapt to fine task requirements. The invention realizes accurate and efficient Token screening, remarkably reduces the cost of calculation and memory, does not need to introduce extra calculation steps, and improves the reasoning efficiency.

Inventors

WANG JIANZONG
ZHANG XULONG
SHI JIAQI

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20260119

Claims (10)

1. A method for word clipping of a multimodal visual language model, the method comprising: Receiving a questioning image and a questioning text input by a user, converting the questioning image into a visual word element set through a visual encoder, and processing the questioning text through a language model to obtain a target word element set; calculating a contribution matrix by taking optimal transmission cost from the visual word element to the target word element as a target through a preset visual perception cost function according to the visual word element set and the target word element set, wherein the contribution matrix is characterized by contribution of the visual word element to the target word element; calculating a contribution value corresponding to each visual word element according to the contribution matrix; And cutting the visual word set according to the contribution value of the visual word and a preset contribution threshold value to obtain a target visual word set.
2. The method of claim 1, wherein the step of converting the quiz image into a visual vocabulary set by a visual encoder comprises: Receiving a questioning image input by a user, preprocessing the questioning image, removing noise interference in the image and adjusting the image size to be the input size adapted by a visual encoder; inputting the preprocessed questioning image to a preset visual encoder, and extracting the characteristics of the image through a convolution layer and a pooling layer of the visual encoder to obtain a global visual characteristic diagram of the image; the global visual feature map is subjected to block processing, the feature map is divided into a plurality of local feature blocks with the same size, and each local feature block corresponds to a local area of the image; performing dimension mapping and feature coding on each local feature block through a full connection layer or an attention layer of the visual encoder, and converting each local feature block into a feature vector with a preset dimension; And integrating all the feature vectors to form a visual word set containing a plurality of preset dimension feature vectors.
3. The method of claim 1, wherein the step of processing the question text through a language model to obtain a target set of tokens comprises: receiving a question text input by a user, preprocessing the question text, removing invalid characters in the text, and standardizing a text format to obtain a question text to be processed with uniform format; Inputting the standardized questioning text to be processed into a preset language model, and splitting the questioning text to be processed into a plurality of continuous text fragments according to a preset splitting rule by a text splitting module of the language model; Carrying out semantic coding on each text segment through an embedding layer of a language model, and converting each text segment into a semantic feature vector with a preset dimension; performing validity check on all the semantic feature vectors, removing feature vectors with missing semantic information or noise interference, and reserving semantic feature vectors meeting the preset semantic integrity requirement; integrating all the reserved semantic feature vectors to form a target word set containing a plurality of preset dimension semantic feature vectors.
4. The method according to claim 1, wherein the step of calculating a contribution matrix targeting visual-to-target-lemmas optimal transmission costs by a preset visual perception cost function from the visual-lemmas set and the target-lemmas set comprises: Calculating the transmission cost from each visual word element to each target word element through the preset visual perception cost function based on the visual word element set and the target word element set to form a transmission cost matrix; and solving the transmission cost matrix with the aim of minimizing the total transmission cost to obtain the contribution matrix.
5. The method of claim 4, wherein the step of calculating a transmission cost from each visual word element to each target word element by the preset visual perception cost function based on the set of visual word elements and the set of target word elements, forming a transmission cost matrix, comprises: extracting the feature vector of each visual word element from the visual word element set, extracting the feature vector of each target word element from the target word element set, and calculating the similarity between every two feature vectors to obtain feature similarity parameters; acquiring spatial position information of each visual word element in the questioning image, and calculating relative spatial distances of different visual word elements associated with corresponding target word elements based on the spatial position information to acquire spatial relation parameters; determining the center coordinates of the questioning image, and calculating the absolute distance from the center point of each visual word element to the center coordinates of the image to obtain a center distance parameter; And inputting the characteristic similarity parameter, the spatial relation parameter and the center distance parameter into the preset visual perception cost function, and obtaining the transmission cost from each visual word element to each target word element through function operation.
6. The method of claim 1, wherein the step of calculating a contribution value for each visual word element from the contribution matrix comprises: Acquiring a contribution matrix representing the contribution degree of each visual word element to the target word element, and determining the contribution coefficients of each row of elements in the matrix, which correspond to the single visual word element, to all the target word elements respectively; performing aggregation operation on all contribution coefficients of each row in the contribution matrix, wherein the aggregation operation comprises any one of summation, weighted summation or maximum value taking, so as to obtain an initial contribution value of each row corresponding to a visual word element; And carrying out normalization processing on the initial contribution values of all the visual words, and mapping the initial contribution values into a preset numerical value interval to obtain the final contribution value of each visual word.
7. The method of claim 1, wherein the step of clipping the visual vocabulary set according to the contribution value of the visual vocabulary and the preset contribution threshold value to obtain the target visual vocabulary set comprises: Comparing the contribution value of each visual word element with a preset contribution threshold value, screening out visual word elements with contribution values not lower than the preset contribution threshold value as word elements to be reserved, and screening out visual word elements with contribution values lower than the preset contribution threshold value as word elements to be cut; Reserving all the to-be-reserved tokens to form an initial visual token set after preliminary clipping; uniformly sampling the word elements to be cut, and selecting the word elements to be cut which are associated with the image key areas or contain the image detail characteristics in the sampling result as word elements to be recovered; and supplementing the to-be-recovered lemmas to the initial visual lemmas set to obtain a final target visual lemmas set.
8. A vocabulary clipping device for a multimodal visual language model, comprising means for performing the method of any of the preceding claims 1-7.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.

Description

Word element clipping method, device, equipment and medium of multi-mode visual language model Technical Field The invention relates to the technical field of artificial intelligence, which can be applied to the fields of financial science and technology and medical science and technology, in particular to a word element clipping method, device, equipment and medium of a multi-mode visual language model. Background In the technical development of the current vision-language model (VLMs), the technology shows remarkable efficiency when processing multi-mode tasks, but simultaneously, extremely high demands are put forward on computing resources, the problem is particularly remarkable in the visual information processing link, and particularly, the problem is particularly critical in the fields with extremely high requirements on precision and efficiency, such as medical image analysis, insurance claim image auditing and the like. Specifically, to realize effective understanding and analysis of image content, the existing VLMs (such as LLaVA, internVL and other typical models) generally needs to convert an input image (such as a medical X-ray film or an insurance accident scene photo) into a large number of visual Token, and the number of visual Token often far exceeds the number of text Token processed by the model, and this difference directly leads to a large increase in the calculation amount of the model in the reasoning process, and simultaneously, the significant increase in the memory usage amount is accompanied, so that not only is the hardware cost of model deployment improved, but also the application of the model in the limited scenes of calculation resources such as medical diagnosis or insurance wind control is limited. In order to alleviate the above-mentioned computation and memory pressure, in the prior art, a visual Token clipping strategy is generally adopted, that is, a part of visual Token is screened and removed to reduce the subsequent computation burden, and the core basis of such clipping strategy is mostly greedy heuristic standard, wherein the most common method is to judge the importance of the visual Token based on the attention score, preferentially reserve the Token with higher attention score and clip the Token with lower attention score. However, the Token clipping method based on the greedy heuristic standard has obvious defects that on one hand, the attention score can only reflect the weight of the Token in the local attention interaction process, the actual contribution degree of each visual Token to the completion of the overall task (such as focus identification or insurance receipt information extraction) of the model cannot be comprehensively and accurately captured, the situation that the key Token is wrongly clipped or redundant Token is reserved easily occurs so as to influence the performance of the model in processing the visual-language task is easily caused, and on the other hand, in order to realize clipping based on the greedy heuristic standard, the conventional method generally needs to add additional calculation steps outside the conventional reasoning process of the model so as to count and sort the attention scores of the visual Token and execute clipping operation, and the additional steps not only further increase the calculation cost, but also are difficult to adapt to the existing optimization mechanism of the model, so that the overall reasoning efficiency is not effectively improved, and particularly the method is poor in performance under the timeliness sensitive scenes such as medical real-time auxiliary diagnosis or insurance automation processing. In summary, the existing visual-language model faces the problem that excessive number of visual Token leads to rapid increase of calculation cost and memory usage when processing visual information, and the Token clipping method based on greedy heuristic standard currently adopted has the technical defects of inaccurate importance assessment, additional calculation steps and limited inference efficiency improvement, so that a better Token processing scheme is needed to solve the problem. Disclosure of Invention The invention provides a word element clipping method, device, equipment and medium of a multi-mode visual language model, and aims to solve the problems that visual Token clipping of the existing visual language model is inaccurate and extra calculation steps are needed. In a first aspect, a method for clipping a word element of a multi-modal visual language model is provided, the method comprising: Receiving a questioning image and a questioning text input by a user, converting the questioning image into a visual word element set through a visual encoder, and processing the questioning text through a language model to obtain a target word element set; calculating a contribution matrix by taking optimal transmission cost from the visual word element to the target word element as a target throu