CN-121999308-A - Visual token processing method, device, equipment, storage medium and program product

CN121999308ACN 121999308 ACN121999308 ACN 121999308ACN-121999308-A

Abstract

The disclosure relates to the technical field of visual processing, and discloses a processing method, a device, equipment, a storage medium and a program product of a visual token. The method comprises the steps of obtaining an initial visual token set corresponding to a target image in a current coding layer, dividing the initial visual token set into a first visual token subset and a second visual token subset, extracting multi-dimensional visual features between the first visual token subset and the second visual token subset, and combining the first visual token subset and the second visual token subset based on the multi-dimensional visual features to obtain the target visual token set corresponding to the current coding layer. By implementing the technical scheme, pruning of the visual tokens in the visual encoder is realized, the reasoning time can be effectively shortened on the premise of not obviously reducing the reasoning performance of the multi-mode large language model, and the reasoning efficiency is improved.

Inventors

CHEN YUHAO
SHAN BIN
YE XIN
CHEN CHENG

Assignees

北京字跳网络技术有限公司
抖音视界有限公司

Dates

Publication Date: 20260508
Application Date: 20260104

Claims (13)

1. A method of processing a visual token, the method comprising: Acquiring an initial visual token set corresponding to a target image in a current coding layer; Dividing the initial set of visual tokens into a first subset of visual tokens and a second subset of visual tokens; Extracting multi-dimensional visual features between the first subset of visual tokens and the second subset of visual tokens; And combining the first visual token subset and the second visual token subset based on the multidimensional visual features to obtain a target visual token set corresponding to the current coding layer.
2. The method of claim 1, wherein the extracting multi-dimensional visual features between the first subset of visual tokens and the second subset of visual tokens comprises: Determining semantic similarity between each first visual token in the first subset of visual tokens and each second visual token in the second subset of visual tokens; Based on the semantic similarity, a plurality of visual token pairs are obtained; determining the target visual density and the target attention weight respectively corresponding to each visual token; Wherein the multi-dimensional visual features include semantic similarity, the target visual density, and the target attention weight.
3. The method of claim 2, wherein determining the visual density of the visual token to the corresponding target visual density comprises: Acquiring a first feature neighborhood and a second feature neighborhood respectively corresponding to two visual tokens in the visual token pair; acquiring a first characteristic distance and a second characteristic distance respectively corresponding to two visual tokens in the visual token pair; determining a first visual density based on the first feature distance and the first feature neighborhood; determining a second visual density based on the second feature distance and the second feature neighborhood; and obtaining the target visual density corresponding to the visual token pair based on the fusion result of the first visual density and the second visual density.
4. A method according to claim 2 or 3, wherein determining the visual token's corresponding target attention weight comprises: judging whether two visual tokens in the visual token pair belong to a target attention set or not; if any one of the two visual tokens belongs to the target attention set, determining the target attention weight as a first weight; and if the two visual tokens do not belong to the target attention set, determining that the target attention weight is a second weight.
5. The method according to claim 1 or 2, wherein the merging the first visual token subset and the second visual token subset based on the multi-dimensional visual feature to obtain the target visual token set corresponding to the current coding layer includes: determining a degree of merger between each of the first visual tokens in the first visual token and each of the second visual tokens in the second visual token based on the multi-dimensional visual features; combining the first visual tokens with the second visual tokens based on the combination degree to obtain a plurality of third visual tokens; and arranging according to the visual sequence corresponding to each third visual token to obtain the target visual token set.
6. The method of claim 5, wherein each of the first visual tokens is merged once or each of the second visual tokens is merged once.
7. The method of claim 5, wherein the determining a degree of merger between each of the first visual tokens in the first visual token set and each of the second visual tokens in the second visual token set based on the multi-dimensional visual feature comprises: Multiplying a plurality of visual features corresponding to the multi-dimensional visual features to obtain the merging degree between each first visual token and each second visual token.
8. The method of claim 1, wherein the obtaining the initial set of visual tokens corresponding to the target image at the current coding layer comprises: Acquiring the current coding layer based on a preset coding layer selection strategy; And inputting the target image into the current coding layer, and acquiring an initial visual token set corresponding to the target image in the current coding layer.
9. The method as recited in claim 1, further comprising: and inputting the target visual token set generated by the last coding layer into a visual model to acquire a visual reasoning result.
10. A processing apparatus for a visual token, the apparatus comprising: The acquisition module is used for acquiring an initial visual token set corresponding to the target image at the current coding layer; A partitioning module for partitioning the initial set of visual tokens into a first subset of visual tokens and a second subset of visual tokens; An extraction module for extracting multi-dimensional visual features between the first subset of visual tokens and the second subset of visual tokens; And the merging processing module is used for merging the first visual token subset and the second visual token subset based on the multidimensional visual characteristics to obtain a target visual token set corresponding to the current coding layer.
11. An electronic device, comprising: A memory and a processor in communication with each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the method of processing a visual token according to any one of claims 1 to 9.
12. A computer-readable storage medium, having stored thereon computer instructions for causing a computer to perform the method of processing a visual token according to any one of claims 1 to 9.
13. A computer program product comprising computer instructions for causing a computer to perform the method of processing a visual token according to any one of claims 1 to 9.

Description

Visual token processing method, device, equipment, storage medium and program product Technical Field The present disclosure relates to the field of vision processing technology, and in particular, to a method, an apparatus, a device, a storage medium, and a program product for processing a vision token. Background In the multi-modal large language model (Multimodal Large Language Model, MLLM), the input of visual information such as high-resolution images or long-sequence videos can generate a huge and continuously expanding number of visual tokens (token), so that the calculation and the memory overhead of the visual encoder are increased sharply, and the bottleneck of the whole reasoning process is formed. Although the token amount can be reduced by the visual token pruning method, the current visual token pruning method can only prune after the visual encoder, so that not only can the overhead of the visual encoder be reduced, but also the reasoning performance and the reasoning efficiency are difficult to be considered along with the expansion of the input scale. Disclosure of Invention The present disclosure provides a method, apparatus, device, storage medium, and program product for processing visual tokens to solve the problem of difficulty in reducing the number of visual tokens inside a visual encoder. The method for processing the visual tokens comprises the steps of obtaining an initial visual token set corresponding to a target image in a current coding layer, dividing the initial visual token set into a first visual token subset and a second visual token subset, extracting multi-dimensional visual features between the first visual token subset and the second visual token subset, and combining the first visual token subset and the second visual token subset based on the multi-dimensional visual features to obtain the target visual token set corresponding to the current coding layer. The second aspect provides a processing device of a visual token, which comprises an acquisition module, a dividing module, an extraction module and a merging processing module, wherein the acquisition module is used for acquiring an initial visual token set corresponding to a target image in a current coding layer, the dividing module is used for dividing the initial visual token set into a first visual token subset and a second visual token subset, the extraction module is used for extracting multi-dimensional visual features between the first visual token subset and the second visual token subset, and the merging processing module is used for merging the first visual token subset and the second visual token subset based on the multi-dimensional visual features to obtain the target visual token set corresponding to the current coding layer. In a third aspect, the present disclosure provides an electronic device, including a memory and a processor, where the memory and the processor are communicatively connected to each other, and the memory stores computer instructions, and the processor executes the computer instructions, thereby executing the method for processing a visual token according to the first aspect or any embodiment thereof. In a fourth aspect, the present disclosure provides a computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method of processing a visual token of the first aspect or any of its corresponding embodiments. In a fifth aspect, the present disclosure provides a computer program product comprising computer instructions for causing a computer to perform the method of processing a visual token of the first aspect or any of its corresponding embodiments described above. According to the processing method, the device, the equipment, the storage medium and the program product of the visual tokens, for any one coding layer of the visual encoder, an initial visual token set of a target image under the current coding layer is obtained, the initial visual token set is divided into a first visual token subset and a second visual token subset, the combination processing between the visual tokens is carried out by combining multi-dimensional visual features between the first visual token subset and the second visual token subset, so that the effective visual tokens are reserved, the corresponding target visual token set is obtained, the effective visual information of the target visual token set can be reserved to the greatest extent, pruning of the visual tokens in the visual encoder is realized, the self expenditure of the visual encoder is reduced, the calculation bottleneck of the visual encoder is relieved, higher visual fidelity is maintained for a downstream multi-mode reasoning task, and therefore, the reasoning time is effectively shortened on the premise that the reasoning performance of a multi-mode large language model is not remarkably reduced, and the reasoning efficiency is improved. Drawings In order to more clearly illus