CN-122023946-A - Method and apparatus for processing image data based on Mamba model

CN122023946ACN 122023946 ACN122023946 ACN 122023946ACN-122023946-A

Abstract

Methods and apparatus for processing image data based on Mamba models are provided. The method includes receiving a data sequence corresponding to image data, processing the data sequence using Mamba model to obtain a data sequence processing result, and outputting a classification result corresponding to the image data based on the data sequence processing result. The Mamba model includes Mamba backbone blocks that process data sequences based on a selectivity-state spatial model, and a plurality of computation blocks located after Mamba backbone blocks. Each of the plurality of computing blocks includes a global scoring unit that performs a global evaluation to obtain an importance score, a context holding unit that generates relative position information of the token, and a slow scanning unit that performs a first selective state space update and a second selective state space update. Therefore, mamba model can significantly reduce the number of floating point operations per second in the inference phase without substantially sacrificing accuracy.

Inventors

WANG PEISONG
QIAN JIAHE
HU QINGHAO
CHENG JIAN

Assignees

中国科学院自动化研究所

Dates

Publication Date: 20260512
Application Date: 20260409

Claims (11)

1. A method of processing image data based on Mamba models, the method comprising: receiving a data sequence corresponding to the image data, the data sequence including a class label and a plurality of tokens; processing the data sequence using Mamba model to obtain data sequence processing results, and Based on the data sequence processing result, a classification result corresponding to the image data is output, Wherein Mamba model includes: mamba a backbone block configured to process the data sequence based on the selective state space model and output the processed data sequence, and A plurality of computation blocks connected in series with each other, and located after Mamba trunk blocks, Wherein each of the plurality of computing blocks comprises: A global scoring unit configured to receive the processed data sequence or the updated data sequence as an input sequence, perform a global evaluation on the input sequence to obtain a importance score for no position bias on the full sequence, and select an importance token from the plurality of tokens based on the importance score; a context-holding unit configured to generate relative position information of tokens in the input sequence, and And a slow-speed scanning unit configured to perform a first selective state space update on all tokens in the input sequence to obtain a first update result, perform a second selective state space update on important tokens in the input sequence to obtain a second update result, fuse the second update result to the first update result based on the relative position information to obtain an updated data sequence, and output the updated data sequence.
2. The method of claim 1, wherein the global scoring unit performs global assessment using global information interaction attention, the slow scanning unit performs a first selective state space update and a second selective state space update using dynamic gaze-browsing scanning, and the context holding unit generates the relative location information using context location embedding.
3. The method of claim 1, wherein the global scoring unit is configured to: stripping class labels from the input sequence, and A global lightweight aggregate evaluation is performed in a query dimension-down and key-value dimension-down-up manner on all tokens stripped of the input sequence of class labels based on the class labels to obtain a location-bias-free importance score for each token.
4. The method of claim 1, wherein the step of determining the position of the substrate comprises, When a computation block of the plurality of computation blocks is a computation block immediately following the Mamba backbone block, the global scoring unit of the computation block is configured to receive the processed data sequence from the Mamba backbone block as an input sequence of the computation block, When the computation block is not the computation block immediately following the Mamba backbone block, the global scoring unit of the computation block is configured to receive the updated data sequence from the previous computation block of the computation block as an input sequence of the computation block, When the computation block is the last computation block of the plurality of computation blocks, the slow-fast-scan unit of the computation block is configured to output the updated data sequence as a data sequence processing result, and When the computation block is not the last computation block of the plurality of computation blocks, the slow-fast-scan unit of the computation block is configured to output the updated data sequence to a next computation block of the plurality of computation blocks.
5. The method of claim 1, wherein important tokens are propagated between adjacent computing blocks in a manner that excludes consistency.
6. The method of claim 1, wherein the slow-fast scanning unit is configured to: performing a first selectively state space update in a first dimension on all tokens in the input sequence to obtain a first update result; performing a second selective state space update on the important tokens in the input sequence in a second dimension to obtain a second update result; Fusing the second update result to the first update result based on the relative position information to obtain an updated data sequence, and The data sequence after the update is output, Wherein the first dimension is smaller than the second dimension.
7. The method of claim 6, wherein the ratio of the first dimension to the second dimension is 1/8.
8. The method of claim 1, wherein the context holding unit is configured to generate an embedding matrix encoding the relative position information and provide the embedding matrix to the slow-scan unit.
9. An apparatus for processing image data based on Mamba models, the apparatus comprising: a receiving module configured to receive a data sequence corresponding to image data, the data sequence including a class mark and a plurality of tokens; A processing module configured to process the data sequence using Mamba model to obtain a data sequence processing result, and An output module configured to output a classification result corresponding to the image data based on the data sequence processing result, Wherein Mamba model includes: mamba a backbone block configured to process the data sequence based on the selective state space model and output the processed data sequence, and A plurality of computation blocks connected in series with each other, and located after Mamba trunk blocks, Wherein each of the plurality of computing blocks comprises: A global scoring unit configured to receive the processed data sequence or the updated data sequence as an input sequence, perform a global evaluation on the input sequence to obtain a importance score for no position bias on the full sequence, and select an importance token from the plurality of tokens based on the importance score; a context-holding unit configured to generate relative position information of tokens in the input sequence, and And a slow-speed scanning unit configured to perform a first selective state space update on all tokens in the input sequence to obtain a first update result, perform a second selective state space update on important tokens in the input sequence to obtain a second update result, fuse the second update result to the first update result based on the relative position information to obtain an updated data sequence, and output the updated data sequence.
10. A computer-readable storage medium, characterized in that instructions stored in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform the method of processing image data based on Mamba model according to any one of claims 1 to 8.
11. A computer program product comprising computer instructions which, when executed by at least one processor, implement a method of processing image data based on a Mamba model according to any one of claims 1 to 8.

Description

Method and apparatus for processing image data based on Mamba model Technical Field The present disclosure relates to deep learning, and more particularly, to a method and apparatus for processing image data based on Mamba models. Background In recent years, mamba models based on the sequence-state space modeling thought exhibit excellent efficiency and accuracy in various scenes such as vision and language. Mamba replaces the secondary complexity of attention operations with linear time selective scanning, which has natural advantages in long-sequence modeling, streaming and edge-side deployment compared to traditional global attention-dependent architectures. However, along with the improvement of task scale and data complexity, mamba in actual engineering still faces the problem that the parameter quantity and the video memory occupation cannot be ignored, namely, on one hand, the convolution kernel and the selection gating involved in the state update bring about larger intermediate activation, and on the other hand, the hidden state transferred in a cross-layer manner is highly sensitive to a numerical value range, and a state drift can be caused by a slightly improper quantization or pruning, so that long-range dependence is destroyed, and the inference instability and the precision dip are represented. In addition, for the existing research of model compression, the method focuses on quantization and pruning of dense structures such as transformers, for example, stabilizing quantization scales through equivalent transformation of layer normalization, or adopting a mixed precision strategy to allocate bit widths layer by layer so as to achieve both precision and computational overhead. Such schemes achieve considerable resource savings on the generic language model, but their assumptions often build on the premise of "independent layer-independent error", default pruning or quantization error is not amplified in the time-sequential dimension. For Mamba with explicit state recursion and selective gating, the assumption is not fully true that quantization noise accumulates along the scan direction and changes the gating distribution, and structural sparseness due to pruning may disrupt the relative order information and hidden state alignment, thereby affecting the global context reconstruction capability. The compression paradigm of dense architecture is simply applied, and the phenomenon that the performance is remarkably attenuated after the reasoning is early and normal and the sequence is prolonged easily occurs. Therefore, how to design a set of compression and reasoning mechanism capable of remarkably reducing the number of floating point operations per second in the reasoning stage on the premise of basically not sacrificing the precision aiming at the structural characteristics of Mamba becomes a key for pushing the model to stably run on the lightweight equipment. Disclosure of Invention It is an object of the present disclosure to provide a method and apparatus for processing image data based on Mamba model that can significantly reduce the number of floating point operations per second in the inference phase without substantially sacrificing accuracy. According to an aspect of embodiments of the present disclosure, a method of processing image data based on Mamba models is provided. The method includes receiving a data sequence corresponding to image data, the data sequence including a class label and a plurality of tokens, processing the data sequence using a Mamba model to obtain a data sequence processing result, and outputting a classification result corresponding to the image data based on the data sequence processing result, wherein the Mamba model includes a Mamba backbone block configured to process the data sequence based on a selective state space model and output the processed data sequence, and a plurality of computation blocks connected in series with each other and located after the Mamba backbone block, wherein each of the plurality of computation blocks includes a global scoring unit configured to receive the processed data sequence or the updated data sequence as an input sequence, perform a global evaluation on the input sequence to obtain an importance score having no position bias on the full sequence, and select an important token from the plurality of tokens based on the importance score, a context holding unit configured to generate relative position information of tokens in the input sequence, and a slow scanning unit configured to perform a first selective state space update on all of the input sequence to obtain a first update, perform a relative position update on the token in the input sequence, and obtain a fusion result of the second update to obtain a second update result based on the second update position of the update sequence. Optionally, the global scoring unit performs a global assessment using global information interaction attention, the slow scanning u