CN-121982310-A - Two-dimensional hybrid position coding system, method Medium, terminal, and program product

CN121982310ACN 121982310 ACN121982310 ACN 121982310ACN-121982310-A

Abstract

The application provides a two-dimensional hybrid position coding system, a method, a medium, a terminal and a program product, wherein the system comprises a data acquisition module for acquiring an image block, an absolute coding module for adding row codes and column codes to row positions and column positions of image block decomposition respectively and fusing the added row codes and column codes to obtain absolute position vectors, a rotary relative coding module for applying two-dimensional rotary position mapping to query vectors and key vectors to obtain rotary query vectors and rotary key vectors, a local relative coding module for adding sparse relative bias to attention scores based on a sparse relative bias matrix to obtain corrected attention scores, and a fusion gating module for calculating dynamic weights and fusing the absolute position vectors, the rotary query vectors and the rotary key vectors based on context information. The application can reserve and strengthen the relative geometric relationship between the two-dimensional grid structure and the relative geometric relationship, and has good mobility on resolution, image block size and window size.

Inventors

Request for anonymity
Request for anonymity

Assignees

上海光羽芯辰科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260121

Claims (10)

1. A two-dimensional hybrid position coding system, comprising: The data acquisition module is used for acquiring a plurality of image blocks of the image to be processed, wherein the image blocks are divided; The absolute coding module is used for decomposing the two-dimensional position of the image block into a row position and a column position, adding row codes and column codes for the decomposed row position and column position respectively, and fusing the added row codes and column codes to obtain an absolute position vector; the rotary relative coding module is used for applying two-dimensional rotary position mapping to the query vector and the key vector which are obtained by calculation according to the absolute position vector and the feature vector converted by the image block so as to obtain a rotary query vector and a rotary key vector; the local relative coding module is used for constructing a sparse relative bias matrix based on a dynamic sparse mechanism, and adding sparse relative bias to the attention score calculated according to the query vector and the key vector based on the sparse relative bias matrix to obtain a corrected attention score; And the fusion gating module is used for calculating dynamic weights for the absolute position vector generated by the absolute coding module, the rotation query vector and the rotation key vector generated by the rotation type relative coding module and the sparse relative bias matrix generated by the local relative coding module based on the context information, and executing after fusion.
2. The two-dimensional hybrid position-coding system of claim 1, further comprising a multi-scale level sharing module for setting parameter representations in a base relative position offset table of the respective layer sharing presets based on a cross-level sharing mechanism.
3. The two-dimensional hybrid position-coding system of claim 1, wherein the local relative-coding module comprises: The receiving unit is used for receiving the input attention score calculated according to the query vector and the key vector; The bias acquisition unit is used for acquiring bias items corresponding to the two-dimensional relative positions from a preset two-dimensional relative displacement bias table; and the sparse bias adding unit is used for adding bias items to the attention scores of the local neighborhood and the axial key direction based on a sparse mask mechanism so as to obtain corrected attention scores.
4. The two-dimensional hybrid position-coding system of claim 3, wherein the axial critical directions comprise a horizontal direction, a vertical direction, and a diagonal direction.
5. A two-dimensional hybrid position-coding system according to claim 3, wherein the local neighborhood is a set of all neighboring image blocks centered on the current query image block and having a spatial distance from it of less than or equal to a preset threshold on the grid of image blocks.
6. The two-dimensional hybrid position-coding system of claim 1, wherein the specific way to fuse the added row codes and column codes is sum fusion or post-splice linear transform fusion.
7. A two-dimensional hybrid position coding method, comprising: acquiring a plurality of image blocks of an image to be processed, wherein the image blocks are divided; decomposing the two-dimensional position of the image block into a row position and a column position, adding row codes and column codes for the decomposed row position and column position respectively, and fusing the added row codes and column codes to obtain an absolute position vector; Applying a two-dimensional rotation position mapping to the query vector and the key vector calculated according to the absolute position vector and the feature vector converted by the image block to obtain a rotation query vector and a rotation key vector; Constructing a sparse relative bias matrix based on a dynamic sparse mechanism, and adding sparse relative bias for attention scores obtained by calculation according to query vectors and key vectors based on the sparse relative bias matrix to obtain corrected attention scores; And calculating dynamic weights for the absolute position vector generated by the absolute coding module, the rotation query vector and the rotation key vector generated by the rotation type relative coding module and the sparse relative bias matrix generated by the local relative coding module based on the context information, and executing after fusing.
8. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method as claimed in claim 7.
9. A computer program product comprising computer program code means for causing a computer to carry out the method as claimed in claim 7 when said computer program code means are run on the computer.
10. An electronic terminal comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the method as claimed in claim 7.

Description

Two-dimensional hybrid position coding system, method Medium, terminal, and program product Technical Field The present application relates to the field of image processing technologies, and in particular, to a two-dimensional hybrid position coding system, method, medium, terminal, and program product. Background The existing visual transformer (Vision Transformer, viT) model typically divides the input image into fixed-size image blocks (patches), then linearly maps each Patch into a vector, then concatenates the vectors into a one-dimensional sequence, and superimposes one-dimensional position codes, such as sine function codes (Sinusoidal Encoding) or learnable absolute position vectors, and finally inputs the sequence into a standard encoder for feature learning and classification. Although ViT model performs well in many image recognition tasks, the above processing method ignores the row-column relationship and relative position information of Patch on the two-dimensional grid, so that the following problems exist in the image processing task: (1) Because in the two-dimensional image, the spatial relationship between pixels or image blocks has definite row-column directivity, however, the one-dimensional serialization processing makes the model difficult to effectively distinguish the long-range dependence relationship in the horizontal direction and the vertical direction, thereby limiting the accurate understanding of the model on the global context of the image; (2) The robustness of geometric changes such as scale, translation, rotation and the like is insufficient, namely the conventional one-dimensional position code is usually bound with an absolute position, and when the geometric changes such as scale, translation, rotation and the like occur to an image, the position relation of the code can be invalid, so that the robustness of a model is insufficient; (3) The expression of the space structure in the downstream tasks (detection, segmentation, key points and the like) is limited, and the performance of the model is limited when reconstructing or positioning the space layout because one-dimensional position coding is difficult to provide two-dimensional structure guidance; (4) The generalization capability of migrating to any resolution or changing the Patch size is reduced, namely when the resolution of an image or the Patch size needs to be changed, the current one-dimensional absolute position coding needs to be interpolated or retrained, so that the calculation cost is increased, and the position information is distorted, thereby influencing the use of a model in an image processing task. Accordingly, there is a need for a two-dimensional hybrid position-coding system, method, medium, terminal and program product that address the above-identified problems in the prior art. Disclosure of Invention In view of the above-mentioned drawbacks of the prior art, an object of the present application is to provide a two-dimensional hybrid position coding system, method, medium, terminal and program product, for solving the technical problem that the prior art ignores the row-column relationship and relative position information of image blocks on a two-dimensional grid. To achieve the above and other related objects, a first aspect of the present application provides a two-dimensional hybrid position-coding system, comprising: The data acquisition module is used for acquiring a plurality of image blocks of the image to be processed, wherein the image blocks are divided; The absolute coding module is used for decomposing the two-dimensional position of the image block into a row position and a column position, adding row codes and column codes for the decomposed row position and column position respectively, and fusing the added row codes and column codes to obtain an absolute position vector; the rotary relative coding module is used for applying two-dimensional rotary position mapping to the query vector and the key vector which are obtained by calculation according to the absolute position vector and the feature vector converted by the image block so as to obtain a rotary query vector and a rotary key vector; the local relative coding module is used for constructing a sparse relative bias matrix based on a dynamic sparse mechanism, and adding sparse relative bias to the attention score calculated according to the query vector and the key vector based on the sparse relative bias matrix to obtain a corrected attention score; And the fusion gating module is used for calculating dynamic weights for the absolute position vector generated by the absolute coding module, the rotation query vector and the rotation key vector generated by the rotation type relative coding module and the sparse relative bias matrix generated by the local relative coding module based on the context information, and executing after fusion. In some embodiments of the first aspect of the present application, the method further