CN-121982067-A - Four-branch RGBT target tracking method based on cross Mamba prompt
Abstract
A four-branch RGBT target tracking method based on cross Mamba prompt designs a backbone network based on four-branch feature extraction, can effectively extract all image features of two modes, aims at solving the problem that TATrack fails to realize effective bidirectional interaction between modes, finishes bidirectional prompt fusion between modes layer by layer based on the four-branch feature extraction backbone network, aims at solving the problem that TATrack only utilizes template features to realize bidirectional prompt fusion based on a self-attention mechanism, comprehensively utilizes the template features and search features, and designs a space-time context information interaction method based on cross Mamba.
Inventors
- GAO BIN
- Xie xinyang
- Jin Ading
Assignees
- 云南大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260205
Claims (4)
- 1. A four-branch RGBT target tracking method based on a cross Mamba prompt is characterized by comprising five parts, namely input, a backbone network, inter-mode interaction, space-time context information interaction and prediction head, wherein the input comprises an initial template image, a dynamic template image and a search area image, the dynamic template is continuously updated in the tracking process, the backbone network adopts a four-branch feature extraction network, the first branch realizes the feature extraction of an RGB mode initial template and a search area, the second branch realizes the feature extraction of a T mode initial template and a search area, the third branch realizes the feature extraction of an RGB mode dynamic template and a search area, the fourth branch realizes the feature extraction of the T mode dynamic template and the search area, the inter-mode interaction respectively carries out bidirectional interaction layer by layer between the first branch and the second branch and bidirectional interaction layer by layer between the third branch and the fourth branch, the space-time context information interaction realizes the bidirectional interaction of space-time context information between the first branch and the third branch based on the cross Mamba method, the prediction head adopts the search features of four branch outputs to realize target positioning, and the method carries out two feature extraction, the cross Mamba bidirectional interaction based on the space-time context information, and accurate bidirectional tracking effect is realized.
- 2. The method according to claim 1, characterized by comprising the following eleven steps: Step one input video first frame A video sequence comprises a plurality of frames of images, each frame of image comprises an RGB mode image, namely a visible light image and a TIR mode image, namely a thermal infrared image, and at least comprises one target, wherein the target position on a first frame of image is known, the target position on the rest frames of image is unknown, the number of video frames is a positive integer greater than 1, and the upper left corner of each frame in the video frame sequence is the origin of coordinates Width and height are respectively And Inputting a first frame image and automatically or manually marking a rectangular region of the tracked object I.e. selected tracking target, wherein Representing the upper left corner coordinates of the rectangular area, Respectively represent the width and height of rectangular area, let Representing the current frame number, the first frame number The first frame selected target is also referred to as the current frame tracking result = ; Step two, generating an initial template token sequence Enlarging a rectangular area of a first frame of the selected object The image size corresponding to the magnified region becomes Target initial template image as two modalities And In which the height and width are equal, i.e. , Indicating the number of channels of the image, superscript Representing an initial template; For a pair of And Dividing image blocks, wherein the pixel resolution of each image block is as follows A sequence of template image blocks of two modalities may be formed And Wherein Representing the number of image blocks of a modality target template, Representing the token embedding dimension, in superscripts Representing an image block; And Through linear projection And add location embedding And Two modalities of target template token sequences can be generated And I.e. And Superscript Representing an initial target template; step three initializing parameters The present layer number of backbone network is that the characteristic extraction process of the invention is divided into four branches, each branch adopts ViT as backbone network and four branch parameters are shared, each backbone network comprises 12 layers of encoders, each layer of encoder mainly comprises two modules of multi-head self-attention and feedforward network, and the method comprises the steps of Representing the current layer number of the backbone network, initializing to zero Order-making machine Representing feature extraction branch number, order Represent the first Branch backbone network No Output characteristics of layers, where And Respectively representing template features and search feature components; represent the first Input features of the individual branch backbone networks; The invention only carries out information interaction for partial layers of backbone network in order to reduce the expenditure caused by the interaction prompt of the space-time context information based on the intersection Mamba, and enables A backbone network layer number set which needs space-time context information interaction is represented; Dynamic template token sequence, order And Dynamic template token sequences respectively representing two modes, in superscripts Representing dynamic templates, for the first frame, i.e. In the case of (a), the dynamic template token sequence is initialized with the target initial template token sequence, i.e. 、 Superscript Representing a dynamic template; step four, inputting the next frame and generating a search area token sequence Inputting a new video frame as the current frame, the current frame number is increased by 1, namely ; Tracking the last frame B 2 times the rectangular area of the image, and the image size corresponding to the enlarged area becomes Search area image as two modalities And Wherein And Respectively represent the height and width of the search area image, and , Representing the number of channels of an image, the sign Representing a search area; For a pair of And Dividing image blocks, wherein the pixel resolution of each image block is as follows A search area image block sequence of two modalities can be formed And Wherein Representing the number of image blocks of a modality search area, Representing token embedding dimension which is the same as the template image block embedding dimension; And Through linear projection And add location embedding And A search area token sequence of two modalities may be generated And I.e. And ; The backbone network current layer number is set to zero The four branch feature extraction backbone networks have input features of: 、 、 、 In the superscript And Respectively representing a template part and a search area part; step five and four branch feature extraction and modal interaction The current layer number of the backbone network is increased by 1, namely ; This step completes four branch backbone network The method for realizing feature extraction of each layer of encoder is the same as the four branch backbone networks all adopt ViT networks and the parameters are shared, and the first branch is taken as an example, and the present first branch is taken as the example The layer input features are Its output characteristics The calculation process of (2) is as follows: ; ; Wherein the method comprises the steps of 、 And Representing a conventional multiheaded self-attention operation, a feed-forward neural network and a layer normalization operation, And Representing the outputs of two modules of the multi-headed self-attention and feed forward networks respectively, The calculation process of other branch output characteristics is the same as that of the other branch output characteristics; The method comprises the steps of extracting RGB mode characteristics corresponding to a first branch and a third branch, extracting TIR mode characteristics corresponding to a fourth branch, carrying out mode interaction operation on the first branch and the third branch, carrying out mode interaction operation on the fourth branch, wherein mode interaction can adopt the existing method of information interaction between modes, taking a bidirectional adapter method as an example, and synchronously realizing mode interaction in the two different mode characteristic extraction processes, wherein the mode interaction of the first branch and the second branch is the same as the mode interaction method of the third branch and the fourth branch, and taking the bidirectional adapter method as an example, and carrying out the mode interaction of the first branch and the second branch synchronously, wherein the characteristic extraction and the mode interaction are carried out as follows: ; ; ; ; Wherein the method comprises the steps of Representing a bi-directional adapter consisting of a set of linear projection layers, And Respectively represent the first and the second branch output characteristics, And The third and fourth branch feature extraction and the modal interaction synchronization process are similar; Step six Cross Mamba-based spatio-temporal context information interaction If the current layer number does not belong to the backbone network layer number set of space-time context information interaction The step need not be performed; The first branch learns the characteristic information of the second branch in the modal interaction process, the first branch can be considered to contain the characteristic information of the first two branches, and the third branch can be considered to contain the characteristic information of the second two branches, the space-time context information interaction can be realized by intersecting Mamba the first branch and the third branch, and the step is completed Layer characteristics And The information interaction of the system comprises normalization operation, forward scanning operation, reverse scanning operation and output processing; Step seven, judging whether the last layer is the last layer If the backbone network is currently layer number I.e. not reaching the last layer, turning to the fifth step, otherwise, continuing to execute the subsequent steps; Step eight, determining tracking results Step nine, judging whether the last frame is the last frame If the current frame is the last frame, tracking is finished, otherwise, continuing to execute the subsequent steps; step ten, judging whether the updating condition is reached Obtaining the confidence coefficient corresponding to the target center position according to the classification score map in the process of determining the tracking result of the current frame, and comparing the confidence coefficient with a confidence coefficient threshold value; Step eleven updating dynamic template token sequence Tracking the current frame to result Image magnification corresponding to rectangular area Multiple, change into Size, dynamic template image as two modalities And In which the height and width are equal, i.e. , Representing the number of image channels And Dividing image blocks, wherein the pixel resolution of each image block is as follows Forming a sequence of dynamic template image blocks And Through linear projection Adding position embedment And A dynamic template token sequence of two modalities can be generated And I.e. And And step four, after the dynamic template token sequence updating is completed, the process goes to step four.
- 3. The method according to claim 2, wherein in the sixth step, the normalization operation, the forward scanning operation, the backward scanning operation, and the output processing are specifically: Normalization operation two branches Layer characteristics And Respectively performing layer normalization operation Obtaining intermediate variables And ; ; ; Forward scanning operation: pair of And Respectively performing forward scanning operation; To be used for Forward scanning operation is exemplified by Performing linear projection Separation operation Can obtain And ; ; Further, to The linear projection and separation operation can be performed to obtain an intermediate variable parameter matrix 、 、 By using Maintaining rules by zero order For the matrix of learnable parameters in Mamba And Transforming to obtain intermediate variable And ; ; ; Order the Representation of From the above parameter matrix, a structured convolution kernel can be obtained ; ; Similarly, by Can be obtained by similar treatment 、 Structured convolution kernel : ; ; ; ; Next, to Performing one-dimensional convolution Activation function Processing and third branch structuring convolution kernels Together input state space model Generating cross-processing result, and then combining with Through an activation function Results dot product of (2) Can obtain Forward scan results : ; Similarly, pair 、 、 Can be obtained by similar treatment Forward scan results : ; Reverse scanning operation of And Respectively performing reverse scanning operation, wherein the reverse scanning operation is different from the forward scanning operation in that sequence inversion operation is added at the beginning and the end of the reverse scanning operation The rest of the process is similar; From the following components Obtaining 、 Structured convolution kernel : ; ; ; ; From the following components Obtaining 、 Structured convolution kernel : ; ; ; ; Reverse scan results : ; Reverse scan results : ; Output processing, namely The forward and backward scanning results of the first branch can be obtained by adding and linear projection processing Bidirectional hint feature for layers Similarly, a third branch of the first branch can be obtained Bidirectional hint feature for layers ; ; ; Bidirectional prompt feature and current first Layer characteristic residual error connection can obtain the first information after information interaction Layer characteristics: ; 。
- 4. The method according to claim 3, wherein the step eight determining the tracking result is specifically: Outputting the four branch backbone networks to search regional characteristic sequences 、 、 、 Fusion can be carried out to obtain the characteristic information of the search area after fusion Taking the fusion method of element-by-element addition as an example, the region characteristic information is searched Can be expressed as: ; The method mainly comprises the steps of re-interpreting the characteristic information of the search area into a two-dimensional characteristic map, determining the target position through a full convolution network formed by stacking a plurality of layers of Conv-BN-ReLU structures according to the information such as a classification score map, local offset and the like, wherein the total loss of training of a tracking model is as follows: ; Wherein, the classification loss Classification with Weighed Focal Loss, regression loss with L1 loss And Generalized IoU loss A regression of the bounding box is performed and, And Regularization parameters corresponding to Generalized IoU loss and L1 loss respectively, and taking the target positioning result as the current frame tracking result 。
Description
Four-branch RGBT target tracking method based on cross Mamba prompt Technical Field The invention relates to the field of target tracking methods, in particular to a four-branch RGBT target tracking method based on a cross Mamba prompt. Background Visual Object Tracking (VOT) is a fundamental task in the field of computer vision, the purpose of which is to predict the positional state of objects in subsequent frames given the initial state of the objects in a first frame. In order to overcome the failure challenge of a single visible light-based target tracking algorithm facing a complex and changeable actual scene, RGBT visual target tracking can realize more accurate and robust target tracking in a complex environment by fusing information of visible light, namely RGB mode and infrared, namely T mode. There are three general approaches to RGBT tracking. The first method only extracts features from the RGB mode image, and the T mode information is used for generating RGB prompt information through a prompter. For example ,"J. Zhu, S. Lai, X. Chen, D. Wang, and H. Lu, Visual Prompt Multi-Modal Tracking, IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 9516–9526, 2023". such methods ignore T-modality feature extraction. The second method performs feature extraction on both the two modal images, and realizes the fusion of the two modal features in the feature extraction process. For example ,"B. Cao, J. Guo, P. Zhu, and Q. Hu, Bi-directional Adapter for Multimodal Tracking, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 2, pp. 927–935, 2024"., although the method effectively extracts two modal characteristics and interaction between modalities, the two methods only rely on an initial template for tracking, and all the space-time context information is ignored. When the appearance of the target changes greatly, the tracking performance of the target may be obviously affected, and it is difficult to maintain a stable tracking effect. The third approach allows for the introduction of spatio-temporal context information. For example ,"H. Wang, X. Liu, Y. Li, M. Sun, D. Yuan, and J. Liu, Temporal Adaptive RGBT Tracking with Modality Prompt, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, pp. 5436–5444, 2024". the method constructs a basic template search matching branch through the initial template and an online branch through the dynamic template to enable the template to interact with the time information. However, this approach ignores the T-modality feature extraction and the efficient interaction between modalities. The technical scheme of the prior art, namely TATrack tracking method, has the problems and disadvantages that (1) the method only performs feature extraction on RGB modes and omits T mode feature extraction, (2) the method adopts T modes to generate unidirectional prompt information of the RGB modes, effective bidirectional interaction between modes cannot be realized, and (3) offline and online branch bidirectional space-time prompt fusion only utilizes template features, search features cannot be fully utilized, and the fusion method is worthy of further exploration. By researching and analyzing the existing RGBT tracking method, the problems of feature extraction, inter-mode information interaction, space-time context information interaction and the like of two mode images are comprehensively considered. An existing mode prompting method can be adopted for the information interaction among modes. For space-time context information interaction, the invention designs a space-time context information interaction method based on the intersection Mamba by means of Mamba high-efficiency long-sequence modeling capability, effectively utilizes space-time context information to enhance the current frame characteristics and promotes tracking accuracy. Disclosure of Invention In order to solve the problems in the prior art, the invention aims to provide an RGBT video target tracking method, which constructs a four-branch RGBT target tracking frame, can realize information interaction among modes and can realize space-time context information interaction based on a cross Mamba, and effectively improves the accuracy of RGBT tracking on the basis of promoting the mode interaction and the space-time context information interaction. In order to achieve the above purpose, the technical scheme of the invention is as follows: The invention provides a four-branch RGBT target tracking method based on a cross Mamba prompt, which comprises five parts of input, a backbone network, inter-mode interaction, space-time context information interaction and prediction. The input comprises an initial template image, a dynamic template image and a search area image, wherein the dynamic template is continuously updated in the tracking process. The backbone network adopts a four-branch feature extraction network, wherein the fi