CN-122023464-A - Multi-mode target tracking method and system based on space-time condition denoising transducer

CN122023464ACN 122023464 ACN122023464 ACN 122023464ACN-122023464-A

Abstract

The invention provides a multimode target tracking method and a multimode target tracking system based on space-time condition denoising transformers, which belong to the field of computer vision, and are used for generating noisy inputs by superposing Gaussian noise on RGB (red, green and blue) and TIR (total internal reflection) search frame coding features, and obtaining the result by fusing a TIR short-term history frame sequence to the noisy inputs of an RGB mode Fusion of a long-term history frame sequence of TIR to Generating Fusing the short-term history frame sequence of RGB to the noisy input of the TIR mode to obtain Fusion of long-term history frame sequences of RGB to Generating Training the denoising device, injecting Gaussian noise into coding features of available modes to obtain noise features, inputting the noise features into the trained denoising device to generate reconstruction or enhancement features, and ensuring tracking stability.

Inventors

LU ANDONG
WANG QIONGKE
Zha Ziyi
LI CHENGLONG

Assignees

安徽大学

Dates

Publication Date: 20260512
Application Date: 20260415

Claims (10)

1. The multi-mode target tracking method based on the space-time condition denoising transducer is characterized by comprising the following steps of: Encoding the preprocessed RGB and TIR mode templates and search frame sequences to obtain RGB mode search frame encoding characteristics Search frame coding features for TIR modalities ; Generating Gaussian noise adaptively according to the integrity of RGB mode and TIR mode, and superimposing Gaussian noise to search frame coding features 、 Generating noisy inputs for RGB modalities Noisy input of TIR mode And enters a denoising device; Fusion of a short-term history frame sequence of TIR modalities to noisy inputs Obtaining characteristics of And fusing a long-term history frame sequence of TIR modes to the features Generating refinement features Fusion of short-term history frame sequences of RGB modalities to noisy inputs Obtaining characteristics of And fusing the long-term history frame sequence of RGB mode to the feature Generating refinement features ; The denoising device is trained jointly based on the feature reconstruction loss, the feature alignment loss and the tracking loss of each mode, and the trained denoising device is obtained; and (3) injecting Gaussian noise into the coding features of the search frames of the available modes to obtain noise features, inputting the noise features into a trained denoising device to generate reconstruction or enhancement features, inputting the reconstruction features and the features spliced by the available modes into a pre-measurement head or inputting the enhancement features into the pre-measurement head to generate a tracking boundary box.
2. The method for multi-modal target tracking based on space-time condition denoising transformers according to claim 1, wherein the step of adaptively generating Gaussian noise according to the integrity of RGB modes and TIR modes is to inject high-intensity noise into available RGB modes or TIR modes when complementary modes of the RGB modes or the TIR modes are missing or damaged, and inject slight noise into the available RGB modes or TIR modes when the modes are not missing or damaged.
3. The method for multi-modal target tracking based on spatiotemporal denoising transformers of claim 1, wherein the noisy input 、 The expression of (2) is: Wherein, the For noisy inputs Or (b) , Features are encoded for search frames of RGB modalities or TIR modalities, As a result of the gaussian noise, , As the variance of the gaussian noise, Is a matrix of units which is a matrix of units, Parameters are scheduled for noise.
4. The method for multi-modal target tracking based on spatiotemporal denoising transformers of claim 1, wherein the denoising means comprises The number of the denoising modules in cascade, Each denoising module comprises a self-attention layer, a cross-attention layer, fiLM modulation modules, a feedforward network and a noisy input which are connected in sequence Or noisy inputs Features generated by the self-attention layer enter the cross-attention layer, and the cross-attention layer fuses the short-term history frame sequences of the complementary modes to obtain the features Or features of Features of Or features of Entering FiLM a modulation module, modulating the features by the global long-term token through FiLM-style scaling and shifting operation, inputting the modulated features into a feedforward network, and generating refined features Or refine features 。
5. The method for multi-modal target tracking based on spatiotemporal condition denoising transformers of claim 1, wherein the method is characterized by comprising the steps of 、 The expression of (2) is: Wherein, the Is characterized by Or (b) , For noisy inputs Or noisy inputs After the output from the attention layer, The cross-attention is indicated as being directed, A short term history frame sequence that is an RGB modality or a TIR modality.
6. The method for multi-modal target tracking based on spatiotemporal condition denoising transformers of claim 1, wherein features are refined 、 The expression of (2) is: Wherein, the To refine the characteristics Or (b) , The normalization is indicated by the fact that, Representing a feed-forward network and, For a global long-term memory, 、 In the form of a projection that can be learned, Representing hadamard products.
7. The method for multi-modal target tracking based on spatio-temporal condition denoising transformers of claim 1, wherein the method is characterized in that the total loss function of the denoising device is jointly trained based on feature reconstruction loss, feature alignment loss and tracking loss of each modal The method comprises the following steps: Wherein, the To reconstruct the loss of features for reconstructing the missing modalities, To force feature statistics alignment for modal enhancement feature alignment loss, To follow the standard set in ODTrack.
8. The method for multi-modal target tracking based on spatiotemporal denoising transformers of claim 1, wherein the feature reconstruction loss The method comprises the following steps: Wherein, the To refine the characteristics Or (b) , Is the true value characteristic of the complementary mode.
9. The method for multi-modal target tracking based on spatiotemporal denoising transformers of claim 1, wherein the feature alignment loss The method comprises the following steps: Wherein, the To refine the characteristics Or (b) , Is a true value characteristic of the mode, And Representing the mean and variance calculated in the spatial dimension.
10. The multimode target tracking system based on the space-time condition denoising transducer is characterized by comprising the following components: The coding module is used for coding the preprocessed RGB and TIR mode templates and search frame sequences to obtain RGB mode search frame coding characteristics Search frame coding features for TIR modalities ; A noise modulation module for adaptively generating Gaussian noise according to the integrity of RGB mode and TIR mode, and superposing the Gaussian noise to the search frame coding feature 、 Generating noisy inputs for RGB modalities Noisy input of TIR mode And enters a denoising device; A space-time condition denoising module for fusing the short-term history frame sequence of the TIR mode to the noisy input Obtaining characteristics of And fusing a long-term history frame sequence of TIR modes to the features Generating refinement features Fusion of short-term history frame sequences of RGB modalities to noisy inputs Obtaining characteristics of And fusing the long-term history frame sequence of RGB mode to the feature Generating refinement features ; The joint training module is used for jointly training the denoising device based on the characteristic reconstruction loss, the characteristic alignment loss and the tracking loss of each mode to obtain a trained denoising device; the target tracking module is used for obtaining noise characteristics after Gaussian noise is injected into the search frame coding characteristics of the available modes, inputting the noise characteristics into the trained denoising device to generate reconstruction or enhancement characteristics, inputting the reconstruction characteristics and the characteristics spliced by the available modes into the pre-measuring head or inputting the enhancement characteristics into the pre-measuring head to generate a tracking boundary frame.

Description

Multi-mode target tracking method and system based on space-time condition denoising transducer Technical Field The invention relates to the technical field of computer vision and deep learning, in particular to a multimode target tracking method and system based on a space-time condition denoising transducer. Background RGBT multi-modal target tracking aims at continuously locating targets in a video sequence using RGB modalities and Thermal Infrared (TIR) modalities. The RGB modality provides detailed appearance and semantic information under normal illumination, while the thermal infrared modality provides stable complementary radiative cues in low light or occluded scenes. By utilizing these complementary advantages, RGBT tracking exhibits great application potential in the fields of autopilot, night monitoring, search and rescue, and the like. However, in actual deployment, modality-missing problems often occur due to sensor misalignment, occlusion, or hardware failure. When a modality is not available, the network learned feature representation may become incomplete or unstable, which may greatly undermine the robustness of tracking. In view of the above problems, prior studies have attempted to deal with this by: (1) Methods based on single mode reconstruction paper Towards good practices for missing modality robust action recognition (Woo et al, korean institute of science and technology (KAIST), in Proceedings of the AAAI Conference on ARTIFICIAL INTELLIGENCE, 2023) use the remaining single available modes to synthesize or reduce the characteristics of the missing modes. (2) Multi-modal knowledge migration based methods, paper "Toward robust incomplete multimodal sentiment analysis via hierarchical representation learning"（Li et al, complex denier university engineering and technology institute, ADVANCES IN Neural Information Processing Systems, 2024) use full-modal models to guide models under incomplete input. (3) Based on the prompt learning method, paper DISTILLED PROMPT LEARNING FOR INCOMPLETE MULTIMODAL SURVIVAL PREDICTION (Xu et al, university of hong Kong science and technology, in Proceedings of the Computer Vision AND PATTERN Recognition Conference, 2025) uses a learnable token as a prompt strategy. The paper Modality-missing rgbt tracking: invertible prompt LEARNING AND HIGH-quality benchmarks (Lu et al, university of Anhui computer science and technology college, international Journal of Computer Vision, 2025) completes the generation of the missing modality by designing a reversible hint generation module. (4) The paper What you HAVE IS WHAT you track ADAPTIVE AND robust multimodal tracking (Tan et al, china telecom TeleAI, in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025) uses a multi-expert architecture to handle different modal availability separately. The above existing approaches are either static or single frame task designs and do not explicitly exploit temporal cues, extending them to video-based tracking is challenging, as temporal cues from historical frames are critical to compensating for missing modalities. Disclosure of Invention The technical problem to be solved by the invention is how to solve the problems of characteristic reduction distortion and time sequence fault caused by only relying on spatial clues in a modal missing scene and poor generalization and calculation redundancy caused by manual switching of branches of a model architecture in the existing RGBT tracking method. The invention solves the technical problems by adopting the following technical scheme that the multi-mode target tracking method based on the space-time condition denoising transducer comprises the following steps: Encoding the preprocessed RGB and TIR mode templates and search frame sequences to obtain RGB mode search frame encoding characteristics Search frame coding features for TIR modalities; Generating Gaussian noise adaptively according to the integrity of RGB mode and TIR mode, and superimposing Gaussian noise to search frame coding features、Generating noisy inputs for RGB modalitiesNoisy input of TIR modeAnd enters a denoising device; Fusion of a short-term history frame sequence of TIR modalities to noisy inputs Obtaining characteristics ofAnd fusing a long-term history frame sequence of TIR modes to the featuresGenerating refinement featuresFusion of short-term history frame sequences of RGB modalities to noisy inputsObtaining characteristics ofAnd fusing the long-term history frame sequence of RGB mode to the featureGenerating refinement features; The denoising device is trained jointly based on the feature reconstruction loss, the feature alignment loss and the tracking loss of each mode, and the trained denoising device is obtained; and (3) injecting Gaussian noise into the coding features of the search frames of the available modes to obtain noise features, inputting the noise features into a trained denoising device to generate reconstruction o