CN-120014299-B - Long-term target tracking method based on time sequence query propagation

CN120014299BCN 120014299 BCN120014299 BCN 120014299BCN-120014299-B

Abstract

A long-term target tracking method based on time sequence query propagation comprises the steps of firstly obtaining video sequence frames comprising a multi-frame template frame set and a multi-frame search frame set, flattening to obtain template patch embedding and search patch embedding, secondly adding position embedding, thirdly designing a target query vector and a background query vector, inputting the first frame template patch embedding and the search patch embedding obtained in the second step, the designed target query vector and the background query vector into a plurality of vision transducer layers together for attention calculation, fourthly inputting the target query vector and the background query vector updated in the third step into the plurality of vision transducer layers together with the template patch embedding and the search patch embedding of a subsequent frame for attention calculation again, and fifthly embedding the finally generated search frame patch into a pre-measuring head to obtain target position coordinates. The invention can improve the long-term tracking robustness and accuracy.

Inventors

LING QIANG
XIONG JIABING
LIU YUAN
FANG YI

Assignees

中国科学技术大学

Dates

Publication Date: 20260512
Application Date: 20250126

Claims (9)

1. A long-term target tracking method based on timing query propagation, comprising: step one, video sequence blocking and patch embedding, namely obtaining a multi-frame template frame set And multi-frame search frame set Dividing and flattening each frame of image into a series of image blocks, and then flattening each image block into a one-dimensional vector using a trainable linear projection network Respectively mapping the flattened template frame image block and the search frame image block to a high-dimensional potential space to obtain template patch embedding Search patch embedding ; Step two, adding position embedding, namely embedding the template patch obtained in the step one Search patch embedding Respectively adding position embeddings capable of learning And ; Step three, the initial query interacts with the transducer to design a target query vector Background query vector The target query vector aims at learning and characterizing the salient features of the target object, the background query vector aims at learning and characterizing the features of the background area, and the first frame template patch obtained in the step two is embedded Search patch embedding Target query vector with design Background query vector Inputting multiple visual transducer layers together for attention calculation, wherein the target inquiry vector Is randomly initialized to a one-dimensional vector, the background query vector The method comprises the steps of gradually reducing the resolution of a feature map by using a multi-scale convolution network and taking a marking sequence of a background area as input through layer-by-layer convolution and pooling operation, finally learning a two-dimensional vector with a sampling step length of 16, and flattening the two-dimensional vector into a one-dimensional vector to serve as a final background query vector; Step four, time sequence query propagation and transform iteration, namely updating the target query vector in the step three Background query vector Template patch embedding with subsequent frames Search patch embedding Input into multiple visual converters layers together, calculate again attention, repeat target query vector Background query vector Updating and inputting the visual transducer layer, namely, continuously utilizing the transducer layer to calculate the attention between continuous video frames; Step five, embedding the search frame patch finally generated after the multi-layer transform iterative computation in the step four Inputting the pre-measuring head to obtain the target position coordinates.
2. The long-term target tracking method based on time-series query propagation of claim 1, wherein the trainable linear projection network Is a linear function.
3. The long-term target tracking method based on time-series query propagation of claim 1, wherein the dimension of the high-dimensional potential space is 384.
4. The long-term target tracking method based on time series query propagation as claimed in claim 1, wherein adding location embeddings in the template patch embeddings and the search patch embeddings is adding the template patch embeddings and the search patch embeddings element by element with the location embeddings, respectively.
5. The long-term object tracking method based on time-series query propagation of claim 1, wherein the location embedding is a learnable location embedding vector, and the location embedding is obtained by directly defining a tensor and then autonomously learning by a deep learning network to obtain the location-encoded information.
6. The long-term target tracking method based on time-series query propagation of claim 1, wherein in multiple visual transducer layers, the target query vector And background query vectors Will be respectively associated with And The cross-attention computation is performed to extract feature information related to the object and the background, and at the same time, Inside and The internal patch embedding also performs self-attention calculations to learn context within the frame.
7. The long-term target tracking method based on time-series query propagation of claim 1, wherein the number of the plurality of visual transducer layers is 24 layers.
8. The long-term object tracking method based on time series query propagation as claimed in claim 1, wherein the prediction head is composed of a fully connected layer for embedding and mapping the patch to position coordinate information of the object.
9. The long-term object tracking method based on time series query propagation as claimed in claim 1, wherein the position coordinate information of the object includes a center point coordinate and a width and height of the object.

Description

Long-term target tracking method based on time sequence query propagation Technical Field The invention relates to the field of pattern recognition and computer vision, in particular to a long-term target tracking method based on time sequence query propagation. Background Object tracking is a fundamental and important task in the field of computer vision, the object of which is to predict the state, typically including position and size, of an object in a subsequent frame, given the position or area of the object in an initial frame of a video sequence. By virtue of the core effect, the target tracking technology is widely applied to a plurality of key fields such as video monitoring, man-machine interaction, automatic driving, robot navigation and the like. In recent years, a target tracking method based on deep learning is remarkably improved in performance and becomes the main stream of research due to the rapid development of deep learning technology. However, despite great progress, existing deep learning target tracking methods still face many challenges in practical applications, especially when dealing with long-term tracking tasks, which limit the robustness and reliability thereof. These challenges are mainly manifested in the following aspects: The appearance change of the target, namely the target inevitably undergoes rotation, scaling, deformation, illumination change, even self-posture change and other appearance changes in the motion process, and the changes can lead to large differences between the target characteristics extracted by the tracker and the initial template, thereby causing tracking failure. The background interference is that in a complex dynamic background environment, an object similar to the appearance of a target is often present, or interference factors such as light mutation, scene switching and the like are generated, and the tracker is extremely easy to drift, so that the background is misjudged as the target. Occlusion and disappearance of the target, which may be completely or partially occluded by other objects during movement, even temporarily out of view. Conventional trackers often have difficulty restoring tracking after a target reappear due to a lack of efficient use of the target history information. Model drift in long-term tracking, namely, some trackers based on online learning strategies have certain self-adaptive capacity, but in the long-term running process, if inaccurate prediction is encountered, an erroneous tracking result is used as a positive sample to update the model, so that errors are accumulated easily for a long time, the model drift is caused, and the tracking performance is finally reduced. As shown in fig. 1 (a), currently, the mainstream deep learning object tracking method is mostly based on training with a single frame image pair, that is, learning the feature representation of an object by using an image pair composed of a template image containing the object and a search image of the object to be searched. This training paradigm reduces the objective tracking task to a similarity matching problem between image pairs, with the core being learning the visual similarity of the template image to candidate regions in the search image. However, this single frame image pair-based training approach essentially treats the video frames as independent still images, ignoring the time dimension inherent in the video sequence and the rich timing information from frame to frame. The neglect of the timing information makes it difficult for the tracker to effectively model the dynamic change rule of the target in the motion process, so that the tracker is worry about dealing with challenges such as appearance change and background interference of the target. Although some target tracking methods attempt to introduce timing information, the main way is to dynamically update the template image, i.e. replace the template image of the initial frame with an image block in the subsequent tracking frame that is cut out according to the prediction result. The initial goal of this dynamic update strategy is to hope that the model will learn more similar target features to the current frame. However, this strategy has inherent drawbacks in that, on the one hand, when tracking prediction results are unreliable, such as drift occurs or the target is occluded, the updated template image is likely to no longer contain the target to be tracked, but rather background noise is introduced, thereby accelerating tracking failure. On the other hand, the method based on dynamic template updating often needs to carefully design and adjust complex super parameters, such as a time interval of template updating, a confidence threshold value of updating, fusion weight and the like, the setting of the super parameters has significant influence on tracking performance, and the method lacks generality, needs to adjust for different scenes, and increases the complexity and deploy