CN-119338906-B - Target tracking method, device, electronic equipment and storage medium

CN119338906BCN 119338906 BCN119338906 BCN 119338906BCN-119338906-B

Abstract

The application provides a target tracking method, a target tracking device, electronic equipment and a storage medium, and relates to the technical field of positioning and tracking. The method comprises the steps of obtaining relevant information for tracking a target object, wherein the relevant information comprises a satellite image, text description of the target object included in the satellite image, a target satellite image of the target object intercepted from the satellite image, and a historical track sequence of the target object obtained based on multi-frame historical satellite images before the satellite image, and inputting the satellite image, the text description, the target satellite image and the historical track sequence into a joint positioning and sequence tracking model to obtain a target tracking result of the target object. The combined positioning and sequence tracking model based on the deep learning combines the text description of the target object and the historical track sequence on the basis of the satellite image to track the target together, so that the target tracking in the satellite video scene can be realized, and the accuracy of the target tracking in the satellite video scene is improved.

Inventors

WU RIHONG
Fan huadong
XIE YONGHU
Bu Dongdong
SU ZHIJUAN

Assignees

北京观微科技有限公司

Dates

Publication Date: 20260508
Application Date: 20240913

Claims (6)

1. A method of tracking a target, comprising: acquiring related information for tracking a target object, wherein the related information comprises a satellite image, a text description of the target object included in the satellite image, a target satellite image of the target object intercepted from the satellite image, and a historical track sequence of the target object obtained based on multi-frame historical satellite images before the satellite image; inputting the satellite image, the text description, the target satellite image and the historical track sequence into a combined positioning and sequence tracking model to obtain a target tracking result of the target object; the combined positioning and sequence tracking model is obtained by training based on a plurality of related information samples and tracking result labels corresponding to the related information samples; The combined positioning and sequence tracking model comprises a natural language encoder, a visual encoder, a feature fusion module and an autoregressive decoder, wherein the autoregressive decoder comprises a multi-head attention unit and a feedforward neural network unit; The step of inputting the satellite image, the text description, the target satellite image and the historical track sequence into a combined positioning and sequence tracking model to obtain a target tracking result of the target object comprises the following steps: Inputting the text description into the natural language encoder, and extracting text coding features; Inputting the satellite image and the target satellite image into the visual encoder, and extracting first image coding features of the satellite image and second image coding features of the target satellite image; Inputting the text coding feature, the first image coding feature and the second image coding feature into the feature fusion module to obtain a target fusion feature, wherein the text coding feature is input into a first linear projection unit, and linear projection is carried out on the text coding feature to obtain the target text coding feature; inputting the first image coding feature and the second image coding feature into a second linear projection unit, and respectively performing linear projection on the first image coding feature and the second image coding feature to obtain a corresponding first target image coding feature and a corresponding second target image coding feature, wherein the dimensions of the target text coding feature, the dimensions of the second target image coding feature and the dimensions of the second target image coding feature are the same; the method comprises the steps of carrying out splicing processing on target text coding features, coding features corresponding to an empty target satellite image of a target object and the first target image coding features to obtain first splicing features, carrying out splicing processing on the target text coding features, the second target image coding features and the first target image coding features to obtain second splicing features, and inputting the first splicing features and the second splicing features into a multi-source association fusion unit to obtain target fusion features, wherein the empty target satellite image is a satellite image only comprising the target satellite image of the target object; The target fusion characteristic and the history track sequence are input into the autoregressive decoder to obtain the target tracking result, and the target tracking result is determined by inputting the target fusion characteristic and the history track sequence into the multi-head attention unit to obtain multi-head attention characteristics, inputting the multi-head attention characteristics into the feedforward neural network unit to obtain target decoding characteristics and determining the first sum value of the multi-head attention characteristics and the target decoding characteristics.
2. The object tracking method of claim 1, wherein the multi-head attention unit comprises a multi-head cross attention layer and a masked multi-head self attention layer; The step of inputting the target fusion feature and the history track sequence into the multi-head attention unit to obtain multi-head attention features comprises the following steps: inputting the history track sequence into the mask multi-head self-attention layer to obtain mask multi-head self-attention characteristics; Inputting the second sum value of the history track sequence and the mask multi-head self-attention feature and the target fusion feature into the multi-head cross attention layer to obtain multi-head cross attention feature; And determining the sum of the multi-head cross attention feature and the second sum as the multi-head cross attention feature.
3. A target tracking device, comprising: An acquisition unit configured to acquire related information for performing target object tracking, the related information including a satellite image, a text description of the target object included in the satellite image, a target satellite image of the target object taken from the satellite image, and a history track sequence of the target object obtained based on a plurality of frames of history satellite images preceding the satellite image; The processing unit is used for inputting the satellite image, the text description, the target satellite image and the historical track sequence into a combined positioning and sequence tracking model to obtain a target tracking result of the target object; the combined positioning and sequence tracking model is obtained by training based on a plurality of related information samples and tracking result labels corresponding to the related information samples; The combined positioning and sequence tracking model comprises a natural language encoder, a visual encoder, a feature fusion module and an autoregressive decoder, wherein the autoregressive decoder comprises a multi-head attention unit and a feedforward neural network unit; the processing unit is configured to input the satellite image, the text description, the target satellite image, and the historical track sequence into a joint positioning and sequence tracking model, to obtain a target tracking result of the target object, and includes: Inputting the text description into the natural language encoder, and extracting text coding features; Inputting the satellite image and the target satellite image into the visual encoder, and extracting first image coding features of the satellite image and second image coding features of the target satellite image; Inputting the text coding feature, the first image coding feature and the second image coding feature into the feature fusion module to obtain a target fusion feature, wherein the text coding feature is input into a first linear projection unit, and linear projection is carried out on the text coding feature to obtain the target text coding feature; inputting the first image coding feature and the second image coding feature into a second linear projection unit, and respectively performing linear projection on the first image coding feature and the second image coding feature to obtain a corresponding first target image coding feature and a corresponding second target image coding feature, wherein the dimensions of the target text coding feature, the dimensions of the second target image coding feature and the dimensions of the second target image coding feature are the same; the method comprises the steps of carrying out splicing processing on target text coding features, coding features corresponding to an empty target satellite image of a target object and the first target image coding features to obtain first splicing features, carrying out splicing processing on the target text coding features, the second target image coding features and the first target image coding features to obtain second splicing features, and inputting the first splicing features and the second splicing features into a multi-source association fusion unit to obtain target fusion features, wherein the empty target satellite image is a satellite image only comprising the target satellite image of the target object; The target fusion characteristic and the history track sequence are input into the autoregressive decoder to obtain the target tracking result, and the target tracking result is determined by inputting the target fusion characteristic and the history track sequence into the multi-head attention unit to obtain multi-head attention characteristics, inputting the multi-head attention characteristics into the feedforward neural network unit to obtain target decoding characteristics and determining the first sum value of the multi-head attention characteristics and the target decoding characteristics.
4. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the object tracking method of any one of claims 1 to 2 when the computer program is executed by the processor.
5. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the object tracking method according to any one of claims 1 to 2.
6. A computer program product comprising a computer program which, when executed by a processor, implements the object tracking method according to any one of claims 1 to 2.

Description

Target tracking method, device, electronic equipment and storage medium Technical Field The present application relates to the field of positioning and tracking technologies, and in particular, to a target tracking method, device, electronic apparatus, and storage medium. Background Compared with the target tracking in a general video scene, the target object in the satellite video scene has the characteristics of small size, weak visual characteristics, low resolution, more complex background and the like, so that the conditions of poor distinguishability, background shielding and the like exist between the target object and between the target object and the background, and a certain challenge is brought to a target tracking task. In the related target tracking technology, there are various target tracking algorithms, for example, a target tracking task is decomposed into three subtasks, which are located, tracked and integrated, and each subtask is processed by three modules respectively, so as to achieve target tracking. Or defining a target tracking task through natural language specifications, providing a special platform for the natural language specification tracking task, releasing a new natural language-based tracking reference TNL2K, and respectively providing two baselines initialized by the natural language and the natural language with a boundary box to realize target tracking and the like. However, the above-mentioned target tracking algorithm is only aimed at target tracking in a general video scene, and is not suitable for target tracking in a satellite video scene, so how to implement target tracking in a satellite video scene is a technical problem to be solved by those skilled in the art. Disclosure of Invention The application provides a target tracking method, a target tracking device, electronic equipment and a storage medium, which are used for solving the problem of how to realize target tracking in a satellite video scene in the prior art and improving the accuracy of target tracking in the satellite video scene. The application provides a target tracking method, which comprises the following steps: acquiring related information for tracking a target object, wherein the related information comprises a satellite image, a text description of the target object included in the satellite image, a target satellite image of the target object intercepted from the satellite image, and a historical track sequence of the target object obtained based on multi-frame historical satellite images before the satellite image; inputting the satellite image, the text description, the target satellite image and the historical track sequence into a combined positioning and sequence tracking model to obtain a target tracking result of the target object; The joint positioning and sequence tracking model is trained based on a plurality of related information samples and tracking result labels corresponding to the related information samples. According to the target tracking method provided by the application, the combined positioning and sequence tracking model comprises a natural language encoder, a visual encoder, a feature fusion module and an autoregressive decoder; The step of inputting the satellite image, the text description, the target satellite image and the historical track sequence into a combined positioning and sequence tracking model to obtain a target tracking result of the target object comprises the following steps: Inputting the text description into the natural language encoder, and extracting text coding features; Inputting the satellite image and the target satellite image into the visual encoder, and extracting first image coding features of the satellite image and second image coding features of the target satellite image; inputting the text coding feature, the first image coding feature and the second image coding feature to the feature fusion module to obtain a target fusion feature; And inputting the target fusion characteristics and the historical track sequence into the autoregressive decoder to obtain the target tracking result. According to the target tracking method provided by the application, the feature fusion module comprises a first linear projection unit, a second linear projection unit and a multi-source association fusion unit; The step of inputting the text coding feature, the first image coding feature and the second image coding feature to the feature fusion module to obtain a target fusion feature includes: Inputting the text coding features into the first linear projection unit, and linearly projecting the text coding features to obtain target text coding features; Inputting the first image coding feature and the second image coding feature into the second linear projection unit, and respectively performing linear projection on the first image coding feature and the second image coding feature to obtain a corresponding first target image coding feature and a corr