CN-121982058-A - Thermal infrared target tracking method based on efficient fine adjustment of parameters

CN121982058ACN 121982058 ACN121982058 ACN 121982058ACN-121982058-A

Abstract

The invention discloses a thermal infrared target tracking method based on efficient fine adjustment of parameters, which comprises the following steps of 1, obtaining a visible light data set and a thermal infrared data set, dividing the obtained data set into a training set and a testing set, 2, training a visible light tracker as a basic model, 3, inserting a cross-scale representation adapter in each encoder layer of the visible light tracker as the basic model, and constructing a thermal infrared target tracking network by using a efficient fine adjustment algorithm of parameters as the adapter, 5, carrying out fine adjustment on the network parameters on the thermal infrared data set, carrying out testing on the testing set after loss is stable, and reserving optimal weights. The invention significantly improves the thermal infrared target tracking performance on a plurality of public reference data sets.

Inventors

LIU JIAHANG
XU YAN
DONG YI

Assignees

南京航空航天大学

Dates

Publication Date: 20260505
Application Date: 20251215

Claims (7)

1. A thermal infrared target tracking method based on efficient fine adjustment of parameters is characterized by comprising the following steps: step 1, acquiring a visible light data set and a thermal infrared data set, and dividing the acquired data set into a training set and a testing set; Step 2, training a visible light tracker as a basic model; Step 3, a visible light tracker is used as a basic model, and a trans-scale representation adapter is inserted into each encoder layer of the visible light tracker, so that common knowledge of visible light target tracking is migrated to a thermal infrared target tracking task, and a thermal infrared target tracking network is constructed; Step 4, using a parameter efficient fine tuning algorithm as an adapter; and 5, fine-tuning the network on the thermal infrared data set, iteratively optimizing network parameters, testing on the test set after the loss is stable, and reserving the optimal weight.
2. The method for tracking the thermal infrared target based on the efficient fine tuning of parameters according to claim 1, wherein in the step 2, a visible light tracker is trained as a basic model, MAE pre-trained basic version original Vision Transformer is adopted as a backbone network for joint feature extraction and relation modeling, the network head is a lightweight full convolution network FCN, three output branches of classification score graph branches, local offset branches and normalized bounding box size branches are respectively formed by 4 stacked convolution-batch normalization-rectification linear units Conv-BN-ReLU layers, training sets of COCO, laSOT, GOT-10k and TRACKINGNET are used in the training process, the data sets form complementations, a core scene and a target type of a tracking task are covered, a basic and effective conventional lightweight data enhancement strategy is adopted, and the core comprises two operations, namely horizontal inversion, random horizontal mirror inversion, brightness dithering, and random lifting adjustment of the brightness of an image.
3. The thermal infrared target tracking method based on parameter efficient fine adjustment according to claim 1, wherein in step 3, the thermal infrared target tracking network accepts two input images of a template frame and a search frame, and outputs the position boundary frame coordinates of a target in the search frame; set a video to include Taking the first frame as a template frame Taking the subsequent frame as a search frame Both are first divided into And respectively mapped as image block embedding: Wherein the method comprises the steps of In the form of a linear projection matrix, Coding for a leachable position; The two are then directly spliced: the self-attention mechanism in the multi-layer encoder, viT, fed ViT not only models the interior of a single frame, but also assumes matching between templates and search frames, and writes the self-attention of the transducer to the spliced token sequence as: Wherein the method comprises the steps of Representing the attention weight of the template token to the search token, Representing the attention weight of the search token to the template token; the search frame token after being transformed is restored to a two-dimensional space structure, and three kinds of images are output through a lightweight FCN, namely 1) the classification image Predicting the probability of each location being the target center, 2) local offset Compensating grid quantization error, 3) target scale Predicting the width and height of a target, and determining a final target frame by the following formula: Wherein the method comprises the steps of Is the highest response point in the classification map; The training Loss is a weighted sum of the classification Loss Focal Loss and the regression Loss l1+ GIoU Loss.
4. The thermal infrared target tracking method based on efficient fine adjustment of parameters according to claim 3, wherein a trans-scale representation adapter is inserted into each layer of a transducer encoder layer, when the trans-scale representation adapter processes the features of templates and searching branches, the trans-scale representation adapter firstly remodels a one-dimensional token after dimension reduction into a two-dimensional feature map to restore space topology, then deep convolution kernels with multiple receptive field sizes are deployed on the feature map in parallel, fine-grained local thermal radiation cues and coarse-grained global context structures of a TIR target are respectively captured, fusion of multi-scale features is achieved through weighted summation, and therefore target feature modeling under different scales is achieved.
5. The method for tracking thermal infrared target based on efficient fine tuning of parameters as recited in claim 1, wherein in step 4, the adapter adopts two branches with identical architecture, namely a template branch and a search branch, and the two branches share all parameters, and the input sequence after splicing Will be divided into template markers Search markers Then respectively sending the branches to corresponding branches for processing; each branch is processed by a series of transformations to obtain a corresponding marker sequence, and firstly, the input sequence is passed through a fully-connected lower projection layer Compressing high-dimensional representation to compact hidden space, reducing calculation amount while focusing task related adaptation, remodelling the reduced-dimensional mark to two-dimensional feature map according to spatial arrangement of original image block grid, applying parallel depth convolution kernel of multiple receptive field sizes on restored space topological structure, explicitly coding space induction bias and capturing multi-scale space mode, aggregating output of these convolution paths by weighting summation to realize multi-scale feature fusion considering local detail restoration and global context consistency, and then using one Point-by-point convolution refines the fusion characteristics and readjusts the relation among channels, and then the fusion characteristics are subjected to GELU activation functions and upper projection layers The transformed features are restored to the original dimension, and a plurality of residual error connections are added in the module to keep the input consistency and stabilize the optimization process; in each transducer encoder layer, the adapter runs as an independent branch in parallel with the original multi-headed self-attention layer and feed forward layer, and the final output of each layer is calculated as follows: Wherein the method comprises the steps of For a multi-headed self-attention operation, In order to provide a feed-forward layer, Characterizing adapter mappings for cross-scale; In the efficient fine adjustment process of parameters, the following conditions are satisfied: By optimizing only The cross-modal adaptation from the visible light pre-training model to the thermal infrared target tracking task can be realized on the thermal infrared training data, so that the tracking performance is maintained or improved while the scale of the trainable parameters is remarkably reduced.
6. The method for thermal infrared target tracking based on efficient fine tuning of parameters as recited in claim 5, wherein in step 4, the set of convolution kernel sizes is At least comprises: With each convolved branch decomposed using a "deep convolution+point convolution" structure, i.e., for each branch Channel independent depth convolution is performed first: and then performing point convolution of channel mixing: On the premise of ensuring the controllable quantity and calculated amount of parameters, the modeling capability of the local heat radiation structure of the targets with weak textures, low contrast and scale change in the thermal infrared image is enhanced, so that the robustness of thermal infrared target tracking is improved.
7. The method for tracking thermal infrared targets based on efficient fine tuning of parameters according to claim 1, wherein in step 5, in the training phase, classification loss and regression loss are used simultaneously, a weighted focus loss is used to complete classification tasks, and a predicted bounding box is combined Loss and generalized cross-ratio loss To perform bounding box regression, and finally, the overall loss function is as follows: wherein, the super parameters are subjected to grid search and are set as in all experiments , 。

Description

Thermal infrared target tracking method based on efficient fine adjustment of parameters Technical Field The invention relates to the technical field of computer vision, in particular to a thermal infrared target tracking method based on efficient fine adjustment of parameters. Background Thermal Infrared (TIR) target tracking has irreplaceable application value in key fields such as military reconnaissance, wild animal observation, ecological monitoring and the like by virtue of the all-weather imaging capability of the Thermal Infrared (TIR) target tracking, but compared with the technical mature visible light (RGB) target tracking, the development of the TIR tracking is limited by the lack of a large-scale high-quality annotation data set for a long time, the robust and generalized characteristic characterization is difficult to learn by a depth model only trained based on the TIR data, and the problem of great performance reduction easily occurs in complex scenes such as target shielding, bad weather and the like. In order to break through the data bottleneck, the migration of knowledge of an RGB pre-training model to a TIR tracking task becomes a mainstream solution, and the traditional full-scale fine tuning (FFT) can improve the TIR tracking performance, but has the defects of huge calculation and storage cost, easy fitting on a small data set, catastrophic forgetting and the like. For this reason, a Parameter Efficient Fine Tuning (PEFT) technique has been developed that adapts downstream tasks by freezing most of the parameters of the pre-trained backbone network, training only a small number of newly added parameters, effectively balancing computational efficiency and model performance, where adapter fine tuning is the dominant solution in visual tasks due to light weight and flexibility. However, the conventional visual adapter is inherited from a linear bottleneck architecture in the natural language processing field, and is not specially optimized for the modal difference of the TIR and RGB data, on one hand, the linear structure lacks space induction bias and cannot effectively calibrate the specific heat radiation contrast, space energy distribution and other modal specific characteristics of the TIR data, on the other hand, the single linear compression operation is difficult to realize multi-scale characteristic modeling, the dimensional diversity of the TIR target due to distance and posture change cannot be adapted, the details of the thermal structure and the accurate target boundary are difficult to be reserved, and the validity of trans-modal migration and the robustness of the TIR tracking are restricted. Together, these issues motivate the study of novel parameter efficient cross-modal adaptation architecture for TIR tracking. Disclosure of Invention The invention aims to provide a thermal infrared target tracking method based on efficient fine adjustment of parameters, which can obviously improve the thermal infrared target tracking performance on a plurality of public reference data sets. The technical scheme is that the thermal infrared target tracking method based on efficient fine adjustment of parameters comprises the following steps: step 1, acquiring a visible light data set and a thermal infrared data set, and dividing the acquired data set into a training set and a testing set; Step 2, training a visible light tracker as a basic model; Step 3, a visible light tracker is used as a basic model, and a trans-scale representation adapter is inserted into each encoder layer of the visible light tracker, so that common knowledge of visible light target tracking is migrated to a thermal infrared target tracking task, and a thermal infrared target tracking network is constructed; Step 4, using a parameter efficient fine tuning algorithm as an adapter; and 5, fine-tuning the network on the thermal infrared data set, iteratively optimizing network parameters, testing on the test set after the loss is stable, and reserving the optimal weight. Further, in step 2, a visible light tracker is trained as a basic model, MAE pre-trained basic plate original Vision Transformer (ViT-Base) is adopted as a backbone network for joint feature extraction and relation modeling, the network head is a light-weighted Full Convolution Network (FCN), three output branches of classification score graph branches, local offset branches and normalized boundary frame size branches are aimed at, each branch is composed of 4 stacked convolution-batch normalization-rectification linear units (Conv-BN-ReLU) layers, training sets of COCO, laSOT, GOT-10k and TRACKINGNET are used in the training process, the data sets can form complementation to cover a core scene and a target type of a tracking task, a basic and effective conventional lightweight data enhancement strategy is adopted for improving generalization capability and robustness of the model, and the core comprises two operations of horizontal inversion, nam