CN-115546252-B - Target tracking method and system based on cross-correlation matching enhanced twin network

CN115546252BCN 115546252 BCN115546252 BCN 115546252BCN-115546252-B

Abstract

The invention discloses a target tracking method and a target tracking system based on a cross-correlation matching enhanced twin network, wherein the target tracking method comprises the steps of cutting an acquired video sequence of a target to be tracked to obtain template images and search images of all frame images; inputting the template image and the search image into a constructed and trained cross-correlation matching enhanced twin network, extracting template features and search features of the template image and the search image, carrying out cross-correlation matching on the template features and the search features to obtain cross-correlation features, carrying out bounding box information encoding on the template image to obtain bounding box encoding features, carrying out classification calculation and regression calculation on fusion features of the cross-correlation features and the bounding box encoding features to obtain a corresponding classification score graph and a regression prediction graph, and combining the offset of the regression prediction graph according to the position with the maximum response value in the classification score graph to obtain the final position of the target on the video sequence frame. The invention has strong adaptability and high precision to complex scene change tracking.

Inventors

HU ZHAOHUA
LIU HAONAN
LIN XIAO
WANG YING

Assignees

南京信息工程大学

Dates

Publication Date: 20260505
Application Date: 20221031

Claims (9)

1. A target tracking method for enhancing a twin network based on cross-correlation matching, the method comprising: Cutting the acquired video sequence of the target to be tracked to obtain template images and search images of all frame images; Inputting the template image and the search image into a constructed and trained cross-correlation matching enhancement twin network, extracting template features and search features of the template image and the search image through the cross-correlation matching enhancement twin network, performing cross-correlation matching on the template features and the search features to obtain cross-correlation features, performing bounding box information coding on the template image to obtain bounding box coding features, and performing classification calculation and regression calculation on fusion features of the cross-correlation features and the bounding box coding features to obtain corresponding classification score graphs and regression prediction graphs; according to the position with the maximum response value in the classification score diagram, combining the offset of the regression prediction diagram to obtain the final position of the target on the video sequence frame; the cross-correlation matching enhancement twin network comprises a feature extraction network, a cross-correlation matching network, a classification regression network and a boundary box coding module; The cross-correlation matching network comprises a transverse scale extraction module, a longitudinal scale extraction module and a cascade dual mutual module; the cascade dual cross-correlation module is used for performing pixel matching cross-correlation operation on the template features and the search features, and then performing depth separable cross-correlation operation on the template features to obtain cascade dual cross-correlation features, wherein the calculation formula is as follows: ; (1); In the formula, Representing template features And search features Is a pixel of the matching cross-correlation feature, Representing pixel matching cross-correlations; representing a cascade of dual cross-correlation features, Representing depth separable cross-correlations; the lateral scale extraction module is used for extracting lateral scale branch characteristics of template characteristics and search characteristics, and the calculation formula is as follows: ; (2); wherein: Representing a convention The convolution kernel is used to determine the convolution kernel, Representing convolution Warp yarn The transverse 3:1 convolution expands to a transverse convolution kernel of 7*3 size; Representing lateral-scale branch features; representing a convolution operation; The longitudinal scale extraction module is used for extracting longitudinal scale branch characteristics of template characteristics and search characteristics, and the calculation formula is as follows: ; (3); In the formula, Representing convolution kernels Warp yarn The convolution expansion performed longitudinally 1:3 is a longitudinal convolution kernel of 3*7 size; representing longitudinal scale branching features.
2. The target tracking method based on the cross-correlation matching enhanced twin network according to claim 1, wherein the construction and training process of the cross-correlation matching enhanced twin network comprises: acquiring a target video sequence frame data set, cutting each frame image in the data set according to the target position and the size of the image, and acquiring template images and search images of all frame images as a training sample set; Constructing a cross-correlation matching enhancement twin network, wherein the feature extraction network is an improved ResNet depth residual network, and the improvement of the ResNet depth residual network comprises the steps of removing a fifth convolution layer of the original ResNet depth residual network, setting the convolution step length of a third layer and a fourth layer to be 1, and setting the expansion convolution sizes of the third layer and the fourth layer to be 4; training the constructed cross-correlation matching enhancement twin network based on the training sample set to obtain a trained cross-correlation matching enhancement twin network.
3. The method for object tracking based on cross-correlation matching enhanced twin network according to claim 2, wherein the method for cropping each frame image in the dataset comprises: And cutting the first frame image of the target video sequence into template images with the size of 127 x 3 by taking the target as the center, and starting from the second frame, cutting the images of the subsequent frames of the target video sequence into the size of 255 x 3 by taking the target as the center.
4. The target tracking method based on cross-correlation matching enhanced twin network of claim 1, wherein the cascade dual cross-correlation feature is based on Lateral-scale branching features Longitudinal scale branching features Calculating the cross-correlation characteristics of the output of the cross-correlation matching network The method comprises the following steps: (4); In the formula, 、 And Representing cascaded dual cross-correlation features, respectively Lateral-scale branching features And longitudinal scale branching features The fusion coefficient of (2) 、 And The value of (2) is obtained by optimizing according to network training.
5. The method for tracking the target based on the cross-correlation matching enhanced twin network according to claim 4, wherein the bounding box coding module comprises a plurality of fully connected layers, and the step of the bounding box coding module coding the bounding box information of the template image comprises the following steps: Converting target boundary frame coordinates of template image into one-dimensional feature vector , wherein, ( , , , ),( , ) Representing the corner coordinates of the bounding box of the object, Representing the width of the bounding box of the object, Representing the height of the target bounding box; The feature vector Vector dimension coding is carried out through a plurality of full-connection layers, and boundary frame coding characteristics are obtained Expressed as: (5); wherein: the structure of the full-connection layer is shown, Representing feature vectors Output characteristics via the fully connected layer.
6. The method for target tracking based on cross-correlation matching enhanced twin network of claim 5, wherein the step of extracting the fusion feature of the cross-correlation feature and the bounding box coding feature comprises: characterizing the cross-correlation With bounding box coding features Performing broadcast addition operation to obtain primary fusion characteristics : (6); For the preliminary fusion feature Performing 1*1-sized convolutional encoding to obtain fusion characteristics : (7); In the formula, Representing a convolutional encoding operation.
7. The target tracking method based on a cross-correlation matching enhanced twin network according to any one of claims 2 to 6, wherein the step of training the constructed cross-correlation matching enhanced twin network based on the training sample set comprises: Randomly extracting paired template images and search images from the training sample set to serve as inputs of two branches of the cross-correlation matching enhancement twin network; Carrying out gradient return by utilizing a SGD random gradient descent method with momentum, and optimizing network parameters until the joint task loss function converges; The calculation formula of the joint task loss function is as follows: (8); In the formula, Representing a binary cross entropy loss function, Representing the IOU loss function, And Respectively is And Is a weight of (2).
8. A cross-correlation matching enhanced twin network based target tracking system, the system comprising: The clipping module is used for clipping the acquired video sequence of the target to be tracked to obtain template images and search images of all frame images; The classification prediction module is used for inputting the template image and the search image into a constructed and trained cross-correlation matching enhancement twin network, extracting template features and search features of the template image and the search image through the cross-correlation matching enhancement twin network, carrying out cross-correlation matching on the template features and the search features to obtain cross-correlation features, carrying out boundary frame information coding on the template image to obtain boundary frame coding features, and carrying out classification calculation and regression calculation on fusion features of the cross-correlation features and the boundary frame coding features to obtain corresponding classification score graphs and regression prediction graphs; The target position obtaining module is used for obtaining the final position of the target on the video sequence frame according to the position with the maximum response value in the classification score diagram and combining the offset of the regression prediction diagram; the cross-correlation matching enhancement twin network comprises a feature extraction network, a cross-correlation matching network, a classification regression network and a boundary box coding module; The cross-correlation matching network comprises a transverse scale extraction module, a longitudinal scale extraction module and a cascade dual mutual module; the cascade dual cross-correlation module is used for performing pixel matching cross-correlation operation on the template features and the search features, and then performing depth separable cross-correlation operation on the template features to obtain cascade dual cross-correlation features, wherein the calculation formula is as follows: ; (1); In the formula, Representing template features And search features Is a pixel of the matching cross-correlation feature, Representing pixel matching cross-correlations; representing a cascade of dual cross-correlation features, Representing depth separable cross-correlations; the lateral scale extraction module is used for extracting lateral scale branch characteristics of template characteristics and search characteristics, and the calculation formula is as follows: ; (2); wherein: Representing a convention The convolution kernel is used to determine the convolution kernel, Representing convolution Warp yarn The transverse 3:1 convolution expands to a transverse convolution kernel of 7*3 size; Representing lateral-scale branch features; representing a convolution operation; The longitudinal scale extraction module is used for extracting longitudinal scale branch characteristics of template characteristics and search characteristics, and the calculation formula is as follows: ; (3); In the formula, Representing convolution kernels Warp yarn The convolution expansion performed longitudinally 1:3 is a longitudinal convolution kernel of 3*7 size; representing longitudinal scale branching features.
9. A computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the object tracking method of a cross-correlation matching enhanced twin network based on any of claims 1 to 7.

Description

Target tracking method and system based on cross-correlation matching enhanced twin network Technical Field The invention relates to the technical field of computer vision and target tracking, in particular to a target tracking method and system based on a cross-correlation matching enhanced twin network. Background Target tracking is a basic and challenging task in the field of computer vision, and is one of the most active research subjects in the field of computer vision in recent decades, and is defined as a video sequence that can keep accurately tracking a target in each subsequent frame given only the initial frame position of the tracked target. Target tracking has wide application in the fields of automatic driving, video monitoring, marine exploration, medical imaging and the like, and is therefore concerned by academia and industry. The traditional target tracking method based on the correlation filtering is low in robustness, the off-line training and the on-line tracking mode based on the twin network achieve good balance between tracking precision and reasoning speed, the twin network carries out similarity learning to estimate the most likely position of the target in the next frame, and the defect of the traditional method is overcome. As a representative of a twin network-based target tracking method, siamFC introduces a cross-correlation structure, truly realizes the balance of speed and precision, siamRPN improves the cross-correlation mode of SiamFC on the basis of SiamFC and introduces a region generation network, so that regression is more accurate, but the cross-correlation mode adopted by SiamRPN can generate very large parameter quantity, so that the network is difficult to train and optimize on the whole, siamRPN ++ introduces a deep neural network into the twin tracking network, greatly improves the tracking performance, simultaneously adopts a deep separable cross-correlation mode, reduces a large number of parameters and stabilizes the whole training process, however, the essence of any existing cross-correlation mode is still a sliding window convolution operation with a fixed size between two feature graphs, so that when an object is greatly deformed or a target region is relatively smaller, the cross-correlation can introduce a large amount of background information to interfere with the tracking of the target object, siamBAN solves the problems brought by directly predicting the foreground-background score and four center distance offsets on an output feature graph to obtain a predicted frame of a maximum response position, reduces the parameter prior to adjust the parameter, and has a certain prior variable and can not bear the learning and well change the target. Disclosure of Invention The invention aims to overcome the defects in the prior art, provides a target tracking method and a target tracking system based on a cross-correlation matching enhanced twin network, solves the technical problems that the prior information and the characteristic ambiguity brought by a simple cross-correlation matching mode are not fully utilized in the main stream target tracking method based on the twin network in the prior art, can reduce irrelevant background information and interference information, and improves the discrimination capability of the tracking network, so that the position of a target is more accurate. In order to achieve the above purpose, the invention is realized by adopting the following technical scheme: in a first aspect, the present invention provides a method for target tracking based on cross-correlation matching enhanced twin networks, the method comprising: Cutting the acquired video sequence of the target to be tracked to obtain template images and search images of all frame images; Inputting the template image and the search image into a constructed and trained cross-correlation matching enhancement twin network, extracting template features and search features of the template image and the search image through the cross-correlation matching enhancement twin network, performing cross-correlation matching on the template features and the search features to obtain cross-correlation features, performing bounding box information coding on the template image to obtain bounding box coding features, and performing classification calculation and regression calculation on fusion features of the cross-correlation features and the bounding box coding features to obtain corresponding classification score graphs and regression prediction graphs; And according to the position with the maximum response value in the classification score diagram, combining the offset of the regression prediction diagram to obtain the final position of the target on the video sequence frame. With reference to the first aspect, preferably, the construction and training process of the cross-correlation matching enhanced twin network includes: acquiring a target video sequence frame data set, cu