CN-116051601-B - Depth space-time associated video target tracking method and system

CN116051601BCN 116051601 BCN116051601 BCN 116051601BCN-116051601-B

Abstract

The application discloses a depth space-time associated video target tracking method and a depth space-time associated video target tracking system, which realize target tracking of video sequences and ensure that an accurate video target tracking result is obtained. Firstly, a space-time feature extractor is designed to extract the space-time features of the template sequence and the search sequence. And secondly, introducing a feature matching module which consists of a classification branch and a regression branch. And performing similar matching on the extracted template space-time characteristics and the search space-time characteristics through correlation filtering to obtain multichannel correlation filtering characteristics respectively. Then, a target tracking module is deployed, wherein the target tracking module comprises a classification head and a regression head, and the classification score map and the regression score map are calculated according to the input multi-channel related filtering characteristics respectively and are used for predicting the target position and estimating the scale of the target. Finally, the spatio-temporal associated visual tracking model is optimized by minimizing defined joint loss. In the target tracking test, a confidence region estimation strategy is proposed to maintain robust and accurate target tracking in the video sequence.

Inventors

LIANG MIN
Gui yan
LIU BINBIN

Assignees

长沙理工大学

Dates

Publication Date: 20260505
Application Date: 20221230

Claims (9)

1. A method for video object tracking with depth spatiotemporal association, the method being performed by a computer and comprising the steps of: S1, constructing a network architecture, wherein the network comprises a space-time feature extractor, a feature matching sub-network and a target prediction sub-network, the network architecture improves model detection and positioning capability, and obtains more accurate video target tracking results, and the method comprises the following steps: S11, a space-time feature extractor based on a 3D twin network comprises a template branch and a search branch, and a 3D full convolution neural network is used as a basic network and weight is shared, so that the space-time feature extractor is used for extracting template space-time features and search space-time features from an input template sequence block and a search sequence block; s12, a feature matching sub-network consists of a classification branch and a regression branch, takes template space-time features and search space-time features as inputs respectively, and performs feature similarity matching by using a correlation filtering operation to obtain a multi-channel correlation filtering feature; S13, the target prediction sub-network comprises a classification head and a regression head, the multi-channel related filtering characteristics are used as input, and a classification score chart and a regression score chart are respectively obtained through the classification head and the regression head; s2, respectively giving a template sequence video frame and a search sequence video frame, and cutting the template sequence video frame and the search sequence video frame into a template sequence block and a search sequence block which are used as input of the whole network architecture; s3, constructing a space-time feature extractor, wherein the space-time feature extractor is a 3D twin full convolution network and comprises a template branch and a search branch, and takes the 3D full convolution network as a basic network and weights are shared; S4, constructing a feature matching sub-network comprising a classification branch and a regression branch, taking the obtained template space-time feature and the search space-time feature as the input of the two branches, and carrying out feature similarity matching by utilizing a correlation filtering operation so that the classification branch and the regression branch respectively output multi-channel correlation filtering features; S5, constructing a target prediction sub-network, which mainly comprises a classification head and a regression head, and inputting multi-channel related filtering characteristics output by a classification branch and a regression branch into the corresponding classification head and regression head to obtain a classification score graph and a regression score graph; S6, positioning the position of each video frame target in the sequence according to the classification score map, estimating the target scale of each video frame in the sequence according to the regression score map, and obtaining a target prediction frame of each video frame in the search sequence; s7, optimizing a network model by minimizing joint loss, wherein the network model comprises classified cross entropy loss and regressive cross-ratio loss, and finally obtaining a video target tracker model; And S8, using the trained network model as a visual tracker to track targets of the given video sequence by video sequence, defining a confidence search area estimation strategy for ensuring stable and accurate tracking, cutting a search area of a next sequence according to different target states in the current video sequence, reducing error accumulation, and accurately positioning targets of each video frame in the search sequence.
2. The method for tracking video targets by depth space-time correlation according to claim 1, wherein the template sequence block and the search sequence block are constructed, and the specific implementation process is as follows: S21, giving a template sequence, acquiring the center position, width and height information of the target according to the real value information of the target in each video frame in the template sequence, and representing as ; S211, calculating the expansion value of the width and the height of the target frame according to the information of each real target frame given in S21 And calculates a scaling factor For scaling the expanded target frame area, if the expanded target frame area exceeds the boundary value of the video frame, filling with the average RGB value of the current video frame, and finally, cutting each video frame in the template sequence into A large and small template block; S212, clipping each video frame in the template sequence to obtain a template block Wherein Representing a total number of video frames in the template sequence; S22, giving a search sequence, acquiring the center position, width and height information of the target according to the real value information of the first frame video frame target in the template sequence, and representing the center position, the width and the height information as ; S221, calculating the expansion value of the width and the height of the target frame according to the real target frame information given in S22 And calculates a scaling factor If the object frame area added with the expansion value exceeds the boundary value of the video frame, the average RGB value of the current video frame is used for filling, and finally, each video frame in the search sequence is cut into A search block of a size; s222, clipping each video frame in the search sequence to obtain a search block Wherein Representing the total number of video frames in the search sequence.
3. The method for tracking video objects with depth spatiotemporal association according to claim 2, wherein the spatiotemporal feature extractor is constructed as follows: S31, constructing a feature extraction network, wherein each branch is a Res3D network consisting of five residual blocks; S32, modifying the padding attribute in the residual block of the first block of the Res3D into Stride is adjusted to Modifying the output channel of the fourth residual block and the input channel of the fifth block to 128 respectively, and removing the downsampling and final classifying layer of the fifth residual block, thereby the output space-time characteristics and the input video sequence have the same time length; S33, inputting the template blocks and the search blocks obtained in S212 and S222 into a space-time feature extractor to obtain template space-time features respectively And searching for spatiotemporal features 。
4. A depth spatio-temporal associated video object tracking method according to claim 3, characterized by constructing a feature matching sub-network, which is implemented as follows: S41, the template features obtained in S3 And search features Respectively inputting the filtered signals into a classification branch and a regression branch, and performing relevant filtering operation, wherein the specific calculation is as follows: , (1) , (2) Wherein, the Representing the branch of classification, Representing the regression branch of the flow, Representing a correlation filter; S42, respectively outputting multi-channel related filtering characteristics by the classification branch and the regression branch And 。
5. The method for tracking video targets by depth space-time correlation according to claim 4, wherein a video sequence target tracking sub-network is constructed, and the specific implementation process is as follows: S51, classifying head is composed of one Convolution layer composition, multi-channel correlation filtering characteristics outputted by classifying branches in S42 As an input of the classification header, a classification score map is output: ; S52, returning the head to one Convolution layer composition, multi-channel correlation filtering characteristics output by regression branch in S42 As an input to the regression head, a regression score map is output: 。
6. The method for tracking video objects according to claim 1, wherein the predicted object position and the estimated bounding box scale are implemented as follows: s61, the size of the classification score map is as follows: finding the point with the largest response value in the classification score graph Expressed in the original video frame as: wherein Is the total step length of the whole network; S62, a regression score graph is a four-way vector with the following size: Using , , , Representing the offset of the regression target, the coordinate information of the target can be expressed as: ,(3) Wherein the method comprises the steps of , Representing target prediction frames Upper left corner and lower right corner coordinates of (c).
7. The method for tracking video targets by depth space-time correlation according to claim 1, wherein a visual tracking model is trained, and the method is specifically implemented as follows: S71, the total training loss is defined as: ,(4) Wherein, the Is the first Loss of individual search frames; Expressed as the total number of classification score graphs (regression score graphs); represent the first In search blocks Probability that the location belongs to the target; Is shown in the first Position in individual regression score plots Distance from periphery of the bounding box; S72, training loss Cross-ratio loss, including cross entropy loss of classification and regression, is defined as: ,(5) Wherein, the Refers to an indicative function, indicating whether or not it belongs to a target, if the value is 1, otherwise, the value is 0; representing cross entropy loss of classification; Representing regression cross-ratio loss if the current position Belonging to a positive sample, i.e. the current position belongs to the target, then Assign 1, if it is a negative sample, it will Assigning a value of 0; Is shown in the first Center position of real target in each search block Offset from the periphery of the bounding box.
8. The method for tracking video objects according to claim 1, wherein the confidence search region is estimated by the following steps: s81, according to the result of the prediction frame of the current search sequence, due to the fact that the target may have a large position change in the video sequence Wherein Is the first in the search sequence Target prediction frame of frame according to the coordinates of upper left corner point of each target frame And lower right corner coordinates Calculating to obtain the minimum bounding box ; S82, minimum bounding box Expanding search areas for cropping a set of video sequences The search area is guaranteed to cover the target in each video frame of the search sequence.
9. A depth spatiotemporal correlated video object tracking system, characterized in that the steps of claim 1 are performed, comprising the following modules: the image marking module is used for giving a picture as input, marking random pixel points according to the real label of the picture, wherein the random pixel points comprise foreground and background marks, so that a large amount of interaction information is generated to simulate the interaction of users; the video sequence input module is used for giving a group of template sequence video frames and search sequence video frames, and cutting the template sequence video frames and the search sequence video frames into template sequence blocks and search sequence blocks with specified sizes according to the form in S2; The model training module is used for training a video target tracker based on a 3D twin network, the target tracker comprises a space-time feature extractor module, a feature matching module and a target prediction module, wherein the space-time feature extractor takes a template sequence block and a search sequence block as inputs, extracts template space-time features and search space-time features from the template sequence block and the search sequence block, inputs the space-time features into the feature matching module, carries out similarity matching by utilizing a correlation filtering operation to obtain multichannel correlation filtering features, sequentially inputs the multichannel correlation filtering features into a classification head and a regression head in the target prediction module, and finally obtains a classification score graph and a regression graph; And the video target tracking module is used for respectively estimating the target state and predicting the scale in the video frames of the search sequence by utilizing the classification diagram and the regression diagram which are output by the model in the test stage so as to obtain a target prediction frame in the search sequence, obtaining a set of confidence search areas by utilizing the set of target prediction frames, and inputting the confidence search areas into a search branch so as to track the targets of the subsequent sequence frames.

Description

Depth space-time associated video target tracking method and system Technical Field The invention relates to the field of computer vision, in particular to a depth space-time associated video target tracking method and system. Background Video object tracking refers to a technique of modeling the appearance and motion information of an object by using the context information of a video or image sequence, so as to predict the motion state of the object and calibrate the position. Typically, the target is continuously tracked in subsequent video frames according to the target specified in the first frame of the video, so as to achieve target positioning and target scale estimation. The video target tracking has wide application value and can be used in the fields of video monitoring, unmanned driving, accurate guidance and the like. In recent years, with the rapid development of deep learning and convolutional networks, more and more video object trackers based on convolutional networks are emerging. Researchers have become more favoured with trackers based on twin networks, which not only take advantage of tracking speed, but also achieve good accuracy. Such a twin network based tracker treats visual tracking as a similarity matching problem. In 2016, bertinetto et al proposed a SiamFC tracker for visual tracking (Luca Bertinetto, jack Valmadre,F.Henriques,Andrea Vedaldi,Philip H.S.Torr:Fully-Convolutional Siamese Networks for Object Tracking.ECCV Workshops(2)2016:850-865.), Extracting templates and search features by means of a twin network, and calculating the degree of cross-correlation between the target templates and the search area by means of correlation filtering. Subsequently, held et al propose GOTURN that the tracker (David Held,Sebastian Thrun,Silvio Savarese:Learning to Track at 100FPS with Deep Regression Networks.ECCV(1)2016:749-765.), regress the predicted target frame of the previous frame to obtain the target frame of the current frame. To further improve accuracy, 2018, li et al combined the twinning network with the region suggestion network, proposed SiamRPN tracker (Bo Li,Junjie Yan,Wei Wu,Zheng Zhu,Xiaolin Hu:High Performance Visual Tracking With Siamese Region Proposal Network.CVPR 2018:8971-8980.), to more accurately estimate target frame dimensions by introducing a region candidate network. However, the introduction of the anchor frame easily leads to ambiguity of similarity matching, thereby affecting tracking accuracy, causing error accumulation, reducing robustness of the target tracker, and bringing more superparameters. In 2020, chen et al designed a simple and efficient anchor-free tracker SiamBAN(Zedu Chen,Bineng Zhong,Guorong Li,Shengping Zhang,Rongrong Ji:Siamese Box Adaptive Network for Visual Tracking.CVPR 2020:6667-6676.), to improve tracker performance by adding feature combination module branches and quality assessment branches. These trackers have excellent performance and real-time tracking speed in most video scenes, but existing methods often consider video object tracking as an object detection problem from video frame to video frame, ignoring the rich spatio-temporal information between video frames. The visual tracking method based on the twin network should effectively utilize rich information of cross time frames, and can better learn space-time visual characteristics to perform target appearance modeling so as to improve the accuracy of tracking and positioning. Disclosure of Invention Aiming at the defects in the prior art, the invention provides a depth space-time associated video target tracking method and a system. The tracker can not only keep the characteristic information of time-space association by utilizing the time-space information, but also better model the appearance of a video target by utilizing the template sequence to store the characteristics of different template frames, thereby improving the accuracy of the tracker, and meanwhile, the target prediction result in the search sequence can be obtained by taking the template sequence and the search sequence as inputs. This way of processing video object tracking on a sequence-by-sequence basis greatly increases the speed of video object tracking. In order to achieve the above object, the present invention provides a video object tracking method with depth spatio-temporal correlation, comprising the following steps: S1, constructing a network architecture, wherein the network comprises a space-time feature extractor, a feature matching sub-network and a target prediction sub-network, the network architecture improves model detection and positioning capability, and obtains more accurate video target tracking results, and the method comprises the following steps: S11, a space-time feature extractor based on a 3D twin network comprises a template branch and a search branch, and a 3D full convolution neural network is used as a basic network and weight is shared, so that the space-time feature extr