CN-115953431-B - Multi-target tracking method and system for unmanned aerial vehicle aerial video
Abstract
A multi-target tracking method and system for unmanned aerial vehicle aerial videos extracts target categories and bounding boxes from video frames through a multi-scale pixel-by-pixel target detection network MSPNet, extracts apparent feature vectors of targets through a multi-granularity fusion feature extraction network MaskMGN, calculates apparent feature vectors of tracks through a weighted moving average method TCMWA based on time sequences and detection confidence, and performs data association on the tracks and detection results through the apparent feature vectors and a Kalman motion model to obtain multi-target tracking results. The invention improves the target detection and apparent feature extraction part, solves the problems of complex background, small target scale, target shielding, changeable visual angle and the like in the unmanned aerial vehicle aerial video, and can effectively improve the multi-target tracking precision of the unmanned aerial vehicle aerial video.
Inventors
- LIU XIAORUI
- RAO RUONAN
- XUE GUANGTAO
- JIANG FENGYI
Assignees
- 上海交通大学
Dates
- Publication Date
- 20260512
- Application Date
- 20221224
Claims (3)
- 1. The online multi-target tracking method for the unmanned aerial vehicle aerial video is characterized in that after a target class and a boundary box are extracted from a video frame through a multi-scale pixel-by-pixel target detection network MSPNet, an apparent feature vector of a target is further extracted through a multi-granularity fusion feature extraction network MaskMGN, the apparent feature vector of a track is calculated through a weighted moving average method TCMWA based on time sequence and detection confidence, and the current track and a detection result are subjected to data association through the apparent feature vector and a motion feature based on Kalman filtering, so that a multi-target tracking result is obtained, and the method specifically comprises the following steps: Step 1, reading a video file or an image sequence and preprocessing, including edge trimming, cutting and scaling, to obtain a plurality of video frames with the same size; Step 2, performing target detection on the preprocessed video frame by using a multi-scale pixel-by-pixel target detection network MSPNet as a target detector to obtain a target detection result; step 3, extracting apparent feature vectors of the targets detected in the step 2 through a multi-granularity fusion feature extraction network MaskMGN, calculating apparent feature vectors of tracks through TCMWA, and calculating similarity, wherein the method specifically comprises the following steps: step 3.1, obtaining a detection result of a target detection model MSPNet on a current frame, and cutting a target area image; Step 3.2, extracting features of the target image cut out based on MaskMGN target detection results of the current frame to obtain an apparent feature vector with fixed dimension; 3.3, predicting the feature vector of the tracked track in the current frame by using a moving weighted average method TCMWA based on time sequence and detection confidence based on the feature vector of the target historical frame, wherein the dimension of the feature vector is consistent with the dimension of the feature vector output by MaskMGN; In the TCMWA method, the weight given to the apparent feature vector with the detection confidence coefficient of c to the historical frame t is c/t; step 3.4, calculating the cosine similarity between the target detected by the current frame and the track currently tracked And generating a similarity matrix, wherein: 、 apparent feature vectors respectively representing the target and the track; And 4, extracting motion characteristics of the target and the track and calculating similarity, wherein the method specifically comprises the following steps: step 4.1, obtaining the boundary frame coordinates of the target detected by the current frame, and converting the boundary frame coordinates into a [ u, v, a, h ] format, wherein u, v is the coordinates of the central position of the target, and a, h is the aspect ratio and the height of the target frame; Step 4.2, predicting the position of the current tracking track in the current frame based on the Kalman filtering algorithm Wherein: Is an estimated value of the motion state of the target at the time k-1, In order to observe the predicted value of the vector, The predicted value of the observed vector variance, F and Q are parameters of a Kalman filter; Step 4.3, calculating the Marsh distance between the bounding box in the detection result and the bounding box in the track prediction result And generating a similarity metric matrix, wherein: 、 Representing motion feature vectors of the object and the trajectory respectively, A covariance matrix representing the observation space of the track at the current moment; and 5, carrying out data association on the detection result and the current track.
- 2. The online multi-target tracking method for aerial videos of an unmanned aerial vehicle according to claim 1, wherein the step 5 specifically comprises: step 5.1, when the first frame of the video, namely the detected target initializing track, the default track type is unreliable, and only after more than three continuous times of successful matching, the track type is set to be reliable and different matching modes are carried out on the subsequent video frames according to the current track type; Step 5.2, when the track type of the current track is unreliable, the track is matched through IoU matching algorithm, specifically, in IoU matching algorithm, a cost matrix is built based on IoU values of a target detection frame and a track prediction frame, a data association problem is regarded as a bipartite graph matching problem, a matching result is obtained by solving the cost matrix through Hungary algorithm, and further, when the detected target does not have the matching track, the track is initialized for the target, when the track does not have the matching of the target in the detection result of the current frame, the track type of the track is marked as lost, and whether the track is deleted is determined according to whether the current lost times of the track exceeds a preset maximum threshold; Step 5.3, when the track type of the current track is credible, the track is matched through a cascade matching algorithm, specifically, in the cascade matching algorithm, a cost matrix is constructed by utilizing the apparent feature similarity measurement matrix and the motion feature similarity measurement matrix obtained in the step 3 and the step 4 And then solving the cost matrix by using a Hungary algorithm to obtain a matching result, and further judging that when the previous frame is successfully matched, the lost time is set to be 0, when matching is carried out, preferentially matching the trace with less matching lost time with the detection result, dividing the cost matrix into a plurality of submatrices according to the lost time, preferentially solving the cost matrix with less lost time by using the Hungary algorithm, and carrying out IoU matching again on the trace with unsuccessful matching after cascade matching and the detection result.
- 3. A system for realizing the online multi-target tracking method for the unmanned aerial vehicle aerial video according to claim 1 or 2 is characterized by comprising MSPNet units, maskMGN units, TCMWA units and a data association unit, wherein the MSPNet units perform target detection processing according to a current video frame to obtain a target detection result, the MaskMGN units cut out images of targets according to the target detection result, perform apparent feature extraction to obtain apparent feature vectors of the targets, the TCMWA units perform weighted calculation according to the apparent feature vectors in a target history frame of track tracking to obtain the apparent feature vectors of the tracks, and the data association unit performs matching between the current track and the detection result according to the apparent feature vectors and the motion feature based on Kalman filtering to obtain a multi-target tracking result.
Description
Multi-target tracking method and system for unmanned aerial vehicle aerial video Technical Field The invention relates to a technology in the field of image processing, in particular to a multi-target tracking method and system for unmanned aerial vehicle aerial videos. Background The multi-target tracking technology of the unmanned aerial vehicle aerial video has wide application in the aspects of area monitoring, inspection, investigation and the like. The unmanned aerial vehicle aerial video has the characteristics of complex background, small and dense targets, target shielding, changeable visual angles and the like, and although the video multi-target tracking technology research based on deep learning currently improves the MOTA to 81.0 on a common data set such as MOT17, the MOTA on an unmanned aerial vehicle aerial video data set such as VisDrone-MOT is not half of the common data set currently. How to design and realize a multi-target tracking system aiming at an unmanned aerial vehicle aerial photographing scene has important practical significance. Disclosure of Invention Aiming at the defects that targets with different scales cannot be detected respectively in the prior art, missed detection is easy to cause and a network structure is complex, the invention provides a multi-target tracking method and system for unmanned aerial vehicle aerial video, which mainly improve a target detection and apparent feature extraction part, solve the problems of complex background, small target scale, target shielding, changeable visual angle and the like in the unmanned aerial vehicle aerial video, and can effectively improve the multi-target tracking precision of the unmanned aerial vehicle aerial video. The invention is realized by the following technical scheme: The invention relates to an unmanned aerial vehicle aerial video online multi-target tracking method, which is characterized in that after a target class and a boundary box are extracted from a video frame through a multi-scale pixel-by-pixel target detection network MSPNet, an apparent feature vector of a target is further extracted through a multi-granularity fusion feature extraction network MaskMGN, the apparent feature vector of a track is calculated through a weighted moving average method TCMWA based on time sequence and detection confidence, and the track and a detection result are subjected to data association through the apparent feature vector and a Kalman motion model, so that a multi-target tracking result is obtained. The invention relates to a system for realizing the method, which comprises MSPNet units, maskMGN units, TCMWA units and a data association unit, wherein the MSPNet units perform target detection processing according to a current video frame to obtain a target detection result, the MaskMGN units cut out images of targets according to the target detection result and perform apparent feature extraction to obtain apparent feature vectors, the TCMWA units perform weighted calculation according to the apparent feature vectors in target history frames tracked by tracks to obtain the apparent feature vectors of the tracks, and the data association unit performs matching between the tracks and the detection result according to the apparent feature vectors and motion features based on Kalman filtering to obtain a multi-target tracking result. Technical effects The invention is based on the improved MSPNet, the multi-granularity feature extraction network MaskMGN based on feature erasure and the weighted moving average method TCMWA based on time and detection confidence, which remarkably improves the detection capability of small targets, and compared with a benchmark model FCOS, mAP of small target detection is improved from 10.6 to 14.9.MaskMGN and TCMWA improve apparent feature vector representation capability of targets and tracks, improve accuracy in data correlation, reduce the number of ID switching in multi-target tracking, and reduce the number of ID switching from 4377 to 1755 compared with the reference model DeepSORT. Drawings FIG. 1 is a flow chart of the present invention; FIG. 2 is a schematic diagram of a MSPNet network architecture; FIG. 3 is a schematic diagram of a MaskMGN network architecture; FIG. 4 is a flowchart of a DeepSORT multi-objective tracking algorithm. Detailed Description As shown in fig. 1, this embodiment relates to a multi-target tracking method for unmanned aerial vehicle aerial videos, which includes the following steps: Step 1, reading a video file or an image sequence and preprocessing, including edge trimming, cutting and scaling, to obtain a plurality of video frames with the same size. Step 2, as shown in fig. 2, the target detection method for the preprocessed video frame by using the multi-scale pixel-by-pixel target detection network MSPNet as a target detector, includes: and 2.1, inputting the video frames into a backbone network ResNet of the target detector for feature extraction, a