CN-121982057-A - 3D multi-target tracking method and system based on monocular vision

CN121982057ACN 121982057 ACN121982057 ACN 121982057ACN-121982057-A

Abstract

The application provides a 3D multi-target tracking method and system based on monocular vision, wherein the method comprises the steps of obtaining a continuous video image sequence acquired by a monocular camera, preprocessing the continuous video image sequence to obtain an input image sequence; the method comprises the steps of carrying out target detection and double-stage depth estimation and fusion on an input image sequence to obtain a target detection list with fusion depth information, constructing a multi-dimensional association cost matrix, carrying out optimal association on current detection and an existing tracking track based on the target detection list and the multi-dimensional association cost matrix, updating a tracking track state according to an association result, and carrying out life cycle management on the tracking track state. The monocular camera depth detection technology provided by the application ensures low cost and easy integration, and can effectively make up the defects of the traditional algorithm in shielding treatment, similar target distinction and three-dimensional motion modeling by tracking and injecting depth dimension information for multiple targets, so that the monocular camera depth detection technology is the optimal choice of balance performance and cost.

Inventors

HE HAIMING
WANG FEI
LI XIA

Assignees

中建材信息技术股份有限公司
中建材信云智联科技有限公司
中建材信息科技有限公司
中建材信云智联科技有限公司北京分公司
中建材信云智联科技(北京)有限公司

Dates

Publication Date: 20260505
Application Date: 20251209

Claims (10)

1. A 3D multi-target tracking method based on monocular vision, comprising: Acquiring a continuous video image sequence acquired by a monocular camera, and preprocessing the continuous video image sequence to obtain an input image sequence; performing target detection and double-stage depth estimation and fusion on the input image sequence to obtain a target detection list with fusion depth information; constructing a multi-dimensional association cost matrix, and optimally associating the current detection with the existing tracking track based on the target detection list and the multi-dimensional association cost matrix; And updating the tracking track state according to the association result, and carrying out life cycle management on the tracking track state.
2. The method of claim 1, wherein performing object detection and dual-stage depth estimation and fusion on the input image sequence to obtain an object detection list with fusion depth information specifically comprises: Performing target detection on a current frame image in the input image sequence by using a first neural network model, and outputting a two-dimensional boundary box, a category label and a confidence coefficient of a target; Performing double-stage depth estimation on the current frame and a plurality of continuous frame images of the current frame and the preamble thereof based on the two-dimensional boundary box, wherein the double-stage depth estimation comprises a first-stage depth estimation and a second-stage depth estimation; and fusing the relative depth change information output by the first-stage depth estimation and the absolute depth information output by the second-stage depth estimation to generate a fused depth value and a depth confidence coefficient of each target, thereby forming a target detection list with fused depth information.
3. The method according to claim 2, wherein performing a dual-stage depth estimation on the current frame and its preamble several consecutive frames of images based on the two-dimensional bounding box comprises in particular: calculating the relative depth change rate of a target in the two-dimensional boundary frame by triangulation based on feature point matching and motion parallax between the preamble continuous frames and the current frame image, and using the relative depth change rate as relative depth change information; And carrying out dense depth estimation on the current frame image through a second neural network model, and outputting a depth map as absolute depth information.
4. A method according to claim 3, wherein fusing the relative depth change information of the first stage depth estimation output and the absolute depth information of the second stage depth estimation output comprises: Extracting a depth sampling value of a region corresponding to the two-dimensional boundary frame from the depth map output by the second-stage depth estimation, and taking a median value as a preliminary absolute depth estimation value; the fusion depth value of the corresponding target in the previous frame is recursively calculated by utilizing the relative depth change rate output by the first-stage depth estimation, so that the geometric recursion depth value of the current frame is obtained; And carrying out weighted fusion on the preliminary absolute depth estimation value and the geometric recursive depth value to obtain a fusion depth value, wherein the weighted weight is dynamically adjusted according to the movement intensity of the target.
5. The method of claim 1, wherein the multi-dimensional correlation cost matrix is a correlation cost matrix comprising at least five dimensional feature costs including an appearance feature cost, a two-dimensional motion feature cost, a depth continuity feature cost, and an occlusion level feature cost; the appearance feature cost is calculated based on the appearance feature vector similarity of the target; The two-dimensional motion characteristic cost is calculated based on the intersection ratio of the two-dimensional image plane position predicted by Kalman filtering and the detection position; The depth motion feature cost is calculated based on the deviation of the predicted depth value and the current fusion depth value; The depth continuity feature cost is calculated based on the smoothness degree of the current fusion depth value and the target historical depth sequence; when two-dimensional overlapping exists among detection targets, the shielding layer characteristic cost judges a front-back shielding relation according to the fusion depth value of the detection targets, and sets an associated protection weight for the shielded targets.
6. The method according to claim 1 or 5, wherein optimally associating the current detection with the existing tracking trajectory based on the target detection list and the multi-dimensional association cost matrix specifically comprises: And carrying out optimal correlation on the current detection and the existing tracking track by adopting a Hungary algorithm based on the multi-dimensional correlation cost matrix, and completing optimal matching by taking the total cost of the multi-dimensional correlation cost matrix as a target.
7. The method according to claim 1, wherein updating the tracking track state according to the association result and performing life cycle management on the tracking track state specifically comprises: For the existing tracking track successfully associated with the current frame detection target, updating the corresponding Kalman filtering state by utilizing the two-dimensional position information and the fusion depth value of the detection target, and updating the appearance characteristic history cache of the track; for the current frame detection target which is not associated with any existing tracking track, initializing the current frame detection target as a new tracking track if the detection confidence coefficient is greater than or equal to a preset threshold value; When an existing tracking track is not detected in a continuous multi-frame because of being blocked by other targets, starting depth auxiliary track prediction, predicting a potential three-dimensional motion track of the blocked target based on depth information, motion state and scene priori of a blocking object, and maintaining the tracking state and identity of the track during prediction; Dynamically setting a disappearance judgment threshold value according to the fusion depth value confirmed when each tracking track is successfully associated for the last time, wherein the continuous unassociated frame number corresponding to the disappearance judgment threshold value is dynamically determined according to the fusion depth value of the target; When abnormality of the depth information is detected, the weight of the depth related feature dimension in the multi-dimensional correlation cost matrix used in the data correlation process is automatically reduced, and the weight of the appearance feature dimension and the two-dimensional motion feature dimension is correspondingly improved.
8. A monocular vision-based 3D multi-target tracking system, comprising: the data processing module is used for acquiring a continuous video image sequence acquired by the monocular camera, and preprocessing the continuous video image sequence to obtain an input image sequence; The detection and fusion module is used for carrying out target detection and double-stage depth estimation and fusion on the input image sequence to obtain a target detection list with fusion depth information; the data association module is used for constructing a multi-dimensional association cost matrix, and optimally associating the current detection with the existing tracking track based on the target detection list and the multi-dimensional association cost matrix; And the track management module is used for updating the tracking track state according to the association result and carrying out life cycle management on the tracking track state.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor performs the steps of the monocular vision based 3D multi-object tracking method as claimed in any one of claims 1-7.
10. A computer-readable storage medium, wherein the computer-readable storage medium has stored thereon a program for realizing information transfer, which when executed by a processor, realizes the steps of the monocular vision-based 3D multi-target tracking method according to any one of claims 1 to 7.

Description

3D multi-target tracking method and system based on monocular vision Technical Field The invention relates to the technical field of computer vision and intelligent driving, in particular to a 3D multi-target tracking method and system based on monocular vision. Background In the intelligent driving field, the multi-target tracking technology is a core link for realizing environment awareness, and the performance of the multi-target tracking technology directly influences the safety of vehicle decision and control. Currently, there are a large number of dynamic targets (such as pedestrians, vehicles, non-motor vehicles, etc.) in urban road and expressway scenes, and the motion states of the targets are complex and are easily interfered by factors such as shielding, illumination change, similar appearance, etc., so that the traditional multi-target tracking algorithm faces a plurality of challenges. According to industry data statistics, in a dense traffic scene, the target ID switching Rate (ID Switch) of a traditional tracking algorithm is over 20 percent on average, and the tracking loss Rate (Miss Rate) in an occlusion scene is even up to 35 percent, and the problems can directly lead to misjudgment of vehicles on surrounding environment, such as identifying continuously running vehicles as a plurality of independent targets or failing to re-associate the targets after occlusion, so that collision risk is caused. The monocular camera is used as a main stream sensor which is low in cost and easy to integrate in intelligent driving, and the collected image information contains rich appearance and scene characteristics. The depth detection technology is introduced, so that a distance hierarchy relation among targets can be provided for multi-target tracking, and the problems of occlusion misjudgment, similar target confusion and the like caused by only depending on appearance and motion characteristics in the traditional algorithm are effectively solved. Therefore, research on how to optimize the multi-target tracking algorithm through monocular camera depth detection has important significance for improving the robustness of intelligent driving environment perception. Disclosure of Invention The invention aims to provide a 3D multi-target tracking method and system based on monocular vision, which aims to overcome the defects of the existing monocular multi-target tracking method in shielding, similar target distinguishing and three-dimensional motion modeling, and provides a method and system which are low in cost, easy to integrate and capable of realizing high-robustness and high-precision 3D tracking in a complex dynamic environment. The embodiment of the invention provides a 3D multi-target tracking method based on monocular vision, which comprises the following steps: Acquiring a continuous video image sequence acquired by a monocular camera, and preprocessing the continuous video image sequence to obtain an input image sequence; performing target detection and double-stage depth estimation and fusion on the input image sequence to obtain a target detection list with fusion depth information; constructing a multi-dimensional association cost matrix, and optimally associating the current detection with the existing tracking track based on the target detection list and the multi-dimensional association cost matrix; And updating the tracking track state according to the association result, and carrying out life cycle management on the tracking track state. The embodiment of the invention provides a 3D multi-target tracking system based on monocular vision, which comprises the following components: the data processing module is used for acquiring a continuous video image sequence acquired by the monocular camera, and preprocessing the continuous video image sequence to obtain an input image sequence; The detection and fusion module is used for carrying out target detection and double-stage depth estimation and fusion on the input image sequence to obtain a target detection list with fusion depth information; the data association module is used for constructing a multi-dimensional association cost matrix, and optimally associating the current detection with the existing tracking track based on the target detection list and the multi-dimensional association cost matrix; And the track management module is used for updating the tracking track state according to the association result and carrying out life cycle management on the tracking track state. The embodiment of the invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the steps of the 3D multi-target tracking method based on monocular vision when being executed by the processor. The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medi