CN-122023836-A - Target tracking method, device, equipment and medium based on multi-view fusion

CN122023836ACN 122023836 ACN122023836 ACN 122023836ACN-122023836-A

Abstract

The application relates to a target tracking method, device, equipment and medium based on multi-view fusion. The method can organically integrate tracking prediction results of multiple different depth camera visual angles, predicts the position track of each target at the next moment, calculates the hierarchical attenuation confidence coefficient according to the shielding degree by equivalent human body targets in the scene as neural network graph nodes for space-time modeling, greatly relieves the two-dimensional shielding problem in target tracking, and organically integrates the prediction results of the different depth camera visual angles on the basis so as to realize global tracking of the human body targets in the scene. In addition, the system can be compatible with a plurality of different visual angles, has strong universality and expandability, and is beneficial to realizing more accurate prediction tracking due to more camera visual angles.

Inventors

CHEN YING
Pan Ruihan

Assignees

中科弘拓(苏州)智能科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260408

Claims (10)

1. The target tracking method based on multi-view fusion is characterized by comprising the following steps of: Establishing a global coordinate system for a two-dimensional plane where an application scene is located, respectively establishing a sub-coordinate system for each depth camera for acquiring a two-dimensional image and a three-dimensional point cloud in the application scene, and converting a target position coordinate under each sub-coordinate system into the global coordinate system; Detecting human targets according to the two-dimensional images and the three-dimensional point clouds acquired by each depth camera, and binding global identity tags to corresponding position coordinates of each human target in the global coordinate system; The method comprises the steps that the observation input result and the shielding degree of each human body target under the camera view angle of each depth camera at the current moment are determined for the human body target corresponding to each global identity tag according to two-dimensional images and three-dimensional point clouds acquired by a plurality of depth cameras; Respectively processing the observation input results of each human target under the camera view angles of each depth camera by adopting a space-time diagram neural network to obtain a prediction result at the next moment, determining the confidence level of the prediction result according to the shielding degree corresponding to the observation input results, and carrying out multi-view fusion on the prediction results to obtain a fused prediction output result; integrating the predicted output result corresponding to each human body target with the observed input result, and reversely deducing the visibility of each human body part of the human body target by combining the confidence coefficient of the predicted result corresponding to each depth camera to realize global tracking of each human body target.
2. The multi-view fusion-based target tracking method according to claim 1, wherein the establishing a global coordinate system for the two-dimensional plane where the application scene is located, establishing a sub-coordinate system for each depth camera for acquiring two-dimensional images and three-dimensional point clouds in the application scene, and converting the target position coordinates in each sub-coordinate system into the global coordinate system, includes: Establishing a global coordinate system for a two-dimensional plane where an application scene is located, determining an origin coordinate and an X axis of the global coordinate system, and rotating the X axis in the world coordinate plane by 90 degrees anticlockwise with the origin of the global coordinate system as a rotation center to obtain a Y axis; acquiring the position and the orientation of each depth camera for acquiring a two-dimensional image and a three-dimensional point cloud in an application scene, determining a visual angle according to the orientation of the depth camera, and respectively establishing a sub-coordinate system in the global coordinate system corresponding to the visual angle of each depth camera; when the target position coordinate under each sub-coordinate system is converted into the global coordinate system, acquiring the coordinates of the origin of the sub-coordinate system in the global coordinate system as follows Wherein X O is the abscissa of the origin of the sub-coordinate system in the global coordinate system, y O is the ordinate of the origin of the sub-coordinate system in the global coordinate system, and the unit direction vector corresponding to the positive direction of the X axis of the sub-coordinate system is set to be corresponding to the global coordinate system Wherein X X is the abscissa of the unit direction vector corresponding to the X-axis forward direction of the sub-coordinate system in the global coordinate system, Y X is the ordinate of the unit direction vector corresponding to the X-axis forward direction of the sub-coordinate system in the global coordinate system, and the Y-axis of the sub-coordinate system is obtained by rotating the X-axis of the sub-coordinate system in the sub-coordinate plane by 90 degrees counterclockwise with the origin of the sub-coordinate system as the rotation center, wherein the unit direction vector corresponding to the Y-axis forward direction of the sub-coordinate system is: , Is a rotation matrix; if the target position coordinate in the target sub-coordinate system is Where x S is the abscissa of the target position in the target sub-coordinate system and y S is the ordinate of the target position in the target sub-coordinate system, then the target position coordinates correspond to the coordinates P W in the global coordinate system as: where x W is the abscissa of the target location in the global coordinate system and y W is the ordinate of the target location in the global coordinate system.
3. The multi-view fusion-based target tracking method according to claim 1, wherein the detecting the human target according to the two-dimensional image and the three-dimensional point cloud acquired by each depth camera, respectively binding global identity tags to the position coordinates of each human target corresponding to the global coordinate system, comprises: The method comprises the steps of transferring a pre-trained human body target detection depth neural network to a two-dimensional image acquired by each depth camera, detecting human body key points of each human body target, and acquiring the human body target in an application scene; After a human body target is detected, acquiring a three-dimensional coordinate of a human body target center from a three-dimensional point cloud acquired by each depth camera, removing a longitudinal axis coordinate in the three-dimensional coordinate, and mapping the longitudinal axis coordinate to a two-dimensional plane camera visual angle sub-coordinate system to obtain a position coordinate of the human body target in the sub-coordinate system; Converting the position coordinates of the human body target in the sub-coordinate system to obtain the position coordinates of the human body target in the global coordinate system; When a new human body target appears in the camera view angle of any depth camera, calculating the corresponding position coordinates of the new human body target in the global coordinate system, and matching the nearest human body target with the shortest Euclidean distance between the position coordinates of the two in the human body targets detected by each other depth camera within the range of the preset distance; If the new human body targets are matched with the respective nearest human body targets one by one in each of the other depth cameras, binding a global identity tag for the new human body targets; if the new human body targets are not matched with the nearest human body targets one by one in each of the other depth cameras, setting the global identity label as undetermined, and matching the undetermined human body targets with the next new human body targets as the nearest human body targets within the preset interval range.
4. The multi-view fusion-based target tracking method according to claim 1, wherein the processing the observation input result of each human target under the camera view angle of each depth camera by using the space-time graph neural network to obtain the prediction result of the next moment comprises: For each depth camera, respectively constructing in advance according to task requirements and training a space-time diagram neural network by adopting collected target position track data, wherein the space-time diagram neural network is used for predicting future motion tracks and trends of human targets in an application scene to obtain a prediction result at the next moment; In the space-time diagram neural network corresponding to each depth camera, each diagram node is associated with one human body target under the camera view angle of the depth camera, each diagram node records global identity labels bound on the corresponding human body target, and then the global identity labels of the N human body targets are 、、、...、 ; The input node characteristics of the space-time diagram neural network are position coordinates (x, y) and speed vectors (Deltax, deltay) corresponding to a human body target under a current camera view angle, and the input node characteristics are divided into a predicted value P for the current time calculated by the space-time diagram neural network at the previous time and an observed value S acquired and detected by a depth camera at the current time, wherein x is an abscissa corresponding to the human body target under the current camera view angle, y is an ordinate corresponding to the human body target under the current camera view angle, deltax is a speed vector corresponding to the human body target under the current camera view angle at the abscissa, and Deltay is a speed vector corresponding to the human body target under the current camera view angle at the ordinate; the dimension of the input node feature matrix of the space-time diagram neural network is T multiplied by N multiplied by 8, wherein N is the number of target nodes, T is the length of a time window, and the dimension of the input feature matrix is N multiplied by 8 corresponding to a single moment T, and the shape is as follows: ; Wherein the method comprises the steps of For the predicted value of the nth target node at time t on the abscissa, For the predicted value of the nth target node at time t on the ordinate, For the speed vector predictor of the nth target node at time t on the abscissa, For the velocity vector predictor of the nth target node at time t on the ordinate, For the observation of the nth target node at time t on the abscissa, For the observation of the nth target node at time t on the ordinate, For the speed vector observation of the nth target node at time t on the abscissa, The speed vector observation value of the nth target node at the ordinate is the time t; The node adjacency matrix is a symmetric matrix, the dimension of the node adjacency matrix is N multiplied by N, wherein the edge weight for connecting any two graph nodes is the Euclidean distance length between two corresponding human targets in the sub-coordinate system; The dimension of the output matrix of the space-time diagram neural network is 1 XN multiplied by 4, and the prediction result output by the space-time diagram neural network is the prediction value P of the respective position coordinates (x, y) and the velocity vectors (Deltax, deltay) of all human targets at the camera view angle of the current depth camera at the next time t+1.
5. The multi-view fusion-based object tracking method according to claim 4, wherein the processing the observation input result of each human object under the camera view angle of each depth camera by using the space-time graph neural network to obtain the prediction result of the next moment respectively further comprises: the edge weights of the nodes of any two pictures are calculated by adopting an observed value S, and the edge weights of the nodes of any two pictures are: ; Wherein the method comprises the steps of The edge weight matrix of any two graph nodes in the camera view angle of the kth subsystem at the current moment t is represented, and W NN represents the edge weights of the nth graph node and the nth graph node; the prediction result output by the space-time diagram neural network is as follows: Wherein Representing the prediction result of the kth subsystem at the next instant t +1, For the predicted value of the nth target node at time t +1 on the abscissa, For the predicted value of the nth target node at time t +1 on the ordinate, For the speed vector predictor of the nth target node at time t +1 on the abscissa, The predicted value of the velocity vector of the Nth target node at the time t+1 in the ordinate is obtained.
6. The multi-view fusion-based target tracking method according to claim 5, wherein the determining the confidence level of the predicted result according to the shielding degree corresponding to the observed input result, and performing multi-view fusion on the predicted result to obtain a fused predicted output result includes: acquiring the total number K of depth cameras in an application scene, and taking each depth camera as a subsystem; Determining the confidence coefficient of the predicted result according to the shielding degree corresponding to the observed input result to calculate a predicted result confidence coefficient vector, and using Representing a prediction result confidence vector calculated by all human targets according to respective shielding degrees in a kth subsystem camera view at the current moment t; Performing multi-view fusion according to the prediction result confidence vector and the prediction result to obtain a fused prediction output result, wherein the prediction output result is that Wherein the prediction outputs a result Representing the respective position coordinates (x, y) and velocity vector (Deltax, deltay) predicted values of all targets after fusing all subsystem predicted results at the next time t+1, Representing the respective position coordinates (x, y) and velocity vector (Δx, Δy) predictions of all targets calculated by the kth subsystem at the next time t+1; the fused prediction output result corresponding to the ith single human body target is expressed as: Wherein, the method comprises the steps of, Representing the predicted values of the position coordinates (x, y) and the velocity vectors (Deltax, deltay) of the ith human target after the predicted output results of all subsystems are fused at the next time t+1, Representing the confidence of the predicted output result calculated by the ith human target according to the shielding degree of the ith human target in the camera view angle of the kth subsystem at the current moment t, And representing the sum of the confidence of the prediction output result calculated by the ith target according to the shielding degree of the ith target in all subsystem camera view angles at the current moment t.
7. The multi-view fusion-based object tracking method according to claim 6, wherein integrating the predicted output result corresponding to each human object with the observed input result and reversely deriving the visibility of each human part of the human object in combination with the confidence of the predicted result corresponding to each depth camera comprises: For an ith human body target in a camera view angle of a kth subsystem at a current moment t, obtaining a human body key point of the human body target; Confidence of prediction result of human body target under camera view angle of corresponding subsystem at current time t The step attenuation is performed according to the shielding degree according to the following formula: Wherein, the method comprises the steps of, Take a value between 0 and 1; for the weight of the header, Is the trunk weight; as the weight of the left hand is given, Right hand weight; A boolean variable value visible for the head; a boolean variable value visible for the torso; The boolean variable value is visible for the left hand; the boolean variable value is visible for the right hand; And determining the confidence coefficient of the human body key points according to the confidence coefficient of the prediction result corresponding to each depth camera at the current time t, and reversely deducing the visibility of each human body part of the human body target according to the confidence coefficient of the human body key points.
8. A multi-view fusion-based object tracking device, the device comprising: The coordinate conversion module is used for establishing a global coordinate system for a two-dimensional plane where the application scene is located, respectively establishing a sub-coordinate system for each depth camera for acquiring a two-dimensional image and a three-dimensional point cloud in the application scene, and converting the target position coordinate under each sub-coordinate system into the global coordinate system; The target detection and identification module is used for detecting human targets according to the two-dimensional images and the three-dimensional point clouds acquired by each depth camera and binding global identity tags to the position coordinates of each human target corresponding to the global coordinate system; The observation and shielding analysis module is used for determining the observation input result and the shielding degree of each human body target at the current moment under the camera view angle of each depth camera according to the two-dimensional images and the three-dimensional point clouds acquired by the depth cameras for the human body target corresponding to each global identity tag; The space-time diagram processing and fusion module is used for respectively processing the observation input results of each human body target under the camera view angles of each depth camera by adopting a space-time diagram neural network to obtain a prediction result at the next moment, determining the confidence level of the prediction result according to the shielding degree corresponding to the observation input results, and carrying out multi-view fusion on the prediction result to obtain a fused prediction output result; the global tracking and visibility calculation module is used for integrating the predicted output result corresponding to each human body target with the observed input result, and reversely pushing out the visibility of each human body part of the human body target by combining the confidence coefficient of the predicted result corresponding to each depth camera so as to realize the global tracking of each human body target.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 7 when the computer program is executed by the processor.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.

Description

Target tracking method, device, equipment and medium based on multi-view fusion Technical Field The application relates to the technical field of computer vision, in particular to a target tracking method, device, equipment and medium based on multi-view fusion. Background In general, the prior art scheme is limited by a single camera view angle, and can not fully cover the target to be tracked in the scene, and meanwhile, the prior art scheme often encounters a two-dimensional shielding problem when tracking the target, namely, the condition that tracking ID is lost, jumped or wrongly exchanged due to mutual shielding between the target and an obstacle or between the target and other targets in a two-dimensional image occurs. Disclosure of Invention Based on the above, the object tracking method, device, equipment and medium based on multi-view fusion are provided, which are used for solving the technical problems that the prior art scheme can not well cover all objects to be tracked in a scene, and tracking ID of the objects is lost, jumped or wrongly exchanged due to the occurrence of two-dimensional shielding problem. In one aspect, a target tracking method based on multi-view fusion is provided, the method comprising: Establishing a global coordinate system for a two-dimensional plane where an application scene is located, respectively establishing a sub-coordinate system for each depth camera for acquiring a two-dimensional image and a three-dimensional point cloud in the application scene, and converting a target position coordinate under each sub-coordinate system into the global coordinate system; Detecting human targets according to the two-dimensional images and the three-dimensional point clouds acquired by each depth camera, and binding global identity tags to corresponding position coordinates of each human target in the global coordinate system; The method comprises the steps that the observation input result and the shielding degree of each human body target under the camera view angle of each depth camera at the current moment are determined for the human body target corresponding to each global identity tag according to two-dimensional images and three-dimensional point clouds acquired by a plurality of depth cameras; Respectively processing the observation input results of each human target under the camera view angles of each depth camera by adopting a space-time diagram neural network to obtain a prediction result at the next moment, determining the confidence level of the prediction result according to the shielding degree corresponding to the observation input results, and carrying out multi-view fusion on the prediction results to obtain a fused prediction output result; integrating the predicted output result corresponding to each human body target with the observed input result, and reversely deducing the visibility of each human body part of the human body target by combining the confidence coefficient of the predicted result corresponding to each depth camera to realize global tracking of each human body target. In one embodiment, the establishing a global coordinate system on the two-dimensional plane where the application scene is located, establishing a sub-coordinate system for each depth camera for acquiring the two-dimensional image and the three-dimensional point cloud in the application scene, and converting the target position coordinate under each sub-coordinate system to the global coordinate system, includes: Establishing a global coordinate system for a two-dimensional plane where an application scene is located, determining an origin coordinate and an X axis of the global coordinate system, and rotating the X axis in the world coordinate plane by 90 degrees anticlockwise with the origin of the global coordinate system as a rotation center to obtain a Y axis; acquiring the position and the orientation of each depth camera for acquiring a two-dimensional image and a three-dimensional point cloud in an application scene, determining a visual angle according to the orientation of the depth camera, and respectively establishing a sub-coordinate system in the global coordinate system corresponding to the visual angle of each depth camera; when the target position coordinate under each sub-coordinate system is converted into the global coordinate system, acquiring the coordinates of the origin of the sub-coordinate system in the global coordinate system as follows Wherein X O is the abscissa of the origin of the sub-coordinate system in the global coordinate system, y O is the ordinate of the origin of the sub-coordinate system in the global coordinate system, and the unit direction vector corresponding to the positive direction of the X axis of the sub-coordinate system is set to be corresponding to the global coordinate systemWherein X X is the abscissa of the unit direction vector corresponding to the X-axis forward direction of the sub-coordinate system in the global c