CN-122023458-A - Multi-view joint tracking method for complex target in industrial production scene

CN122023458ACN 122023458 ACN122023458 ACN 122023458ACN-122023458-A

Abstract

The invention relates to a multi-view joint tracking method of complex targets in an industrial production scene, which comprises the following steps of S1, calibrating internal parameters and external parameters of all cameras, constructing a physical space 3D model of a target production area, forming a factory coordinate system, S2, collecting multi-view image frames through the calibrated cameras, preprocessing, S3, detecting the position of each frame of target by using a deep learning detection model Detectron, carrying out monocular inter-frame association on the detected targets to obtain local tracks, extracting depth feature vectors for each target, S4, obtaining accurate 3D positioning on the fusion space coordinates of the same target when the same target appears in a plurality of view angles, obtaining a physical 3D space coordinate candidate set of each actual target, S5, constructing a matching candidate set according to the physical 3D space coordinate candidate set, and adopting a joint matching algorithm to realize the combined tracks, thus obtaining a complete 3D track. The invention realizes the efficient and multi-view collaborative precise tracking of multiple targets in a complex industrial scene.

Inventors

XU CHAO
CHEN JING
BAI HAI
JIANG GELI
XIE HUIHUI
Li Taorui
ZHU TONG
LIU YINGTONG
LI BAILONG
LI MANTING

Assignees

国网湖北省电力有限公司超高压公司

Dates

Publication Date: 20260512
Application Date: 20260120

Claims (10)

1. The multi-view combined tracking method of the complex target in the industrial production scene is characterized by comprising the following steps of: S1, performing internal parameter and external parameter calibration on all cameras, constructing a physical space 3D model of a target production area, forming a factory coordinate system, and acquiring geometric mapping parameters of each camera and a physical space; S2, acquiring multi-view image frames through a calibrated camera, acquiring multi-view synchronous original image frames with time stamps, and preprocessing to obtain preprocessed image frames; S3, detecting the target position of each frame by using a deep learning detection model Detectron according to the preprocessed image frames, carrying out monocular inter-frame association on the detected targets to obtain local tracks, extracting depth feature vectors for each target, and obtaining 2D detection frames, local 2D track IDs and appearance features of all the targets in the current frame under each camera; S4, back projecting the target to a factory physical 3D space in a 2D detection frame of the current frame, and fusing space coordinates to obtain accurate 3D positioning when the same target appears in a plurality of view angles to obtain a physical 3D space coordinate candidate set of each actual target; S5, according to the physical 3D space coordinate candidate set, utilizing space distance, time consistency and appearance characteristics to perform multi-mode fusion, constructing a matching candidate set, adopting a joint matching algorithm to realize global ID distribution of the target among different cameras and different time slices, and combining tracks to obtain a complete 3D track.
2. The multi-view joint tracking method of complex targets in industrial production scene according to claim 1, wherein the method is characterized in that the internal and external parameters of all cameras are calibrated as follows: the camera internal reference calibration adopts Zhang Zhengyou calibration method, and a plurality of checkerboard images with different angles are shot by the same camera, and the camera model is as follows: ; Wherein K is an internal reference matrix; Fx, fy is focal length, cx and cy are principal point coordinates, the rotation matrix R and the translation vector t are external parameters, and the X w ,Y w ,Z w is three-dimensional point coordinates under the world coordinate system; distortion model: ; Wherein, the K 1 ,k 2 ,k 3 is a radial distortion coefficient, and p 1 ,p 2 is a tangential distortion coefficient; Fixing a checkerboard calibration plate with a known size in a factory scene, ensuring that the camera can observe the checkerboard calibration plate simultaneously, and solving an external parameter according to 3D world coordinates of corner points of the calibration plate and corresponding 2D pixel coordinates by adopting a PnP algorithm: ; Wherein P i is the 3D world coordinate of the corner point of the calibration plate, P i is the corresponding 2D pixel coordinate, and pi () is the projection function; the extrinsic matrices R and t for each camera are calculated using the solvePnP functions of OpenCV.
3. The multi-view joint tracking method for complex targets in industrial production scene according to claim 2, wherein the construction of the target production area physical space 3D model forms a factory coordinate system, specifically comprising the following steps: Acquiring a 3D point cloud model of a scene through laser radar scanning, wherein the world coordinate system of the calibration plate is aligned with the global coordinate system of the 3D model of the factory; If the position of the calibration plate in the factory coordinate system is [ R b ,t b ], the camera external parameters need to be converted into: ; projection matrix p=kr factory ∣t factory of all cameras in the global coordinate system of the plant; Each camera independently collects images, detects the pixel position of the same physical space target on the local image, and obtains the 3D space coordinates of the object point observed by each camera through back projection by utilizing the internal and external parameters of each camera; Comparing the 3D coordinates calculated by each camera with the real known coordinates to see whether the coordinates are consistent or not, or comparing the 3D coordinates of the target calculated by a plurality of cameras with each other to see whether the error is controlled within the engineering allowable range, and if significant drift and stacking errors are found, checking the calibration flow, distortion correction or mechanical adjustment precision of each camera.
4. The multi-view joint tracking method of complex targets in an industrial production scene according to claim 1 is characterized in that multi-view image frames are acquired through calibrated cameras, multi-view synchronous original image frames with time stamps are acquired, and preprocessing is carried out, specifically, a trigger signal line is adopted to connect all cameras, hardware-level synchronous exposure is achieved, all cameras control SDK synchronous time through the same upper computer, an acquisition instruction is uniformly initiated, time stamps are automatically added to each acquired image frame, each moment is acquired to obtain original frames of all cameras, preprocessing comprises gray level correction, filtering and contrast enhancement processing, the gray level correction is carried out, offset of each channel is corrected to an overall average level M through calculating average values (mu R ,μ G ,μ B ) of three channels of an original image R, G, B respectively, nonlinear transformation is carried out on corrected pixel values, brightness distribution of the overall image is adjusted, filtering processing is carried out on the images with brightness adjusted, global or local contrast stretching is carried out on the filtered images, and self-adaptive enhancement is adopted to divide the images into small areas to increase contrast respectively.
5. The multi-view joint tracking method of complex targets in industrial production scene according to claim 1, wherein the method is characterized in that according to the preprocessed image frames, a deep learning detection model Detectron is used for detecting the target position of each frame, monocular inter-frame association is performed on the detected targets to obtain local tracks, depth feature vectors are extracted for each target, and 2D detection frames, local 2D track IDs and appearance features of all targets in the current frame under each camera are obtained, specifically as follows: Carrying out multi-target detection on each frame of preprocessed image by YOLOv, and outputting a 2D boundary box, a category label and a confidence score of each target; Cutting out a target area for each 2D detection frame, and sending the target area into an appearance feature extraction depth network to obtain Re-ID feature vectors; tracking continuous frame target detection results of the same camera based on single targets of motion and appearance, adopting BYTETrack to realize target association between frames, and distributing unique local 2D track IDs based on IoU overlapping degree and appearance characteristic distance; the output of each camera, each frame, is a target set: ; Wherein b i is a two-dimensional detection frame of the ith target, s i is a confidence score, ID i is a local track ID of the camera at the current moment, and f i is a target appearance Re-ID characteristic.
6. The multi-view joint tracking method for complex objects in industrial production scene according to claim 5, wherein the multi-object detection is performed on each frame of preprocessed image by YOLOv, and the 2D bounding box, class label and confidence score of each object are output, specifically as follows: normalization and size transformation are performed on the image I in (x, y, c) after each frame pretreatment: ; Wherein, the W raw ,H raw is original pattern width and height, W model ,H model is input width and height required by YOLOv; Inputting the processed image into a trained YOLOv model, extracting multi-level characteristics of the image based on a depth convolution network, predicting the position and category information of all targets in the image on different spatial scales, and outputting YOLOv to each characteristic point: O=(x norm ,y norm ,w norm ,h norm ,s,p 1 ,...,p k ,...,p C ); Wherein x norm ,y norm is the relative coordinate of the center point of the boundary frame, w norm ,h norm is the width and height of the boundary frame, s is the confidence score, and p k is the class probability of the kth class of targets; restoring the relative coordinates of the center point of the boundary frame into actual pixel coordinates of the original image, only reserving a detection result with high enough confidence by setting a confidence threshold, then eliminating a plurality of highly overlapped detection frames by using a non-maximum suppression algorithm, and only reserving the highest scoring in the similar area; and finally returning the two-dimensional frame positions, the category labels and the confidence scores of all the targets per frame.
7. The multi-view joint tracking method for complex targets in industrial production scenes according to claim 5, wherein continuous frame target detection results of the same camera are tracked based on single targets of motion and appearance, inter-frame target association is achieved by BYTETrack, and unique local 2D track IDs are allocated based on IoU overlapping degree and appearance feature distance: According to the obtained two-dimensional frames, the category and the confidence score of each frame and the Re-ID feature vector extracted by each detection target; The detection result is divided into a high confidence candidate HSD and a low confidence candidate LSD by BYTETrack, wherein the HSD is used for updating the active track, and the LSD can supplement track interruption caused by missed detection: carrying out state prediction on all tracked targets by Kalman filtering, and predicting possible positions of the targets in the current frame; Adopting two-stage data association, firstly performing Hungary matching in the HSD and the Track by IoU overlapping degrees; supplementing and matching all unassigned detection and tracks by using the appearance feature distance; If Track can be associated with new detection, maintaining the original ID and updating the state and appearance characteristics; and (3) distributing unique local 2D track IDs to all tracking targets of each frame, and outputting a 2D detection frame, classification, confidence score, track IDs and appearance characteristics.
8. The multi-view joint tracking method of complex targets in industrial production scene according to claim 1, wherein the 2D detection frame of the target in the current frame is back projected to the physical 3D space of the factory, and when the same target appears in multiple view angles, the fusion of space coordinates obtains accurate 3D positioning to obtain a physical 3D space coordinate candidate set of each actual target, specifically as follows: Each detection target is formed by back-projecting an internal reference matrix K and external reference (R, t) of each camera to a 3D ray by using the central point of the 2D detection frame and the internal reference and external reference of the camera on each camera image, and the back-projection is carried out to form a ray in the physical space of a factory instead of a single point; Based on Re-ID feature vectors and space constraint, matching 2D detection of the same physical object under different cameras to obtain a plurality of ray groups, and setting the same object to generate a space ray L j in a j-th camera: ; wherein, C j is the optical center of the j-th camera in the world coordinate system, d j is the ray direction vector; The ray of the same target under different camera angles is utilized to calculate the closest point of the ray in space, and the closest point is used as a three-dimensional space positioning result of the measured object: If the candidate points are on N camera rays, find the closest point to minimize the following losses: ; Q 3D is the space 3D point coordinate to be solved, N is the number of rays; is the first The three-dimensional coordinates of the optical centers of the cameras in the world coordinate system; First, the A direction vector of the bar ray; And obtaining a physical 3D space coordinate candidate set.
9. The multi-view joint tracking method of complex targets in industrial production scene according to claim 1, wherein the multi-mode fusion is performed according to a physical 3D space coordinate candidate set by using space distance, time consistency and appearance characteristics to construct a matching candidate set, and global ID allocation of the targets among different cameras and different time slices is realized by adopting a joint matching algorithm, and the complete 3D track is obtained by combining tracks, which is specifically as follows: According to the physical 3D space coordinate candidate set, utilizing multi-mode characteristics comprising three dimensions of space distance, time interval and appearance characteristics to construct a pair-matching candidate set for all physical target candidate points, and judging whether a group of targets are generated by the same real object or not; Defining a globally associated joint matching Cost function, integrating space, time and appearance information, constructing a Cost matrix C of paired candidates, wherein each element is a matching Cost (A a ,B b ) between target pairs, and combining the matching optimization problems: ; Wherein, the E {0,1} characterizes whether to integrate group a and group b, satisfying the following constraint that each target candidate is attributed to at most one global track; Constructing a cost graph, and realizing global optimization allocation by adopting a Hungary algorithm; And distributing a global unique target ID according to the matching result, combining all the matched 3D track fragments, and outputting a complete global 3D target track list that each physical target has a unique ID and comprises complete three-dimensional tracks of the physical target at different visual angles and at all moments.
10. A multi-view joint tracking system for complex objects in industrial production scenarios, characterized in that it comprises a processor, a memory and a computer program stored on the memory, which processor, when executing the computer program, performs in particular the steps of a multi-view joint tracking method for complex objects in industrial production scenarios according to any of claims 1-9.

Description

Multi-view joint tracking method for complex target in industrial production scene Technical Field The invention relates to the field of video processing, in particular to a multi-view joint tracking method of complex targets in an industrial production scene. Background Modern industrial production environments are increasingly complex, and demands for automation, precision and intelligence are continuously raised. Under the background of intelligent manufacturing, industrial 4.0, flexible production lines and the like, real-time and accurate tracking of production objects (such as workpieces, AGVs, mechanical arms, personnel and the like) in factory workshops has become a key basic technology for improving production efficiency, guaranteeing safe production and realizing intelligent decision-making. The traditional industrial target tracking method mainly depends on a single sensor or a limited view angle, and often faces the following challenges that equipment is dense, illumination is complex and shielding is frequent in a factory environment, targets are easy to lose or confuse under the single view angle, industrial objects are various in variety and similar in appearance, stability is difficult to maintain based on identification of single features, accurate three-dimensional space positioning is lacking, high-precision navigation and fine operation cannot be supported, and a single system is difficult to realize full-flow global monitoring, so that data fragmentation is caused. Disclosure of Invention In order to solve the problems, the invention aims to provide a multi-view joint tracking method for complex targets in an industrial production scene, which realizes efficient and multi-view collaborative precise tracking of the complex targets in the complex industrial scene. In order to achieve the above purpose, the present invention adopts the following technical scheme: A multi-view joint tracking method of complex targets in an industrial production scene comprises the following steps: S1, performing internal parameter and external parameter calibration on all cameras, constructing a physical space 3D model of a target production area, forming a factory coordinate system, and acquiring geometric mapping parameters of each camera and a physical space; S2, acquiring multi-view image frames through a calibrated camera, acquiring multi-view synchronous original image frames with time stamps, and preprocessing to obtain preprocessed image frames; S3, detecting the target position of each frame by using a deep learning detection model Detectron according to the preprocessed image frames, carrying out monocular inter-frame association on the detected targets to obtain local tracks, extracting depth feature vectors for each target, and obtaining 2D detection frames, local 2D track IDs and appearance features of all the targets in the current frame under each camera; S4, back projecting the target to a factory physical 3D space in a 2D detection frame of the current frame, and fusing space coordinates to obtain accurate 3D positioning when the same target appears in a plurality of view angles to obtain a physical 3D space coordinate candidate set of each actual target; S5, according to the physical 3D space coordinate candidate set, utilizing space distance, time consistency and appearance characteristics to perform multi-mode fusion, constructing a matching candidate set, adopting a joint matching algorithm to realize global ID distribution of the target among different cameras and different time slices, and combining tracks to obtain a complete 3D track. Further, the internal reference and external reference calibration is performed on all cameras, specifically as follows: the camera internal reference calibration adopts Zhang Zhengyou calibration method, and a plurality of checkerboard images with different angles are shot by the same camera, and the camera model is as follows: ; wherein K is an internal reference matrix; Fx, fy is focal length, cx and cy are principal point coordinates, the rotation matrix R and the translation vector t are external parameters, and the X w,Yw,Zw is three-dimensional point coordinates under the world coordinate system; distortion model: ; Wherein, the K 1,k2,k3 is a radial distortion coefficient, and p 1,p2 is a tangential distortion coefficient; Fixing a checkerboard calibration plate with a known size in a factory scene, ensuring that the camera can observe the checkerboard calibration plate simultaneously, and solving an external parameter according to 3D world coordinates of corner points of the calibration plate and corresponding 2D pixel coordinates by adopting a PnP algorithm: ; Wherein P i is the 3D world coordinate of the corner point of the calibration plate, P i is the corresponding 2D pixel coordinate, and pi () is the projection function; the extrinsic matrices R and t for each camera are calculated using the solvePnP functions of OpenCV. Further, a phy