CN-121639739-B - Multi-target tracking method for complex scene
Abstract
The invention relates to the technical field of image processing of computer vision, and discloses a multi-target tracking method for complex scenes, which comprises the steps of obtaining target image frames; the method comprises the steps of constructing an HND-MOTR model, inputting target image frames into the HND-MOTR model, introducing a hard negative example denoising training paradigm for training, generating a trained HND-MOTR model, acquiring an image to be detected, inputting the image to be detected into the trained HND-MOTR model, and realizing multi-target tracking identification.
Inventors
- Cao danyang
- Du Canxing
- ZHANG HAOYU
- GAO LEI
- LI JINHONG
- YANG JIAN
Assignees
- 北方工业大学
Dates
- Publication Date
- 20260512
- Application Date
- 20251205
Claims (7)
- 1. A multi-target tracking method for use in complex scenarios, comprising the steps of: S1, acquiring a target image frame; s2, constructing an HND-MOTR model; s3, inputting the target image frame into an HND-MOTR model, introducing a hard negative example denoising training paradigm for training, and generating a trained HND-MOTR model, wherein the method specifically comprises the following steps of: the current image frame is respectively input into a backbone network and a transducer encoder for feature extraction to generate a multi-scale image feature map ; Inputting the current image frame into an HND query generation module to generate an HND query of the current image frame The method specifically comprises the following steps: During training, constructing a plurality of denoising groups, wherein each denoising group comprises positive queries and negative queries with the same quantity, wherein the positive queries are generated by applying small disturbance to real tag frame coordinates; then, generating a hard negative case query by using a K-NN random sampling method, specifically, for a real tag frame generating the hard negative case query, calculating the Euclidean distance between the center coordinates of the real tag frame and the rest real tag frames of the same image frame to generate the hard negative case query; meanwhile, injecting standard noise into the positive query, injecting maximum noise into the disturbed false positive query, and injecting minimum noise into the hard negative query; inputting the current image frame into a proposal inquiry generating module to generate proposal inquiry of the current image frame ; Inputting the previous image frame into a time sequence interaction module to generate tracking inquiry of the previous image frame ; Randomly initializing according to the number of the detection queries and the characteristic dimension by using a learning detection query generation module to generate learning detection queries ; Querying HND Proposal query Tracking queries Learning detection query Splicing to generate a joint query set ; Will combine the set of queries Multi-scale image feature map The input transducer decoder performs decoding operation, interacts with the cross-self-attention mechanism through a multi-layer self-attention mechanism, and outputs decoding results comprising bounding boxes and identity embedding ; Decoding results that will contain bounding boxes and identity embedding Inputting a time sequence interaction module, carrying out state update, and generating tracking inquiry of the next image frame so as to update the tracking inquiry of the current image frame; Finally, decoding results containing bounding boxes and identity embedding Inputting a pre-measuring head for prediction, generating a target ID, a boundary frame position, a size and a confidence coefficient of a current image frame, and finishing training of the HND-MOTR model to obtain a trained HND-MOTR model; s4, acquiring an image to be detected, and inputting the image to be detected into a trained HND-MOTR model to realize multi-target tracking and identification.
- 2. The multi-objective tracking method for complex scenarios according to claim 1, wherein HND-MOTR model comprises a feature extraction module, a transducer decoder, a timing interaction module, a prediction header; The feature extraction module comprises a backbone network and a transducer encoder.
- 3. The multi-objective tracking method for complex scenarios according to claim 2, wherein the HND-MOTR model further comprises a variety of query generation modules including a proposed query generation module, a leachable detected query generation module.
- 4. The multi-objective tracking method for complex scenarios according to claim 3, wherein the plurality of query generation modules of the HND-MOTR model further comprises an HND query generation module when training by introducing a hard negative denoising training paradigm.
- 5. The multi-objective tracking method for complex scenarios according to claim 1, characterized in that decoding results comprising bounding boxes and identity embedding are to be performed The input time sequence interaction module is used for carrying out state update and generating tracking inquiry of the next image frame so as to realize the update process of the tracking inquiry of the current image frame, wherein the update process comprises the following steps: Decoding results that will contain bounding boxes and identity embedding Tracking queries with previous image frames Splicing to obtain short-time memory of the current image frame, and simultaneously taking the track state of the current image frame as long-time memory; Using the short-time memory of the current image frame as the query vector, and decoding the result As a value vector, taking the track state of the current image frame as a key vector; Inputting the query vector, the value vector and the key vector into an attention layer, calculating the correlation weights among the query vector, the value vector and the key vector, and carrying out weighted fusion to generate fusion characteristics; After the fusion features and the long-time memory of the current image frame are summed pixel by pixel, the fusion features and the long-time memory of the current image frame are input into a full-connection layer for feature integration, and tracking inquiry of the current image frame is generated; and carrying out parameter smoothing on the long-term memory of the current image frame by adopting an exponential moving average method, generating the long-term memory of the next image frame, and obtaining the tracking inquiry of the next image frame, thereby realizing the updating of the tracking inquiry of the current image frame.
- 6. The multi-target tracking method for complex scene as defined in claim 5, wherein the formula for generating the long-term memory of the next image frame by performing parameter smoothing on the long-term memory of the current image frame by using an exponential moving average method, i.e. obtaining the tracking query of the next image frame is: Wherein, the Representing the long term memory of the next image frame, Representing the coefficient of smoothing and the coefficient of smoothing, Representing the long-term memory of the current image frame, Representing the decoding result of the current image frame output by the transducer decoder, including the bounding box and the identity embedding.
- 7. The multi-target tracking method for complex scenes according to claim 1, wherein the formula for calculating the center coordinate euclidean distance of the real tag frame from the remaining real tag frames of the same image frame is: Wherein, the Representing a real label frame To real label frame Is defined by a center coordinate Euclidean distance, 、 Respectively representing real label frames Center coordinates of (c) Is a combination of the horizontal coordinate value and the vertical coordinate value, 、 Respectively representing real label frames Center coordinates of (c) Is a combination of the horizontal coordinate value and the vertical coordinate value, The conditions are indicated to be such that, Representing the total number of real label boxes.
Description
Multi-target tracking method for complex scene Technical Field The invention relates to the technical field of image processing of computer vision, in particular to a multi-target tracking method used in a complex scene. Background Multi-Object Tracking (MOT) is a fundamental task in computer vision that aims to detect and continuously correlate objects of interest across video frames. The method plays a key role in wide practical applications such as automatic driving, video monitoring, intelligent retail and sports analysis. Although traditional tracking-by-detection paradigms have made significant progress, they are often limited in robustness under challenging conditions such as occlusion and frequent target entry and exit, mainly because of inconsistencies in feature requirements between the detection module and the associated module. In recent years, a query-based end-to-end tracking method has become a mainstream method. The paradigm unifies target detection and timing association in one network, and realizes joint modeling, thereby improving overall optimization efficiency and performance. Existing methods under this framework can generally be divided into two categories. The first is represented by MeMOTR, which is good at long-term timing modeling. It introduces memory-enhanced tracking queries (memory-augmented track queries) to support explicit cross-frame propagation, effectively handling occlusion. However, such methods typically rely on static query initialization, lack adaptability to new targets, and are sensitive to initial positioning, often resulting in tracking failures. The second class, illustrated as MOTRv2, employs a dynamic, proposal-driven (proposal-driven) query generation strategy. By using a powerful external detector (e.g., YOLOX), it improves robustness in target detection and initialization. However, it lacks a structured long-term memory mechanism, but relies on short-term propagation between adjacent frames, which makes it susceptible to identity switching under long-term occlusion or rapid motion (IDENTITY SWITCHES). On this basis, the prior art also proposes an end-to-end multi-target tracking method, which, although making remarkable progress, still has the following problems when pursuing a unified target of high quality target detection and robust long-term timing correlation: (1) The method has the limitations in the aspects of target initialization and recall rate, such as a time sequence focusing method relies on static and leachable inquiry (Static Learnable Queries) to find a new target, the design of the mechanism is initially designed to ensure that a model can keep stable long-term memory, but the detection recall rate and the initialization accuracy of the new target are relatively low due to insufficient coupling between the inquiry and the image information of a current frame; (2) The method mainly relies on short-term autoregressive propagation between adjacent frames for correlation, and the design causes that the model cannot effectively utilize historical information for track recovery when facing long-term shielding, abrupt change of appearance of a target or rapid crossing of a motion track, thus being extremely easy to cause identity switching or track fracture and damaging long-term robustness of tracking; (3) The common feature discriminant in the existing training paradigm is insufficient, the existing end-to-end tracker (comprising the two types) is mainly trained by using standard detection loss and basic denoising tasks, the training targets pay attention to positioning accuracy and rough classification but are insensitive in discriminant in nature, in crowded and high-similarity scenes, such as dancers or multiple sportsmen, the visual difference among examples is fine, the lack of specific discriminant constraint causes the feature representation learned by the model to be insufficient for distinguishing different individuals with adjacent space or similar appearance, and the model becomes a main bottle diameter for keeping the identity in the high-density tracking scene. Disclosure of Invention Aiming at the defects in the prior art, the invention provides a multi-target tracking method used in a complex scene, which is used for solving the problems of insufficient feature discrimination and insufficient target tracking robustness in the complex scene existing in the conventional multi-target tracking method. In order to achieve the aim of the invention, the invention adopts the following technical scheme: a multi-target tracking method for use in complex scenarios, comprising the steps of: S1, acquiring a target image frame; s2, constructing an HND-MOTR model; S3, inputting a target image frame with a real label into an HND-MOTR model, introducing a hard negative example denoising training paradigm for training, and generating a trained HND-MOTR model; s4, acquiring an image to be detected, and inputting the image to be detected into a trai