CN-121999496-A - Automatic image pre-labeling method, device and medium for urban road scene

CN121999496ACN 121999496 ACN121999496 ACN 121999496ACN-121999496-A

Abstract

The invention discloses an automatic pre-labeling method, equipment and medium for images of urban road scenes, which are used for automatically identifying and labeling targets and structural information thereof in complex road scenes through the collaborative work of a plurality of complementary pre-labeling large models, have complete pre-labeling results and small manual modification quantity, greatly improve the automatic pre-labeling efficiency of urban road scene image data and provide high-quality, low-cost and stable long-term data production support for subsequent automatic driving perception algorithm training.

Inventors

Tian Maoshuai
WANG JIE
YU ZHENGHUA

Assignees

魔视智能科技(武汉)有限公司

Dates

Publication Date: 20260508
Application Date: 20260409

Claims (10)

1. An automatic image pre-labeling method for urban road scenes is characterized by comprising the following steps: respectively inputting the urban road scene image to be processed into a first detection truth model, a second detection truth model and a third detection truth model to respectively obtain pre-labeling information output by each model, wherein: the first detection truth model is trained to identify a vehicle body target in an urban road scene image, and the output pre-labeling information comprises a vehicle category and boundary frame coordinates of the vehicle body in the image; the second detection truth model is trained to identify a vehicle structural component target in the urban road scene image, and the output pre-labeling information comprises structural component categories and boundary frame coordinates of the structural components in the image; the third detection truth model is trained to identify the weak traffic participant targets in the urban road scene image, and the output pre-labeling information comprises the classes of the weak traffic participant targets and the boundary frame coordinates of the weak traffic participant targets in the image; The first detection truth model, the second detection truth model and the third detection truth model are used for extracting multi-scale features of an input image through a backbone network, a transducer encoder is used for carrying out global context modeling on the extracted multi-scale features through a self-attention mechanism, a transducer decoder is used for introducing query vectors to interact with the encoded features to generate a group of candidate target representations, and a prediction head arranged at the output end of the decoder is used for outputting pre-labeling information according to the group of candidate target representations; A one-to-one supervision mechanism of Hungary matching is adopted when the first detection truth model, the second detection truth model and the third detection truth model are trained; And aligning and hanging pre-labeling information output by each model with a corresponding urban road scene image, taking a vehicle main body target identified by the first detection truth model as a father object, taking a structural component target identified by the second detection truth model as a child object, matching the father object with the child object, and hanging pre-labeling information corresponding to the child object to the corresponding father object.
2. The method for automatic pre-labeling of images of an urban road scene according to claim 1, further comprising: and reading a preset image index file, and loading an urban road scene image to be processed, wherein the image index file records the source path, the file name and the batch information of the image.
3. The method for automatic pre-labeling of images of an urban road scene according to claim 1, further comprising: And before the alignment hanging, unified standardization processing is carried out on the pre-labeling information output by the first detection truth model, the second detection truth model and the third detection truth model to form a unified intermediate data structure.
4. The method for automatic image pre-labeling of an urban road scene according to claim 3, further comprising: and writing the unified standardized and aligned pre-labeling results into a unified structured labeling file.
5. The method according to claim 1, wherein the pre-labeling information output by the first detection truth model further comprises key point information and attribute information of a vehicle body, the pre-labeling information output by the second detection truth model further comprises attribute information of a vehicle structural component, and the pre-labeling information output by the third detection truth model further comprises key point information and attribute information of a weak traffic participant.
6. The method according to claim 5, wherein during training, based on a one-to-one correspondence between predicted targets and real targets formed by the one-to-one supervision mechanism of hungarian matching, the classification loss, the bounding box regression loss, the keypoint regression loss and the attribute classification loss are optimized synchronously for the first detection truth model, the classification loss, the bounding box regression loss and the attribute classification loss are optimized synchronously for the second detection truth model, and the classification loss, the bounding box regression loss, the keypoint regression loss and the attribute classification loss are optimized synchronously for the third detection truth model.
7. The method for automatic pre-labeling of images of an urban road scene according to claim 1, further comprising: In the model reasoning process, only the bounding box with the highest confidence coefficient is reserved for the same target; When the parent object is matched with the child object, boundary box matching pairs with IoU value larger than a IoU threshold value are screened out according to a target overlapping degree IoU value between boundary boxes, wherein the boundary box matching pairs are parent object frames and child object frames; the matching pair with the maximum IoU value is taken from the matching pair to form a final matching relationship between the father object and the child object; Wherein the matched parent object and child object no longer participate in the matching.
8. The method for automatic pre-labeling of images of urban road scenes according to claim 1, wherein the first, second and third detection truth models perform reasoning in parallel or in series.
9. An electronic device comprising a memory module including instructions loaded and executed by a processor, which when executed, cause the processor to perform a method of automatically pre-labeling images of an urban road scene according to any of claims 1-8.
10. A computer readable storage medium storing one or more programs, which when executed by a processor, implement a method of automatic pre-labeling of images of an urban road scene according to any of claims 1-8.

Description

Automatic image pre-labeling method, device and medium for urban road scene Technical Field The invention belongs to the technical field of computer vision and automatic driving data engineering, and particularly relates to an automatic image pre-labeling method, equipment and medium for urban road scenes. Background In the field of automatic driving and intelligent traffic, urban road scene image data is core basic data for training and verifying a perception algorithm. In the prior art, in order to reduce the manual labeling cost and improve the data production efficiency, the image data is generally subjected to preliminary processing by adopting a pre-labeling or automatic labeling mode, and the main technical route comprises the following categories. One is a mode of combining manual full-quantity labeling with spot check repair. The method completely relies on manual work to label the vehicles, pedestrians and related attributes in the images frame by frame, and comprises target frame selection, category selection, attribute filling and key point (such as wheel grounding point) labeling, and then quality inspection personnel perform spot inspection and repair. The scheme has high labeling precision, but has extremely strong dependence on manpower, high labeling cost and long period, and is difficult to support large-scale data rapid production. The other is a pre-labeling mode based on a single target detection model. The method generally adopts a universal target detection model such as YOLO, faster R-CNN, DETR and the like, outputs the bounding box, the category and the confidence information of the target to the input image, and then corrects and supplements the bounding box, the category and the confidence information manually on the basis. However, the method generally only covers the whole frame of the vehicle target, and is difficult to simultaneously provide fine-grained part information such as a head part, a tail part and the like, so that the manual correction workload is still large. Yet another category is the multitasking single model approach, i.e., attempting to learn the overall frame, head and tail, pedestrian/rider information of a vehicle simultaneously through a unified large model. According to the scheme, the number of models can be reduced in theory, but in a complex scene of an actual urban road, different tasks are obvious in target scale, appearance characteristics, sample distribution and difficult-to-sample characteristics, the problem of mutual interference among the tasks is easy to occur, the model training difficulty is high, the convergence is unstable, the engineering landing period is long, and the engineering landing period is difficult to stably use in an actual production environment. In addition, some of the prior art also employ component replenishment methods based on rules or geometric priors, such as inferring the head or tail position from the geometric proportions of the vehicle bounding box. The method relies on manual experience rules, has poor adaptability to shielding, attitude change and dense multi-vehicle scenes, has insufficient robustness and has larger error in complex urban road scenes. In the comprehensive view, the prior art has the following general problems under the complex scene of the urban road that firstly, single model or simple rule is difficult to simultaneously consider multi-category target detection and fine granularity part positioning, the pre-labeling result is incomplete, the manual modification amount is large, and secondly, the task coupling degree of the multi-task single model scheme is high, the training and parameter adjusting cost is high, and long-term data production is difficult to stably support. Disclosure of Invention Based on the above, the method, the device and the medium for automatically pre-labeling the image of the urban road scene are provided for the technical problems. The technical scheme adopted by the invention is as follows: as a first aspect of the present invention, there is provided an image automatic pre-labeling method for an urban road scene, comprising: respectively inputting the urban road scene image to be processed into a first detection truth model, a second detection truth model and a third detection truth model to respectively obtain pre-labeling information output by each model, wherein: the first detection truth model is trained to identify a vehicle body target in an urban road scene image, and the output pre-labeling information comprises a vehicle category and boundary frame coordinates of the vehicle body in the image; the second detection truth model is trained to identify a vehicle structural component target in the urban road scene image, and the output pre-labeling information comprises structural component categories and boundary frame coordinates of the structural components in the image; the third detection truth model is trained to identify the weak traffic participant targ