CN-121982677-A - Traffic scene target detection method, device and storage medium based on improved RT-DETR

CN121982677ACN 121982677 ACN121982677 ACN 121982677ACN-121982677-A

Abstract

The application provides a traffic scene target detection method, a traffic scene target detection device and a storage medium based on improved RT-DETR, belonging to the field of target detection; the method comprises the steps of data set preparation, construction of an improved RT-DETR model, model training, output result output, accurate positioning of a traffic scene target area and the type of a target, wherein the improved RT-DETR model comprises a feature extraction network back bone module, a feature fusion network Neck module and a feature recognition network Head module, the feature extraction network back bone module integrates a WM-Dual Block module for fusing wavelet transformation and linear selective scanning, the SQ-ACA module for fusing feature images at different stages and carrying out attention interaction along the axial direction is added in the feature fusion network Neck module, the WM-Dual Block module replaces an SPPF module in the back bone, and the model training and output result.

Inventors

CHEN ZEHUA
XU LIANGHAO
WANG ZHENGJIE
LIU XIAOYAN

Assignees

太原理工大学

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (10)

1. A traffic scene target detection method based on improved RT-DETR is characterized by comprising the following steps: constructing and dividing a traffic scene target detection data set to obtain a test set and a training set; The method comprises the steps of constructing a traffic scene target detection model, wherein the traffic scene target detection model is obtained based on improvement of an RT-DETR model and comprises a feature extraction network back bone module, a feature fusion network Neck module and a feature recognition network Head module, wherein the feature extraction network back bone module integrates a WM-Dual Block module for fusing wavelet transformation and linear selective scanning, the feature fusion network Neck module is added with an SQ-ACA module for fusing different stages of output feature graphs and performing attention interaction along the axial direction on the basis of feature fusion, and the WM-Dual Block module replaces an SPPF module in the back bone; training the model by adopting a training set image, and storing parameters; inputting the test set image into the trained model to obtain a final traffic target detection result, and labeling the category to which the test set image belongs.
2. The traffic scene target detection method based on the improved RT-DETR, which is disclosed in claim 1, is characterized in that the WM-Dual Block module comprises a first Split module, the feature map is divided into two information streams according to the channel number based on a division factor a after the first Split module, the first information stream is directly transmitted to a first Concat module without additional processing, the second information stream is subjected to wavelet transformation to obtain four feature maps with different frequencies, namely LL, LH, HL, HH, each feature map is processed through a SELECTIVE SCAN module, the processed result is subjected to inverse wavelet transformation to obtain a feature map after the second information stream is processed and transmitted to a first Concat module, and the two information stream data are spliced along the channel direction through the first Concat module to obtain an output feature map.
3. The traffic scene target detection method based on the improved RT-DETR according to claim 2, wherein SELECTIVE SCAN is used for copying four characteristic diagrams with different frequencies into two parts, wherein one part is used for obtaining a weight characteristic diagram through a first Conv module and a SiLU activation function, the other part is used for obtaining a space scanning characteristic diagram through a second Conv module, a first SS2D module and a LayerNorm module, and finally the weight characteristic diagram is multiplied by the space scanning characteristic diagram to obtain an output characteristic diagram.
4. The traffic scene target detection method based on the improved RT-DETR according to claim 1, wherein the SQ-ACA module comprises a Positional Embedding module, a Feature Alignment module and a CrossPath Bridge module, two feature maps input into the SQ-ACA module are processed and fused through the Positional Embedding module to output information streams, the information streams are then feature aligned through the Feature Alignment module, the aligned information streams are extracted through the CrossPath Bridge module based on Conv modules with different degrees, the information streams output by the CrossPath Bridge module are then divided into five information streams through a third Conv module, the five information streams are Vx, kx, Q, ky and Vy respectively, Q and Vx are interacted to obtain transverse attention feature maps, Q and Ky are interacted to obtain longitudinal attention interaction maps, meanwhile Q is added with the longitudinal attention interaction maps through the DWConv module respectively, the two attention interaction maps are integrated through the second Concat module after being added, and the later fusion is further integrated through the MLP module to obtain output features.
5. The traffic scene target detection method based on the improved RT-DETR as claimed in claim 4, wherein Feature Alignment is used for carrying out average value and maximum value pooling processing on the first input feature map, then carrying out operation by convolution with a convolution kernel size of 7, and finally adding the first input feature map with the second input feature map to realize feature alignment and outputting an alignment feature map.
6. The traffic scene target detection method based on the improved RT-DETR, which is disclosed in claim 5, is characterized in that a CrossPath Bridge module is used for operating a multi-scale convolution alignment feature map, the alignment feature map is divided into four parts in the channel direction by a second Split module, the first part of feature map is processed by a fourth Conv module, the second part of feature map is processed by a fifth Conv module, the third part of feature map is processed by a sixth Conv module, the fourth part of feature map is processed by a seventh Conv module, and then the fourth part of feature map is fused into one part in the channel direction by a third Concat module.
7. The traffic scene target detection method based on the improved RT-DETR according to claim 6, wherein the convolution size of the fourth Conv module is 5, the position is 1, the convolution kernel size of the fifth Conv module is 3, the position is 1, the convolution kernel size of the sixth Conv module is 3, the position is 2, and the convolution kernel size of the seventh Conv module is 3, the position is 3.
8. The traffic scene target detection method based on the improved RT-DETR as claimed in claim 4, wherein Positional Embedding modules respectively perform convolution operation on the two input feature maps.
9. The traffic scene target detection device based on the improved RT-DETR is characterized by comprising: The image acquisition module is used for acquiring traffic scene images of different types; The image detection module is used for inputting the traffic scene image into the traffic scene target model to obtain the accurate positioning of the traffic target by the traffic scene target detection model and the category of the traffic target in the image; The traffic scene target model is obtained based on an improvement of an RT-DETR model and comprises a feature extraction network back-bone module, a feature fusion network Neck module and a feature recognition network Head module, wherein the feature extraction network back-bone module integrates a WM-Dual Block module for fusing wavelet transformation and linear selective scanning, the feature fusion network Neck module is added with an SQ-ACA module for fusing feature graphs at different stages and performing attention interaction along the axial direction on the basis of feature fusion, the WM-Dual Block module replaces an SPPF module in the back-bone, and the SQ-ACA module is added before splicing operation of the feature graphs at different scales.
10. A computer-readable storage medium, on which a computer program/instruction is stored, characterized in that the computer program/instruction, when executed by a processor, implements the steps of the method according to any of the claims 1-8.

Description

Traffic scene target detection method, device and storage medium based on improved RT-DETR Technical Field The application relates to the technical field of target detection, in particular to a traffic scene target detection method and device based on improved RT-DETR and a storage medium. Background The target detection algorithm plays a key role in an automatic driving and intelligent traffic system, provides reliable and real-time environment perception input for core tasks such as auxiliary driving decision, traffic signal perception, pedestrian behavior prediction and the like, and has wide application value. In recent years, with rapid development of computer vision technology, a target detection method based on deep learning is widely applied to intelligent traffic scenes. Compared with the traditional machine learning method (such as performing manual feature extraction by adopting a SIFT operator and completing identification by combining a classifier), the modern detection model represented by the YOLO series and the RT-DETR achieves better balance between accuracy and reasoning efficiency. However, in practical applications, existing algorithms still face a number of challenges: (1) The problem of target shielding in dense scenes is that in urban environments with dense traffic or complex road structures, targets such as pedestrians, non-motor vehicles and the like are often shielded by adjacent vehicles partially or completely, so that the omission ratio is obviously increased; (2) The perception degradation under low illumination and severe weather conditions is that in scenes such as night, cloudy days or rain and fog, the image quality is limited by insufficient illumination, reflective interference or sensor noise, and false detection or omission is easily caused. The reflection of rainwater on the road surface may be misjudged as a traffic signal lamp, and traffic signs or pedestrians in a dim environment may not be effectively identified due to the fuzzy features. Disclosure of Invention In order to solve the technical problems, the application provides a traffic scene target detection method, a device and a storage medium based on improved RT-DETR, and aims to further improve the robustness and adaptability under a complex traffic scene based on an RT-DETR framework, so that the accurate positioning of a traffic scene target area can be realized and the type of a traffic target in an image can be output. The technical scheme adopted by the application is that the traffic scene target detection method based on the improved RT-DETR comprises the following steps: constructing and dividing a traffic scene target detection data set to obtain a test set and a training set; The method comprises the steps of constructing a traffic scene target detection model, wherein the traffic scene target detection model is obtained based on improvement of an RT-DETR model and comprises a feature extraction network back bone module, a feature fusion network Neck module and a feature recognition network Head module, wherein the feature extraction network back bone module integrates a WM-Dual Block module for fusing wavelet transformation and linear selective scanning, the feature fusion network Neck module is added with an SQ-ACA module for fusing different stages of output feature graphs and performing attention interaction along the axial direction on the basis of feature fusion, and the WM-Dual Block module replaces an SPPF module in the back bone; training the model by adopting a training set image, and storing parameters; inputting the test set image into the trained model to obtain a final traffic target detection result, and labeling the category to which the test set image belongs. Further, the WM-Dual Block module comprises a first Split module, wherein the first Split module divides the feature map into two information streams according to the channel number based on a division factor a, the first information stream is directly transmitted to the first Concat module without additional processing, the second information stream is subjected to wavelet transformation to obtain four feature maps with different frequencies, the feature maps are LL, LH, HL, HH respectively, each feature map is processed through the SELECTIVE SCAN module, the processed result is subjected to inverse wavelet transformation to obtain a feature map after the second information stream is processed, the feature map is transmitted to the first Concat module, and the two information stream data are spliced along the channel direction through the first Concat module to obtain an output feature map. Further, a SELECTIVE SCAN module is used for copying the four feature graphs with different frequencies into two parts, wherein one part is used for activating functions through a first Conv module and SiLU to obtain a weight feature graph, the other part is used for obtaining a space scanning feature graph through a second Conv module,