CN-121861623-B - Light-weight vehicle target detection method and system based on RT-DETR
Abstract
A lightweight vehicle target detection method and system based on RT-DETR belongs to the technical field of vehicle target detection. In order to solve the problem that ResNet-18 backbone network computing resources are not consumed unnecessarily and the multi-scale target level features are not fused sufficiently, a new CGResNet backbone network is provided, the detection capability of a vehicle target is effectively maintained, the number of original model parameters and the calculated amount are reduced, and therefore the detection speed is improved. And a bidirectional feature pyramid network BiFPN is introduced in the feature fusion stage, so that the precision is improved through multi-level feature pyramid and bidirectional information transmission on the basis of keeping the light weight advantage. The novel loss function EPGIoU is provided for carrying out prediction frame positioning regression, so that the problem of gradient fluctuation in a multi-scale and shielding scene in a vehicle target detection task is solved.
Inventors
- MA LI
- Zhang Zidie
- ZHU ZENGYAN
Assignees
- 无锡学院
Dates
- Publication Date
- 20260512
- Application Date
- 20260317
Claims (5)
- 1. The lightweight vehicle target detection method based on RT-DETR is characterized by comprising the following steps of: S1, data acquisition and division: collecting traffic vehicle images, and dividing a training set and a verification set according to a proportion; S2, constructing a lightweight vehicle target detection model, specifically, replacing backbone networks ResNet-18 of an RT-DETR model with an improved context-guided residual network based on the RT-DETR model, and introducing five bidirectional feature pyramid networks into an encoder part of the RT-DETR model, wherein the specific introduction mode is as follows: The method comprises the steps of enhancing a 1 st bidirectional feature pyramid network, carrying out bidirectional weighted fusion on a P5 feature with a size aligned to a P4 scale and an original P4 reference feature after channel unification by 2 times of up-sampling, carrying out bidirectional weighted fusion on a P4 feature with a size aligned to a P3 scale and an original P3 reference feature after channel unification by 2 times of up-sampling, carrying out convolution down-sampling on a P2 feature of a 3 rd bidirectional feature pyramid network, carrying out weighted fusion on a bottom texture feature with a P3 scale, an original P3 reference feature and an original enhanced P3 feature three-way feature after top-down fusion, carrying out convolution down-sampling on a final enhanced P3 feature received by the 4 bidirectional feature pyramid network, carrying out weighted fusion on a feature with a size aligned to a P4 scale, an original P4 reference feature and an original enhanced P4 feature after top-down fusion, carrying out convolution down-sampling on a 5 th bidirectional feature pyramid network, carrying out weighted fusion on a feature with a P5 scale and an original P5 feature after AIFI global attention; s3, training a lightweight vehicle target detection model: Training the light-weight vehicle target detection model by adopting a training set, and adopting a loss function EPGIoU, wherein the EPGIoU normalizes a punishment item through a center distance, an aspect ratio consistency punishment item and a multi-constraint collaborative design of an area difference optimization item during training; The center distance normalization punishment item adopts a normalization design strongly associated with the target scale, takes the square of the diagonal length of the minimum bounding box as a normalization factor, performs normalization punishment on the Euclidean distance between the centers of the two boxes, and realizes scale self-adaptation of punishment intensity aiming at the multi-scale characteristics of the targets in the traffic scene, and specifically comprises the following steps: , wherein, The square of Euclidean distance between two centers of the predicted frame A and the real frame B is obtained; diagonal length of the minimum bounding box for the predicted box A and the real box B; The aspect ratio consistency penalty term carries out smooth constraint on the aspect ratio deviation of the prediction frame and the real frame through an exponential function, and a scale factor is introduced The aspect ratio distribution characteristics of the vehicle target are dynamically adapted, specifically: , wherein, And Is the width and height of the prediction box a, And Is the width and height of the real frame B; Is a scale factor; And S4, detecting the vehicle target in the traffic scene by using the trained lightweight vehicle target detection model.
- 2. The RT-DETR based lightweight vehicle target detection method of claim 1, wherein the modified context-guided residual network replaces residual base modules in backbone network ResNet-18 of the RT-DETR model with context-guided modules.
- 3. The RT-DETR based lightweight vehicle target detection method according to claim 2, wherein the area difference optimization term penalizes and integrates the area deviation with the minimum bounding box area as a reference, for providing a stable optimization signal, specifically: , wherein, Aligning the area of the bounding box for the smallest axis that can contain both the predicted box a and the real box B; the area is collected for the union of the prediction box A and the real box B.
- 4. The RT-DETR based lightweight vehicle target detection method of claim 3, wherein the loss function EPGIoU is specifically: , wherein, Is the intersection ratio of the predicted frame A and the real frame B.
- 5. A RT-DETR based lightweight vehicle target detection system, the system comprising: And the data acquisition and division module: collecting traffic vehicle images, and dividing a training set and a verification set according to a proportion; the light vehicle target detection model building module comprises: based on the RT-DETR model, the backbone network ResNet-18 of the RT-DETR model is replaced by an improved context-guided residual network, and five bidirectional feature pyramid networks are introduced into an encoder part of the RT-DETR model, wherein the specific introduction mode is as follows: The method comprises the steps of enhancing a 1 st bidirectional feature pyramid network, carrying out bidirectional weighted fusion on a P5 feature with a size aligned to a P4 scale and an original P4 reference feature after channel unification by 2 times of up-sampling, carrying out bidirectional weighted fusion on a P4 feature with a size aligned to a P3 scale and an original P3 reference feature after channel unification by 2 times of up-sampling, carrying out convolution down-sampling on a P2 feature of a 3 rd bidirectional feature pyramid network, carrying out weighted fusion on a bottom texture feature with a P3 scale, an original P3 reference feature and an original enhanced P3 feature three-way feature after top-down fusion, carrying out convolution down-sampling on a final enhanced P3 feature received by the 4 bidirectional feature pyramid network, carrying out weighted fusion on a feature with a size aligned to a P4 scale, an original P4 reference feature and an original enhanced P4 feature after top-down fusion, carrying out convolution down-sampling on a 5 th bidirectional feature pyramid network, carrying out weighted fusion on a feature with a P5 scale and an original P5 feature after AIFI global attention; the lightweight vehicle target detection model training module: Training the light-weight vehicle target detection model by adopting a training set, and adopting a loss function EPGIoU, wherein the EPGIoU normalizes a punishment item through a center distance, an aspect ratio consistency punishment item and a multi-constraint collaborative design of an area difference optimization item during training; The center distance normalization punishment item adopts a normalization design strongly associated with the target scale, takes the square of the diagonal length of the minimum bounding box as a normalization factor, performs normalization punishment on the Euclidean distance between the centers of the two boxes, and realizes scale self-adaptation of punishment intensity aiming at the multi-scale characteristics of the targets in the traffic scene, and specifically comprises the following steps: , wherein, The square of Euclidean distance between two centers of the predicted frame A and the real frame B is obtained; diagonal length of the minimum bounding box for the predicted box A and the real box B; The aspect ratio consistency penalty term carries out smooth constraint on the aspect ratio deviation of the prediction frame and the real frame through an exponential function, and a scale factor is introduced The aspect ratio distribution characteristics of the vehicle target are dynamically adapted, specifically: , wherein, And Is the width and height of the prediction box a, And Is the width and height of the real frame B; Is a scale factor; and the target detection module is used for detecting the vehicle target in the traffic scene by using the trained lightweight vehicle target detection model.
Description
Light-weight vehicle target detection method and system based on RT-DETR Technical Field The invention belongs to the technical field of vehicle target detection, and particularly relates to a lightweight vehicle target detection method and system based on RT-DETR. Background In an automatic driving scenario, vehicles are often required to perform rapid and accurate target detection as important participants in road traffic and as key objects for traffic research, so as to determine a better driving route and reduce the occurrence rate of traffic accidents. Therefore, detection research of vehicle targets has important significance for road traffic safety, and is also the focus of current research. With the continuous development of computer technology, a target detection algorithm based on deep learning is gradually becoming the mainstream in the field. At present, the target detection algorithm based on deep learning is mainly divided into two main categories, namely a two-stage target detection algorithm based on candidate areas and a single-stage target detection algorithm based on regression. In the two-stage target detection algorithm, a plurality of proposal boxes are generated in an image through a regional proposal network in the first stage, the proposal boxes are finely tuned in the second stage, and the classical algorithms are Fast R-CNN, fast R-CNN and Mask R-CNN. Because the two-stage target detection algorithm detection process is divided into two stages, although good detection accuracy can be obtained, the detection speed is slower, and the real-time detection requirement in the vehicle detection task can not be met. The single-stage target detection algorithm directly carries out regression detection on the target, and classical algorithms comprise SSD, YOLO series and the like. The inference speed of the single-stage model is much faster than that of the two-stage model, but the accuracy is slightly lower, and most of researches focus on the improvement of the single-stage model in consideration of the importance of real-time requirements in traffic safety. At present, in the effort of many students, the performance of vehicle target detection is continuously improved, however, vehicle target detection facing an automatic driving scene is generally challenged. First, high-precision vehicle detection models often rely on complex structures and large computational effort, requiring high-performance GPU support, which poses a challenge to edge devices. Secondly, the sizes of the targets of the vehicles to be detected are different, and dense targets are difficult to accurately identify. Because vehicles have larger difference in scale, multiple targets with different scales exist on the same picture, so that the detector has difficulty in extracting the characteristics of different targets. Meanwhile, when the traffic flow or the vehicle flow is large, shielding phenomenon is easy to occur among dense targets, so that the feature extraction of the shielding targets in the image is incomplete, and the detection performance of the detector is poor. Disclosure of Invention The invention provides a lightweight vehicle target detection method based on RT-DETR, which solves the technical problems of large parameter quantity and difficult accurate identification existing in the existing vehicle target detection. The method comprises the following steps: S1, data acquisition and division: collecting traffic vehicle images, and dividing a training set and a verification set according to a proportion; S2, constructing a lightweight vehicle target detection model, specifically, replacing backbone networks ResNet-18 of an RT-DETR model with an improved context-guided residual network based on the RT-DETR model, and introducing five bidirectional feature pyramid networks into an encoder part of the RT-DETR model, wherein the specific introduction mode is as follows: The method comprises the steps of enhancing a 1 st bidirectional feature pyramid network, carrying out bidirectional weighted fusion on a P5 feature with a size aligned to a P4 scale and an original P4 reference feature after channel unification by 2 times of up-sampling, carrying out bidirectional weighted fusion on a P4 feature with a size aligned to a P3 scale and an original P3 reference feature after channel unification by 2 times of up-sampling, carrying out convolution down-sampling on a P2 feature of a 3 rd bidirectional feature pyramid network, carrying out weighted fusion on a bottom texture feature with a P3 scale, an original P3 reference feature and an original enhanced P3 feature three-way feature after top-down fusion, carrying out convolution down-sampling on a final enhanced P3 feature received by the 4 bidirectional feature pyramid network, carrying out weighted fusion on a feature with a size aligned to a P4 scale, an original P4 reference feature and an original enhanced P4 feature after top-down fusion, carrying out conv