CN-122024205-A - Traffic sign recognition method based on YOLO11

CN122024205ACN 122024205 ACN122024205 ACN 122024205ACN-122024205-A

Abstract

The invention discloses a traffic sign recognition method based on YOLO11, which comprises the following steps of S1, collecting a multi-source road scene image, preprocessing to form a small target reinforcement data set, S2, based on the small target reinforcement data set, utilizing a CSPDARKNET main network of YOLO11 to extract input features, S3, carrying out CBS processing and C3K2-DF module processing on the input features for multiple times, extracting first features and second features, modeling through an SPPF module and an MA module to obtain third features, S3, carrying out CBS processing on the features respectively, carrying out feature fusion and enhancement through three modules in sequence to generate three groups of final fusion features, and S4, respectively inputting the final fusion features into a first detection head, a second detection head and a third detection head to respectively carry out classification and positioning processing to obtain a target detection result. The invention combines the YOLO11 framework, the C3K2-DF module, the MA module, the FB module and the self-adaptive filter technology, and realizes the traffic sign identification based on YOLO 11.

Inventors

HONG LIANGKUN
LUO QINGLAN
Song Aimi
BAO XIAONAN
SONG QIUYUN
SHI CHUNLI
ZHANG XUESONG
LI XIAOBO

Assignees

东华大学
上海谋乐网络科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260203

Claims (5)

1. The traffic sign recognition method based on the YOLO11 is characterized by comprising the following steps of: s1, acquiring a multi-source road scene image, and performing fuzzy enhancement, rain and fog simulation, shielding simulation and illumination disturbance treatment to form a small target enhancement data set covering multiple scenes; S2, based on a small target reinforcement data set, extracting input features of the traffic image by utilizing CSPDARKNET main network of YOLO 11; S3, performing CBS processing and C3K2-DF module processing on the input features for a plurality of times, sequentially extracting a first shallow feature, a first feature, a second feature and a second deep feature, and modeling through an SPPF module and an MA module to obtain a third feature; S4, performing CBS processing on the first feature, the second feature and the third feature respectively to obtain feature graphs with consistent three channel numbers, and performing feature fusion and enhancement through a first fusion module, a second fusion module and an FB module in sequence to obtain first, second and third final fusion features; S5, inputting the first final fusion feature, the second final fusion feature and the third final fusion feature into the first detection head, the second detection head and the third detection head respectively, classifying and positioning to obtain a first target detection result, a second target detection result and a third target detection result respectively, and performing aggregation treatment on the first target detection result, the second target detection result and the third target detection result to generate a final traffic sign recognition result.
2. The YOLO 11-based traffic sign recognition method according to claim 1, wherein S2 specifically comprises: S21, adopting CSPDARKNET of YOLO11 as a backbone network to extract characteristic information of an input traffic image in a small target intensified data set, wherein CSPDARKNET comprises 5 CBS modules, 4C 3K2-DF modules, 1 SPPF module and one MA; S22, the CBS module processes the convolution with the group ticket product kernel size of 1X1 and the BN activation function in sequence to realize the retention of the feature images and the feature information, the C3K2-DF module performs feature conversion on the input images through two convolution layers to extract features with different layers and abstract degrees in the input images, the SPPF module can rescale the input images with any size to a fixed size and generate feature vectors with fixed lengths, and the MA module is used for focusing a target area to optimize the attention to key information; S23, dividing an input feature into two parts by a C3K2-DF module through an initial convolution layer, then gradually refining a part of features by utilizing a Bottleneck structure, capturing feature information of different scales and abstract degrees by combining Bottleneck and DualConv, carrying out initial group feature extraction on an input feature map by group convolution on a part of features to obtain local features, then re-integrating output features of the group convolution into a global feature space by point convolution, completing interaction and feature enhancement among channels, directly reserving original features by shortcut connection on the other part of features, finally splicing the features in channel dimensions, and outputting the features through a fusion convolution layer; s24, the MA module optimizes the calculation mode of a self-attention mechanism through positive offset operation based on Query and Key characteristics, the rotation position coding optimizes the expression capacity of input characteristics by utilizing a complex rotation matrix, and the local position coding optimizes the space perception capacity of the characteristics through deep convolution operation: ; wherein x is the input characteristic of the input device, 、、 For Query, key and Value in the input feature, 、、 As a matrix of projections that can be learned, Is an activation function; s25, the calculation of linear attention reduces complexity by changing the matrix calculation order: ; Wherein, the Is the first The linear attention output vector of the individual positions, Is the first The number of Query vectors is set to be, Is the first The number of Key vectors is set up to, Is the first Transpose of Key vector, 1 A Value vector, N is the sequence length; S26, embedding the rotary position code into the Query and Key with the position information: RoPE RoPE ; Wherein, the For the Query vector processed by the rotational position code RoPE, Is a Key vector RoPE processed by rotational position coding RoPE Adding rotational position information to the input by complex transformation; s27, extracting local features by means of depth convolution through local position coding: LePE onv ; s28, synthesizing linear attention and local characteristics to obtain a final output: ; Wherein, the To integrate the final output results of the linear attention and local features, For the purpose of the normalization operation, Is the transpose of the Key vector, Is in combination with A corresponding Value vector.
3. The YOLO 11-based traffic sign recognition method according to claim 1, wherein S3 specifically comprises: s31, performing CBS processing on an input image twice to obtain a first result, and performing feature extraction on the first result through a first C3K2-DF module to obtain a first shallow feature; s32, performing CBS processing on the first shallow features through a second C3K2-DF module, performing feature extraction through the second C3K2-DF module to obtain first features, and simultaneously storing the first features into a first feature storage module; S33, performing CBS processing on the first features in the first feature storage module through a third C3K2-DF module, performing feature extraction through the third C3K2-DF module to obtain second features, and simultaneously storing the second features in the second feature storage module; S34, performing CBS processing on the second features in the second feature storage module through a fourth C3K2-DF module, performing feature extraction through the fourth C3K2-DF module to obtain second deep features, performing feature extraction on the second deep features through an SPPF module, and further modeling the processed features through an MA linear attention module.
4. The YOLO 11-based traffic sign recognition method according to claim 1, wherein S4 specifically comprises: s41, respectively performing CBS processing on the feature graphs with different scales, fusing the second feature and the third feature through a first fusion module, obtaining a first spliced layer through CBS processing on fusion results, and storing the first spliced layer; S42, inputting the first feature and the second fusion feature into a second fusion module for fusion, obtaining a second splicing layer through CBS processing and storing a fusion result, carrying out frequency domain fusion on the second splicing layer through an FB module to generate a sixth fusion sub-feature, splicing the second fusion sub-feature with the original splicing layer, carrying out CBS processing again to obtain a second deep fusion feature, and extracting the second deep fusion feature through a C3k2 module to obtain a first deep fusion feature; S43, performing CBS downsampling with the stride of 2 on the original input feature map, fusing with a first fusion feature input FB module to generate a new first fusion feature, and extracting through a C3k2 module to obtain a final first fusion feature; S44, performing CBS processing with the stride of 2 on the latest first fusion feature, downsampling to a second feature scale, inputting the downsampled first fusion feature into the FB module for fusion with the second fusion feature, generating a new second fusion feature, and extracting the new second fusion feature through the C3k2 module to obtain a final second fusion feature; S45, the FB module comprises a self-adaptive low-pass filter generator, an offset generator and a self-adaptive high-pass filter generator, which respectively process high-frequency smoothing, feature resampling and boundary enhancement operations, wherein BiFPN structures fuse different scale features through a bidirectional feature transfer mechanism and combine high-level semantics and low-level details to generate multi-scale unified expression features; S46, the self-adaptive low-pass filter generator smoothes high-frequency characteristics through a low-pass filter for predicting spatial variation, and reduces inconsistency in categories: oftmax ; Wherein, the For the initial filter weights generated by the 3x3 convolutional layer, For the spatial extent of the filter, , oftmax The function of the operation is normalized and, To input the spatial position coordinates on the feature map, Is the first of the network The layer of the material is formed from a layer, In the space range Relative position coordinates within; S47, the offset generator predicts the offset by calculating the local cosine similarity, and resamples the features: ; Wherein, the Is the direction of the offset predicted by the 3x3 convolutional layer, Is the offset magnitude predicted by the Sigmoid function, Is the final offset; S48, the adaptive high-pass filter generator predicts the offset and resamples the features by calculating the local cosine similarity: oftmax ; wherein E is an identity matrix, For the high pass filter weights generated by the inversion operation, ; S49, biFPN are another core component of the FB module, configured to implement cross-scale bidirectional feature delivery and multi-scale information fusion: ; Wherein, the In order to output the characteristics of the feature, Is an input feature that is used to determine the input, Is a weight dynamically adjusted by a weighted feature fusion mechanism.
5. The YOLO 11-based traffic sign recognition method according to claim 1, wherein S5 specifically comprises: S51, inputting three groups of final fusion features into corresponding detection heads for classification and positioning respectively, generating three groups of target detection results, and integrating the three groups of target detection results into a final detection result; S52, the YOLO11 detection head network comprises three detection heads, wherein the first detection head is a decoupled two-branch detection head, the first fusion characteristic is input into one branch, and the positioning of a small target is completed through CBS, common convolution and WIoUv Loss processing in sequence; S53, inputting the first fusion characteristic into another branch, and sequentially performing CBS, depth separable convolution and BCELoss processing to finish classification of the small targets.

Description

Traffic sign recognition method based on YOLO11 Technical Field The invention relates to the field of image target detection, in particular to a traffic sign recognition method based on YOLO 11. Background Under the background of continuous development of automatic driving technology, traffic sign recognition is one of the core tasks in an intelligent driving perception system and bears important responsibilities of road information perception, driving decision assistance and traffic safety guarantee. The method and the device mainly aim to accurately and real-timely detect and classify the traffic signs in the vehicle driving environment, so that reliable traffic instruction input is provided for an automatic driving system. In practical application, traffic signs are usually small in size and various in variety, and are easily affected by complex factors such as background interference, shielding, illumination change, bad weather and the like, so that problems such as image quality degradation and unobvious characteristics are caused, and the robustness and the accuracy of a detection algorithm are obviously affected. The existing traffic sign recognition technology is mostly dependent on a deep convolutional neural network model for feature extraction and classification. Early methods based on ResNet, fasterR-CNN and other structures can achieve higher detection accuracy in standard scenes, but the methods have the problems of huge model parameters, low reasoning speed and the like, and are not suitable for being deployed in an automatic driving system sensitive to delay. Along with the gradual becoming mainstream application demand of detection speed and efficiency, the YOLO series algorithm becomes a research hot spot in traffic sign detection tasks due to its single-stage target detection framework, high frame rate, smaller model volume and better real-time. The YOLOv, YOLOv7, YOLOv and other versions make various trade-offs and improvements between precision and efficiency, and are widely applied to various scenes. The prior patent CN118262335A proposes a traffic sign recognition model based on YOLOv5 improvement, and the detection precision and recall rate are improved by designing a cross-layer feature fusion network and optimizing a training strategy. The method obtains a certain detection effect in a general scene, particularly has good recognition performance on a medium-scale target, but still has the problem of higher false detection rate when facing a small-size traffic sign. Because the small target occupies fewer pixels in the image, the small target is easily ignored or the characteristics are weakened by the downsampling process in the deep network, and the final detection effect is further affected. In addition, although the method has a certain improvement of precision, under the real-time application scene with high frame rate and high throughput, the algorithm reasoning speed and the calculation complexity are not effectively controlled, and the practical feasibility of the model in end side deployment is limited. The other patent CN119339350A further optimizes the YOLOv network structure aiming at small target detection, introduces a multi-scale attention mechanism in a backbone network, replaces an original upsampling operator with a DySample dynamic sampling operator, combines WIoUv loss functions, and effectively improves the detection precision and detection speed of the small target traffic sign. The method gives consideration to detection precision and partial calculation efficiency, and shows good performance in experiments. However, in a complex weather environment or a high-speed scene with dynamic blur, the robustness of the model is still insufficient, and the stable and reliable detection effect is difficult to maintain. More importantly, the improvement of the method on the model structure is mainly focused on the balance of precision and speed, and deep exploration and optimization are still lacking for the lightweight design of the model and the efficient deployment of edge equipment. In light of the need for traffic sign recognition in complex road environments, the prior art has shortcomings in several respects. Firstly, the existing network structure has limited feature extraction capability when facing complex background and shielding interference, especially has weak discrimination capability on small-size targets, and is easy to generate missed detection and false detection, secondly, the feature fusion strategy adopted by most methods is difficult to fully integrate feature information from different scales and semantic layers, so that the problems of insufficient high-level semantics and loss of low-level details are caused, the final detection accuracy is affected, thirdly, although part of methods are optimized in precision, the strict requirements of automatic driving on high instantaneity and low calculation consumption still cannot be met due to the proble