CN-121982671-A - Lightweight detection method for panoramic driving perception

CN121982671ACN 121982671 ACN121982671 ACN 121982671ACN-121982671-A

Abstract

The invention discloses a lightweight detection method for panoramic driving perception. The method comprises the steps of constructing a panoramic perception teacher model, loading model weights, constructing a student model, acquiring a driving video data set, inputting the driving video data set into the teacher model and the student model, utilizing a dual-domain decoupling distillation module to align feature information between the teacher model and the student model, utilizing a parallel context extraction module to enhance global semantic extraction capacity to enhance modeling capacity of a student network on features, utilizing a layout perception local refinement module to model pixel features which are suitable for lane lines and a drivable area by enhancing detailed features with specific layout distribution in a detection head, and superposing outputs of multiple detection heads to obtain a final panoramic driving perception prediction result. According to the method, a lightweight detection method of panoramic driving perception by double-domain decoupling distillation is explored, and the parameter number and the overall reasoning time consumption of the model are effectively reduced without sacrificing the accuracy of panoramic driving perception.

Inventors

TU ZHIWEI
ZHANG YUNZUO
MA XINNA
WANG HUI

Assignees

石家庄铁道大学

Dates

Publication Date: 20260505
Application Date: 20260131

Claims (3)

1. The lightweight detection method for panoramic driving perception is characterized by comprising the following steps of: S1, acquiring a driving video data set, respectively inputting the driving video data set into an intermediate feature map of an acquisition model in a teacher model and a student model, wherein i represents a level of a feature, Wherein Representing the characteristics of the backbone network, Representing the characteristics of the neck network, The student model comprises a backbone network, a neck network and a detection head, wherein the detection head comprises a target detection head, a drivable area detection head and a lane line detection head; S2, utilizing a double-domain decoupling distillation module, and effectively aligning feature information between a teacher model and a student model by calculating a double-domain attention distance between a teacher feature map Tea_i and a student feature map Stu_i and obtaining a double-domain distillation loss L_att, wherein the module comprises a channel calibration regularizer Rc and a space alignment regularizer Rs; The double-domain attention distance comprises a channel attention distance calculated by a channel calibration regulater Rs and a space alignment regulater Rs, wherein the space attention distance is calculated by the channel calibration regulater Rs and is used for measuring the difference of a teacher model to a panoramic driving perception target attention area; The two-domain distillation loss l_att represents the result of the addition of the spatial attention distance and the channel attention distance; s3, utilizing a parallel context extraction module, enhancing modeling capability of a student model on characteristics by enhancing global semantic extraction capability, and improving detection accuracy of the model on panoramic driving targets, wherein the parallel context extraction module comprises a branch A and a branch B in a parallel structure; The branch A comprises a CBS module, a maximum pooling MP, a group normalization Groupnorm, a Channel scaling Channle _scaling and a Linear layer Linear, an input feature Fin captures a feature FA1 through the CBS module and the maximum pooling MP, the feature FA1 reduces the feature dimension through the group normalization Groupnorm and the Channel scaling channel_scaling and adds the feature FA1 to obtain a feature FA2, then the feature FA2 extracts a global feature through the group normalization Groupnorm and the Linear layer Linear and adds the global feature to the feature FA2 to obtain a feature FA3, and the feature FA3 and the input feature Fin are added to obtain an output FAout of the branch A; The branch B comprises an expanded convolution DConv, a SiLU activation function, a heavy parameter expanded convolution RepDConv, a convolution layer Conv, layer normalization Layernorm and channel splicing Concat, an input feature Fin captures a feature FB1 through the expanded convolution DConv and SiLU activation function, the feature FB1 captures feature information under multiple scales through the heavy parameter expanded convolution RepDConv and generates a feature weight Wg through Softmax, the feature weight Wg is multiplied by the feature FB1 to obtain a weighted feature FB2, the feature FB2 is subjected to 1x1 convolution layer adjustment and is added with the input feature Fin to obtain a feature FB3 after layer normalization Layernorm and 1x1 convolution layer adjustment, the feature FB3 and the input feature Fin are subjected to channel splicing Concat to obtain an output feature FBout of the branch B through a 1x1 convolution layer compression dimension, the output features FAout and FBout of the branch A and the branch B are subjected to channel splicing Concat, and then the output feature Fout is obtained through the 1x1 convolution layer compression dimension; the CBS module is a combination of a convolutional layer Conv and batch normalization layers Batchnorm and SiLU activation functions; The Channel scaling channel_scaling is a depth separable convolution with a convolution kernel size of 1x 1; The heavy parameter expansion convolution RepDConv is used for carrying out expansion convolution with the convolution kernel size of 5 and the expansion rate of 2 in the training stage of the student model, expansion convolution with the convolution kernel size of 3 and the expansion rate of 2 and a 1x1 convolution layer are adopted, input features are respectively subjected to expansion convolution of 5x5, expansion convolution of 3x3 and expansion convolution of 1x1 convolution layer, the results are added to obtain output features, and RepDConv convolution kernel heavy parameters are adopted as expansion convolution with the convolution kernel size of 5 and the expansion rate of 2 in the reasoning stage of the student model; The heavy parameter represents that the convolution kernel parameters of the 5x5 expansion convolution, the 3x3 expansion convolution and the 1x1 convolution layer are added and set as a kernel parameter of the 5x5 expansion convolution; S4, utilizing a layout perception local refinement module, and modeling pixel characteristics suitable for lane lines and a drivable region by enhancing detail characteristics with specific layout distribution, so that the detection performance of a student model on lane lines and drivable region targets is improved; The specific layout distribution indicates that the lane lines and the drivable area are distributed in the lower half area of the image; The layout perception local refinement module comprises a cross-scale layout perception layer, a C2f module and a feature aggregation unit, wherein an input feature Lin obtains a reinforced feature L1 through the cross-scale layout perception layer and the C2f module, the feature L1 and the input feature Lin are transmitted to the feature aggregation unit to obtain an aggregated feature L2, the aggregated feature L2 obtains a reinforced feature L3 through the cross-scale layout perception layer and the C2f layer, and the feature L3 and the aggregated feature L2 are transmitted to the feature aggregation unit to obtain an output feature Lout; The trans-scale layout sensing layer comprises a convolution layer Conv, a layout sensing block and a channel splice Concat, an input feature Cin is respectively subjected to a 1x1 convolution layer and a 3x3 convolution layer to obtain features C1 and C2, the feature C1 is subjected to the layout sensing block and the 3x3 convolution layer and added with the feature C1 to obtain a layout sensing enhanced feature Ctvc, the feature Ctvc is subjected to the layout sensing block and the 3x3 convolution layer and added with the feature Ctvc to obtain a feature Ctvc2, the feature C2 is subjected to the 3x3 convolution layer to obtain a feature C3, the feature Ctvc2 and the feature C2 are subjected to the channel splice Concat and the 1x1 convolution layer to obtain an output feature Cout; The layout perception block comprises TVConv, a maximum pooling MP, a convolution layer, a ReLU activation function and a Sigmoid function, wherein an input feature Tin is subjected to TVConv to obtain a layout enhancement feature T1, the input feature Tin is subjected to the maximum pooling MP, a 1x1 convolution layer, the ReLU activation function and the 1x1 convolution layer to obtain important features, the attention weight Watt is obtained through the Sigmoid function, and the feature T1 is multiplied by the attention weight Watt to obtain an output feature Tout; The feature aggregation unit comprises a convolution layer Conv, a Sigmoid function and SimAM attentions, global features UG are subjected to a 1x1 convolution layer to obtain features UG1, the features UG1 are subjected to the Sigmoid function to obtain weights Wg, local features UL are subjected to the 1x1 convolution layer to obtain features UL1, the features UL1 are multiplied by the weights Wg and subjected to the 1x1 convolution layer to obtain global enhancement features UL2, the features UL2 and the features UG1 are added to obtain local enhancement features UG2, the features UL1 and the features UG2 are added to obtain output features Uout through SimAM attentions, and the SimAM attentions are unconsidered 3D attentions; And S5, superposing the outputs of the target detection head, the drivable area detection head and the lane line detection head to obtain a final panoramic driving perception prediction result.
2. The method for detecting the panoramic driving perception by the light weight according to claim 1, wherein the characteristic information between the teacher model and the student model is aligned by calculating the double-domain attention distillation loss L_att between the teacher model and the student model by using the double-domain decoupling distillation module; the two-domain decoupling distillation module comprises a channel calibration regularizer Rc and a space alignment regularizer Rs; the channel calibration regularizer Rc calculates Euclidean distance of the channel attention between the teacher model and the student model, and promotes the channel information learned by the student model; The space alignment regularizer Rs compresses feature graphs of the teacher model and the student model through a channel compression function CF respectively and directly calculates Euclidean distance between the feature graphs to promote the spatial information learned by the student model; The two-domain attention loss L_att between the teacher model and the student model is obtained by adding the result of the channel calibration regularizer Rc and the result of the space alignment regularizer Rs; The Euclidean distance is a numerical value obtained by carrying out square summation operation on the difference between two vector elements and then squaring the result, and the channel compression function CF represents that the input feature map Fea is added with absolute values along the channel dimension and then divided by the Euclidean length; The spatial attention distance Ds is expressed as the Euclidean distance between the teacher feature Tea_i and the student feature Stu_i after the passage compression function CF; The channel calibration regularization Rc comprises an average pooling AP, a 1x1 convolution layer and a Linear layer Linear, an input feature is subjected to average pooling to obtain global space response features, then the channel response features are obtained through the 1x1 convolution layer, the Linear layer Linear and the 1x1 convolution layer, a teacher feature Tea_i and a student feature Stu_i are respectively subjected to channel calibration regularization Rc to obtain channel response features CR_tea_i and CR_stu_i, wherein The Euclidean distance between the teacher and student channel response characteristics is then calculated to obtain the channel attention distance Dc.
3. The method for lightweight detection of panoramic driving perception of claim 1, wherein the training step of the student model comprises: Constructing a teacher model and a student model, and loading trained teacher model weights; constructing a training set, wherein the training set is an image sequence, a target true value coordinate, and a mask map of a drivable area and a lane line; inputting the training set into a teacher model and a student model, and training the student model; outputting an intermediate feature diagram by the teacher model; Outputting an intermediate feature map by the student model; calculating the difference of the intermediate feature graphs of the teacher model and the student model and back-propagating; Respectively calculating the difference between the target prediction coordinates and the target truth coordinates of the student model, the difference between the drivable region prediction map and the drivable region truth segmentation mask map, and the difference between the lane line prediction map and the lane line truth segmentation mask map, and carrying out back propagation; and when the loss value reaches the minimum, the network converges, the training is stopped, and a trained student model is obtained.

Description

Lightweight detection method for panoramic driving perception Technical Field The invention relates to a lightweight detection method for panoramic driving perception, and belongs to the technical field of computer vision. Background Panoramic driving perception serves as a core component of an intelligent driving environment perception system, and is a key foundation for guaranteeing intelligent driving panoramic decision planning and safe operation. Through the perception of the surrounding environment of the vehicle and the accurate identification of multiple targets, the intelligent driving system can be provided with complete road topology, vehicle targets, drivable areas and other core information support. Under the background of continuous updating and upgrading of intelligent driving technology, the multitask detection precision and the real-time response efficiency of panoramic driving perception directly influence the environment perception capability and the dangerous avoidance performance of the intelligent driving vehicle in a complex traffic scene. The rise of the deep learning technology brings breakthrough progress to the field of panoramic driving perception, and remodels the technical path of the traditional panoramic perception method. The traditional panoramic sensing scheme mostly depends on a manually constructed feature extraction mode, so that not only is the feature construction process complicated and time-consuming, but also the traditional panoramic sensing scheme is difficult to adapt to changing road scenes. The panoramic perception model driven by deep learning converts the traditional complicated artificial feature construction flow into end-to-end autonomous feature learning, and the detection precision and the environment adaptability of panoramic perception are obviously improved. In recent years, the panoramic sensing method based on deep learning continuously improves the comprehensive performance of panoramic driving sensing through network architecture optimization and feature fusion strategy improvement. Compared with an artificial feature mode, the deep learning panoramic sensing model can analyze multidimensional feature information from an input image by virtue of the capability of automatically mining deep semantic features of multiple scenes, and can still maintain stable panoramic sensing performance even under complex driving scenes such as abrupt illumination, multi-target intensive shielding or unstructured roads. Although the existing panoramic driving perception method has achieved preliminary results, many challenges still face to be broken through in the application of the actual complex driving scene to the ground. On the one hand, the real driving scene has environmental interference, which is easy to cause confusion of the characteristics of panoramic perception or missing of key information, and the difficulty of panoramic driving detection is increased. On the other hand, most panoramic sensing models have the problem that the model parameter amount is too large during multitasking parallel detection, and the requirements of real-time detection are met while the multi-target detection precision of lane lines, drivable areas and the like are difficult to ensure. Disclosure of Invention The present invention has been made to solve the above-mentioned problems in the conventional methods, and an object of the present invention is to provide a lightweight detection method for panoramic driving perception. In order to achieve the above purpose, the technical scheme of the invention is as follows: the lightweight detection method for panoramic driving perception is characterized by comprising the following steps of: S1, acquiring a driving video data set, respectively inputting the driving video data set into an intermediate feature map of an acquisition model in a teacher model and a student model, wherein i represents a level of a feature, WhereinRepresenting the characteristics of the backbone network,Representing the characteristics of the neck network,The student model comprises a backbone network, a neck network and a detection head, wherein the detection head comprises a target detection head, a drivable area detection head and a lane line detection head; S2, utilizing a double-domain decoupling distillation module, and effectively aligning feature information between a teacher model and a student model by calculating a double-domain attention distance between a teacher feature map Tea_i and a student feature map Stu_i and obtaining a double-domain distillation loss L_att, wherein the module comprises a channel calibration regularizer Rc and a space alignment regularizer Rs; The double-domain attention distance comprises a channel attention distance calculated by a channel calibration regulater Rs and a space alignment regulater Rs, wherein the space attention distance is calculated by the channel calibration regulater Rs and is used for measuring t