CN-121999219-A - Traffic road segmentation method based on multi-domain feature alignment under severe weather

CN121999219ACN 121999219 ACN121999219 ACN 121999219ACN-121999219-A

Abstract

A traffic road segmentation method under severe weather based on multi-domain feature alignment. The method aims at solving the problem of reduced road segmentation performance caused by image degradation and low visibility under the conditions of rain, snow, fog, night low illumination and the like, and provides a segmentation network named MDFANet. The network core comprises three innovation modules, namely a dynamic self-adaptive weighted space pyramid pooling module (DAWASPP), a hierarchical attention mechanism, a progressive decoder (PFD) and a multistage feature alignment and fusion module, wherein the traditional static fusion strategy is replaced by multi-scale cavity convolution and dynamic weight fusion, adaptability of the model to complex weather is enhanced, noise is restrained and road structure features are enhanced through layering of a geometric attention module (GAB) and a channel-space attention module (CSAB), and the problem of splitting semantic and detail information is solved through multistage feature alignment and fusion, so that boundary segmentation precision is improved. Experiments show that the method is obviously superior to the main stream model in both self-made data set and public data set, and has higher segmentation accuracy, robustness and instantaneity in severe weather.

Inventors

DING XIAOBO
LIU YANRU
CHEN PAN
ZHOU HAORAN
REN ZHENGYANG

Assignees

三峡大学

Dates

Publication Date: 20260508
Application Date: 20260122

Claims (10)

1. The traffic road segmentation method based on multi-domain feature alignment under severe weather is characterized by comprising the following steps of: step 1, collecting traffic road image data under severe weather conditions, and performing image screening and preprocessing; step 2, marking the data set in multiple categories of road areas and dynamic traffic objects by using marking, and dividing a training set, a verification set and a test set according to a preset proportion; step 3, constructing a traffic road segmentation network MDFANet based on multi-domain feature alignment, wherein the network comprises a dynamic self-adaptive weighted spatial pyramid pooling module DAWASPP, a layered attention mechanism and a progressive decoder PFD; Step 4, inputting the data set into a segmentation network for training, segmenting the severe weather road image by combining a segmentation model, and using an average intersection ratio M Iou , an average pixel precision M PA , a calculated amount F lops , a parameter amount P arams , an average reasoning time T Latency , an accuracy rate P m and an accuracy A cc as evaluation indexes; and 5, dividing the predicted image by using the trained model, and outputting a division result graph.
2. The method according to claim 1, wherein in step S1, the data set is collected to cover various bad weather scenes such as rain, snow, fog, night low light, including urban road, expressway, rural road types, to ensure the diversity of the data in weather conditions, illumination intensity and obstacle types, thereby providing a rich scene basis for training of the subsequent model, and then the collected images are processed.
3. The method of claim 2, wherein processing the collected image includes preprocessing operations and data enhancement means; The preprocessing operation comprises image clipping, size normalization and color correction to ensure the uniformity and quality of data input, and the data enhancement means comprises rotation, overturning and brightness adjustment, so that the data distribution range is further expanded, and the diversity of training samples is enhanced.
4. The method according to claim 1, wherein in step 2, the specific steps are: In the data preparation stage, firstly, an image sample in an original data set is marked frame by utilizing a marking tool, and the target area and the background area in each image are ensured to be accurately marked, so that high-quality supervision information is provided for subsequent model training; After labeling, the data set is divided into a training set, a verification set and a test set according to a certain proportion, wherein the training set is mainly used for parameter learning of a model, the verification set is used for performing intermediate evaluation and parameter tuning on the performance of the model in the training process, and the test set is used for performing objective performance inspection and generalization capability evaluation on the model after training is completed.
5. The method according to claim 1, wherein in step 4, the specific steps are: The training process uses an Adam optimizer to set learning rate, weight attenuation and training round number super parameters, and optimizes model parameters through gradient descent, after training is completed, a road image in severe weather is inferred and segmented by using a trained segmentation model to obtain a corresponding segmentation result, and in order to comprehensively and objectively evaluate the performance of the model in practical application, a multi-dimensional evaluation index is introduced, wherein the method comprises the following steps: The average intersection ratio M Iou is used for measuring the overlapping degree of the prediction result and the real label on the target area, and can reflect the segmentation precision of the model at the pixel level; The average pixel precision M PA is used for evaluating the average classification accuracy of the model on each category and can reflect the balanced performance of the model on different categories; the calculation amount Flops is used for measuring the floating point operation times required by single reasoning of the model, directly reflecting the calculation speed and the energy consumption, and is a key index for evaluating the actual execution efficiency of the model, and the calculation amount is often combined with the parameter amount to comprehensively analyze the deployment feasibility; The parameter P arams is used for measuring the overall complexity and storage overhead of the model and reflecting the deployment feasibility of the model on the resource-constrained equipment; The average reasoning time T Latency is used for evaluating the average processing speed of the model on the single image so as to verify the real-time performance of the model in an actual scene; The accuracy rate P m is used for measuring the number of the real positive classes in the results of the model prediction as the positive classes and reflecting the inhibition capability of the real positive classes to the false segmentation of the non-road area; And the accuracy A cc is used for integrally measuring the consistency of the model prediction result and the real result and reflecting the global segmentation performance.
6. The method according to claim 1, wherein in step 5, the segmentation result is visualized to show the distinguishing effect of the traffic road area and the background, so as to verify the practicability of the model in actual bad weather.
7. The method according to claim 1, wherein in step 3, the traffic road segmentation network architecture under severe weather based on multi-domain feature alignment is specifically: Firstly, an input image sequentially outputs a shallow layer characteristic diagram F1, a shallower layer characteristic diagram F2, a middle layer characteristic diagram F3 and a deep layer characteristic diagram F4 through a MobileNetv main network; Firstly, inputting a deep feature map F4 into a first CSAB module and DAWASPP modules, wherein the output feature of the first CSAB module is a feature F6, the output feature of a dynamic self-adaptive weighted spatial pyramid pooling module DAWASPP module is a feature F5, taking the obtained feature F5 and the obtained feature F6 as the input of a first progressive decoder PFD decoder, and obtaining a deep semantic feature F7 by the output of the first progressive decoder PFD; Then, after the middle-layer feature map F3 passes through the second CSAB module, the output features of the second CSAB module are features F8 and deep semantic features F7, which are used as inputs of the second progressive decoder PFD, and are fused to obtain middle-layer multi-scale fusion semantic features F9; Then, after the shallower feature map F2 passes through the first GAB module, the first GAB module outputs a feature map F10, and uses the feature map F10 and the middle-layer multiscale fusion semantic feature F9 just obtained as input of the third progressive decoder PFD together to obtain a shallow multiscale fusion semantic feature map F11; And finally, the shallow layer feature map F1 passes through a second GAB module, the feature map F12 output after the second GAB is used as the input of a fourth progressive decoder PFD together with the shallow layer multi-scale fusion semantic feature F11 just obtained to obtain a segmentation result map corresponding to the input image, and the structure realizes accurate segmentation of the road area under the complex scene through a multi-stage process of coding, enhancement, fusion and reconstruction.
8. The method according to claim 7, wherein the dynamic adaptive weighted spatial pyramid pooling module DAWASPP is specifically configured to: The deep feature map F4 is input into a first 1X 1 convolution layer DAWASPP to obtain a first convolution feature map for channel compression and basic semantic extraction, and is input into three 3X 3 cavity convolution layers with different expansion rates respectively, wherein the three 3X 3 cavity convolution layers comprise a cavity convolution layer with the expansion rate of 3, a cavity convolution layer with the expansion rate of 6 and a cavity convolution layer with the expansion rate of 9 to obtain a first cavity convolution feature map, a second cavity convolution feature map and a third cavity convolution feature map respectively; Then, three cavity convolution feature graphs are respectively input into independent channel attention mechanisms to adaptively calibrate the importance of the features of each channel; the method comprises the steps of inputting each cavity convolution feature map to a global average pooling layer to extract statistical features describing global information of each channel to obtain channel statistical features, inputting the channel statistical features to a first full-connection layer to reduce the dimension, introducing nonlinearity through a ReLU activation function, inputting the channel statistical features to a second full-connection layer to restore the original dimension, and finally inputting the channel statistical features to a Sigmoid activation function to generate channel attention weight; Then, the first convolution characteristic diagram is input to a3×3 convolution layer for processing, and after softmax function, self-adaptive weights corresponding to five scales are generated as weights respectively Weight of Weight of Weight of Weight of The weight generation mechanism can dynamically adjust the importance of different scales according to the characteristic content; Subsequently, the first convolution feature map is combined with the weight values Performing convolution multiplication to output a first weighted feature map; convolving the first attention-enhancing hole convolution feature map with a weight value Performing convolution multiplication to output a second weighted feature map; convolving the second attention-enhancing hole feature map with weight values Performing convolution multiplication to output a third weighted feature map; Convolving the third attention-enhancing hole feature map with weight values Performing convolution multiplication to output a fourth weighted feature map; global up-sampling feature map and weight value Performing convolution multiplication to output a fifth weighted feature map; Then, carrying out channel splicing on the first weighted feature map, the second weighted feature map, the third weighted feature map, the fourth weighted feature map and the fifth weighted feature map to obtain a spliced feature map; The spliced feature map is input to a 1 multiplied by 1 convolution layer and used for channel integration and feature compression to obtain a final output feature map F5 of a DAWASPP module, and the module realizes dynamic adjustment and enhancement of semantic information of different scales by multi-scale cavity convolution and parallel fusion of a channel attention mechanism and a global semantic path, so that the feature expression capability under severe weather scenes is effectively improved.
9. The method according to claim 7, wherein the hierarchical attention mechanism is specifically: In the geometric attention module GAB, the shallow feature map F1 and the shallower feature map F2 are respectively input into a plurality of two-dimensional convolution layers which are arranged in parallel and have different convolution kernel sizes, wherein the two-dimensional convolution layers comprise a 1X 3 convolution layer, a 3X 1 convolution layer, a 1X 5 convolution layer, a 5X 1 convolution layer, a 1X 7 convolution layer and a 7X 1 convolution layer, and the two-dimensional convolution layers are respectively used for extracting edge, texture and geometric structure information in different directions and under different sensing fields; The method comprises the steps of obtaining a first direction fusion characteristic diagram, obtaining a second direction fusion characteristic diagram, obtaining a third direction fusion characteristic diagram, obtaining a first direction fusion characteristic diagram by adding a 1X 3 convolution characteristic diagram and a 3X 1 convolution characteristic diagram element by element, obtaining a first direction fusion characteristic diagram by adding a 1X 5 convolution characteristic diagram and a 5X 1 convolution characteristic diagram element by element, obtaining a second direction fusion characteristic diagram by adding a 1X 7 convolution characteristic diagram and a 7X 1 convolution characteristic diagram element by element, and then splicing the first direction fusion characteristic diagram, the second direction fusion characteristic diagram and the third direction fusion characteristic diagram along a channel dimension to obtain a direction fusion splicing characteristic diagram; Meanwhile, the shallow feature map F1 and the shallower feature map F2 are respectively input into a lightweight attention module, so as to realize interaction and information screening of the cross feature map, wherein the input feature map is firstly subjected to average pooling to compress space dimension and aggregate global context information to form compact feature representation, then the aggregated feature is subjected to dimension reduction and preliminary fusion through a 1X 1 convolution layer to reduce calculated amount and integrate cross-layer information, then a ReLU activation function is introduced to inject nonlinearity into a model to enhance the expression capability, then dimension recovery and feature refining are performed through a 1X 1 convolution layer to learn a more accurate interaction relationship, and finally a space and channel self-adaptive attention weight map is generated through a Sigmoid function, wherein the weight map is used for recalibrating original or intermediate features; the operation aims at carrying out dynamic weighting on the fusion feature by utilizing the weight graph to highlight key information, and finally generating a geometrical-attention feature graph subjected to fine attention calibration; finally, multiplying the shallow feature map F1 and the shallower feature map F2 by each element in the geometric-attention feature map, guiding and screening detail features in the shallow feature map F1 and the shallower feature map F2, providing accurate supplement in space for high-level semantics, and finally outputting an attention feature map F10 and a feature map F12 with enhanced details; In the channel-space attention module CSAB, firstly, the input feature images are respectively input into a global average pooling layer and a global maximum pooling layer to respectively acquire the mean feature of the global context of the characterization channel and the extreme value feature of the salient significant response to obtain two groups of channel statistical features, and finally, the two groups of channel statistical features are respectively output to a first 1X 1 convolution layer for dimensional recovery, then are respectively input to a ReLU activation function for model injection nonlinearity, then are input to a second 1X 1 convolution layer for further refining the feature relation to obtain two sub-features, then the two sub-features are added element by element to obtain a feature image after fusion, so as to realize fusion and enhancement of double statistical information, and then the feature image after fusion is normalized through a Sigmoid activation function to generate a channel attention weight image; Then, the two sets of space statistics features are spliced along the channel dimension to realize the fusion of multi-view space information, and then input into a 1X 1 convolution layer to carry out cross-channel interaction and information integration, and space attention weight is generated through a Sigmoid activation function, the space attention weight is multiplied by the channel enhancement feature map pixel by pixel, thus carrying out self-adaptive weighting on the space dimension of the feature map, highlighting important areas and inhibiting irrelevant backgrounds to obtain a space enhancement feature map F6 and a feature map F8, and finally, the space enhancement feature map F6 and the feature map F8 are used as output feature maps of a channel-space attention module CSAB module for providing semantic features after channel recalibration and space saliency enhancement; the layered attention mechanism enables the network to effectively inhibit noise interference and highlight key areas under severe weather through the synergistic effect of shallow geometric structure enhancement and mid-deep semantic significance enhancement, and improves the overall feature expression capability and segmentation robustness.
10. The method according to claim 7, characterized in that the progressive decoder module PFD, in particular: Firstly, an output feature map F5 of a dynamic self-adaptive weighted spatial pyramid pooling DAWASPP module is input to a first up-sampling module to obtain a first up-sampling feature map, and the first up-sampling feature map is subjected to channel splicing with a spatial enhancement feature map F6 from a first channel-spatial attention module CSAB to realize preliminary fusion of high-level semantic information and middle-level spatial details to obtain a first spliced feature map; then, the first spliced feature map is input to a 1 multiplied by 1 convolution layer for channel compression and feature screening, redundancy is reduced, and key information is reserved; then, sequentially inputting and inputting the characteristic enhancement function to batch normalization BN and ReLU activation functions to realize characteristic distribution normalization and introduce nonlinear expression capability to obtain a preliminary enhancement characteristic diagram; then, inputting the preliminary enhancement feature map into a depth separable convolution layer, and utilizing the structure of the separation space convolution and the channel-by-channel convolution of the preliminary enhancement feature map to enhance the modeling capability of the space structure while reducing the calculated amount; Inputting BN and ReLU activation functions again to further normalize and activate to obtain a depth enhancement feature map, inputting the depth enhancement feature map to a global average pooling layer, and aggregating global space information into channel description vectors to obtain pooled features, inputting the pooled features to a 1X 1 convolution layer to perform channel relation recalibration, and inputting the pooled features to a Sigmoid activation function to obtain an adaptive recalibration weight, and multiplying the adaptive recalibration weight by a first spliced feature map channel by channel to obtain a recalibration feature map; And finally, adding the recalibration feature map and the first splicing feature map element by element through residual connection, so as to realize information retention and optimization of gradient flow, and obtain a fusion feature map.

Description

Traffic road segmentation method based on multi-domain feature alignment under severe weather Technical Field The invention relates to the technical field of computer vision and automatic driving, in particular to a method for segmenting a traffic road in severe weather based on deep learning, which is used for improving the perception robustness of an automatic driving vehicle in complex environments such as rain, snow, heavy fog and the like. Background With the rapid development of artificial intelligence technology, automatic driving automobiles are gradually moving from concept stage to scale application, and become an important component of intelligent traffic. Under the background of construction of smart cities and new generation traffic infrastructures, the automatic driving technology is aimed at improving road traffic efficiency, reducing traffic accident rate, relieving traffic jams and other key tasks. Currently, in conventional scenes such as city streetscape roads and unstructured roads, key technologies of an automatic driving system are relatively mature, and high-level environment perception and path planning can be achieved. However, these advances are mostly based on relatively ideal or single scene conditions. When the vehicle runs in severe weather such as rain, snow, heavy fog and the like or in complex environments such as insufficient illumination at night, the robustness of the existing system is still obviously insufficient. In heavy fog, rain and snow weather, water drops, haze and light spot interference in images can damage the outline of a target, so that a model is difficult to extract stable features, therefore, the boundary segmentation of a road area is easy to be fuzzy, the road range is difficult to accurately identify by vehicles, and false detection or omission detection of dynamic traffic targets such as pedestrians, vehicles and the like often occurs, so that the integrity and reliability of environmental perception are affected. These problems not only impair the decision making ability of the autonomous vehicle, but may also cause a decrease in road traffic efficiency, confusion in traffic order, and even increase the risk of traffic accidents. Thus, improving the robust perceptibility of autopilot in complex environments has become one of the key challenges in driving its true landing application. At present, the field of semantic segmentation of road scenes has been remarkably advanced, and a better segmentation effect is obtained. The full convolution network FCN used for mountain-opening lays an important foundation for image semantic segmentation research, but is difficult to effectively model global context and multi-scale information. The pyramid scene analysis network (PSPNet) integrates multi-scale context information by introducing a pyramid pooling module, so that the problems of limited receptive field and insufficient context utilization of the Full Convolution Network (FCN) are solved, and the global scene analysis performance is remarkably improved. Ronneberger et al, by introducing an encoder-decoder structure to be connected with the jump, effectively alleviates the problem of spatial information loss in the up-sampling process. The DeepLab series of models proposed by the Google team adopts cavity convolution to enlarge the receptive field and maintain the resolution, and effectively solves the problem of spatial information loss caused by continuous downsampling of the traditional convolutional neural network. Wang Biyao et al improve BiSeNetV and construct a segmentation network for structured roads by enhancing the semantic consistency of lane line features, which achieves higher segmentation accuracy in open road scenes. SegFormer based on the hierarchical transform encoder, the decoder discards complex convolution or up-sampling modules, is composed of only lightweight fully-connected layers, and is used for semantic segmentation of road regions. DDRNet proposes a dual resolution network that achieves higher inference speeds while maintaining rich detail by maintaining high resolution feature maps in parallel with deep semantic downsampling. LCFNet retains the original features of the input image by introducing a compensating branch, so that the detail branch and the semantic branch can extract the required information from the original features. Yang Lu and the like, and the accurate segmentation of the unstructured road drivable area is effectively realized through integrating a lightweight pyramid pooling module. Zhang et al propose an improved STDC structure in combination with DCFFM and CBAM modules to effectively enhance the feature expression capability of the model on complex park roads. Luo et al construct IDS-MODEL shared multitasking frames, realize efficient fusion reasoning of instance segmentation and drivable region segmentation, and obviously reduce perceived system delay and calculation cost. In order to cope with the p