CN-121982305-A - Real-time semantic segmentation method based on transform improvement

CN121982305ACN 121982305 ACN121982305 ACN 121982305ACN-121982305-A

Abstract

The application relates to a real-time semantic segmentation method based on a transform improvement, which comprises the following steps of preprocessing an automatic driving scene data set of an open source, distributing a class label for each pixel of each image, dividing the class label into a training set and a testing set, constructing a real-time semantic segmentation model, training the real-time semantic segmentation model by using the trained real-time semantic segmentation model, wherein the real-time semantic segmentation model comprises a shallow convolution layer, a semantic feature enhancement branch, a fusion branch, an edge enhancement branch, two semantic segmentation heads and an edge segmentation head. The method can solve the problems of insufficient interaction of multi-branch features, limited attention capacity of a significant region and limited modeling of multi-scale features in the existing real-time semantic segmentation method.

Inventors

LIU BO
YU WENLING
LIU HUA
ZHANG KUN

Assignees

东华理工大学

Dates

Publication Date: 20260505
Application Date: 20260119

Claims (7)

1. The real-time semantic segmentation method based on the improvement of the Transformer is characterized by comprising the following steps of: S1, preprocessing an open-source automatic driving scene data set, distributing a class label for each pixel of each image, and dividing the automatic driving scene data set into a training set and a testing set; S2, constructing a real-time semantic segmentation model, wherein the real-time semantic segmentation model comprises a shallow convolution layer, a semantic feature enhancement branch, a fusion branch, an edge enhancement branch, two semantic segmentation heads and an edge segmentation head, the shallow convolution layer is used for carrying out basic feature extraction and downsampling on an input image to generate a main feature image, the semantic feature enhancement branch is used for carrying out regional proportion feature enhancement on the main feature image and outputting the semantic segmentation image through the first semantic segmentation head to calculate semantic cross entropy loss L s1 with a real segmentation image, the fusion branch is used for outputting high-dimensional features into the semantic feature enhancement branch and the edge enhancement branch, the semantic feature enhancement branch, the fusion branch and the output of the edge enhancement branch are spliced after being aligned through downsampling to form a joint feature, the joint feature is used for calculating the fusion semantic cross entropy loss L s2 with the real segmentation image after outputting a probability image through the second semantic segmentation head, and the edge enhancement branch is used for capturing object edges and structural details in the main feature image and calculating the boundary cross entropy loss L b with the real boundary image after outputting the edge probability image through the edge segmentation head; S3, training the real-time semantic segmentation model, and taking the weighted sum of semantic cross entropy loss L s1 , fusion semantic cross entropy loss L s2 and boundary cross entropy loss L b as an integral loss function; s4, performing semantic segmentation by using the trained real-time semantic segmentation model.
2. The method is characterized in that the semantic feature enhancement branch comprises a two-stage residual error convolution layer and two multi-scale cross attention modules, the main feature map is subjected to convolution, normalization, reLU activation, convolution and normalization in the two-stage residual error convolution layer respectively, then is added with input features element by element to form first residual error output, one path of the first residual error output and the output of the fusion branch are enhanced by the multi-scale cross attention modules and then is input into a first semantic segmentation head, and the other path of the first residual error output and the output of the fusion branch are enhanced by the multi-scale cross attention modules again and are subjected to feature fusion with the output of the fusion branch and the output of the edge enhancement branch.
3. The method is characterized in that in the multi-scale cross attention module, first residual output or enhanced output is processed by a GPU friendly attention mechanism, output of a fusion branch is subjected to residual connection after convolution and then is subjected to residual connection after FFN module processing, one path of obtained output is fused with output of the GPU friendly attention mechanism, a fused result is used as a vector Q of query characteristics, another path of obtained output is subjected to convolution and pooling to obtain a key vector K and a value vector V, the vector Q of query characteristics, the key vector K and the value vector V are subjected to cross attention mechanism to achieve cross-scale semantic interaction and then are subjected to residual connection, then are subjected to nonlinear enhancement by an FFN module and then are subjected to residual connection, finally are subjected to local refinement by a local window attention mechanism, and finally are subjected to nonlinear enhancement by an FFN module to obtain final enhancement characteristics.
4. The method for real-time semantic segmentation based on the improvement of a Transformer according to claim 3 is characterized in that the edge enhancement branch comprises a three-level residual convolution layer and two feature fusion modules, the main feature map is subjected to convolution, normalization, reLU activation, convolution and normalization in the three-level residual convolution layer respectively, then is added with input features element by element to form third residual output, the output of the third residual output and the output of the fusion branch are subjected to feature fusion twice in sequence to obtain one path of output input edge segmentation head, and the other path of output edge segmentation head is subjected to feature fusion with the output of the fusion branch and the semantic feature enhancement branch.
5. The method is characterized in that the fusion branch comprises two three-level residual convolution layers, a regional attention mechanism module and a multi-scale cavity convolution and global pooling module, the main feature map is subjected to convolution, normalization, reLU activation, convolution and normalization in the three-level residual convolution layers respectively, then is added with input features element by element to form second residual output, is input to a semantic feature enhancement branch and an edge enhancement branch, is respectively enhanced and fused with first residual output and third residual output, meanwhile, the second residual output is input to the regional attention mechanism module, dynamic weighting is carried out on a significant region, response of the enhancement model to a key ground object region is carried out, the output of the regional attention mechanism module is input to the semantic feature enhancement branch and the edge enhancement branch, is respectively enhanced and fused with the first residual output and the third residual output, is processed through the three-level residual convolution layers again, is input to the multi-scale cavity convolution and global pooling module, multi-scale fusion output is obtained, and the feature enhancement is fused with the output of the feature enhancement branch and the edge enhancement branch.
6. The method is characterized in that the multi-scale hole convolution and global pooling module comprises a main path, multi-level residual convolution characteristic enhancement branches and multi-scale hole convolution branches, the main path comprises a global average pooling layer, a convolution layer and an up-sampling layer, the multi-level residual convolution characteristic enhancement branches comprise four layers of convolution layers, the multi-scale hole convolution branches comprise four layers of hole convolution layers, the output of the regional attention mechanism module carries out global average pooling layer extraction of global semantic information through the main path, channel compression and characteristic enhancement through the convolution layers, spatial size alignment through up-sampling operation, step-by-step characteristic extraction and residual connection through the multi-level residual convolution characteristic enhancement branches, multi-scale context information extraction and residual connection through the multi-scale hole convolution branches, and finally, the output of the main path, the multi-level residual convolution characteristic enhancement branches and the multi-scale hole convolution branches are spliced, channel fusion and characteristic smoothing are achieved through the convolution layers, and multi-scale fusion output is obtained.
7. The method for real-time semantic segmentation based on transform improvement according to claim 6, wherein the expression of the overall loss function is: ; Wherein, the The overall loss function is represented as a function of the overall loss, The weight coefficient representing the semantic cross entropy loss L s1 , The weight coefficient representing the boundary cross entropy loss L b , The weight coefficient of the fusion semantic cross entropy loss L s2 is represented.

Description

Real-time semantic segmentation method based on transform improvement Technical Field The application relates to the technical field of real-time semantic segmentation, in particular to a real-time semantic segmentation method based on transform improvement. Background In an automatic driving scene, the semantic segmentation can realize accurate identification of lane lines, obstacles, traffic signs and free space, and support is provided for vehicle path planning and decision making. In recent years, with the increasing abundance of high-resolution semantic segmentation datasets, the computational complexity of network models has increased, and high-precision semantic segmentation networks, such as DeepLabv3+, PSPNet, and the like, are difficult to deploy in low-power consumption devices or real-time reasoning scenes, although they are excellent in segmentation precision. Therefore, the real-time semantic segmentation network becomes a research hotspot, and the aim is to remarkably improve the reasoning speed while ensuring higher precision. The invention patent with the bulletin number of CN116612288B provides a multi-scale lightweight real-time semantic segmentation method and system, shallow, middle and deep features are extracted through a lightweight encoder, and are enhanced by adopting a attention module, and meanwhile, semantic information is enriched by combining an object context feature fusion module, so that higher segmentation precision is realized on low-computation equipment. The invention patent with the publication number of CN120823388A provides a real-time semantic segmentation method based on hierarchical feature fusion and channel attention enhancement, and the problems of local detail information loss and inconsistent intra-class semantic tags can be relieved by constructing a semantic segmentation model comprising multi-layer feature fusion and channel attention enhancement. Although the prior art has made significant progress in terms of structural optimization and speed improvement, the following key limitations still exist when facing complex traffic scenarios or multi-class fine-grained segmentation tasks: (1) The information coupling among the characteristic branches is insufficient, namely, in the fusion process of the space characteristics and the semantic characteristics, the traditional double-side or triple-branch architecture still has information loss and semantic drift, and the optimal balance between the boundary details and the global consistency is difficult to obtain; (2) The lack of focusing capability of the salient features due to insufficient regional attention mechanisms, namely that the traditional channel/spatial attention modules (such as SE and CBAM) mainly carry out global weighting, and the lack of adaptive focusing mechanisms for salient regions (such as buildings, roads and the like) with different scales or forms; (3) The pyramid pooling module can expand the receptive field, but introduces higher calculation cost, can not ensure the segmentation consistency of the boundary and the main body area, and is unfavorable for being deployed on terminal equipment with limited resources. Disclosure of Invention The invention aims to provide a real-time semantic segmentation method based on a transform improvement, which can solve the problems of insufficient multi-branch feature interaction, remarkable region attention capability and limited multi-scale feature modeling in the existing real-time semantic segmentation method. The technical scheme adopted by the invention is that the real-time semantic segmentation method based on the improvement of the Transformer comprises the following steps: S1, preprocessing an open-source automatic driving scene data set, distributing a class label for each pixel of each image, and dividing the automatic driving scene data set into a training set and a testing set; S2, constructing a real-time semantic segmentation model, wherein the real-time semantic segmentation model comprises a shallow convolution layer, a semantic feature enhancement branch, a fusion branch, an edge enhancement branch, two semantic segmentation heads and an edge segmentation head, the shallow convolution layer is used for carrying out basic feature extraction and downsampling on an input image to generate a main feature image, the semantic feature enhancement branch is used for carrying out regional proportion feature enhancement on the main feature image and outputting the semantic segmentation image through the first semantic segmentation head to calculate semantic cross entropy loss L s1 with a real segmentation image, the fusion branch is used for outputting high-dimensional features into the semantic feature enhancement branch and the edge enhancement branch, the semantic feature enhancement branch, the fusion branch and the output of the edge enhancement branch are spliced after being aligned through downsampling to form a joint feature, the joi