Search

CN-116468740-B - Image semantic segmentation model and segmentation method

CN116468740BCN 116468740 BCN116468740 BCN 116468740BCN-116468740-B

Abstract

The invention discloses an image semantic segmentation model, which consists of a designed mixed attention focusing method, an attention correction residual error module (ARRM) and a mixed feature integration module (MFFM), is trained in a depth supervision mode as a whole, reasonably adds deformable convolution, and combines a constructed multi-scale space attention Module (MSP) and a double-pool attention module (DPA) for combined optimization, so that the problem of difficult segmentation of small target features is solved. In order to embody excellent performance of the model on a downstream task, training among different data sets is completed by adopting a migration learning mode, and the application range of the model is enlarged. Finally, the grouping convolution is added into the backbone network, so that the calculation cost is greatly reduced, and the problem of high training cost of the segmentation model is solved on the premise of guaranteeing the segmentation effect by reasonable network depth and internal module design.

Inventors

  • XIAO HANGUANG
  • SHI XINYI
  • Song Wangwang
  • XUE XUFENG
  • CAO LIUYANG
  • LI YULIN

Assignees

  • 重庆理工大学

Dates

Publication Date
20260505
Application Date
20230426

Claims (1)

  1. 1. The image semantic segmentation method is characterized by comprising the following steps of: (a) Selecting a picture data set with the same characteristics as the picture to be processed, preprocessing the batched data, setting a corresponding window width and window level according to a segmentation target, and converting the picture data into picture data in a PNG format; (b) Dividing the picture data set obtained in the step (a) into a training set, a verification set and a test set without crossing; (c) Carrying out data enhancement on the picture data in the training set, wherein the data enhancement comprises random rotation, random horizontal overturn and random vertical overturn; (d) The loss function is designed for improving the model segmentation precision, the loss function adopts DiceLoss and BCELoss mixed loss, wherein alpha and beta are respectively the weight super parameters of DiceLoss and BCELoss, the sum of the weight super parameters is 1, and the loss expression of DiceLoss is as follows: (13) BCELoss the loss expression is: (14) The mixed loss expression for DiceLoss and BCELoss is: (15) N represents the total number of pixels, C represents the total number of categories, Representing the c-th class, i-th pixel corresponds to the class value, A predicted probability value representing the corresponding category; representing a smooth index and equal to ; (E) Acquiring an image semantic segmentation model, wherein the image semantic segmentation model comprises a feature extraction module, a feature fusion module and a depth supervision training module; The feature extraction module consists of six STDC backbone convolution layers and comprises stages 1 to 6, wherein each stage comprises a plurality of basic modules, different attention modules and jump links with different sizes, and the final 2 stages are combined with an attention correction residual error ARRM module to perform feature attention correction screening; for the low-level features, a multi-scale space attention mechanism MSP module is used for extracting space information through three different pooling layers, and for the high-level features, a channel attention mechanism is used for realizing targeted screening of semantic information and feature reservation through jump connection with different sizes, so that the result is optimized; The feature fusion module inputs the stage 3 features and the integrated stage 5 features into the hybrid feature integration MFFM module for fusion, so that the extraction and combination of high-level semantic information and low-level space information are realized, the segmentation performance is improved, deformable convolution is introduced into the hybrid feature integration MFFM module, the direction of a pixel point is finely adjusted after the deformable convolution is subjected to common convolution, and the self-adaptive expansion of a convolution kernel is realized; The depth supervision training module adopts three layers of features to carry out up-sampling, namely a phase 5 feature, a remodeled phase 6 feature and a phase 3 feature and a phase 5 feature fused by the mixing feature integration MFFM module, takes the three features as the input of the segmentation head, and adopts a weighted average mode to obtain a final output result; The MSP module is used for respectively carrying out AvgPooling on input features on channel dimensions, carrying out Strip-Pooling and MaxPooling, obtaining feature information of a segmentation target rich in the space dimensions through three pooling paths, then stacking results obtained through the three pooling paths on the channels, regulating the number of the channels to be 1 through convolution, carrying out normalization by using a sigmoid activation function to obtain a space attention weight, and finally carrying out dot multiplication on the space attention weight and an original feature matrix, wherein the MSP module is calculated as follows: (1) (2) Wherein the method comprises the steps of Concat represents a channel splicing operation, F input represents an input original feature map; The Attention correction residual ARRM module comprises a DoublePooling-Attention module and residual jump connection, wherein an input feature map is subjected to 3×3 convolution to reduce the dimension, two parallel MaxPooling and AvgPooling are passed through a shared MLP layer to compress and expand a channel, and finally two output results are added element by element, a corresponding Attention correction weight matrix is obtained after a BN layer and sigmoid activation function, and the obtained Attention correction weight matrix is subjected to point multiplication with an original feature matrix and then is subjected to jump connection with the original feature matrix to obtain an output result, and the Attention correction residual ARRM module is calculated as follows: (3) (4) (5) (6) (7) Wherein the method comprises the steps of Representing the original feature map of the input, Representing new features after convolutional adjustment of channel number, maxPool and AvgPool representing maximum pooling and average pooling operations, respectively, MLP representing shared multi-layer perceptron, BN representing batch normalization, RELU and Representing a different activation function that is to be activated, And The method comprises the steps of respectively representing element-by-element summation and dot multiplication, DPA is a double-pooling attention module, F MaxPool represents a feature map obtained after maximum pooling operation, F AvgPool represents a feature map obtained after average pooling operation, a mixed feature integration MFFM module comprises a mixed attention mechanism comprising an MSP module and an SE module and deformable convolution, an input end consists of low-level features and high-level semantic features, the input low-level features need to pass through the MSP module during low-level feature processing, the high-level information needs to pass through the SE module during high-level information processing, the self-adaptive receptive field adjustment is carried out on the features of two paths after attention screening respectively through one deformable convolution so as to optimize boundary feature extraction, then channel splicing is carried out, the spliced features increase nonlinear capability through a 1x1 convolution block, then the spliced features are spliced again with the processed high-level features, semantic guidance is enhanced, the spliced feature matrix is subjected to feature extraction through one deformable convolution, finally the combined feature extraction with the low-level features of the MSP module and output after the addition, and the mixed feature integration MFFM is calculated as follows: (8) (9) (10) (11) (12) DConv is denoted as deformable convolution operation, upS is upsampling operation, MSP is MSP module, SE is compression and excitation operation, And Respectively input the low-level and high-level feature map of the hybrid feature integration MFFM module, And Respectively expressed as low-layer and high-layer characteristics after screening by a mixed attention mechanism; And Is the result after the two different scale features are fused; the final output characteristic diagram; (f) Training an image semantic segmentation model based on a loss function, updating model weights by utilizing gradient back propagation of a neural network in the training process, judging whether the super-parameter setting of the model is better according to the segmentation effect of the image semantic segmentation model on a verification set after training, updating the stored model weights, and finally evaluating the image semantic segmentation effect on a test set after the model training is finished: (16) after the image semantic segmentation model is trained, the image data to be processed can be subjected to semantic segmentation.

Description

Image semantic segmentation model and segmentation method Technical Field The invention relates to the technical field of image processing, in particular to an image semantic segmentation model and a segmentation method. Background Image segmentation is a key technology in image processing, is an important component in the field of computer vision, and can further analyze and understand images in a higher level through image segmentation. It subdivides the image into different sub-regions, which is a pixel-level image parsing process. Currently, semantic segmentation, instance segmentation, and panoramic segmentation are largely divided, and they are distinguished according to the classification of target entities into different categories, different entities, and combinations of the two. Semantic segmentation is the most basic and important content in the field of image segmentation, classifies pixels of the same class into one class, accurately classifies the pixels, and is widely applied to the fields of unmanned operation, unmanned plane autonomous cruising, medical image processing, satellite remote sensing image processing, other digital image processing and the like. Early, traditional methods were used to achieve accurate segmentation of images, mainly including segmentation based on thresholds, edges, regions, algorithms incorporating certain theoretical tools, such as morphological-based segmentation, hybrid genetic algorithms, wavelet analysis and change-based segmentation techniques, and other methods incorporating machine learning, such as FCM clustering, region level sets, and the like. Although the traditional image segmentation methods can achieve certain segmentation precision, the prior knowledge is still relied on, the robustness of complex target segmentation is poor, the fine granularity information extraction capability is weak, and the method cannot be well applied to real life scenes. In recent years, deep learning has been rapidly developed, and has achieved excellent performance in the field of image segmentation. By means of the fast and efficient execution capacity and the powerful generalization performance of the deep learning model, the segmentation with high precision level can be achieved on the premise of guaranteeing time and space efficiency. Classical full convolution networks (Fully Convolutional Networks, FCN) are designed for better judging the precise category of each pixel point in an image and increasing the receptive field when pixel level segmentation is carried out, but the repeated convolution stacking loses attention to image details, while some improved networks based on FCN are increasingly increased, such as U-Net based on encoder-decoder structures, deeplabV series networks based on hole pyramid pooling ASPP, and a combination TransUNet based on a transducer and U-Net, but the models have the defects that the U-Net series networks complete cross-layer information retention through residual jump connection, integrate high-layer and low-layer characteristic information, break the situations of information loss and zero interaction of different layers of information, bring more information redundancy and noise, greatly reduce the segmentation capability of the model, and DeeplabV series networks enlarge the receptive field by adding the hole convolution pyramid pooling ASPP, but have poor segmentation effect on some small target objects. In the face of different segmentation tasks, the cavity convolution brings unnecessary receptive fields, has no universality, has large overall calculation cost consumption of a model, and has good effects because a segmentation network of a transducer series is mostly combined with U-Net to realize the segmentation tasks, but the transducer training time is long, the calculation cost is high, the data demand is large, and the method is not suitable for the field of segmentation of biomedical images with scarce data sets. In summary, in the application technical field, there are some limitations in the segmentation task: (1) Most segmentation algorithms are limited to single-field application, and the same segmentation model has no good universality in the fields of remote sensing images, automatic image matting, unmanned driving, biomedical image segmentation and the like aiming at different segmentation targets, and can not simultaneously present good segmentation performance for image information of different modes, such as 2D, 3D and the like; (2) The segmentation targets with overlarge morphological distribution and size difference have obvious differences in segmentation effect, for example, good segmentation effect can be achieved for streets and buildings with different sizes when remote sensing images are segmented, but the segmentation effect for targets with smaller outlines such as trolleys and forests in the streets is poor, the targets can not be well positioned, and the segmentation at the boundary part of the