CN-116912485-B - Scene semantic segmentation method based on feature fusion of thermal image and visible light image

CN116912485BCN 116912485 BCN116912485 BCN 116912485BCN-116912485-B

Abstract

The invention discloses a scene semantic segmentation method based on thermal image and visible light image feature fusion, which is characterized in that visible light and thermal images of the same scene are input into a trained semantic segmentation model in pairs to obtain a segmentation result of an object in the image, wherein the segmentation model comprises a double-branch trunk feature extraction network Segfomer, a global feature extraction module and a segmentation module, wherein the double-branch trunk feature extraction network Segfomer is used for extracting global features of input data; the system comprises an auxiliary feature selection module, a cross-mode feature fusion module, a progressive feature fusion decoder module and a multi-loss supervision module, wherein the auxiliary feature selection module is used for mutually supplementing feature information between two modes for the hierarchical features extracted by a trunk feature extraction network, the cross-mode feature fusion module is used for further fusing the features between the modes to obtain a rich semantic information feature, the progressive feature fusion decoder module is used for realizing the fine up-sampling of a decoder, and the multi-loss supervision module is used for supervising the learning of a model. The invention can effectively utilize the characteristics of the visible light image and the thermal image, excavate the complementary characteristics between the visible light image and the thermal image, and effectively improve the segmentation precision and generalization capability of the scene semantic segmentation model while keeping the smaller parameter quantity of the model.

Inventors

ZHU JIANG
CHEN HANMEI
ZHANG JIE
XU HAIXIA
Li saisi
TIAN SHUJUAN
LI YANCHUN

Assignees

湘潭大学

Dates

Publication Date: 20260505
Application Date: 20230516

Claims (4)

1. A scene semantic segmentation method based on thermal image and visible light image feature fusion is characterized in that the thermal image and the visible light image are input into a trained semantic segmentation model to obtain segmentation results of each type of targets in the image, and the street view semantic segmentation method based on thermal image and visible light image feature fusion comprises the following steps: A backbone feature extraction network A, which adopts Segformer network for extracting features of different levels from input visible light image, wherein Segformer network is layered transform encoder, and comprises four layers of feature extraction encoder, named Layer Four levels corresponding to the extracted features, noted as ; A backbone feature extraction network B, which adopts Segformer network to extract features of different levels of the input thermal image, and is named Layer Four levels corresponding to the extracted features, noted as ; The auxiliary feature selection module is arranged after each layer of feature extraction encoder of the main feature extraction network A and is marked as After each layer of feature extraction encoder placed in the backbone feature extraction network B, it is noted as ; The input of (2) is And The output is ; The input of (2) is And The output is The auxiliary feature selection module comprises a splicing module, a channel attention module, a space attention module and an auxiliary feature fusion module, wherein the splicing module is used for carrying out splicing processing on an input feature P main and another input feature P assist in a channel dimension to obtain a feature P1, the feature P1 is input into the channel attention module and the space attention module in parallel to respectively obtain two feature outputs P C and P S , and the features P main 、P C and P S are input into the auxiliary feature fusion module to obtain final output features ; Cross-modal feature fusion modules, four in total, are recorded as The input is And The output is ; Progressive feature fusion decoder module, a total of four layers of decoders, denoted as ; The input of (2) is The output is ; The input of (2) is And The output is ; And the multi-loss supervision module is used for supervising foreground segmentation prediction, semantic segmentation prediction and contour segmentation prediction to realize positioning, segmentation and edge refinement.
2. The scene semantic segmentation method based on thermal image and visible light image feature fusion according to claim 1, wherein the cross-modal feature fusion module comprises a1×1 convolution module, a group convolution module, a dense cascade semantic information module and residual connection, wherein the 1×1 convolution module is used for inputting features by 1 st input feature Spliced features Learning the characteristic channel information to obtain characteristics ; Obtaining the characteristics through a grouping convolution module and a dense cascade semantic information module And Feature by residual connection 、、、 And Fusion is carried out by adopting a characteristic element adding mode to obtain output characteristics, and the output characteristics are as follows 。
3. The scene semantic segmentation method based on thermal image and visible light image feature fusion according to claim 1, wherein the progressive feature fusion decoder module comprises a1×1 convolution module, a3×3 convolution module and a transpose convolution module, and the 1×1 convolution module carries out semantic information convolution of one channel on an input feature to obtain the feature Features of Obtaining a first refined upsampling feature by 3 x 3 convolution and transpose convolution Features of Obtaining a second refined upsampling feature via transpose convolution Features of And features Fusion is carried out by adopting a characteristic element adding mode to obtain output characteristics 。
4. The scene semantic segmentation method based on thermal image and visible light image feature fusion according to claim 1, wherein the multi-loss supervision module evaluates errors between foreground segmentation prediction, semantic segmentation prediction, contour segmentation prediction and three corresponding real labels, helps a network model to learn, and uses a cross entropy loss function 、、 Prediction output for three partitions thereof 、、 Training supervision: Wherein, the 、 Two kinds of cross entropy loss functions are adopted, and the cross entropy loss functions are defined as follows: (1) Where N is the number of samples, The label representing sample i, positive class 1, negative class 0, Representing the probability that a sample is i predicted to be a positive class; is a multi-class cross entropy loss function defined as: (2) Wherein M is the number of categories; As a sign function, when the true class of sample i is equal to c, Otherwise ; The prediction probability of the observation sample i belonging to the category c; Model training total loss is S: 。

Description

Scene semantic segmentation method based on feature fusion of thermal image and visible light image Technical Field The invention relates to the technical field of semantic segmentation based on deep learning, in particular to a scene semantic segmentation method based on fusion of features of a thermal image and a visible light image. Background With the development of computer vision, robots and other technologies, unmanned systems represented by robots and unmanned vehicles are widely used in various fields. In order to realize autonomous navigation of an unmanned system, environmental perception is a very important link, and plays an important role in understanding and interacting a robot with an external environment. The environment perception mainly comprises two methods, namely target detection and semantic segmentation, and compared with the target detection, the semantic segmentation realizes pixel-level segmentation, can give more semantic information, and is more beneficial to helping an unmanned system to recognize and understand the targets of the surrounding environment. In the existing semantic segmentation method, aiming at the existing RGB image data set, the segmentation result is often unable to achieve a good effect, the robustness is poor, and the segmentation performance is often poor under the environments of mutual shielding among targets, poor illumination condition, poor weather climate and the like. In order to improve the robustness of the existing scene semantic segmentation method, a plurality of researchers introduce thermal images into semantic segmentation, infrared information is obtained through heat radiated by an object by utilizing the infrared thermal imaging camera to be different from a visible light camera imaging mechanism, the robustness to light and weather changes can be enhanced, and the thermal infrared information is very effective to the recognition blurring generated by poor illumination conditions, so that the researchers pay attention to the multi-mode semantic segmentation field, and the robustness and the accuracy of the semantic segmentation are improved by utilizing the visible light and stable thermal image characteristics with rich textures and color information. Because the feature fusion of the visible light image and the thermal image can generate unpredictable noise influence, the simple utilization of the features of the two modes can lead to the segmentation accuracy of the two modes to be inferior to that of the single mode. In 2017, ha et al proposed MFNet a semantic segmentation dataset of the network and the first RGB-T city street, which uses two encoders to extract features of RGB and heat maps and a decoder structure, respectively, and fuses the two modalities of information before partial up-sampling operation in the encoders. In 2019 Sun et al designed RTFNet, the backbone network employed was resnet, and the decoder designed two modules to gradually complete feature extraction and resolution recovery by summing up the feature maps of the respective phases of the two modalities in the encoder. In 2020, shivakumar et al designed a two-way neural network structure, which can effectively fuse RGB information and RGB-T information, and simultaneously proposed a method for correcting RGB-T data set, correcting alignment of RGB and RGB-T information by Depth information, and correcting by mapping relationship of RGB-T to RGB image. 2021, zhou et al proposed a multi-level feature multi-tag learning network, designed corresponding modules for feature map processing for features extracted from the encoder, and simultaneously introduced three tags to supervise the network. Subsequently Liu et al propose a CMX model to calibrate the features of the current modality in space and dimensions by combining the features of the other modalities. In the existing research method, the segmentation precision still cannot achieve a satisfactory effect, and the precision and the parameter number of the model cannot be considered. How to effectively utilize the characteristics of the visible light image and the thermal image, mine the complementary characteristics between the visible light image and the thermal image, and reduce noise introduced by different imaging mechanisms so as to improve the generalization capability of the model is an important challenge. Disclosure of Invention Aiming at the defects of the existing method, the invention provides a scene semantic segmentation method based on thermal image and visible light image feature fusion, which aims at carrying out selective feature complementation in two modes, further fully utilizing the feature advantages of the two modes to realize the feature interaction between the modes, combining a cross-mode feature fusion mode and a step-by-step feature fusion decoder method, and simultaneously using multi-loss supervision to locate, segment and refine target edges to improve the semantic segmentation p