CN-121458978-B - Camouflage object semantic segmentation method, camouflage object semantic segmentation device and camouflage object semantic segmentation medium based on self-adaptive candidate strategy
Abstract
The technical scheme of the method and the device acquire category codes of images to be segmented through a coarse feature extractor and a classifier, achieves the purpose of transmitting semantic category information, enhances semantic perception capability of subsequent segmentation tasks, performs initial prediction of the camouflage objects through a target detector and the self-adaptive candidate strategy, generates an optimal target attention frame, provides robust space guidance for a core segmentation task, combines the category codes with the target attention frame, fuses and reconstructs multi-source information through a multi-guidance feature fusion module and a multi-task perception decoder, and outputs accurate segmentation results.
Inventors
- JIAO GE
- WANG FANGYAN
- WAN XIAOQING
- CHEN JIYOU
- ZHONG ZHENPENG
- YUE GUOWEN
Assignees
- 衡阳师范学院
Dates
- Publication Date
- 20260508
- Application Date
- 20251103
Claims (9)
- 1. A camouflage object semantic segmentation method based on a self-adaptive candidate strategy is characterized by comprising the following steps: acquiring an image to be segmented; Extracting multi-scale features of the image to be segmented by using a coarse feature extractor to obtain a plurality of classification feature images with different scales, and clustering the classification feature images with different scales by using a classifier to obtain class codes of the image to be segmented; Coding and extracting positioning features of the image to be segmented by using a target detector to obtain a plurality of positioning feature images with different scales, and dynamically analyzing the positioning feature images with different scales by using a self-adaptive candidate strategy to determine candidate target frames and target attention frames corresponding to camouflage objects; Cutting the image to be segmented based on a target attention frame to obtain a cut image block, and performing multi-magnification amplification on the cut image block to obtain a group of multi-scale perception attention area image blocks; Inputting the category codes and the multi-scale perception attention area image blocks into a backbone network, extracting multi-scale semantic features of input information by the backbone network to obtain a plurality of semantic feature images with different scales, carrying out feature fusion and feature reconstruction on the semantic feature images with different scales by utilizing a multi-guide feature fusion module to obtain a plurality of fine-grained feature expressions with different scales of a camouflage object, carrying out grouping feature interaction and gradual fusion on the fine-grained feature expressions with different scales by utilizing a multi-task perception decoder, and determining a segmentation result of the camouflage object based on the fusion result; the multi-guide feature fusion module comprises a feature alignment and interaction sub-module and a feature reconstruction and enhancer module; the feature alignment and interaction submodule comprises a feature preprocessing branch, a space alignment branch and a multi-head attention branch; The feature preprocessing branch is used for respectively carrying out normalization operation on a plurality of semantic feature graphs with different scales, and applying convolution operation with a convolution kernel of a preset size so as to realize the consistency of feature distribution and the suppression of redundant information; the space alignment branches are used for processing the characteristics output by the characteristic preprocessing branches through three parallel maximum pooling layers and a global average pooling layer to generate space description information with different scales; processing the spatial description information of different scales by utilizing a convolution layer, a normalization layer and a ReLU activation function in turn, and realizing the self-adaptive adjustment of spatial resolution and the alignment of scale features; The multi-head attention branch is used for processing the characteristics output by the space alignment branch by utilizing a multi-head self-attention mechanism so as to capture long-range dependency relationship and cross-scale characteristic interaction information, and the stable characteristic representation of characteristic enhancement is obtained by combining a normalization layer and jump connection; the characteristic reconstruction and enhancement submodule comprises a multi-receptive field convolution branch and a gating activation branch; The multi-receptive field convolution branches are used for processing the stable characteristic representation through the depth convolution layers of three different convolution kernels so as to obtain global and local semantic information combined characteristic representation; The gating activation branch is used for processing the global and local semantic information joint feature representation through convolution, normalization layers, geLU activation and splicing operations so as to realize dynamic regulation and control of feature channels and spatial position importance and information enhancement and noise suppression in a fusion process and obtain fine-grained feature expressions of a camouflage object in a plurality of different scales.
- 2. The method of claim 1, wherein the coarse feature extractor comprises ResNet backbone network or a Transformer backbone network; The ResNet backbone network is a structure based on residual block stacking, and the residual block comprises two layers of convolution, a batch normalization layer and a ReLU activation function; the Transformer backbone network includes PVTv a network, the PVTv a network including a multi-headed self-attention mechanism, a feed-forward neural network, a normalization layer, and a residual block.
- 3. The method for semantic segmentation of camouflage objects based on the adaptive candidate strategy according to claim 1, wherein the classifier comprises a minimization intra-class square distance error module and an iterative update cluster center module; The minimized intra-class square distance error module is used for calculating the distance between an input feature vector and a current cluster center based on any one of Euclidean distance, cosine similarity or Mahalanobis distance, and dividing the feature vector into class clusters to which the cluster center with the minimum distance belongs to so as to minimize intra-class square sum error; the iterative updating cluster center module is used for calculating the average value of all the feature vectors in each type of cluster again after the feature vector classification is completed once, and taking the new average value as a new cluster center.
- 4. The method of claim 1, wherein the object detector comprises a Transformer backbone network, wherein the Transformer backbone network comprises PVTv networks, wherein the PVTv networks comprise a multi-headed self-attention mechanism, a feed-forward neural network, a normalization layer, and a residual block.
- 5. The method of claim 1, wherein the module for executing the adaptive candidate strategy comprises a predicted tensor resolution branch, a confidence filtering branch, a crop box generation branch, and a boundary compensation branch; The prediction tensor analysis branch is used for perceiving the space and semantic information of the potential camouflage object from the input positioning feature map, obtaining a prediction tensor through an end-to-end detection model, wherein the prediction tensor comprises the position information of the corresponding potential camouflage object and the predicted category, and the position information comprises the central position offset, the width and height information and the confidence coefficient corresponding to the potential camouflage object; the confidence level screening branch is used for selecting a candidate target frame with highest confidence level and higher confidence level than a preset threshold value as a final candidate frame, and if the confidence levels of all the candidate target frames are lower than the preset threshold value, using a frame corresponding to the whole positioning feature map as the final candidate frame; the cutting frame generating branch is used for constructing a cutting frame conforming to the side length constraint by taking the center coordinates of the final candidate frame as a reference; The boundary compensation branch is used for calculating the position information of the out-of-range part when the coordinates of the cutting frame exceed the boundary of the image to be segmented, and performing translation compensation based on the position information, so that the cutting frame falls into the effective area of the image to be segmented again on the premise of keeping the original size, and the target attention frame is obtained.
- 6. The method of claim 1, wherein the backbone network comprises a fransformer backbone network, wherein the fransformer backbone network comprises a PVTv network, wherein the PVTv network comprises a multi-headed self-attention mechanism, a feedforward neural network, a normalization layer, and a residual block.
- 7. The method for semantic segmentation of camouflage objects based on the adaptive candidate strategy according to claim 1, wherein the multi-task perception decoder comprises a grouping interaction sub-module and a gradual fusion sub-module; the grouping interaction submodule comprises a feature unfolding unit, a feature grouping unit and an intra-group interaction unit; The characteristic expanding unit is used for processing the fine-grained characteristic expression input with a plurality of different scales sequentially through the multi-cavity convolution layer and dividing the processing result into a plurality of subsets according to the channels; The characteristic grouping unit is used for carrying out group-by-group processing on all subsets, splicing the current subset input with the output of the previous group, and then sequentially carrying out convolution, normalization layer and ReLU activation to obtain corresponding intra-group output characteristics; the intra-group interaction unit is used for carrying out internal association modeling on the intra-group output characteristics to obtain explicit characteristics; The gradual fusion submodule comprises a characteristic reconstruction unit, a fusion generation unit and a residual error output unit; the feature reconstruction unit is used for dividing the explicit features obtained by the intra-group interaction unit into two subsets, and respectively inputting the two subsets into a multi-head self-attention mechanism to model the cross-regional dependency relationship and obtain two groups of enhanced features; the fusion generation unit uses the convolution kernels which are connected in sequence as Processing the two groups of enhancement features by the convolution layer, the normalization layer and the ReLU activation function to obtain fusion features; The residual error output unit is used for performing jump connection and convolution kernel The convolution layer, the normalization layer and the ReLU activation function of the multi-task perceptual decoder are used for processing the fusion features to obtain final fusion features, and the jump connection is used for maintaining continuity between original input information of the multi-task perceptual decoder and the final fusion features.
- 8. A camouflage object semantic segmentation device based on an adaptive candidate strategy, comprising: The image acquisition module is used for acquiring an image to be segmented; The primary classification module is used for extracting multi-scale features of the image to be segmented by using a coarse feature extractor to obtain a plurality of classification feature images with different scales, and clustering the classification feature images with different scales by using a classifier to obtain class codes of the image to be segmented; The target determining module is used for carrying out coding and positioning feature extraction on the image to be segmented by utilizing a target detector to obtain a plurality of positioning feature images with different scales, and carrying out dynamic analysis on the positioning feature images with different scales by utilizing a self-adaptive candidate strategy to determine a candidate target frame and a target attention frame corresponding to the camouflage object; The image enhancement module is used for clipping the image to be segmented based on the target attention frame to obtain clipping image blocks, and performing multi-magnification amplification on the clipping image blocks to obtain a group of multi-scale perception attention area image blocks; The depth classification module is used for inputting the category codes and the multi-scale perception attention area image blocks into a backbone network, the backbone network performs multi-scale semantic feature extraction on input information to obtain a plurality of semantic feature images with different scales, and performs feature fusion and feature reconstruction on the semantic feature images with different scales by using the multi-guide feature fusion module to obtain fine-grained feature expressions with different scales of a camouflage object; the multi-guide feature fusion module comprises a feature alignment and interaction sub-module and a feature reconstruction and enhancer module; the feature alignment and interaction submodule comprises a feature preprocessing branch, a space alignment branch and a multi-head attention branch; The feature preprocessing branch is used for respectively carrying out normalization operation on a plurality of semantic feature graphs with different scales, and applying convolution operation with a convolution kernel of a preset size so as to realize the consistency of feature distribution and the suppression of redundant information; the space alignment branches are used for processing the characteristics output by the characteristic preprocessing branches through three parallel maximum pooling layers and a global average pooling layer to generate space description information with different scales; processing the spatial description information of different scales by utilizing a convolution layer, a normalization layer and a ReLU activation function in turn, and realizing the self-adaptive adjustment of spatial resolution and the alignment of scale features; The multi-head attention branch is used for processing the characteristics output by the space alignment branch by utilizing a multi-head self-attention mechanism so as to capture long-range dependency relationship and cross-scale characteristic interaction information, and the stable characteristic representation of characteristic enhancement is obtained by combining a normalization layer and jump connection; the characteristic reconstruction and enhancement submodule comprises a multi-receptive field convolution branch and a gating activation branch; The multi-receptive field convolution branches are used for processing the stable characteristic representation through the depth convolution layers of three different convolution kernels so as to obtain global and local semantic information combined characteristic representation; The gating activation branch is used for processing the global and local semantic information joint feature representation through convolution, normalization layers, geLU activation and splicing operations so as to realize dynamic regulation and control of feature channels and spatial position importance and information enhancement and noise suppression in a fusion process and obtain fine-grained feature expressions of a camouflage object in a plurality of different scales.
- 9. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-7.
Description
Camouflage object semantic segmentation method, camouflage object semantic segmentation device and camouflage object semantic segmentation medium based on self-adaptive candidate strategy Technical Field The disclosure relates to the technical field of computer vision and image processing, in particular to a camouflage object semantic segmentation method, device and medium based on a self-adaptive candidate strategy. Background The task of camouflage object semantic segmentation aims at identifying from the input image a target object that is very little visually different from the surrounding environment. Since camouflage objects are highly similar in color, texture, and morphology to the background, this task presents a significant challenge to conventional object detection and segmentation techniques and facilitates the study of fine-grained visual perceptibility. The technology has wide application value in various fields, such as polyp segmentation in medical images, surface defect detection in industrial manufacturing processes, anti-fake military investigation, emergency rescue and other scenes. In the prior art, the camouflage object semantic segmentation method is difficult to fully capture the diversity of camouflage objects in size, shape and space position, so that the segmentation accuracy and robustness are insufficient. Disclosure of Invention The disclosure provides at least a camouflage object semantic segmentation method, a camouflage object semantic segmentation device and a camouflage object semantic segmentation medium based on a self-adaptive candidate strategy, so as to solve at least one technical problem. According to another aspect of the present disclosure, there is provided a camouflage object semantic segmentation method based on an adaptive candidate strategy, including: acquiring an image to be segmented; Extracting multi-scale features of the image to be segmented by using a coarse feature extractor to obtain a plurality of classification feature images with different scales, and clustering the classification feature images with different scales by using a classifier to obtain class codes of the image to be segmented; Coding and extracting positioning features of the image to be segmented by using a target detector to obtain a plurality of positioning feature images with different scales, and dynamically analyzing the positioning feature images with different scales by using a self-adaptive candidate strategy to determine candidate target frames and target attention frames corresponding to camouflage objects; Cutting the image to be segmented based on a target attention frame to obtain a cut image block, and performing multi-magnification amplification on the cut image block to obtain a group of multi-scale perception attention area image blocks; Inputting the category codes and the multi-scale perception attention area image blocks into a backbone network, extracting multi-scale semantic features of input information by the backbone network to obtain a plurality of semantic feature images with different scales, carrying out feature fusion and feature reconstruction on the semantic feature images with different scales by utilizing a multi-guide feature fusion module to obtain a plurality of fine-grained feature expressions with different scales of a camouflage object, carrying out grouping feature interaction and gradual fusion on the fine-grained feature expressions with different scales by utilizing a multi-task perception decoder, and determining the segmentation result of the camouflage object based on the fusion result. In one possible implementation, the coarse feature extractor comprises a ResNet backbone network or a Transformer backbone network; The ResNet backbone network is a structure based on residual block stacking, and the residual block comprises two layers of convolution, a batch normalization layer and a ReLU activation function; the Transformer backbone network includes PVTv a network, the PVTv a network including a multi-headed self-attention mechanism, a feed-forward neural network, a normalization layer, and a residual block. In one possible implementation, the classifier includes a minimization intra-class squared distance error module and an iterative update cluster center module; The minimized intra-class square distance error module is used for calculating the distance between an input feature vector and a current cluster center based on any one of Euclidean distance, cosine similarity or Mahalanobis distance, and dividing the feature vector into class clusters to which the cluster center with the minimum distance belongs to so as to minimize intra-class square sum error; the iterative updating cluster center module is used for calculating the average value of all the feature vectors in each type of cluster again after the feature vector classification is completed once, and taking the new average value as a new cluster center. In one possible implem