CN-120833467-B - Saliency target detection method, system, electronic equipment and storage medium

CN120833467BCN 120833467 BCN120833467 BCN 120833467BCN-120833467-B

Abstract

The invention discloses a salient object detection method, a system, electronic equipment and a storage medium, which relate to the technical field of salient object detection, wherein in the method, a constructed salient object detection model is trained based on a plurality of sample images, in the constructed salient object detection model, a plurality of local detail features and a plurality of global semantic features of a target sample image are extracted through a double-branch encoder, after part of local detail features and all local detail features are subjected to interactive fusion, a plurality of enhanced local detail features and a plurality of enhanced global detail features are obtained, each enhanced local detail feature and the corresponding enhanced global detail feature are subjected to feature splicing, and after all spliced features and the rest of local detail features are processed through an edge perception enhancement module and a decoder, a salient object detection result is obtained; and accurately detecting the salient target of the target image by using the trained salient target detection model.

Inventors

XU HUAN
LIN YUEJIN
PAN SHIQI
Ye Haijuan
LV JIAHUI
WANG BEN
ZHU QIZHONG
CHEN YING
ZHANG YANG

Assignees

温州电力设计有限公司普华招标咨询分公司

Dates

Publication Date: 20260505
Application Date: 20250625

Claims (6)

1. A method for salient object detection, comprising: Training the constructed saliency target detection model based on a plurality of sample images to obtain a trained saliency target detection model, wherein the constructed saliency target detection model comprises a double-branch encoder, a cross-model interaction fusion module, The first branch of the double-branch encoder comprises N first network layers which are sequentially arranged, the second branch of the double-branch encoder comprises M second network layers which are sequentially arranged, the target sample image is input into the 1 st first network layer to obtain local detail features output by the 1 st to the N first network layers, the local detail features output by the N-1 st first network layer are input into the 1 st second network layer to obtain global semantic features output by the 1 st second network layer, the global semantic features output by the 1 st second network layer are obtained through the 1 st cross-model interaction fusion module, the local detail features output by the N first network layer and the global semantic features output by the 1 st second network layer are subjected to interactive fusion, local detail information in the local detail features output by the N first network layer is fused into the global semantic features output by the 1 st second network layer, the global enhanced semantic features are obtained, the global enhanced semantic features output by the 1 st second network layer are enabled to be input into the 1 st second network layer, the global enhanced features output by the N first network layer, the 2 nd cross-model interaction fusion module is carried out, the local detail features output by the N first network layer and the 2 nd first network layer are taken as the global enhancement features, the local detail features are obtained after the 2 nd local detail features are input into the 1 st network layer, the local detail features are fused by the second network layer, and the local detail features are obtained through the 2 first and the second network enhancement, the method comprises the steps of obtaining a2 nd enhanced global semantic feature and a2 nd enhanced local detail feature, performing feature splicing to obtain a2 nd spliced feature until an Mth spliced feature is obtained, wherein N and M are positive integers, N=M+n-1, processing each spliced feature through a convolution layer to obtain a convolution result corresponding to each spliced feature, inputting the convolution result corresponding to the Mth spliced feature into an ASPP module to obtain a multi-scale feature, inputting the convolution result corresponding to the multi-scale feature and the Mth spliced feature into a1 st decoder, inputting the output of the 1 st decoder and the convolution result corresponding to the Mth spliced feature into a1 st edge perception enhancement module, so that the edge information of a salient object is enhanced through the 1 st edge perception enhancement module, inputting the output of the Mth-1 spliced feature and the 1 st edge perception enhancement module into the 2 nd decoder, inputting the convolution result corresponding to the Mth spliced feature and the Mth spliced feature into the 2 nd edge perception enhancement module until the M-1 st edge perception enhancement module is input into the 1 st edge perception enhancement module, and the m+1th edge perception enhancement module is input into the 1 st edge perception enhancement module, the m+1th edge perception enhancement module is input into the m+1th edge perception enhancement module, and the Mth edge enhancement module is input into the 1 th edge perception enhancement network, taking the output of the Nth decoder as a significance target detection result until the output of the Nth decoder is obtained, wherein the target sample image is any target sample image; The cross-model interaction fusion module is specifically used for processing the received local detail features through the first channel attention module and the first space attention module in sequence after receiving the local detail features and the global semantic features to obtain first features, processing the received local detail features and the received global semantic features through the first cross-model space attention module to obtain second features, and adding the first features and the second features element by element to obtain enhanced local detail features; the received global detail features are sequentially processed by a second channel attention module and a second space attention module to obtain third features, the received local detail features and the received global semantic features are processed by a second cross-model space attention module to obtain fourth features, and the third features and the fourth features are added element by element to obtain enhanced global semantic features; And performing salient object detection on the object image by using the trained salient object detection model.
2. The salient object detection method according to claim 1, wherein the edge-aware enhancement module is specifically configured to obtain a received convolution result or a boundary map in local detail features, calculate a hadamard product of the boundary map and an output of a received decoder, process the hadamard product through a convolution layer to obtain a convolution result corresponding to the hadamard product, and add the convolution result corresponding to the hadamard product and the output of the received decoder element by element to obtain an output of the edge-aware enhancement module.
3. The salient object detection method of claim 2, further comprising: In the training of the constructed saliency target detection model, a first loss function is adopted to train and monitor the boundary diagram, and a second loss function is adopted to train and monitor the saliency target detection result.
4. The saliency target detection system is characterized by comprising a model training module and a saliency target detection module; The model training module is used for training the constructed saliency target detection model based on a plurality of sample images to obtain a saliency target detection model after training, wherein the constructed saliency target detection model comprises a double-branch encoder, a cross-model interaction fusion module, The first branch of the double-branch encoder comprises N first network layers which are sequentially arranged, the second branch of the double-branch encoder comprises M second network layers which are sequentially arranged, the target sample image is input into the 1 st first network layer to obtain local detail features output by the 1 st to the N first network layers, the local detail features output by the N-1 st first network layer are input into the 1 st second network layer to obtain global semantic features output by the 1 st second network layer, the global semantic features output by the 1 st second network layer are obtained through the 1 st cross-model interaction fusion module, the local detail features output by the N first network layer and the global semantic features output by the 1 st second network layer are subjected to interactive fusion, local detail information in the local detail features output by the N first network layer is fused into the global semantic features output by the 1 st second network layer, the global enhanced semantic features are obtained, the global enhanced semantic features output by the 1 st second network layer are enabled to be input into the 1 st second network layer, the global enhanced features output by the N first network layer, the 2 nd cross-model interaction fusion module is carried out, the local detail features output by the N first network layer and the 2 nd first network layer are taken as the global enhancement features, the local detail features are obtained after the 2 nd local detail features are input into the 1 st network layer, the local detail features are fused by the second network layer, and the local detail features are obtained through the 2 first and the second network enhancement, the method comprises the steps of obtaining a2 nd enhanced global semantic feature and a2 nd enhanced local detail feature, performing feature splicing to obtain a2 nd spliced feature until an Mth spliced feature is obtained, wherein N and M are positive integers, N=M+n-1, processing each spliced feature through a convolution layer to obtain a convolution result corresponding to each spliced feature, inputting the convolution result corresponding to the Mth spliced feature into an ASPP module to obtain a multi-scale feature, inputting the convolution result corresponding to the multi-scale feature and the Mth spliced feature into a1 st decoder, inputting the output of the 1 st decoder and the convolution result corresponding to the Mth spliced feature into a1 st edge perception enhancement module, so that the edge information of a salient object is enhanced through the 1 st edge perception enhancement module, inputting the output of the Mth-1 spliced feature and the 1 st edge perception enhancement module into the 2 nd decoder, inputting the convolution result corresponding to the Mth spliced feature and the Mth spliced feature into the 2 nd edge perception enhancement module until the M-1 st edge perception enhancement module is input into the 1 st edge perception enhancement module, and the m+1th edge perception enhancement module is input into the 1 st edge perception enhancement module, the m+1th edge perception enhancement module is input into the m+1th edge perception enhancement module, and the Mth edge enhancement module is input into the 1 th edge perception enhancement network, taking the output of the Nth decoder as a significance target detection result until the output of the Nth decoder is obtained, wherein the target sample image is any target sample image; The cross-model interaction fusion module is specifically used for processing the received local detail features through the first channel attention module and the first space attention module in sequence after receiving the local detail features and the global semantic features to obtain first features, processing the received local detail features and the received global semantic features through the first cross-model space attention module to obtain second features, and adding the first features and the second features element by element to obtain enhanced local detail features; the received global detail features are sequentially processed by a second channel attention module and a second space attention module to obtain third features, the received local detail features and the received global semantic features are processed by a second cross-model space attention module to obtain fourth features, and the third features and the fourth features are added element by element to obtain enhanced global semantic features; The saliency target detection module is used for carrying out saliency target detection on the target image by utilizing the trained saliency target detection model.
5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a method of saliency target detection as claimed in any one of claims 1 to 3 when the computer program is executed.
6. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements a saliency target detection method according to any one of claims 1 to 3.

Description

Saliency target detection method, system, electronic equipment and storage medium Technical Field The present invention relates to the field of salient object detection technologies, and in particular, to a salient object detection method, a salient object detection system, an electronic device, and a storage medium. Background The saliency target detection (Salient Object Detection, SOD) task aims to mimic the visual perception of humans, automatically identifying and segmenting the most attractive objects or regions from natural images. The "salience" of an object may be expressed in terms of shape, size, color, and spatial location. Because of the powerful functions, the saliency target detection has wide application value in the fields of automatic driving, image retrieval, video segmentation, image clipping, semantic segmentation, object recognition and the like. Early significant object detection methods used a method of manually designing features, i.e., using manually extracted colors and textures to mine the contrast area. However, these methods cannot extract high-level semantic information and perform poorly in some complex scenarios. The development of deep learning has greatly broken the bottleneck of manual design methods and will continue to make steady progress. The salient object detection method based on deep learning mainly relies on a deep neural network to extract discrimination features and divide salient areas with high contrast in the surrounding environment. Although the existing method has made a major breakthrough, challenges such as irregular topology and cluttered background of significant targets still exist. In particular, the salient objects may exhibit complex geometries such as non-rigid deformations, internal voids, bifurcation or adhesion, and small differences in color or texture at the boundary of the salient objects and the background. This will lead to blurred prediction results and unclear segmentation edges. The convolutional neural network-based method can extract the distinguishing features through multi-layer convolution. However, due to the limited depth they cannot capture long range dependencies, do not yield good results in complex scenes, and undermine the integrity of the object. The transform-based approach may extract context semantic information from a global perspective. However, due to the global self-focusing mechanism they cannot obtain enough local detail information, and the prediction results may be blurred when small objects are processed. Disclosure of Invention The invention aims to solve the technical problem of overcoming the defects of the prior art, and particularly provides a method, a system, electronic equipment and a storage medium for detecting a saliency target, wherein the method comprises the following steps: 1) In a first aspect, the present invention provides a method for detecting a salient object, which specifically includes the following technical solutions: Training a constructed saliency target detection model based on a plurality of sample images to obtain a trained saliency target detection model, wherein the constructed saliency target detection model comprises a double-branch encoder, a cross-model interaction fusion module, an edge perception enhancement module and a decoder, a plurality of local detail features and a plurality of global semantic features of a target sample image are extracted through the double-branch encoder, the local detail features and all the global semantic features are respectively subjected to interaction fusion through the cross-model interaction fusion module according to a preset corresponding relation to obtain a plurality of enhanced local detail features and a plurality of enhanced global detail features, each enhanced local detail feature and the corresponding enhanced global detail feature are subjected to feature stitching to obtain a plurality of stitching features, and the saliency target detection result is obtained after the processing of the edge perception enhancement module and the decoder by combining all the stitching features and the local detail features of the rest, wherein the target sample image is any target sample image; And performing salient object detection on the object image by using the trained salient object detection model. The significance target detection method provided by the invention has the beneficial effects that: The dual-branch encoder can extract local detail features and global semantic features respectively, the local detail features are beneficial to capturing complex geometric structures of targets such as non-rigid deformation, internal cavities and the like, the global semantic features are beneficial to grasping an overall scene, disordered background interference is resisted, the features are further enhanced by a cross-model interaction fusion module, so that a salient target detection model can identify the targets more accurately, the proble