CN-122023924-A - RGB-D significance target detection method based on hierarchical feature propagation and scale perception

CN122023924ACN 122023924 ACN122023924 ACN 122023924ACN-122023924-A

Abstract

The invention relates to the technical field of computer vision and image processing, in particular to an RGB-D significance target detection method based on hierarchical feature propagation and scale perception. The method comprises the steps of obtaining an RGB image to be detected and a depth map corresponding to the RGB image to be detected, preprocessing input data, carrying out multi-level feature extraction on the RGB image and the depth map by using a double-flow pyramid vision transducer as an encoder, carrying out weighted aggregation on adjacent level features through a self-adaptive feature aggregation module to enhance complementarity between high-level semantic information and low-level space detail information, carrying out detail positioning fusion on the low-level features by adopting a layered cross-modal fusion strategy, carrying out global semantic fusion on the high-level features, and carrying out step-by-step decoding on the fusion features by using a scale perception decoding structure consisting of a multi-scale context convolution block and a perception refining unit to generate a final salient target detection result.

Inventors

XIAO RUICHAO
ZHAO LI
ZHANG YAN
HU JIANJUN
XU ZHANWEI
Xia Chenxing
WANG WEI

Assignees

河南工业贸易职业学院

Dates

Publication Date: 20260512
Application Date: 20260203

Claims (7)

1. The RGB-D significance target detection method based on hierarchical feature propagation and scale perception is characterized by comprising the following steps: s1, acquiring an RGB image to be detected and a corresponding depth map thereof, and preprocessing input data; s2, carrying out multi-level feature coding on the RGB image and the depth map, and extracting RGB features and depth features of different semantic levels; S3, carrying out self-adaptive feature aggregation on RGB features and depth features of different semantic levels to generate aggregated features; S4, carrying out layered cross-modal fusion on the aggregation features to generate fusion features; S5, performing scale sensing decoding on the fusion features to generate a salient target detection result.
2. The method for detecting the RGB-D saliency target based on hierarchical feature propagation and scale perception according to claim 1, wherein the specific method in step S1 is as follows: acquiring an RGB-D (red, green and blue) -D salient target detection data set, wherein a sample of the data set comprises an RGB image, a depth map which is spatially aligned with the RGB image and a corresponding salient target annotation map; And taking the RGB image and the depth map thereof as bimodal input data of a network, and acquiring a corresponding saliency target label map as supervision information.
3. The method for detecting the RGB-D saliency target based on hierarchical feature propagation and scale perception according to claim 1, wherein the specific method of step S2 is as follows: respectively inputting the RGB image and the depth map into a coding network of a shared structure to perform feature extraction; RGB features and depth features with different spatial resolutions and different semantic levels are obtained through step-by-step downsampling and feature mapping operations.
4. The method for detecting the RGB-D saliency target based on hierarchical feature propagation and scale perception according to claim 1, wherein the specific method of step S3 is as follows: Performing spatial scale alignment on RGB features and depth features of different levels; Splicing adjacent level features after scale alignment, and performing cross-layer feature aggregation through convolution mapping; And introducing a channel attention mechanism to the aggregation feature to generate channel weight, and weighting the aggregation feature to obtain the enhancement feature.
5. The method for detecting the RGB-D saliency target based on hierarchical feature propagation and scale perception according to claim 1, wherein the specific method of step S4 is as follows: aiming at low-layer features, inputting RGB features and depth features into a detail positioning fusion module, and performing cross-modal fusion through an attention mechanism; Aiming at the high-level features, the RGB features and the depth features are input to a global semantic fusion module, and fusion is carried out through bidirectional cross-modal semantic interaction.
6. The method for detecting the RGB-D saliency target based on hierarchical feature propagation and scale perception according to claim 1, wherein the specific method of step S4 is as follows: Performing multi-scale context modeling on the fusion features to obtain multi-scale feature representations; step-by-step enhancement is carried out on the multi-scale feature representation through a perception refining unit, channel attention weight and space attention weight are generated, then the feature is weighted, and refined features are obtained through residual connection; Step-by-step fusion is carried out on the refining characteristics to obtain decoding characteristics; and mapping the decoding characteristics through an output layer to generate a final saliency target detection diagram.
7. The method for detecting the RGB-D significance target based on the hierarchical feature propagation and the scale perception according to claim 1, wherein the method adopts a deep supervision strategy to jointly supervise prediction results of different hierarchies in a training stage.

Description

RGB-D significance target detection method based on hierarchical feature propagation and scale perception Technical Field The invention relates to the technical field of computer vision and image processing, in particular to an RGB-D (red, green and blue) -D salient target detection method based on hierarchical feature propagation and scale perception, which is suitable for fusing RGB (red, blue and blue) appearance information and depth structure information and realizing accurate positioning of a salient target area. Background The saliency target detection (Salient Object Detection, SOD) aims at automatically identifying the area with the most visual attention from the complex scene, and is an important basis for various visual tasks such as image segmentation, target detection, scene understanding, human-computer interaction and the like. With the development of deep learning technology, the saliency target detection method based on RGB images is improved remarkably in performance. However, single RGB images often lack reliable geometric and structural information in complex background, low contrast areas, and occlusion scenes, resulting in unstable detection results. For this reason, RGB-D saliency target detection methods that introduce depth information are attracting attention. The depth map can provide spatial structure and geometric cues in the scene, providing an effective complement to significant object localization. RGB-D saliency target detection (SOD) aims to accurately locate and segment the most attractive areas in an image by fusing color images (RGB) with Depth images (Depth). However, the prior art still faces the following challenges: (1) The lack of an effective information transfer mechanism between different levels of features makes it difficult for high-level semantic information to guide low-level detail features; (2) The cross-modal fusion strategy generally adopts rigid design, and the difference between RGB and depth characteristics under different semantic levels is not fully considered; (3) The decoding stage has insufficient modeling on the multi-scale structural relationship, and is easy to cause incomplete significant areas or boundary blurring. Therefore, an RGB-D saliency target detection method capable of modeling multi-level feature dependency relationships simultaneously, realizing cross-modal self-adaptive fusion and having scale perception capability is needed. Disclosure of Invention In order to overcome the defects of the prior art, the invention provides the RGB-D significance target detection method based on hierarchical feature propagation and scale perception, and high-precision detection of a significance target area is realized by constructing a hierarchical feature propagation mechanism, a hierarchical cross-mode fusion strategy and a scale perception decoding structure. The invention provides a RGB-D significance target detection method based on hierarchical feature propagation and scale perception, which comprises the following steps: 1. acquiring and inputting RGB-D image data; 1.1 A public dataset of RGB-D saliency target detection fields including LFSD datasets, NJU2K datasets, NLPR datasets, DUT-RGBD datasets, SIP datasets, and STERE datasets. 1.2 The RGB image and the corresponding depth map are used as bimodal input data of a network, and the corresponding saliency target label map is obtained as supervision information. 2. Multi-level feature encoding; 2.1 Respectively inputting the RGB image and the depth map into a coding network of a shared structure, and adopting a pyramid visual transformer (Pyramid Vision Transformer, PVT) to extract multi-level features of input data. 2.2 During the encoding process, RGB features and depth features are obtained at different levels, respectively denoted asAndWhereinCorresponding to different semantic level features from lower to higher levels. 3. Self-adaptive feature aggregation; 3.1 Inputting the multi-level RGB features and depth features obtained in the step 2 to an adaptive feature aggregation module (Adaptive Feature Aggregation Module, AFAM) to spatially scale align the different level features. 3.2 The aligned features are subjected to feature mapping through convolution operation respectively, and adjacent level features are spliced and aggregated to enhance the complementary relation between high-level semantic information and low-level space detail information. 3.3 Self-adaptive weighting is carried out on the aggregation characteristics through a channel attention mechanism, redundant characteristic channels are restrained, and the enhanced multi-level RGB characteristics are obtainedAnd depth features。 4. Layering cross-modal fusion; 4.1 Aiming at low-layer characteristics ) The RGB features and depth features obtained in the step 3 are input to a detail positioning fusion module (Detail Localization Fusion Module, DLFM), and space alignment and noise suppression are carried out on low-layer feat