CN-122023994-A - Modal guide enhanced RGB-D video feature fusion and significance target detection method

CN122023994ACN 122023994 ACN122023994 ACN 122023994ACN-122023994-A

Abstract

The application discloses a method for merging RGB-D video features and detecting a salient object by modal guidance enhancement. The fusion method comprises the steps of obtaining an RGB image and an optical flow image of an object, extracting multi-scale features of the RGB image and the optical flow image, and carrying out modal guidance enhancement fusion on the multi-scale features of the optical flow image by utilizing the multi-scale features of the RGB image to obtain fusion features. The application has the technical effects of improving the quality of the optical flow characteristics and enhancing the reliability of the motion information, effectively solves the problem of insufficient characteristic utilization caused by large optical flow noise and poor stability in multi-mode data, and provides a high-quality cross-mode fusion foundation for the subsequent salient target detection.

Inventors

LI GONGYANG
SHI SHIXIANG

Assignees

上海大学

Dates

Publication Date: 20260512
Application Date: 20260210

Claims (10)

1. A method for merging modality-guided enhanced RGB-D video features, comprising: Acquiring an RGB image and an optical flow image of an object; extracting multi-scale features of the RGB image and the optical flow image respectively; and carrying out modal guidance enhancement fusion on the multi-scale features of the optical flow image by utilizing the multi-scale features of the RGB image to obtain fusion features.
2. The method for merging the modality-guided enhanced RGB-D video features according to claim 1, wherein the performing the modality-guided enhanced merging on the multi-scale features of the optical flow image by using the multi-scale features of the RGB image to obtain the merged features includes: The multi-scale features of the RGB image and the optical flow image are enhanced through a channel-space hybrid network, and corresponding primary refinement features are obtained; Aiming at respective primary refinement features of the RGB image and the optical flow image, a state update mechanism is utilized to realize long-range context modeling, and respective refinement features are obtained; and performing cross-modal splicing and mapping on the obtained refined RGB features and optical flow features to generate fusion features.
3. The method for merging the features of the modality-guided enhanced RGB-D video according to claim 2, characterized in that the process of obtaining the refined features of the optical flow image is specifically as follows: the preliminarily refined features of the optical flow images pass through a full connection layer to generate a dynamic parameter matrix B f 、C f 、△ f required by a state space model; initializing a state transition matrix A f and a pass-through coefficient D f ; performing convolution operation on the primarily refined RGB features, and generating a spatial attention mask weight w r through a Sigmoid activation function; Performing element level multiplication on the mask weight w r and an output mapping parameter C f of an optical flow branch, and adding a result and an original C f residual to finish cross-modal modulation of C f ; Inputting the modulated parameter { A f ,B f ,C f ,D f ,△ f } to an S6 module, and executing selective state space scanning; And carrying out layer normalization, full connection transformation and residual connection operation on the output of the S6 module in sequence to generate a final refined optical flow characteristic.
4. The quality-enhanced RGB-D video saliency target detection method is characterized by comprising the following steps of; Extracting a depth image and multi-scale features thereof; performing cross-modal fusion on the multi-scale features of the depth image and the fusion features obtained by the method of any one of claims 1 to 3 to generate conditional features; inputting the condition characteristics into a saliency detection network, and detecting a saliency map of the object.
5. The quality enhanced RGB-D video salient object detection method of claim 4, wherein the cross-modal fusion of the multi-scale features of the depth image with fusion features generates conditional features comprising: enhancing the fusion characteristic and the multi-scale characteristic of the depth image through a channel-space hybrid network to obtain a corresponding preliminary refinement characteristic; Respectively generating initial significance prediction graphs based on the preliminary refinement features of the two modes, calculating difference graphs of the initial significance prediction graphs, and identifying a preset proportion pixel region with the largest absolute difference value as a key difference region; In the key difference region, the preliminary refinement features of the two modes are mutually injected to construct an expansion feature representation containing cross-mode context information, wherein the expansion feature representation comprises expansion fusion features and expansion depth image features; In the modeling process, removing injected abnormal modal components according to the position index, retaining original response of the mode, and fusing and optimizing a bidirectional result after abnormal mode removal to generate refined features; and performing cross-channel splicing on the refined features of the fusion features and the refined features of the depth map to obtain conditional features.
6. The quality-enhanced RGB-D video saliency target detection method of claim 4, wherein the saliency detection network performs multiple rounds of iterative denoising with a diffusion architecture, specifically: multiplexing the same denoising network every time of iteration; Continuously inputting the condition features into the denoising network to serve as the condition input of the denoising network in each time step; the denoising network is a multi-level U-shaped coding-decoding structure, and fuses with an attention mechanism through jump connection and refines multi-scale condition features in a hidden space; And outputting the denoised hidden variable in each time step, and calculating the input hidden variable in the next time step, and decoding the final hidden variable to generate a significance map after multiple reverse diffusion iterations.
7. The quality enhanced RGB-D video saliency target detection method of claim 6, wherein the denoising network is a multi-level U-shaped encoding-decoding structure that fuses with attention mechanisms through skip connections and refines multi-scale conditional features in hidden space, comprising: The coding part is a plurality of convolution blocks, and the output size of each convolution block corresponds to the condition characteristic size; fusing the condition features of the corresponding level with the output of the convolution block, and taking the fusion result as the input of the next convolution; The decoding part is a plurality of convolution blocks, each convolution block is further refined and enhanced through channel attention and space attention upsampling characteristics after upsampling, then channel splicing and fusion are carried out on the decoding part and the corresponding-level modal guiding enhancement module of the encoder, and the decoding part is used as the input of the next convolution after fusion.
8. The quality-enhanced RGB-D video salient object detection method of claim 6, wherein the model corresponding to the RGB-D video salient object detection method comprises: The feature extraction network is used for extracting multi-scale features of RGB, optical flow and depth map; The modal guidance enhancement module is used for carrying out modal guidance enhancement fusion on the multi-scale features of the optical flow image by utilizing the multi-scale features of the RGB image to obtain fusion features; The modal perception enhancement module carries out cross-modal fusion on the multi-scale features of the depth image and the fusion features to generate conditional features; The denoising network inputs the condition characteristics to a saliency detection network, and detects a saliency map of an object; The training process of the model is as follows: Determining a training sample set comprising samples of RGB images, depth images and optical flow images of RGB-D video and corresponding saliency truth-value diagrams thereof; Respectively inputting the samples into an initial detection model to obtain a prediction significance map output by the initial detection model; Calculating a binary cross entropy loss and a cross-over ratio loss based on the predicted significance map and the significance truth map, and calculating a total loss based on the binary cross entropy loss and the cross-over ratio loss; And based on the total loss, adopting an alternate direction multiplier method to iteratively update the structural parameters of the initial detection model to obtain the RGB-D video saliency target detection model.
9. A non-transitory computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor realizes the steps of the method according to any of claims 1-8.
10. An electronic device, comprising: A memory having a computer program stored thereon; A processor for executing the computer program in the memory to implement the steps of the method of any one of claims 1-8.

Description

Modal guide enhanced RGB-D video feature fusion and significance target detection method Technical Field The application relates to the field of multi-mode video analysis, in particular to a method for merging RGB-D video features and detecting a salient target by mode guidance enhancement. Background RGB-D video saliency target detection is an important research direction in the field of multi-modal video analysis, and its core challenges mainly come from dynamic changes of video content and efficient fusion of RGB and depth information. The traditional video saliency detection method is difficult to stably extract the saliency target due to factors such as scene change, illumination condition fluctuation and motion blur in the video. In addition, single-mode RGB video is often limited by color similarity and interference of complex backgrounds, resulting in low contrast of salient regions to the background, affecting accurate detection of salient objects. Compared with the traditional video processing, the RGB-D multi-mode video can make up the limitation of the single-mode RGB video to a certain extent due to the fact that the RGB-D multi-mode video contains depth information. However, the quality of the depth image in the complex dynamic scene is often limited by the sensor precision and noise influence, and particularly when the depth information of the dynamic object is relatively blurred, how to effectively fuse the RGB and the depth information, eliminate noise interference and accurately position the salient object in the video is still a difficult problem to be solved. Existing RGB-D video saliency detection techniques face several challenges. First, the three-modality data (RGB image, optical flow image, and depth image) have an imbalance problem at the time of fusion. Typically, RGB images contribute most to saliency detection, depth images less often, and optical flow images contribute less. This imbalance results in insufficient information utilization, especially in the failure of the optical flow images to effectively provide motion information, thereby affecting the precise positioning of salient objects in dynamic scenarios. Secondly, the prior method usually ignores the mode difference, and fails to perform targeted feature fusion aiming at different mode difference areas, so that the quality of fusion features is limited. The RGB image, the optical flow image and the depth image respectively bear texture, motion and geometric information, and each mode has obvious differences in the performance of a specific area, and the difference areas are critical to the multi-mode fusion effect. Due to the lack of a targeted fusion strategy, the difference information is not fully complemented, so that the three-mode fusion effect is poor, and the extraction precision of the salient targets is greatly reduced. These problems make it difficult for the prior art to cope with the task of saliency detection in complex video scenes, especially in dynamic environments and under variable illumination, where a large lifting space still exists. Through retrieval, china patent with the application number 202110910457.6 provides a saliency target detection method, and through multi-scale feature extraction and fusion, the target positioning accuracy in a complex scene is improved, and the detection robustness is enhanced. However, the method does not solve the problem of modal variability, and the complementary fusion of the different areas of the modalities such as RGB, depth and the like cannot be performed, so that the quality of fusion features is limited. Disclosure of Invention In view of one of the drawbacks of the prior art, an object of the present application is to provide a method for modality-guided enhanced RGB-D video feature fusion and saliency target detection. In a first aspect of the present application, a method for merging features of a modality-guided enhanced RGB-D video is provided, comprising: Acquiring an RGB image and an optical flow image of an object; extracting multi-scale features of the RGB image and the optical flow image respectively; and carrying out modal guidance enhancement fusion on the multi-scale features of the optical flow image by utilizing the multi-scale features of the RGB image to obtain fusion features. Optionally, the performing modal guidance enhancement fusion on the multi-scale feature of the optical flow image by using the multi-scale feature of the RGB image to obtain a fusion feature includes: The multi-scale features of the RGB image and the optical flow image are enhanced through a channel-space hybrid network, and corresponding primary refinement features are obtained; Aiming at respective primary refinement features of the RGB image and the optical flow image, a state update mechanism is utilized to realize long-range context modeling, and respective refinement features are obtained; and performing cross-modal splicing and mapping on the obtained refin