CN-121982613-A - Multi-mode mixed expert and memory enhanced RGB-D video salient target detection method

CN121982613ACN 121982613 ACN121982613 ACN 121982613ACN-121982613-A

Abstract

The invention belongs to the field of computer vision and artificial intelligence, and relates to a multi-mode mixed expert and memory enhanced RGB-D video salient target detection method, which is used for acquiring an RGB-D video sequence, constructing a SAM 2-based shared encoder, extracting multi-scale features by a parallel embedded mode perception mixed expert fine adjustment module (Modality-Aware MoE-LoRA), fusing the multi-scale features by a gating multi-level feature fusion module (Gated-MLF) to obtain time sequence input features, guiding a time sequence memory module through a pseudo mask to realize non-prompt initialization, searching historical context information by the time sequence memory module and interacting with the current frame features, and outputting a prediction mask. According to the invention, manual intervention is not needed, the detection time sequence consistency and the edge segmentation precision are obviously improved, the parameter quantity and the video memory occupation are reduced, the method is suitable for full-automatic RGB-D video salient target detection in complex scenes, and the method has good generalization capability and practical value.

Inventors

ZHOU XIAOFEI
LIU JIYUAN
LIN JIA
Bao Liuxin
LIU SHENGPING
ZHANG JIYONG
LIU ZHI

Assignees

杭州电子科技大学

Dates

Publication Date: 20260505
Application Date: 20260130

Claims (5)

1. The multi-mode mixed expert and memory enhanced RGB-D video salient object detection method is characterized by comprising the following steps: Firstly, acquiring an RGB-D video sequence, wherein the video sequence comprises a plurality of visible light images and depth images corresponding to frames, and preprocessing the image frames; Constructing a hierarchical visual encoder based on SAM2 as a shared encoder, keeping pre-training parameters of the hierarchical visual encoder frozen, embedding a mode sensing hybrid expert fine adjustment module in a multi-head self-attention layer of the shared encoder in parallel, simultaneously extracting multi-scale features of a visible light image and a depth image by using the shared encoder, identifying the currently input mode type by a mode dispatcher, dynamically routing the features to a corresponding convolution expert group for processing, and introducing space geometry induction bias; Constructing a hierarchical decoder comprising jump connection, and fusing the multi-scale features by using a gating multi-level feature fusion module to obtain time sequence input features, wherein the gating multi-level feature fusion module uses middle layer features of the decoder as gating guidance to carry out self-adaptive weighting on coding features of different hierarchies and balance semantic information and space details; the method comprises the steps of establishing a pseudo mask guiding time sequence memory module, generating pseudo prior information without prompts for the first frame of a video sequence, and initializing the time sequence memory module; And fifthly, retrieving the time sequence context information of the historical frame by utilizing the initialized time sequence memory module, interacting with the time sequence input characteristics of the current frame, outputting a final obvious target prediction mask through a decoder, and updating a memory library and performing queue management.
2. The method for detecting the RGB-D video salient target with the multi-mode mixed expert and the memory enhanced according to claim 1, wherein the mode perception mixed expert fine tuning module comprises a dimension reduction projection layer, a mode dispatcher, a plurality of groups of convolution experts, a gating network and a dimension lifting projection layer, and the specific processing process is as follows: S2.1, mapping the input features of the shared encoder to a preset low-rank space through a dimension-reducing projection layer, wherein the low-rank dimension r is an integer of 2-8; S2.2, the mode dispatcher activates a corresponding expert group according to the mode type of the current input image, wherein the expert group comprises an RGB expert group for processing a visible light mode, a depth expert group for processing a depth mode and a fusion expert group for processing cross-mode interaction; s2.3, when the input is in a visible light mode, activating an RGB expert group and a fusion expert group, and when the input is in a depth mode, activating the depth expert group and the fusion expert group; S2.4 the activated expert group comprises The convolution experts with different receptive field sizes respectively check the low-rank features by using convolution with different sizes to perform space feature coding; S2.5, calculating the weight of each convolution expert in the expert group by using the gating network, and screening Top-K according to the weight The output of each target expert is weighted and aggregated; s2.6, restoring the aggregated features to the original feature dimension through the dimension-increasing projection layer, and carrying out residual connection with the original output of the shared encoder.
3. The method for detecting the RGB-D video salient object with the multi-mode mixed expert and memory enhancement according to claim 1, wherein the gating multi-level feature fusion module in the third step specifically comprises the following steps: S3.1, obtaining the multi-level characteristics of the output of the shared encoder Splicing and compressing the two channels to obtain upper and lower Wen Biaozheng ; S3.2 the upper and lower Wen Biaozheng are convolved with spatial convolution and channel Dual attention enhancement to enhanced features ; S3.3, shallow layer characteristics of the shared coder With the enhanced features Splicing, and generating a gating weight map through a convolution layer ; S3.4 utilizing the gating weight The shallow layer characteristics and the enhancement characteristics are weighted and fused, and the calculation formula is as follows: Wherein FFN is a feed-forward neural network, Representing element-by-element multiplication; s3.5 merging the features of the encoder Middle layer feature with the decoder Splicing to obtain the time sequence input characteristics 。
4. The method for detecting the RGB-D video salient objects with multi-modal mixing expert and memory enhancement as set forth in claim 1, wherein the non-prompt initialization process of the time sequence memory module in the fourth step comprises: s4.1 at the first frame of the video sequence ) Generating a coarse mask based on mid-layer decoding characteristics of a first frame using a lightweight branch of the decoder Taking the information as pseudo prior information; S4.2 generating initial memory key by linear projection layer And an initial memory value The initial memory key is obtained by directly projecting the time sequence input characteristics of the first frame, and the initial memory value Dot product of the timing input features of the first frame and the coarse mask Projecting to obtain; S4.3, storing the initial memory key and the initial memory value into a memory bank as a starting point of history memory.
5. The method for detecting the RGB-D video salient objects with multi-modal mixing expert and memory enhancement as set forth in claim 1, wherein the sequential interaction and memory update process in the fifth step comprises: S5.1, reading and memorizing, namely taking the time sequence input characteristic of the current frame as query, calculating the similarity between the time sequence input characteristic and a history memory key in a memory bank, carrying out weighted retrieval on a history memory value according to the similarity to obtain a time sequence context characteristic, and sending the time sequence context characteristic into a decoder after being fused with the current frame characteristic; S5.2 updating memory-final prediction mask based on current frame And timing input features Generating new memory key And a memory value The new memory key is the projection of the current frame characteristic, and the new memory value is the characteristic projection after mask filtering; And S5.3, queue management, namely writing new memory keys and memory values into a memory bank, and discarding the oldest memory frames exceeding a preset time window according to the first-in first-out principle.

Description

Multi-mode mixed expert and memory enhanced RGB-D video salient target detection method Technical Field The invention relates to a multi-mode mixed expert and memory enhanced RGB-D video salient target detection method, in particular to a salient target detection method aiming at RGB-D video data and based on large model fine adjustment, belonging to the technical field of computer vision and artificial intelligence. Background RGB-D video salient object detection (RGB-D Video Salient Object Detection, VSOD) aims to continuously and accurately locate and segment the most attractive objects in a video sequence using visual (RGB) appearance information and Depth (Depth) geometry information. Unlike still image detection, the video detection needs to make full use of the context information of the inter-frame time sequence to ensure the continuity and stability of the segmentation result, and has important application value in the fields of automatic driving, intelligent monitoring, robot vision and the like. Existing RGB-D remarkable target detection methods are mainly divided into two types, but all have technical defects which are difficult to overcome: 1. image detection method based on traditional convolutional neural network or general transducer The method mainly adopts a double-flow architecture, extracts single-frame RGB and depth characteristics respectively through CNN (such as ResNet-34 and ResNet-50) or Swin transducer, and realizes characteristic interaction through a fusion module. For example, the self-adaptive feature fusion method disclosed in Chinese patent CN113538442B, the dynamic feature selection method proposed in Chinese patent CN113392727B and the cross-mode calibration module introduced in Chinese patent CN 120673073A. However, the above method has three main core defects: (1) The time sequence information is seriously wasted, namely, aiming at single-frame image design, a frame-by-frame independent processing strategy is adopted when video is processed, and inter-frame time sequence association is completely ignored, so that a segmentation result flickers and jumps, and the time sequence consistency is poor; (2) The generalization capability is limited, namely, based on an ImageNet pre-training classification network, a training sample is limited, and when the training sample faces to complex background, extreme illumination and other unseen scenes, the generalization performance is far inferior to a visual basic model (such as SAM 2) trained by massive data; (3) The efficiency is low, the independent double-flow encoder processes RGB and depth data respectively, the parameter amount is large, and the reasoning efficiency is low. 2. Fine tuning method based on vision basic model (such as SAM) To utilize the large model generalization capability, part of the method applies SAM to significant object detection, such as SAM-based RGB-D image detection method disclosed in chinese patent CN117876656a, fine-tuning depth map branches by LoRA and fusing the results. However, such methods still have an unresolved technical bottleneck: (1) The time sequence information is not utilized, namely the design is still aimed at the static image, and the time sequence consistency requirement of video detection cannot be met; (2) The space detail is lost, namely, the traditional LoRA adopts a linear projection layer, so that local space geometric features required by pixel-level dense prediction are difficult to capture, and the edge segmentation precision is insufficient; (3) The method relies on manual prompt that the memory bank initialization of the existing video segmentation method (such as SAM2 and XMem) requires manual intervention such as first frame masking, clicking or frame selection, and full-automatic detection cannot be realized. In summary, the prior art cannot simultaneously consider the strong generalization capability, the fine space structure reservation and the full-automatic time sequence information utilization, and is difficult to meet the practical application requirements of RGB-D video significant target detection. Disclosure of Invention The invention provides a multi-mode mixed expert and memory enhanced RGB-D video salient target detection method, which aims to overcome the defect that the prior art cannot simultaneously consider the strong generalization capability, the fine space structure reservation and the full-automatic time sequence information utilization and is difficult to meet the actual application requirements of RGB-D video salient target detection. The method does not need manual intervention, remarkably improves the time sequence consistency and the edge segmentation precision of detection, reduces the parameter quantity and the video memory occupation, is suitable for the full-automatic RGB-D video remarkable target detection in complex scenes, and has good generalization capability and practical value. The multi-mode mixed expert and memory enhanced RGB-D