CN-117036838-B - RGB-D image salient object detection method with weak supervision of category

CN117036838BCN 117036838 BCN117036838 BCN 117036838BCN-117036838-B

Abstract

The invention discloses a class weakly supervised RGB-D image salient object detection method, which comprises a training stage and an updating stage, wherein the training stage is used for supervising training of an image salient object detection model by a pseudo tag, outputting a salient image, meanwhile, the training of a class tag and a pseudo tag supervised vision-language matching model is used for outputting a class similarity vector and a class perception salient image, the updating stage is used for weighting and summing the class perception salient image and the pseudo tag by the class similarity of the image under the class perception salient image and the pseudo tag mask so as to update the pseudo tag, the updating stage occurs in the training stage, and the method is used for testing any input RGB-D image by using the trained image salient object detection model in a testing stage and outputting a final salient image. The invention supervises the training of the model by category labels and pseudo labels, and does not need true value labels at the pixel level.

Inventors

LIU ZHENGYI
ZHANG ZHILI
HAN LI

Assignees

安徽大学

Dates

Publication Date: 20260508
Application Date: 20220427

Claims (8)

1. A kind of weak supervised RGB-D image salient object detection method, characterized by, including training stage and updating stage; The training stage is trained by a pseudo tag supervised image salient object detection model to output a salient map; In the training phase, the input image x is subjected to a conventional image processing method based on a center-dark channel prior to generate an initial pseudo tag ; Meanwhile, category labels and pseudo labels are used for supervising the training of the vision-language matching model so as to output category similarity vectors and category perception saliency maps; In the updating stage, firstly, two groups of mask images are formed under a category perception salient map mask and a pseudo tag mask respectively, then, the two groups of mask images are input into a vision-language matching model to obtain corresponding category similarity vectors, confidence scores of the category perception salient map and the pseudo tag are obtained respectively according to category tags, and finally, the category perception salient map and the pseudo tag are weighted and summed by taking the confidence scores as weights to update the pseudo tag; In the process of obtaining the confidence coefficient score, corresponding positions in the category similarity vectors are respectively indexed and obtained from the two groups of category similarity vectors according to the category labels, numerical values of the positions are read and extracted, normalization operation is carried out on the two obtained groups of numerical values, and finally two groups of probability values are output and are respectively used as the confidence coefficient score of the category perception saliency map and the confidence coefficient score of the pseudo label; the update phase occurs during the training phase; in the test stage, the RGB-D image which is arbitrarily input is tested by using the trained image salient object detection model, and a final salient image is output.
2. A class weakly supervised RGB-D image salient object detection method as recited in claim 1, wherein during the training phase, a pseudo tag is used Supervision of the image salient object detection model Output saliency map for training of (a) ; ; Wherein the function is The first parameter of (a) represents the image input of the model, the second parameter represents the supervision signal of model training, and the function returns to the output of the model, namely the saliency map ; At the same time by category labels Pseudo tag Supervision visual-language matching model Outputs a category similarity vector And class aware saliency maps ; ; Wherein the function is The first parameter of the model represents the image input of the model, the second parameter represents the text input formed by all significant categories in the training set, and the third parameter represents the supervisory signal category label of the model Supervision signal pseudo tag of fourth parameter representation model The function returns two outputs of the model, category similarity vector And class aware saliency maps The second parameter is fixed during the whole training process.
3. A class weakly supervised RGB-D image salient object detection method as claimed in claim 1, wherein class aware saliency maps are utilized in the update phase Mask-down and pseudo tag Two groups of category similarity vectors of the image under the mask are respectively weighted and summed with category perception saliency maps Pseudo tag Updating pseudo tags with the result The specific process of (2) is as follows: first, an image is input Perception of saliency maps in categories Mask and pseudo tag of (a) Form category-aware saliency map mask images respectively under the masking of (1) And pseudo tag mask image ; ; Wherein the method comprises the steps of Representing a class-aware saliency map or pseudo tag, It is referred to as a gaussian smoothing operation, Refers to a convolution operation; second, the class-aware saliency map mask image And pseudo tag mask image Feeding into a vision-language matching model Testing and outputting the class perception saliency map mask image And pseudo tag mask image Corresponding category similarity vector ; ; Wherein the model The third and fourth parameters of (1) are null, indicating that the test is performed here, no supervisory signal is required, the second output Is ignored; Then, labels are marked according to category Indicated classification similarity to form class-aware saliency maps With pseudo tags Confidence score of (2) And ; ; Wherein the method comprises the steps of Refers to the extraction of category similarity vectors Medium class label The value at the location where it is located, Representing a softmax function; Finally, taking the confidence scores as weights, and respectively weighting and summing the category perception saliency maps Pseudo tag Forming updated pseudo tags The method comprises the following steps: 。
4. A method for detecting a class weakly supervised RGB-D image salient object as recited in claim 1, wherein the update phase occurs during the training phase After training, 1 update is performed, model performance and time cost are weighed, Taking 3.
5. A class weakly supervised RGB-D image salient object detection method as claimed in claim 2, wherein the image salient object detection model And adopting a four-channel input fusion U-Net structure of the RGB-D image.
6. A class weakly supervised RGB-D image salient object detection method as claimed in claim 2, wherein the image salient object detection model The loss function of (2) is defined as follows: ; Wherein the method comprises the steps of A pixel location aware loss function is represented for calculating pixel location matching loss between the saliency map and the pseudo tag.
7. A class weakly supervised RGB-D image salient object detection method as recited in claim 2, wherein the vision-language matching model By visual encoder And language encoder Composition, the visual encoder And language encoder Visual encoder and language encoder using CLIP, the visual encoder Encoding an input image Forming high-level features Global features are formed after global average pooling GAP operation The two are operated by multi-head self-attention MHSA to form high-level signals And global signals ; ; ; ; The language encoder Coding all remarkable category texts in training set to form category text characteristics ; ; Wherein, the Representation of Text representations of the individual salient categories; characterizing category text And high-layer signals Performing similarity calculation to form a category similarity matrix ; ; The category similarity matrix With the high-level features After cascade connection operation, the result is sent to a Decoder to generate a category perception saliency map The Decoder is cascade connection and deconvolution operation from a high layer to a low layer; ; at the same time, the category text is characterized And global signals Performing similarity calculation to form category similarity vector ; 。
8. A class weakly supervised RGB-D image salient object detection method as recited in claim 2, wherein the vision-language matching model The loss function of (2) is defined as follows: ; Wherein the method comprises the steps of Representing a pixel location aware loss function for calculating pixel location matching loss between the saliency map and the pseudo tag, Is the cross entropy loss.

Description

RGB-D image salient object detection method with weak supervision of category Technical Field The invention relates to the field of computer vision, in particular to a weakly-supervised RGB-D image salient object detection method. Background RGB-D image salient object detection is intended to extract salient objects in RGB-D images in combination with color and depth information, the extraction of these objects simulating the human visual attention mechanism, only objects that are attracted to human eyes belong to salient objects. The traditional method adopts a fully supervised strategy to train an RGB-D image salient object detection model, and then inputs an RGB-D image to output a salient map. This approach relies on a large amount of manually annotated data. In order to reduce the dependence on manual annotation, a method for utilizing weak supervision signals such as category information, target frames, points, graffiti, counting and the like is gradually proposed, and a certain progress is made. Wherein class information is readily available and the recently proposed CapS dataset provides class information for an RGB-D image salient object detection training dataset, the present invention therefore focuses on using class information to achieve weak supervision of RGB-D image salient object detection. Existing methods of class weak supervision, such as WSS (ex. Learning to DETECT SALIENT Objects WITH IMAGE-Level Supervision), ASMO (ex. Weakly supervised salient object detection using image labels), MSW (ex. Multi-Source Weak Supervision for Saliency Detection), MFNet (ex. MFNet: multi-FILTER DIRECTIVE Network for Weakly Supervised Salient Object Detection), train classification models using image Net or COCO data sets, then generate class activation maps from the characteristics of the classification models as pseudo-labels, and supervise training of the image salient object detection model. The method needs to pretrain on a large-scale classification data set, the generated class activation diagram is not accurate enough, and only the discriminant features related to the class can be highlighted, but the whole object cannot be focused, so that the model trained by taking the class activation diagram as a supervision signal inevitably loses enough precision. Disclosure of Invention The invention aims to solve the technical problem of providing a category weak supervision RGB-D image salient object detection method, which uses category labels and is assisted with pseudo labels provided by an unsupervised method to realize the salient object detection of RGB-D images. The invention discloses a class weakly supervised RGB-D image salient object detection method, which comprises a training stage and an updating stage, wherein the training stage is used for supervising training of an image salient object detection model by a pseudo tag to output a salient image, meanwhile, class similarity vectors and class perception salient images are output by training of a class tag and a pseudo tag supervised vision-language matching model, the updating stage is used for weighting and summing the class perception salient images and the pseudo tag by using the class similarity of the class perception salient images and the class similarity of the images under the pseudo tag mask to update the pseudo tag, the updating stage is used for updating the pseudo tag in the training stage, and the class weakly supervised RGB-D image salient object detection method is used for testing any input RGB-D images by using the trained image salient object detection model to output a final salient image in the testing stage. Further, the method for detecting the RGB-D image salient object with weak supervision is characterized in that in the training stage, an initial pseudo-label pm is generated by an input image x through an unsupervised method, wherein the unsupervised method is the method described in papers An Innovative Salient Object Detection using Center-DARK CHANNEL Prior; m=MSOD(x,pm) wherein the first parameter of the function M SOD (·, ·) represents the image input of the model, the second parameter represents the supervisory signal of model training, the function returns to the output of the model; meanwhile, training of a visual-language matching model M VL is supervised by a category label cls and a pseudo label pm, and a category similarity vector csv and a category perception saliency map cm are output; [csv,cm]=MVL(x,t,cls,pm) The first parameter of the function M VL (the first parameter of the function M is the image input of the model, the second parameter of the function M is the text input formed by all the salient categories in the training set, the third parameter of the function M is the supervision signal category label cls of the model, the fourth parameter of the function M is the supervision signal pseudo label pm of the model, the function returns two outputs of the model, namely a category similarity vect