US-12626371-B2 - Generating image object segmentations utilizing graph-cut partitioning in self-supervised object discovery

US12626371B2US 12626371 B2US12626371 B2US 12626371B2US-12626371-B2

Abstract

The present disclosure is directed toward systems, methods, and non-transitory computer readable media that provide self-supervised object discovery systems that combine motion and appearance information to generate segmentation masks from a digital image or digital video and delineate one or more salient objects within the digital image/digital video. The disclosed systems utilize a neural network encoder to generate a fully connected graph based on image patches from the digital input, incorporating image patch feature and optical flow patch feature similarities to produce edge weights. The disclosed systems partition the generated graph to produce a segmentation mask. Furthermore, the disclosed systems iteratively train a segmentation network based on the segmentation mask as a pseudo-ground truth via a bootstrapped, self-training process. By utilizing both motion and appearance information to generate a bi-partitioned graph, the disclosed systems produce high-quality object segmentation masks that represent a foreground and background of digital inputs.

Inventors

Silky Singh
Shripad Vilasrao Deshmukh
Mausoom Sarkar
Balaji Krishnamurthy

Assignees

ADOBE INC.

Dates

Publication Date: 20260512
Application Date: 20230929

Claims (20)

1 . A computer-implemented method comprising: determining, from a digital image, optical flow patch features representing a movement of visual elements corresponding to a plurality of image patches extracted from the digital image; determining, utilizing a neural network encoder, image patch features corresponding to the plurality of image patches extracted from the digital image; and generating a segmentation mask comprising a foreground region and a background region of the digital image based on a bi-partitioned graph comprising edge weights determined from a weighted linear combination of the optical flow patch features and the image patch features, wherein the weighted linear combination comprises a linear combination coefficient that weights the optical flow patch features at a first weight and the image patch features at a second weight different from the first weight.
2 . The computer-implemented method of claim 1 , further comprising: generating a plurality of nodes based on the plurality of image patches and edges between the plurality of nodes; generating the bi-partitioned graph comprising the plurality of nodes partitioned based on the edge weights.
3 . The computer-implemented method of claim 2 , wherein generating the bi-partitioned graph comprises partitioning the plurality of nodes based on a cosine similarity of the optical flow patch features and a cosine similarity of the image patch features.
4 . The computer-implemented method of claim 1 , further comprising: generating the bi-partitioned graph by determining a similarity measure for the optical flow patch features and the image patch features using the weighted linear combination; and converting the similarity measure into a binary similarity measure by normalizing the similarity measure utilizing a threshold similarity edge weight hyper-parameter.
5 . The computer-implemented method of claim 1 , wherein generating the segmentation mask comprises generating the segmentation mask based on a post-processing step utilizing a probabilistic graphical model to determine a binary segmentation based on individual pixels within the plurality of image patches.
6 . The computer-implemented method of claim 1 , further comprising: generating, utilizing a segmentation model comprising a segmentation encoder neural network and a segmentation decoder neural network, an initial segmentation mask from the digital image; and updating initial parameters of the segmentation decoder neural network of the segmentation model with fixed parameters for the segmentation encoder neural network based on a difference between the segmentation mask and the initial segmentation mask.
7 . The computer-implemented method of claim 6 , further comprising, in response to updating the initial parameters of the segmentation decoder neural network of the segmentation model, iteratively generating pseudo-ground-truth segmentation masks utilizing the segmentation model and modifying parameters of the segmentation model based on the pseudo-ground-truth segmentation masks.
8 . The computer-implemented method of claim 7 , further comprising generating, utilizing the segmentation model with the modified parameters, an additional segmentation mask comprising an additional foreground region and an additional background region of an additional digital image.
9 . The computer-implemented method of claim 1 , further comprising determining, utilizing the neural network encoder, the optical flow patch features from optical flow values in a color space.
10 . The computer-implemented method of claim 1 , further comprising generating, for storage on one or more memory devices, labels for the digital image indicating the foreground region and the background region.
11 . A system comprising: one or more memory devices comprising a digital video; and one or more processors configured to cause the system to: determine, for the digital video, optical flow patch features indicating a distribution of velocities of visual elements corresponding to a plurality of image patches from a video frame of the digital video; determine, utilizing a neural network encoder, image patch features corresponding to the plurality of image patches from the video frame of the digital video; generate a bi-partitioned graph comprising a plurality of nodes corresponding to the plurality of image patches and edges between the plurality of nodes with corresponding edge weights based on a similarity measure between the plurality of nodes determined using a weighted linear combination of the optical flow patch features and the image patch features, wherein the weighted linear combination comprises a linear combination coefficient that weights the optical flow patch features at a first weight and the image patch features at a second weight different from the first weight; and generate, for storage on one or more memory devices, labels for the digital video indicating a foreground region and a background region from a segmentation mask based on the bi-partitioned graph.
12 . The system of claim 11 , further comprising: generating, utilizing a segmentation model comprising a segmentation encoder neural network and a segmentation decoder neural network, an initial segmentation mask from the video frame based on using the segmentation mask as a pseudo-ground truth; and updating initial parameters of the segmentation decoder neural network of the segmentation model with fixed parameters for the segmentation encoder neural network based on a difference between the segmentation mask and the initial segmentation mask.
13 . The system of claim 12 , further comprising further comprising, in response to updating the initial parameters of the segmentation decoder neural network of the segmentation model, iteratively generating a set of pseudo-ground-truth segmentation masks utilizing the segmentation model and modifying parameters of the segmentation model based on the set of pseudo-ground-truth segmentation masks.
14 . The system of claim 13 , further comprising determining, utilizing the segmentation model, an object of an input digital image.
15 . The system of claim 11 , wherein generating the bi-partitioned graph comprises generating a fully connected graph based on the weighted linear combination comprising a cosine similarity of the optical flow patch features and a cosine similarity of the image patch features; and converting the similarity measure into a binary similarity measure by normalizing the similarity measure.
16 . The system of claim 11 , further comprising generating the bi-partitioned graph by generating an adjacency matrix comprising the plurality of nodes and edge weights by incorporating motion signals corresponding to the optical flow patch features based on the similarity measure between the plurality of nodes determined using the weighted linear combination of the optical flow patch features and the image patch features.
17 . A non-transitory computer readable medium storing executable instructions which, when executed by a processing device, cause the processing device to perform operations comprising: determining, for a digital video, optical flow patch features representing movement of visual elements corresponding to a plurality of image patches from a video frame of the digital video; determining, utilizing a neural network encoder, image patch features corresponding to the plurality of image patches from the video frame of the digital video; generating a bi-partitioned graph comprising a plurality of nodes corresponding to the plurality of image patches and edges between the plurality of nodes with corresponding edge weights based on a similarity measure between the plurality of nodes, wherein the similarity measure comprises a linear combination coefficient that weights the optical flow patch features at a first weight and the image patch features at a second weight different from the first weight; and generating a segmentation mask comprising a foreground region and a background region based on the bi-partitioned graph.
18 . The non-transitory computer readable medium of claim 17 , further comprising generating the similarity measure between the plurality of nodes determined using the weighted linear combination comprising a cosine similarity of the optical flow patch features and a cosine similarity of the image patch features.
19 . The non-transitory computer readable medium of claim 17 , further comprising generating the segmentation mask based on a binary segmentation of the video frame by partitioning the video frame based on the corresponding edge weights.
20 . The non-transitory computer readable medium of claim 17 , further comprising: generating, utilizing a segmentation model comprising a segmentation encoder neural network and a segmentation decoder neural network, an initial segmentation mask from the video frame; updating initial parameters of the segmentation decoder neural network of the segmentation model with fixed parameters for the segmentation encoder neural network based on a difference between the segmentation mask and the initial segmentation mask; in response to updating the initial parameters of the segmentation decoder neural network of the segmentation model, iteratively generating pseudo-ground-truth segmentation masks utilizing the segmentation model and modifying parameters of the segmentation model based on the pseudo-ground-truth segmentation masks; and determining, utilizing the segmentation model, an object of an input digital image.

Description

BACKGROUND Advancements in computing devices and computer design applications have given rise to a variety of innovations in computer image analysis and editing software. For example, image design systems have developed that provide tools for discovering and classifying visual objects within digital content from multiple domains such as visual digital content used with autonomous driving, augmented reality, human-computer interaction, and video summarization. For example, some computer design applications separate regions of a video sequence into foreground and background regions to predict regions containing visual objects. Also, some computer design applications provide tools to analyze information from digital images to localize objects within digital images. To localize regions containing the visual objects, many current computer design applications use deep neural networks trained on large, annotated datasets. Notably, partly due to the complexity inherent in visual object classification, it is often difficult for computer systems to produce high-quality object segmentation masks in a timely manner with limited computing resources on a variety of objects given differing image/video qualities and object boundaries. Accordingly, the state of the art exhibits a number of shortcomings with regard to flexibility, accuracy, and computational efficiency when analyzing, discovering, and segmenting visual digital content. SUMMARY One or more embodiments provide benefits and/or solve one or more of the foregoing or other problems in the art with systems, methods, and non-transitory computer readable storage media that provide a self-supervised object discovery system that combines motion and appearance information to generate a segmentation mask from a digital image or digital video and delineate one or more salient objects within the digital image/digital video. In particular, in one or more implementations, the disclosed systems provide a digital video to a neural network encoder to generate a segmentation mask in a graph-cut process that leverages motion information in combination with appearance information. For example, the disclosed systems utilize the neural network encoder to generate a fully connected graph based on image patches from the digital input, incorporating image patch feature and optical flow patch feature similarities to produce edge weights. In certain embodiments, the disclosed systems partition the generated graph to produce a segmentation mask representing the foreground and background of the digital input. Furthermore, in some implementations, the disclosed systems perform an initial training operation on a segmentation model using the segmentation mask as a pseudo-ground truth. In addition, in some implementations, the disclosed systems iteratively train the segmentation network based on the segmentation network outputs via a bootstrapped, self-training process. By utilizing both motion and appearance information to generate a bi-partitioned graph, the disclosed systems produce high-quality object segmentation masks in a self-supervised object discovery approach. BRIEF DESCRIPTION OF THE DRAWINGS This disclosure will describe one or more example implementations of the systems and methods with additional specificity and detail by referencing the accompanying figures. The following paragraphs briefly describe those figures, in which: FIG. 1 illustrates a schematic diagram of an example environment of a graph-cut partitioning system in accordance with one or more embodiments; FIG. 2 illustrates an example overview of using a graph-cut partitioning system to generate a segmentation mask and detect an object within a digital video in accordance with one or more embodiments; FIG. 3 illustrates an example of generating a segmentation mask from a digital video via a graph-cut process in accordance with one or more embodiments; FIG. 4A illustrates an example of an initial training iteration of a segmentation model using a segmentation mask from a graph-cut process as a pseudo-ground truth in accordance with one or more embodiments; FIG. 4B illustrates an example of an iterative training process for a segmentation model using generated segmentation masks as pseudo-ground truths in accordance with one or more embodiments; FIG. 5 illustrates an example of a segmentation mask generated by the graph-cut partitioning system in accordance with one or more embodiments; FIG. 6 illustrates a comparison of qualitative examples of segmentation masks generated from video frames of digital videos by the graph-cut partitioning system in accordance with one or more embodiments; FIG. 7 illustrates qualitative examples of segmentation masks generated from digital images by the graph-cut partitioning system in accordance with one or more embodiments; FIG. 8 illustrates a schematic diagram of a graph-cut partitioning system in accordance with one or more embodiments; FIG. 9 illustrates a flowchart of a series of a