US-20260127844-A1 - TRAINING AND UTILIZING LARGE LANGUAGE MODELS TO GENERATE GROUPS OF SEGMENTATION MASKS FOR DIGITAL IMAGES FROM VISION OR LANGUAGE INPUT FEATURES

US20260127844A1US 20260127844 A1US20260127844 A1US 20260127844A1US-20260127844-A1

Abstract

The present disclosure relates to systems, non-transitory computer-readable media, and methods for grouping segmentation masks from digital images. The disclosed system generates, utilizing a segmentation model, a set of candidate segmentation masks for objects portrayed in a digital image. In addition, the disclosed system generates, utilizing a mask projector model, a set of mask tokens from the set of candidate segmentation masks. Moreover, the disclosed system selects, utilizing a large language model, a group of segmentation masks from the set of candidate segmentation masks based on the set of mask tokens, wherein the group of segmentation masks satisfies a mask group classification threshold. Further, the disclosed system provides, for display via a client device, the group of segmentation masks for the digital image.

Inventors

Zijun Wei
Shengcao Cao
Jason Wen Yong Kuen
Kangning Liu
Lingzhi Zhang
Jiuxiang Gu
Hyun Joon Jung

Assignees

ADOBE INC.

Dates

Publication Date: 20260507
Application Date: 20241101

Claims (20)

1 . A computer-implemented method comprising: generating, utilizing a segmentation model, a set of candidate segmentation masks for objects portrayed in a digital image; generating, utilizing a mask projector model, a set of mask tokens from the set of candidate segmentation masks; selecting, utilizing a large language model, a group of segmentation masks from the set of candidate segmentation masks based on the set of mask tokens, wherein the group of segmentation masks satisfy a mask group classification threshold; and providing, for display via a client device, the group of segmentation masks for the digital image.
2 . The computer-implemented method of claim 1 , further comprising: receiving, from the client device, a reference mask; generating, utilizing the mask projector model, a reference mask token from the reference mask; and selecting, utilizing the large language model, the group of segmentation masks from the set of candidate segmentation masks based on the reference mask token and the set of mask tokens.
3 . The computer-implemented method of claim 1 , further comprising: receiving, from the client device, language input corresponding to the digital image; generating, utilizing a text tokenizer, a set of text tokens associated with the language input from the client device; and selecting, utilizing the large language model, the group of segmentation masks based on the set of text tokens and the set of mask tokens.
4 . The computer-implemented method of claim 1 , further comprising: generating, utilizing a plurality of visual backbone models, a set of global visual tokens associated with the digital image; and selecting, utilizing the large language model, the group of segmentation masks based on the set of global visual tokens and the set of mask tokens.
5 . The computer-implemented method of claim 1 , wherein generating the set of mask tokens comprises, for a candidate mask of the set of candidate segmentation masks: generating, utilizing a plurality of visual backbone models, a localized candidate mask feature map for the candidate mask from the digital image; and converting, utilizing the mask projector model, the localized candidate mask feature map for the candidate mask into a mask token for the candidate mask.
6 . The computer-implemented method of claim 1 , wherein selecting the group of segmentation masks comprises: generating, utilizing a classification machine learning model, a mask group classification probability prediction from a mask token corresponding to a candidate segmentation mask of the set of candidate segmentation masks; and selecting the candidate segmentation mask for the group of segmentation masks by comparing the mask group classification probability prediction to the mask group classification threshold.
7 . The computer-implemented method of claim 1 , further comprising: generating, utilizing the large language model, client response text based on the set of mask tokens; and providing, for display via the client device, the client response text and the group of segmentation masks.
8 . The computer-implemented method of claim 1 , further comprising: generating a group mask extraction training dataset comprising a training image, training candidate masks for the training image, a ground truth training mask group for the training image, and training text descriptions corresponding to training candidate masks; and training the large language model to generate mask groups for individual digital images utilizing the group mask extraction training dataset.
9 . The computer-implemented method of claim 8 , wherein generating the group mask extraction training dataset comprises: extracting localized regions of the training image utilizing the training candidate masks; generating, utilizing one or more large language models, the training text descriptions of the training candidate masks from the localized regions of the training image; and generating, utilizing at least one large language model, the ground truth training mask group for the training image from the training text descriptions and the training candidate masks.
10 . A system comprising: one or more memory devices; and one or more processors coupled to the one or more memory devices that cause the system to perform operations comprising: generating, utilizing a segmentation model, candidate segmentation masks for objects portrayed in a digital image; generating, utilizing a large language model, latent feature vectors for the candidate segmentation masks; generating, utilizing a classification machine learning model, group classification predictions for the candidate segmentation masks from the latent feature vectors; and selecting a group of segmentation masks for the digital image based on the group classification predictions for the candidate segmentation masks.
11 . The system of claim 10 , wherein the operations further comprise: training the large language model by comparing the group of segmentation masks to a ground truth mask group for the digital image.
12 . The system of claim 11 , wherein the operations further comprise generating the ground truth mask group by: extracting localized regions of the digital image utilizing the candidate segmentation masks; generating, utilizing one or more large language models, text descriptions of the candidate segmentation masks from the localized regions of the digital image; and generating, utilizing at least one large language model, the ground truth mask group for the digital image from the text descriptions and the candidate segmentation masks.
13 . The system of claim 12 , wherein the operations further comprise selecting the digital image to include in a group mask extraction training dataset for training the large language model based on comparing a quantity of the objects portrayed in the digital image with an object quantity threshold.
14 . The system of claim 10 , wherein the operations further comprise providing, for display via a client device, the group of segmentation masks for the digital image.
15 . The system of claim 10 , wherein the operations further comprise: receiving at least one of a reference mask or a language input corresponding to the digital image; generating one or more sets of tokens based on at least one of the reference mask or the language input; and generating the group classification predictions based on the one or more sets of tokens.
16 . The system of claim 15 , wherein the one or more sets of tokens comprises at least one of a set of text tokens, a set of mask tokens, or a set of global visual tokens.
17 . A non-transitory computer-readable medium storing instructions thereon that, when executed by at least one processor, cause the at least one processor to perform operations comprising: generating, utilizing a segmentation model, a set of candidate segmentation masks for objects portrayed in a digital image; generating, utilizing a mask projector model, a set of mask tokens from the set of candidate segmentation masks; selecting, utilizing a large language model, a group of segmentation masks from the set of candidate segmentation masks based on the set of mask tokens, wherein the group of segmentation masks satisfy a mask group classification threshold; and providing, for display via a client device, the group of segmentation masks for the digital image.
18 . The non-transitory computer-readable medium of claim 17 , wherein the operations further comprise: generating, utilizing the mask projector model, a reference mask token from a reference mask; and selecting, utilizing the large language model, the group of segmentation masks from the set of candidate segmentation masks based on the reference mask token and the set of mask tokens.
19 . The non-transitory computer-readable medium of claim 17 , wherein the operations further comprise: receiving, from the client device, language input corresponding to the digital image; generating a set of text tokens associated with the language input from the client device; and selecting, utilizing the large language model, the group of segmentation masks based on the set of text tokens and the set of mask tokens.
20 . The non-transitory computer-readable medium of claim 17 , wherein generating the set of mask tokens comprises, for a candidate mask of the set of candidate segmentation masks: generating, utilizing a visual backbone model, a localized candidate mask feature map for the candidate mask from the digital image; and converting, utilizing the mask projector model, the localized candidate mask feature map for the candidate mask into a mask token for the candidate mask.

Description

BACKGROUND Recent years have seen significant improvements in hardware and software platforms for image segmentation. For example, conventional systems utilize computer-implemented models to extract a mask for a visual entity portrayed in a digital image. To illustrate, some conventional systems can utilize machine learning approaches, such as convolutional neural networks, to detect an entity and select pixels in the image that correspond to the detected entity. However, such conventional systems have a number of technical deficiencies with regard to accuracy, flexibility, and efficiency of implementing computing devices. BRIEF SUMMARY Embodiments of the present disclosure provide benefits and/or solve one or more of the foregoing or other problems in the art with systems, non-transitory computer-readable media, and methods for training and utilizing a group segmentation machine learning model to generate groups of segmentation masks for digital images from input vision or language features. To illustrate, in one or more implementations, the disclosed systems utilize a segmentation model to generate a pool of candidate masks for a digital image and utilize a large language model to intelligently select a group of related masks from the pool. In some examples, the disclosed systems select a group of masks using one or more computer vision and/or natural language features. To illustrate, the disclosed systems receive a natural language input and/or a reference mask from a client device and convert these inputs to tokens for utilization in a large language model. For example, in some implementations the disclosed systems select a group of masks by utilizing projector models to generate mask tokens associated with the pool of candidate masks, text tokens associated with a natural language input, and/or reference mask tokens associated with pertinent computer vision features. In one or more embodiments, the disclosed systems process these various tokens with a large language model to generate and provide various multi-modal responses to client devices, including groups of related masks and/or natural language responses. By grouping masks using computer vision and natural language, the disclosed systems can realize improved accuracy, efficiency, and flexibility for image segmentation tasks and higher practicality for various segmentation applications. As mentioned, in some implementations the disclosed systems also train a group segmentation machine learning model to generate groups of segmentation masks for individual digital images. For example, the disclosed systems generate a group mask extraction training dataset for training a mask grouping model. To illustrate, the disclosed systems identify an image dataset, generate candidate masks for the image dataset (utilizing a segmentation model), generate dense descriptions for the candidate masks (utilizing a multi-modal large language model), and generate training mask groups with explanations (utilizing an additional large language model). In one or more implementations, the disclosed systems utilize this annotation pipeline to generate an image dataset for scalable and low-computational cost training data generation. Moreover, in some embodiments the disclosed systems utilize the training dataset to modify parameters of the group segmentation machine learning model for improved accuracy in generating groups of segmentation masks for individual digital images from various multi-modal inputs. Additional features and advantages of one or more embodiments of the present disclosure are outlined in the description which follows, and in part will be obvious from the description, or may be learned by the practice of such example embodiments. BRIEF DESCRIPTION OF THE DRAWINGS The detailed description provides one or more embodiments with additional specificity and detail through the use of the accompanying drawings, as briefly described below. FIG. 1 illustrates a diagram of an environment in which a mask group system operates in accordance with one or more embodiments. FIG. 2 illustrates generating a group of segmentation masks utilizing a mask grouping model in accordance with one or more embodiments. FIG. 3 illustrates an example architecture of a mask grouping model generating a group of segmentation masks from a digital image and multi-modal inputs in accordance with one or more embodiments. FIG. 4 illustrates an annotation pipeline for generating a group mask extraction training dataset in accordance with one or more embodiments. FIG. 5 illustrates training a mask grouping model in accordance with one or more embodiments. FIG. 6 illustrates a diagram of an example architecture of the mask group system in accordance with one or more implementations. FIG. 7 illustrates a flowchart of a series of acts for grouping segmentation masks in accordance with one or more embodiments. FIG. 8 illustrates a block diagram of an example computing device for implementing one or