US-12626069-B2 - Image description generation with varying levels of detail

US12626069B2US 12626069 B2US12626069 B2US 12626069B2US-12626069-B2

Abstract

One or more aspects of the method, apparatus, and non-transitory computer readable medium include obtaining an image and a detail level, wherein the detail level comprises a value indicating a level of detail for a description of the image. One or more aspects of the method, apparatus, and non-transitory computer readable medium further include identifying a set of regions for the image based on the detail level using a machine learning model, and generating a description for the image based on the set of regions, wherein an amount of detail in the description is based on the detail level.

Inventors

Akshay Ganesh Iyer
Nikunj Goyal
Kanad Shrikar Pardeshi
Pranamya Prashant Kulkarni
Abhilasha Sancheti
Praneetha Vaddamanu
Aparna Garimella
Apoorv Umang Saxena
Vishwa Vinay

Assignees

ADOBE INC.

Dates

Publication Date: 20260512
Application Date: 20230720

Claims (20)

1 . A method comprising: obtaining an image and a detail level, wherein the detail level comprises a value indicating a level of detail for a description of the image; identifying a set of region proposals, wherein each of the set of region proposals comprises a region of the image represented by a bounding box; identifying a set of regions for the image based on the detail level using a machine learning model, wherein a number of bounding boxes associated with the set of regions corresponds to the detail level; and generating the description for the image based on the set of regions, wherein an amount of detail in the description is based on the detail level.
2 . The method of claim 1 , further comprising: generating the set of region proposals for the image using an encoder of the machine learning model.
3 . The method of claim 2 , further comprising: generating context features of the image using the encoder, wherein the set of regions is identified based on the context features.
4 . The method of claim 2 , further comprising: generating region features for at least one of the set of region proposals using the encoder, wherein the description is based on the region features.
5 . The method of claim 2 , further comprising: generating a bounding box for at least one of the set of region proposals using the encoder, wherein the set of regions is selected from the set of region proposals based on the bounding box.
6 . The method of claim 2 , further comprising: classifying each of the set of region proposals based on the detail level using a classifier of the machine learning model, wherein the set of regions is based on the classifying.
7 . The method of claim 1 , further comprising: generating a region description for each of the set of regions using a generator of the machine learning model; and combining the region description for each of the set of regions to obtain the description.
8 . The method of claim 7 , further comprising: filtering the region description for each of the set of regions based on a sentence similarity score, wherein the combining is based on the filtering.
9 . A method of training a machine learning model, comprising: obtaining a training data set including a region of an image, a detail level, and a ground truth classification of the region, wherein the detail level comprises a value indicating a level of detail for a description of the image; identifying a set of region proposals, wherein each of the set of region proposals comprises a region of the image represented by a bounding box; identifying a set of regions for the image based on the detail level using the machine learning model, wherein a number of bounding boxes associated with the set of regions corresponds to the detail level; classifying the region of the image based on the detail level and the set of regions using a classifier of the machine learning model to obtain a region classification; and training the classifier to classify image regions based on the region classification and the ground truth classification.
10 . The method of claim 9 , further comprising: computing a classification loss by comparing the ground truth classification and the region classification.
11 . The method of claim 9 , further comprising: training an encoder of the machine learning model to generate the set of region proposals, wherein the region of the image is based on an output of the encoder.
12 . The method of claim 9 , further comprising: training a generator of the machine learning model to generate a region description for the region of the image.
13 . The method of claim 9 , further comprising: obtaining a ground truth description of the image corresponding to the detail level; generating a region description of the region of the image; and comparing the region description to the ground truth description of the image, wherein the region description is based on the comparison.
14 . An apparatus comprising: one or more processors; one or more memories including instructions executable by the one or more processors; and a machine learning model including parameters stored in the one or more memories, wherein the machine learning model is trained to; obtaining an image and a detail level, wherein the detail level comprises a value indicating a level of detail for a description of the image; identifying a set of region proposals, wherein each of the set of region proposals comprises a region of the image represented by a bounding box; identify a set of regions for the image based on the detail level, wherein a number of bounding boxes associated with the set of regions corresponds to the detail level; and generate the description for the image based on the set of regions, wherein an amount of detail in the description is based on the detail level.
15 . The apparatus of claim 14 , wherein: the machine learning model comprises an encoder configured to generate the set of region proposals, wherein the set of regions is identified based on the set of region proposals.
16 . The apparatus of claim 15 , wherein: the encoder comprises a region proposal network (RPN).
17 . The apparatus of claim 14 , wherein: the machine learning model comprises a classifier configured to classify the set of regions of the image based on the detail level, wherein the set of regions is identified based on the classification.
18 . The apparatus of claim 17 , wherein: the classifier comprises a feed forward network.
19 . The apparatus of claim 14 , wherein: the machine learning model comprises a generator configured to generate a region description for each of the set of regions, wherein the description is based on the region description.
20 . The apparatus of claim 19 , wherein: the generator comprises a transformer model or Long-Short Term Memory (LSTM) model.

Description

BACKGROUND The present disclosure relates to generating descriptions having varying levels of detail for an image in response to a user prompt. Images can depict various objects and actions in a scene that can be included in an informative description, such that images and sentences can be associated. With image captioning, classifying an image from a fixed set of categories based on an object has expanded to labeling an image with a sequence of words able to express richer concepts. Dense captioning involves using a model to predict a set of descriptions across regions of an image. SUMMARY Embodiments of the present disclosure provide a machine learning model including a generative network trained to generate a text description for an image, where the description can include a varying amount of detail based on a detail level. The detail level can be provided by a user. A description generator can construct different detail level descriptions for a given image based on detected objects and activities occurring in the image. A method, apparatus, and non-transitory computer readable medium for a method of training a machine learning model are described. One or more aspects of the method, apparatus, and non-transitory computer readable medium include obtaining an image and a detail level, wherein the detail level comprises a value indicating a level of detail for a description of the image, identifying a set of regions for the image based on the detail level using a machine learning model, and generating a description for the image based on the set of regions, wherein an amount of detail in the description is based on the detail level. A method, apparatus, and non-transitory computer readable medium for a method of training a machine learning model are described. One or more aspects of the method, apparatus, and non-transitory computer readable medium include obtaining a training data set including a region of an image, a detail level, and a ground truth classification of the region, classifying the region of the image based on the detail level using a classifier of the machine learning model to obtain a region classification, and training the classifier to classify image regions based on the region classification and the ground truth classification. A method, apparatus, and non-transitory computer readable medium for a method of training a machine learning model are described. One or more aspects of the method, apparatus, and non-transitory computer readable medium include one or more processors, one or more memories including instructions executable by the one or more processors, and a machine learning model including parameters stored in the one or more memories, wherein the machine learning model is trained to identify a set of regions for an image based on a detail level and to generate a description for the image based on the set of regions. A method, apparatus, and non-transitory computer readable medium for training a description generation network is described. One or more aspects of the method, apparatus, and non-transitory computer readable medium include a training data set including a description paragraph having multiple sentences as ground truth descriptions of image features. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is an illustrative depiction of a high-level diagram of a user interacting with a description generation system, including a neural network for generating image descriptions, according to aspects of the present disclosure. FIG. 2 shows a flow diagram illustrating an example of a description generation method applied to an image and description level, according to aspects of the present disclosure. FIG. 3 a block diagram of an example of a description generator, according to aspects of the present disclosure. FIG. 4 shows a diagram of a description generator for receiving an image and outputting a text description, according to aspects of the present disclosure. FIG. 5 shows a diagram illustrating an example of a method of generating a text description at a level of detail for an image, according to aspects of the present disclosure. FIG. 6 shows a diagram of an example of a method of training a description generator, according to aspects of the present disclosure. FIG. 7 shows a diagram of an example of a method of generating a text description for an image, according to aspects of the present disclosure. FIG. 8 shows a diagram of an example of a method of training a classifier of an image generator model, according to aspects of the present disclosure. FIG. 9 shows a diagram of an example of a method of training a description generation network, according to aspects of the present disclosure. FIG. 10 shows an example of an input image for description generation at different detail levels, according to aspects of the present disclosure. FIG. 11 shows an example of a computing device for a description generator, according to aspects of the present disclosure. DETAILED DESCRIPTION The