US-20260127901-A1 - REGION-TEXT CAPTION GENERATION USING GLOBAL CAPTION INFORMATION

US20260127901A1US 20260127901 A1US20260127901 A1US 20260127901A1US-20260127901-A1

Abstract

Approaches presented herein may be used to generate captions using raw caption information. Raw caption information may be used, with an associated image, to generate a detailed image caption. Object lists may then be generated from the image and/or the detailed image caption to produce an image including boxing box proposals for objects within the image. One or more trained machine learning systems may then be used to generate region of interest captions that infuse the global caption context associated with the raw caption information.

Inventors

Subhashree Radhakrishnan
Shijia Liao
Charul Verma
Zhiding Yu
Sifei Liu
Sean Cha

Assignees

NVIDIA CORPORATION

Dates

Publication Date: 20260507
Application Date: 20241107

Claims (20)

1 . A processor, comprising: one or more circuits to: generate an object list for an input image based on the input image and a first caption corresponding to the input image; generate one or more bounding boxes for objects at least partially depicted in the input image; generate a first selected caption for a selected bounding box of the one or more bounding boxes; determine an object, within the selected bounding box, based on the first selected caption; generate a second selected caption, based on the first caption and the selected bounding box; and generate a merged caption based on the first selected caption, the second selected caption, and the object.
2 . The processor of claim 1 , wherein at least one of the first caption, the first selected caption, or the second selected caption, are generated using a vision language model.
3 . The processor of claim 1 , wherein the one or more circuits are further to: generate a first object list from the input image; generate a second object list from the first caption; and combine the first object list and the second object list to form the object list.
4 . The processor of claim 3 , wherein the one or more circuits are further to: provide the object list to one or more trained machine learning systems to generate the one or more bounding boxes.
5 . The processor of claim 1 , wherein the one or more circuits are further to: identify a plurality of bounding boxes, of the one or more bounding boxes, having a common label; determine at least a portion of the plurality of bounding boxes are within a threshold distance; and combine the portion of the plurality of bounding boxes within a single bounding box.
6 . The processor of claim 1 , wherein the one or more circuits are further to: determine a first label associated with the selected bounding box; determine a second label associated with the object; determine a similarity metric between the first label and the second label is below a threshold; and identify the merged caption for review.
7 . The processor of claim 1 , wherein an input to a trained machine learning system used to generate the first caption includes raw caption data for the input image.
8 . The processor of claim 1 , wherein the one or more circuits are further to: receive one or more prompts associated with an output configuration for at least one of the first caption, the first selected caption, or the second selected caption.
9 . The processor of claim 1 , wherein the processor is comprised in at least one of: a system for performing simulation operations; a system for performing simulation operations to test or validate autonomous machine applications; a system for performing digital twin operations; a system for performing light transport simulation; a system for rendering graphical output; a system for performing deep learning operations; a system implemented using an edge device; a system for generating or presenting virtual reality (VR) content; a system for generating or presenting augmented reality (AR) content; a system for generating or presenting mixed reality (MR) content; a system incorporating one or more Virtual Machines (VMs); a system for performing operations for a conversational AI application; a system for performing operations for a generative AI application; a system for performing operations using a language model; a system for performing one or more operations using a large language model (LLM); a system for performing one or more operations using a vision language model (VLM); a system implemented at least partially in a data center; a system for performing hardware testing using simulation; a system for performing one or more generative content operations using a language model; a system for synthetic data generation; a collaborative content creation platform for 3D assets; or a system implemented at least partially using cloud computing resources.
10 . A computer-implemented method, comprising: obtaining a set of bounding boxes and labels for an image based on an object list that includes one or more identified objects at least partially depicted in the image and one or more described objects from a first caption of the image; determining, for a selected bounding box of the set of bounding boxes, a second caption; determining, from the second caption, a bounding box object; determining, for the selected bounding box, a third caption, based on the first caption; and generating a fourth caption based on the second caption, the third caption, and the bounding box object.
11 . The computer-implemented method of claim 10 , further comprising: generating, using a raw caption associated with the image, the first caption; and generating a first object list based on the first caption.
12 . The computer-implemented method of claim 11 , further comprising: identifying the one or more identified objects in the image using a trained machine learning model; generating, a second object list based on the one or more identified objects; and generating the object list based on the first object list and the second object list.
13 . The computer-implemented method of claim 10 , further comprising: determining a plurality of bounding boxes, of the set of bounding boxes, have a common label; determining at least a portion of the plurality of bounding boxes are within a threshold distance; and combining the portion of the plurality of bounding boxes within a single bounding box with the common label.
14 . The computer-implemented method of claim 10 , further comprising: receiving a prompt corresponding to an output format for the fourth caption.
15 . The computer-implemented method of claim 10 , further comprising: determining a first label associated with the selected bounding box; determining a second label associated with the bounding box object; determining a similarity metric between the first label and the second image label is below a threshold; and identifying the fourth caption for review.
16 . A system, comprising: processing circuitry to generate a caption for an input image based on a set of region captions, wherein individual region captions of the set of region captions are generated using a first region caption based on a labeled bounding box for an object within the image and a second region caption based on a global description of the image.
17 . The system of claim 16 , where the one or more processing units are further to generate an object list for the input image based on a first object list corresponding to a first output for an object recognition model and a second object list corresponding to a second output for a large language model.
18 . The system of claim 17 , wherein the object recognition model processes the input image and the large language model processes an image caption generated by a vision language model based on the input image and raw caption data.
19 . The system of claim 16 , wherein the one or more processing units are further to provide the caption to a human-in-the-loop review engine responsive to determining a similarity metric between labeled objects in the input image is below a threshold.
20 . The system of claim 16 , wherein the system is comprised in at least one of: a system for performing simulation operations; a system for performing simulation operations to test or validate autonomous machine applications; a system for performing digital twin operations; a system for performing light transport simulation; a system for rendering graphical output; a system for performing deep learning operations; a system implemented using an edge device; a system for generating or presenting virtual reality (VR) content; a system for generating or presenting augmented reality (AR) content; a system for generating or presenting mixed reality (MR) content; a system incorporating one or more Virtual Machines (VMs); a system for performing operations for a conversational AI application; a system for performing operations for a generative AI application; a system for performing operations using a language model; a system for performing one or more operations using a large language model (LLM); a system for performing one or more operations using a vision language model (VLM); a system implemented at least partially in a data center; a system for performing hardware testing using simulation; a system for performing one or more generative content operations using a language model; a system for synthetic data generation; a collaborative content creation platform for 3D assets; or a system implemented at least partially using cloud computing resources.

Description

BACKGROUND Annotating dense images is a slow process and often error prone when offloaded to different automated systems. Automated systems may provide a holistic description of a scene shown within an image, but fail to provide object level detail or rich information. As a result, detailed captions are often generated by human reviewers that can incorporate context into the captions for particular regions, groupings, or objects within an image. With large datasets, it may be impractical for humans to generate the captions, and only providing partial captions with existing systems may fail to describe all elements of an image, while also being time and labor intensive. BRIEF DESCRIPTION OF THE DRAWINGS Various embodiments in accordance with the present disclosure will be described with reference to the drawings, in which: FIG. 1 illustrates an example environment for caption generation, in accordance with various embodiments; FIG. 2A illustrates an example environment for generating image-level captions using raw caption information, in accordance with various embodiments; FIG. 2B illustrates an example representation of a generated image-level caption, in accordance with various embodiments; FIG. 3A illustrates an example environment for generating images including bounding box proposals, in accordance with various embodiments; FIG. 3B illustrates an example representation of an object list generated from an object detection system and a large language model, in accordance with various embodiments; FIG. 3C illustrates an example representation of an image including bounding box proposals, in accordance with various embodiments; FIG. 3D illustrates an example representation of a merged bounding box, in accordance with various embodiments; FIG. 4A illustrates an example environment for generating captions with infused global caption information, in accordance with various embodiments; FIG. 4B illustrates an example representation of region of interest captions, in accordance with various embodiments; FIG. 4C illustrates an example representation of a detected object list, in accordance with various embodiments; FIG. 4D illustrates an example representation of a caption with infused global caption information, in accordance with various embodiments; FIG. 4E illustrates an example representation of a human-in-the-loop review engine for a caption, in accordance with various embodiments; FIG. 5A illustrates an example process for generating a merged caption with infused global context, in accordance with various embodiments; FIG. 5B illustrates an example process for generating a merged caption with infused global context, in accordance with various embodiments; FIG. 6 illustrates components of a distributed system that can be utilized to update or perform inferencing using a machine learning model, according to at least one embodiment; FIG. 7A illustrates inference and/or training logic, according to at least one embodiment; FIG. 7B illustrates inference and/or training logic, according to at least one embodiment; FIG. 8 illustrates an example data center system, according to at least one embodiment; FIG. 9 illustrates a computer system, according to at least one embodiment; FIG. 10 illustrates a computer system, according to at least one embodiment; FIG. 11 illustrates at least portions of a graphics processor, according to one or more embodiments; FIG. 12 illustrates at least portions of a graphics processor, according to one or more embodiments; FIG. 13 is an example data flow diagram for an advanced computing pipeline, in accordance with at least one embodiment; FIG. 14 is a system diagram for an example system for training, adapting, instantiating and deploying machine learning models in an advanced computing pipeline, in accordance with at least one embodiment; FIGS. 15A and 15B illustrate a data flow diagram for a process to train a machine learning model, as well as client-server architecture to enhance annotation tools with pre-trained annotation models, in accordance with at least one embodiment; FIG. 16A is a block diagram of an example generative language model system, in accordance with at least one embodiment; FIG. 16B is a block diagram of an example generative language model that includes a transformer encoder-decoder, in accordance with at least one embodiment; and FIG. 16C is a block diagram of an example generative language model that includes a decoder-only transformer architecture, in accordance with at least one embodiment. DETAILED DESCRIPTION In the following description, various embodiments will be described. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the embodiments. However, it will also be apparent to one skilled in the art that the embodiments may be practiced without the specific details. Furthermore, well-known features may be omitted or simplified in order not to obscure the embodiment being descr