US-12626496-B2 - Utilizing interactive deep learning to select objects in digital visual media

US12626496B2US 12626496 B2US12626496 B2US 12626496B2US-12626496-B2

Abstract

Systems and methods are disclosed for selecting target objects within digital images utilizing a multi-modal object selection neural network trained to accommodate multiple input modalities. In particular, in one or more embodiments, the disclosed systems and methods generate a trained neural network based on training digital images and training indicators corresponding to various input modalities. Moreover, one or more embodiments of the disclosed systems and methods utilize a trained neural network and iterative user inputs corresponding to different input modalities to select target objects in digital images. Specifically, the disclosed systems and methods can transform user inputs into distance maps that can be utilized in conjunction with color channels and a trained neural network to identify pixels that reflect the target object.

Inventors

Brian Price
Scott Cohen
Mai Long
Jun Hao Liew

Assignees

ADOBE INC.

Dates

Publication Date: 20260512
Application Date: 20230130

Claims (20)

1 . A method comprising: generating, utilizing a neural network, predicted pixels corresponding to training objects utilizing digital training images portraying the training objects and training indicators comprising pixels and indications of how the pixels correspond to the training objects by: generating, for a digital training image of the digital training images, a training distance map by determining distances between pixels in the digital training image and a training indicator corresponding to the digital training image; and generating the predicted pixels from the training distance map utilizing the neural network; comparing the predicted pixels corresponding to the training objects with training ground truth masks; and training the neural network by comparing the predicted pixels corresponding to the training objects with the training ground truth masks.
2 . The method of claim 1 , further comprising: generating an additional training distance map for an additional digital training image of the digital training images; and generating additional predicted pixels for the additional digital training image from the additional training distance map utilizing the neural network.
3 . The method of claim 2 , wherein generating the additional training distance map comprises determining distances between pixels in the additional digital training image and additional training indicator corresponding to the additional digital training image.
4 . The method of claim 1 , wherein a digital training image comprises a training object and further comprising generating the training indicators by generating a positive training indicator comprising at least one pixel of the digital training image that is part of the training object.
5 . The method of claim 4 , wherein generating the training indicators comprises generating a negative training indicator comprising a background pixel of the digital training image that is not part of the training object.
6 . The method of claim 5 , further comprising generating the negative training indicator by randomly sampling the background pixel from a plurality of background pixels that are not part of the training object in the digital training image.
7 . The method of claim 5 , further comprising generating the negative training indicator by sampling the background pixel from a plurality of background pixels based on a distance between the background pixel and another negative training indicator.
8 . The method of claim 1 , further comprising: identifying a training object and an untargeted object in a digital training image of the digital training images; and generating a negative training indicator by sampling a pixel from the untargeted object in the digital training image.
9 . A system comprising: a memory component; and one or more processing devices coupled to the memory component, the one or more processing devices to perform operations comprising: generating, utilizing a neural network, predicted pixels corresponding to training objects utilizing digital training images portraying the training objects and training indicators comprising pixels and indications of how the pixels correspond to the training objects by: generating, for a digital training image of the digital training images, a training distance map by determining distances between pixels in the digital training image and a training indicator corresponding to the digital training image; and generating the predicted pixels from the training distance map utilizing the neural network; comparing the predicted pixels corresponding to the training objects with training ground truth masks; and training the neural network by comparing the predicted pixels corresponding to the training objects with the training ground truth masks.
10 . The system of claim 9 , further comprising: generating an additional training distance map for an additional digital training image of the digital training images; and generating additional predicted pixels for the additional digital training image from the additional training distance map utilizing the neural network.
11 . The system of claim 9 , wherein a digital training image comprises a training object and further comprising generating the training indicators by generating a positive training indicator comprising at least one pixel of the digital training image that is part of the training object.
12 . The system of claim 11 , wherein generating the training indicators comprises generating a negative training indicator comprising a background pixel of the digital training image that is not part of the training object.
13 . The system of claim 12 , further comprising generating the negative training indicator by randomly sampling the background pixel from a plurality of background pixels that are not part of the training object in the digital training image.
14 . The system of claim 12 , further comprising generating the negative training indicator by sampling the background pixel from a plurality of background pixels based on a distance between the background pixel and another negative training indicator.
15 . The system of claim 9 , further comprising: identifying a training object and an untargeted object in a digital training image of the digital training images; and generating a negative training indicator by sampling a pixel from the untargeted object in the digital training image.
16 . A non-transitory computer-readable medium storing executable instructions which, when executed by a processing device, cause the processing device to perform operations comprising: generating, utilizing a neural network, predicted pixels corresponding to training objects utilizing digital training images portraying the training objects and training indicators comprising pixels and indications of how the pixels correspond to the training objects by: generating, for a digital training image of the digital training images, a training distance map by determining distances between pixels in the digital training image and a training indicator corresponding to the digital training image; and generating the predicted pixels from the training distance map utilizing the neural network; comparing the predicted pixels corresponding to the training objects with training ground truth masks; and training the neural network by comparing the predicted pixels corresponding to the training objects with the training ground truth masks.
17 . The non-transitory computer-readable medium of claim 16 , further comprising: generating an additional training distance map for an additional digital training image of the digital training images; and generating additional predicted pixels for the additional digital training image from the additional training distance map utilizing the neural network.
18 . The non-transitory computer-readable medium of claim 17 , wherein generating the additional training distance map comprises determining distances between pixels in the additional digital training image and additional training indicator corresponding to the additional digital training image.
19 . The non-transitory computer-readable medium of claim 16 , wherein a digital training image comprises a training object and further comprising generating the training indicators by generating a positive training indicator comprising at least one pixel of the digital training image that is part of the training object.
20 . The non-transitory computer-readable medium of claim 16 , further comprising: identifying a training object and an untargeted object in a digital training image of the digital training images; and generating a negative training indicator by sampling a pixel from the untargeted object in the digital training image.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS The present application is a divisional of U.S. patent application Ser. No. 16/376,704, filed Apr. 5, 2019, which is a continuation-in-part of U.S. patent application Ser. No. 16/216,739, filed Dec. 11, 2018, now issued as U.S. Pat. No. 11,314,982, which is a continuation of U.S. patent application Ser. No. 14/945,245, filed Nov. 18, 2015, now issued as U.S. Pat. No. 10,192,129. The aforementioned patent(s) and application(s) are hereby incorporated by reference in their entirety. BACKGROUND Recent years have seen a rapid proliferation in the use of digital visual media. Indeed, with advancements in digital cameras, smartphones, and other technology, the ability to capture, access, and utilize digital images and video has steadily increased. Accordingly, engineers have made significant developments in digital object selection systems that capture, manage, and edit digital images. For example, some conventional object selection systems can identify and select objects portrayed within digital images. Although some conventional systems can identify and select digital objects, these systems have a variety of problems and shortcomings. For example, some common digital object selection systems detect user tracing of an area within a digital image and select pixels within the traced area. Although such systems allow a user to select pixels in a digital image, they are often rough, over-inclusive, under-inclusive, and/or time consuming. Indeed, conventional systems that rely upon manual tracing commonly fail to provide sufficient precision to accurately select objects. Moreover, in order to achieve increased accuracy, such systems often require an exorbitant amount of time and user interactions to trace an object in a digital image. Similarly, some common digital object selection systems are trained to identify pixels corresponding to common object classes. For example, some common digital systems are trained to identify and select pixels corresponding to dogs, cats, or other object classes. Although such systems are capable of identifying and selecting common objects, they are limited by the particular classifications with which they are trained. Because the number and type of object classes in the world is so vast, such common digital object selection systems can severely limit the ability to identify, select, and modify objects in digital visual media. Moreover, because such common systems identify pixels corresponding to a particular object type, they often have difficulty distinguishing between multiple objects belonging to the same class. In addition, some conventional digital object selection systems are tied to fixed types of user input. For example, some conventional digital object systems can identify digital objects based on rigid input of area tracing inputs. By specializing on a specific type of user input, these conventional systems reduce flexibility and accuracy, inasmuch as a particular type of user input may only be effective at identifying target objects in specific circumstances (e.g., tracing may only be efficient for easily identifiable and tracable shapes). These and other problems exist with regard to identifying objects in digital visual media. BRIEF SUMMARY One or more embodiments described herein provide benefits and solve one or more of the foregoing or other problems in the art with systems, methods, and non-transitory computer readable media that identify target objects utilizing a unified multi-modal interactive deep learning model. In particular, in one or more embodiments, the disclosed systems utilize a unified deep learning model that accepts and aggregates a plurality of interactive user inputs corresponding to different input modalities (e.g., regional clicks, boundary clicks, natural language expression, bounding boxes, attention masks, and/or soft clicks) to select objects portrayed within digital visual media. For instance, the disclosed systems can train a multi-modal object segmentation neural network based on generic training digital images and training indicators corresponding to different input modalities. Based on this training, the disclosed systems can utilize a multi-modal object segmentation neural network to identify one or more objects based on user input corresponding to a variety of different input modalities. This approach allows for improved efficiency by reducing the user interaction and time required to identify objects portrayed in digital images. Additionally, this approach provides improved flexibility and accuracy by utilizing deep learning techniques and by introducing multiple different user input modalities that can be uniquely applied to the unique circumstances and features of individual digital images. To illustrate, in one or more embodiments, the disclosed systems receive multiple different types of user input (e.g., a regional input modality, a boundary input modality, and/or a language input modality). In