US-12626429-B2 - Joint framework for object-centered shadow detection, removal, and synthesis

US12626429B2US 12626429 B2US12626429 B2US 12626429B2US-12626429-B2

Abstract

The present disclosure relates to systems, methods, and non-transitory computer-readable media that detects shadows, removes shadows, and synthesizes shadows in a joint-framework. In particular, the disclosed systems access an object mask of an object and a digital image depicting the object and a shadow of the object. Furthermore, the disclosed systems perform object-centered shadow detection and removal to generate a modified digital image without the shadow by utilizing a shadow analyzer model. Moreover, the disclosed systems receive a user interaction to manipulate an object and generate a modified shadow utilizing a shadow synthesis model where the shadow synthesis model is conditioned on a shadow mask generated by the shadow analyzer model.

Inventors

Tianyu Wang
Soo Ye KIM
Luis Figueroa
Haitian Zheng
Jianming Zhang
ZHIHONG DING
Scott Cohen
Zhe Lin
Wei Xiong

Assignees

ADOBE INC.

Dates

Publication Date: 20260512
Application Date: 20240430

Claims (20)

1 . A computer-implemented method comprising: receiving a digital image depicting a scene comprising an object and a shadow of the object; accessing an object mask of the object in the digital image; and generating a modified digital image by: extracting, from a combination of the digital image and the object mask, utilizing an encoder of a shadow analyzer model, multi-scale features, global features, and spatial features; generating, from the global features and the spatial features, utilizing a global-spatial decoder, the modified digital image without the shadow of the object; and generating, from the multi-scale features from a spatial decoder of the global-spatial decoder, utilizing a shadow detector, a shadow mask of the shadow.
2 . The computer-implemented method of claim 1 , wherein generating the modified digital image comprises: receiving, via a user interaction with the digital image an indication to remove the shadow of the object; and generating a fill corresponding to the removed shadow of the object that is consistent with the digital image based on the global features and the spatial features.
3 . The computer-implemented method of claim 1 , wherein accessing the object mask further comprises receiving lighting data and geometry of the scene to generate pixels for the shadow removed from digital image that is globally and locally consistent with a remainder of the modified digital image.
4 . The computer-implemented method of claim 1 , wherein generating the modified digital image without the shadow of the object comprises generating, utilizing a generative inpainting neural network of the shadow analyzer model, pixel values consistent with the scene and without the shadow by modulating the generative inpainting neural network based on the object mask of the object.
5 . The computer-implemented method of claim 1 , wherein generating the shadow mask of the shadow comprises: utilizing a shadow detector integrated with the global-spatial decoder to identify a shadow region of the digital image by upsampling multi-scale features to a uniform size; and combining the uniform size of the upsampled multi-scale features to a feature map to generate the shadow mask of the shadow.
6 . The computer-implemented method of claim 1 , further comprising: receiving an additional digital image with a plurality of objects and a plurality of shadows corresponding to the plurality of objects; and accessing an empty mask for the additional digital image.
7 . The computer-implemented method of claim 6 , further comprising in response to receiving the empty mask, removing the plurality of shadows corresponding to the plurality of objects in the additional digital image.
8 . The computer-implemented method of claim 1 , further comprising: receiving an additional digital image comprising an additional shadow cast from an additional object outside of a frame of the additional digital image; accessing an empty mask in response to identifying that the additional shadow is cast from the additional object outside of the frame; and in response to the empty mask, removing the additional shadow from the additional digital image.
9 . A non-transitory computer-readable medium storing instructions thereon that, when executed by at least one processor, cause the at least one processor to perform operations comprising: in response to receiving a user interaction to modify an object depicted in a digital image, accessing an object mask of the object; combining the object mask of the object and the digital image to generate a combined representation; receiving, from a shadow analyzer model, a shadow mask of the object; generating, from the combined representation and utilizing a shadow synthesis diffusion model, a new shadow for the object by conditioning the shadow synthesis diffusion model with the shadow mask of the object received from the shadow analyzer model; and generating a modified digital image that includes the object with the new shadow.
10 . The non-transitory computer-readable medium of claim 9 , wherein: receiving the user interaction to modify the object comprises relocating the object by moving the object in the digital image from a first location in the digital image to a second location in the digital image; and combining the object mask of the object and the digital image comprises combining the object mask and the digital image depicting the object moved to the second location to generate the combined representation.
11 . The non-transitory computer-readable medium of claim 9 , wherein: receiving the user interaction to modify the object comprises relocating the object by adding the object to a location of the digital image, the object coming from an additional digital image; and combining the object mask of the object and the digital image comprises combining the object mask and the digital image depicting the added object in the location to generate the combined representation.
12 . The non-transitory computer-readable medium of claim 9 , further comprising: accessing, from the shadow analyzer model, shadow property data comprising intensity, softness, color, and direction of a shadow corresponding to the shadow mask of the object; and combining the shadow property data to generate a feature map of the shadow mask.
13 . The non-transitory computer-readable medium of claim 12 , wherein the operations further comprise resizing the feature map of the shadow mask from a first size to a second size, the second size being smaller than the first size.
14 . The non-transitory computer-readable medium of claim 13 , wherein conditioning the shadow synthesis diffusion model with the shadow mask of the object comprises: utilizing an adapter of the shadow synthesis diffusion model to align the resized feature map with text tokens; and utilizing a cross-attention mechanism of the shadow synthesis diffusion model to condition an iterative denoising process of the shadow synthesis diffusion model with the resized feature map aligned with the text tokens.
15 . The non-transitory computer-readable medium of claim 9 , wherein generating the new shadow for the object comprises: generating, at a first denoising process of the shadow synthesis diffusion model, a first additional shadow mask of the object based on conditioning the first denoising process with the shadow mask of the object; generating, at a second denoising process of the shadow synthesis diffusion model, a second additional shadow mask of the object based on conditioning the second denoising process with the shadow mask of the object; and generating the new shadow for the object based on the first additional shadow mask and the second additional shadow mask.
16 . A system comprising: one or more memory devices; and one or more processors coupled to the one or more memory devices that cause the system to perform operations comprising: accessing an object mask of an object and a digital image depicting a scene depicting the object and a shadow of the object; performing object-centered shadow detection and object-centered removal to generate a modified digital image without the shadow of the object by extracting, via an encoder of a shadow analyzer model, features from a combination of the digital image and the object mask; in response to receiving a user interaction to manipulate the object, combining the object mask of the object and the modified digital image to generate a combined representation; and generating, from the combined representation and utilizing a shadow synthesis diffusion model, a modified shadow for the object responsive to the user interaction by conditioning the shadow synthesis diffusion model with a shadow mask of the object generated by the shadow analyzer model.
17 . The system of claim 16 , wherein the operations further comprise: extracting the features from the combination of the digital image and the object mask by utilizing an encoder of the shadow analyzer model to extract multi-scale features, global features, and spatial features; and generating, utilizing a shadow detector of the shadow analyzer model, the shadow mask of the object from the multi-scale features.
18 . The system of claim 16 , wherein receiving the user interaction to manipulate the object comprises: moving the object in the digital image from a first location in the digital image to a second location in the digital image; or adding the object to a location in an additional digital image.
19 . The system of claim 16 , wherein conditioning the shadow synthesis diffusion model with the shadow mask of the object comprises: accessing shadow property data comprising intensity, softness, color, and direction of a shadow corresponding to the shadow of the object to generate a feature map of the shadow; aligning the feature map with text tokens by utilizing an adapter of the shadow synthesis diffusion model to generate shadow embeddings of the shadow mask; and conditioning the shadow synthesis diffusion model with the shadow embeddings of the shadow.
20 . The system of claim 16 , wherein the operations further comprise: adding an additional object to the modified digital image, wherein the additional object comes from an additional digital image without a corresponding shadow; and accessing the modified shadow as a reference shadow to generate a new shadow for the additional object.

Description

BACKGROUND Recent years have seen significant advancement in hardware and software platforms for performing computer vision and image editing tasks. Indeed, systems provide a variety of image-related tasks, such as object identification, classification, segmentation, composition, style transfer, image inpainting, etc. For instance, systems provide image editing tools for creating shadows in an image, as shadows play a vital role in enhancing the realism of an image. Despite the advances in shadow-oriented tasks in digital image editing, systems suffer from a number of deficiencies with regards to efficiency, accuracy, and operational flexibility. SUMMARY One or more embodiments described herein provide benefits and/or solve one or more problems in the art with systems, methods, and non-transitory computer-readable media that implement a meta-shadow system to facilitate flexible and efficient scene-based image editing. To illustrate, in one or more embodiments, a disclosed system utilizes a joint framework for object-centered shadow detection, removal, and synthesis. Specifically, the disclosed system contains a framework with a GAN-based shadow detection and removal model and a diffusion-based shadow synthesis pipeline that leverages features from the GAN-based shadow detection and removal model. For example, given a digital image and an object mask, the disclosed system simultaneously detects and removes a shadow cast by an associated object and leverages intermediate removal features as a reference for synthesizing a shadow when relocating the associated object (e.g., moving the object to another digital image or moving the object within the digital image). Additional features and advantages of one or more embodiments of the present disclosure are outlined in the description which follows, and in part will be obvious from the description, or may be learned by the practice of such example embodiments. BRIEF DESCRIPTION OF THE DRAWINGS This disclosure will describe one or more embodiments of the invention with additional specificity and detail by referencing the accompanying figures. The following paragraphs briefly describe those figures, in which: FIG. 1 illustrates an example environment in which a meta-shadow system operates in accordance with one or more embodiments; FIGS. 2A-2B illustrates an overview of the meta-shadow system detecting, removing, and synthesizing shadows in accordance with one or more embodiments; FIG. 3 illustrates an example diagram of the meta-shadow system receiving a digital image and an object mask and removing a shadow in the digital image and generating a shadow mask prediction in accordance with one or more embodiments; FIG. 4 illustrates an example diagram of the meta-shadow system extracting multi-scale features from a digital image and an object mask in accordance with one or more embodiments; FIGS. 5A-5D illustrates example diagrams of the meta-shadow system removing one or more shadows under various conditions in accordance with one or more embodiments; FIG. 6 illustrates an example diagram of the meta-shadow system synthesizing a new shadow for an object in a digital image based on intermediate features obtained from the shadow analyzer model in accordance with one or more embodiments; FIG. 7 illustrates an example diagram of the joint-framework of the meta-shadow system that incorporates both the shadow analyzer model and the shadow synthesis model in accordance with one or more embodiments; FIG. 8 illustrates an example diagram of the meta-shadow system training the shadow analyzer model and the shadow synthesis model in accordance with one or more embodiments; FIG. 9 illustrates experimental results of the meta-shadow system synthesizing a shadow based on an empty shadow mask and with a shadow mask prediction in accordance with one or more embodiments; FIG. 10 illustrates experimental results of the meta-shadow system detecting shadows compared to prior systems in accordance with one or more embodiments; FIG. 11 illustrates experimental results of the meta-shadow system synthesizing shadows compared to prior systems in accordance with one or more embodiments; FIG. 12 illustrates experimental results of the meta-shadow system synthesizing shadows for an image dataset and a video dataset compared to prior systems in accordance with one or more embodiments; FIG. 13 illustrates experimental ablation results of the meta-shadow system synthesizing shadows utilizing intermediate features from the shadow analyzer model in accordance with one or more embodiments; FIG. 14 illustrates a schematic diagram of the meta-shadow system in accordance with one or more implementations; FIG. 15 illustrates a flowchart of a series of acts for generating a shadow mask in accordance with one or more embodiments; FIG. 16 illustrates a flowchart of a series of acts for generating a modified digital image that includes a new shadow in accordance with one or more embodiments; FIG. 17 illustrat