US-20260127719-A1 - 3D-CONSISTENT IMAGE INPAINTING WITH DIFFUSION MODELS

US20260127719A1US 20260127719 A1US20260127719 A1US 20260127719A1US-20260127719-A1

Abstract

The present disclosure relates to image editing or inpainting techniques leveraging a generator model conditionally trained on one or more in-context images during a reverse diffusion process. The generator model performs inpainting of an image at inference by accessing a set of images varying in context that depicts a same or similar scene. A masked version of the image may be generated by obscuring portions of the image using a masking technique. After masking, a noisy image may be generated by iteratively introducing noise to the masked version of the image based on a noise schedule. The noisy image may act as a starting point for the subsequent reverse process leveraging the generator model configured to receive an iterated version of the image and the one or more in-context images. Based on the generator model, a transformed version of the image may be generated by iteratively denoising the noisy image.

Inventors

Boris Chidlovskii
Leonid Antsfeld

Assignees

NAVER CORPORATION

Dates

Publication Date: 20260507
Application Date: 20241105

Claims (20)

1 . A computer-implemented method for image editing including: accessing a set of images, wherein each image of the set of images depicts a scene; generating a masked version of an image of the set of images that obscures or removes one or more portions of the image by a masking technique; at each timestep of a plurality of timesteps: generating a noisy image by introducing noise to the masked version of the image based on a noise schedule that defines an amount of the noise added at each timestep of the plurality of timesteps; and generating a transformed version of the image by denoising the noisy image based on the noise schedule leveraging a generator model, wherein the generator model is configured to receive an iterated version of the image and one or more in-context images of the set of images, and wherein the one or more portions of the masked version of the image are iteratively transformed by the generator model using the one or more in-context images of the set of images; and outputting the transformed version of the image.
2 . The computer-implemented method of claim 1 , further including: segmenting each image of the set of images independently into a set of patches that are equally sized and non-overlapping.
3 . The computer-implemented method of claim 1 , wherein the generator model including: an encoder comprising one or more encoder-transformers, including a self-attention layer, configured to generate an encoded representation by processing a set of patches associated with the noisy image and to simultaneously generate one or more encoded representations by processing one or more sets of patches associated with the one or more in-context images, wherein the encoder shares weights across the noisy image and the one or more in-context images; and a decoder comprising one or more decoder-transformers, including the self-attention layer and a cross-attention layer, configured to process the encoded representation and the one or more encoded representations to generate the iterated version of the image.
4 . The computer-implemented method of claim 1 , wherein the masking technique includes random masking or semantic masking that obscures the one or more portions of the image that are predicted to correspond to any of one or more types of predefined depictions.
5 . The computer-implemented method of claim 1 , wherein the noise schedule modulates a frequency of timesteps of the plurality of timesteps during the denoising, based on an importance sampling technique that dynamically determines the amount of noise to introduce at specific timesteps of the plurality of timesteps based on predefined criteria involving the iterated version of the image.
6 . The computer-implemented method of claim 1 , wherein the noise schedule is generated from a Laplace distribution.
7 . The computer-implemented method of claim 1 , wherein the noise that is iteratively introduced to the masked version of the image has a Gaussian distribution.
8 . The computer-implemented method of claim 1 , wherein the scene of the masked version of the image is the same as the one or more in-context images of the set of images.
9 . The computer-implemented method of claim 1 , wherein the generator model is conditionally trained on one or more in-context images during a reverse diffusion process to generate a less noisy image of an intermediate noisy image.
10 . The computer-implemented method of claim 1 , wherein the method includes inpainting and, wherein the one or more portions of the masked version of the image are transformed by being reconstructed by the generator model using the one or more in-context images of the set of images.
11 . The computer-implemented method of claim 1 , wherein accessing the set of images is in response to a user input and, wherein outputting the transformed version of the image is to a display of a computing system.
12 . A system comprising: one or more data processors; and a non-transitory computer readable storage medium containing instruction which, when executed on the one or more data processors, cause the one or more data processors to perform a set of operations including: accessing a set of images, wherein each image of the set of images depicts a scene; generating a masked version of an image of the set of images that obscures or removes one or more portions of the image by a masking technique; at each timestep of a plurality of timesteps: generating a noisy image by introducing noise to the masked version of the image based on a noise schedule that defines an amount of the noise added at each timestep of the plurality of timesteps; and generating a transformed version of the image by denoising the noisy image based on the noise schedule leveraging a generator model, wherein the generator model is configured to receive an iterated version of the image and one or more in-context images of the set of images, and wherein the one or more portions of the masked version of the image are iteratively transformed by the generator model using the one or more in-context images of the set of images; and outputting the transformed version of the image.
13 . The system of claim 12 , wherein the set of operations further includes: segmenting each image of the set of images into a set of patches that are equally sized and non-overlapping.
14 . The system of claim 12 , wherein the generator model includes: an encoder comprising one or more encoder-transformers, including a self-attention layer, configured to generate an encoded representation by processing a set of patches associated with the noisy image and to simultaneously generate one or more encoded representations by processing sets of patches associated with the one or more in-context images, wherein the encoder shares weights across the noisy image and the one or more in-context images; and a decoder comprising one or more decoder-transformers, including the self-attention layer and a cross-attention layer, configured to process the encoded representation and the one or more encoded representations to generate the iterated version of the image.
15 . The system of claim 12 , wherein the masking technique includes random masking or semantic masking that obscures the one or more portions of the image that are predicted to correspond to any of one or more types of predefined depictions.
16 . The system of claim 12 , wherein the noise schedule modulates a frequency of timesteps of the plurality of timesteps during the denoising, based on an importance sampling technique that dynamically determines the amount of noise to be introduced at specific timesteps of the plurality of timesteps based on predefined criteria involving the iterated version of the image and, wherein the noise schedule is generated from a Laplace distribution.
17 . The system of claim 12 , wherein the noise has a Gaussian distribution.
18 . A computer-program product tangibly embodied in a non-transitory machine readable storage medium, including instructions configured to cause one or more data processors to perform to perform a set of operations comprising: accessing a set of images, wherein each image of the set of images depicts a scene; generating a masked version of an image of the set of images that obscures or removes one or more portions of the image by a masking technique; at each timestep of a plurality of timesteps: generating a noisy image by introducing noise to the masked version of the image based on a noise schedule that defines an amount of the noise added at each timestep of the plurality of timesteps; and generating a transformed version of the image by denoising the noisy image based on the noise schedule leveraging a generator model, wherein the generator model is configured to receive an iterated version of the image and one or more in-context images of the set of images, and wherein the one or more portions of the masked version of the image are iteratively transformed by the generator model using the one or more in-context images of the set of images; and outputting the transformed version of the image.
19 . The computer-program product of claim 18 , wherein the generator model includes: an encoder comprising one or more encoder-transformers, including a self-attention layer, configured to generate an encoded representation by processing a set of patches associated with the noisy image and to simultaneously generate one or more encoded representations by processing sets of patches associated with the one or more in-context images, wherein the encoder shares weights across the noisy image and the one or more in-context images; and a decoder comprising a series of decoder-transformers, including the self-attention layer and a cross-attention layer, configured to process the encoded representation and the one or more encoded representations to generate the iterated version of the image.
20 . The computer-program product of claim 18 , wherein the noise schedule modulates a frequency of timesteps of the plurality of timesteps during the denoising, based on an importance sampling technique that dynamically determines the amount of noise to introduce at specific timesteps of the plurality of timesteps based on predefined criteria involving the iterated version of the image, and wherein the noise schedule is generated from a Laplace distribution.

Description

BACKGROUND Image inpainting is a digital image processing technique that refers to reconstructing or filling in missing, damaged or distorted parts of an image for restoring image to a visually plausible state such that the inpainted areas look seamless and natural. The inpainting techniques may find application in various fields, including photo editing, image restoration, object removal and forensic analysis, where recovery or preservation of visual integrity may be a concern. The inpainting process may involve masking specific portions of the image, designating areas for restoration where the reconstruction of content is to be performed. Regardless of the technique used, successful inpainting may involve semantic consistency and visually harmony of the generated or reconstructed content with the surrounding elements of the image. Therefore, inpainting techniques may analyze the surrounding pixel information, predicting what the obscured content should look like to reconstruct the damaged portions of the image. However, without sufficient contextual understanding, the reconstruction may suffer from inaccuracies, leading to visually inconsistent results or artifacts that may disrupt the overall coherence of the image. Additionally, inpainting techniques may face several other challenges, particularly when masking results in occluding significant portions of an image. Models that are trained on specific types of masks may exhibit limited generalization capabilities when given different masking configurations, which can hinder their effectiveness in real-world applications. Achieving three-dimensional (3D) consistency and a natural blend in the inpainted regions with the surrounding pixels may be a concern, particularly in images with intricate details or textures. Inpainting techniques may often face difficulties in grasping the contextual and semantic information of a scene, which can result in unrealistic outcomes. Similarly, each environment setting may present particular visual cues and spatial relationships and may account for depth and geometry to produce realistic results that influence effective inpainting. For example, variations in training datasets comprising different environments, such as indoor and outdoor scenes may complicate the inpainting process. Models trained on particular contexts, environment settings or mask distributions may encounter difficulties in generalization when faced with unfamiliar scenarios, potentially leading to suboptimal inpainting performance. SUMMARY Certain aspects and features of the present disclosure relate to image inpainting techniques leveraging a denoising diffusion probabilistic model (DDPM)-referred to herein as generator model-trained by conditioning on one or more in-context images. The generator model may utilize a diffusion process that encompasses a forward diffusion process, which may incrementally add noise to a base image over multiple timesteps, and a reverse diffusion process, in which the generator model may learn to iteratively denoise the base image by taking guidance from the visible content provided by the one or more in-context images. During inference, the generator model may perform inpainting of an image by accessing a set of images including one or more in-context images. Each image of the set of images may depict a same or similar scene with variation in contexts such as camera poses, camera angles, time of the day, weather conditions or other dynamics. A masked version of the image may be generated by obscuring or removing one or more portions of the image by applying a masking technique. After masking, a noisy image may be generated in the forward diffusion process by iteratively introducing noise to the masked version of the image based on a noise schedule. The noise schedule may comprise of the multiple timesteps where at each timestep, an amount of the noise to be added (or a noise variance) may be determined. For example, the noise may be added to the masked version of the image in gradual timesteps that are defined by the noise schedule until a completely noisy image is obtained. The noise may be sampled from various noise distributions including a Gaussian, Laplace, or uniform distribution. In one aspect of the present disclosure, Gaussian noise distribution is used for generating the noisy image. The noisy or fully noisy image may act as a starting point for the subsequent reverse diffusion process that leverages the generator model configured to receive an iterated version of the image and the one or more in-context images. Based on the noise schedule, a transformed version of the image may be generated during the reverse diffusion process by iteratively denoising the noisy image using the generator model. The transformed image may be output depicting a denoised and inpainted version of the image, where the one or more masked portions are reconstructed to align seamlessly with the surrounding non-masked areas. In some aspects