US-20260127789-A1 - GENERATION OF REALISTIC IMAGES BY GENERATIVE MACHINE LEARNING MODELS

US20260127789A1US 20260127789 A1US20260127789 A1US 20260127789A1US-20260127789-A1

Abstract

A method for improving the conformity of output images produced by a generative image-to-image machine learning model (GMLM), with the domain and/or distribution to which a given input image belongs. The method includes: processing, by the GMLM, at least one input image into one or more output images; comparing, by a predetermined similarity measure, the one or more output images produced from the input image to the input image; and based on the result of this comparison: optimizing one or more parameters that influence the behavior of the GMLM towards the goal of making subsequent output images produced from the input image more similar to the input image; and/or modifying at least a portion of at least one output image towards the goal of making this output image more similar to the input image.

Inventors

Koustav Mullick
Yoel Shapiro

Assignees

ROBERT BOSCH GMBH

Dates

Publication Date: 20260507
Application Date: 20251017
Priority Date: 20241104

Claims (17)

1 . A method for improving conformity of output images produced by a generative image-to-image machine learning model (GMLM), with a domain and/or distribution to which a given input image belongs, the method comprising the following steps: processing, by the GMLM, at least one input image into one or more output images; comparing, by a predetermined similarity measure, the one or more output images produced by the processing from the input image to the input image; and based on a result of the comparison: (i) optimizing one or more parameters that influence a behavior of the GMLM towards a goal of making subsequent output images produced from the input image more similar to the input image, and/or (ii) modifying at least a portion of at least one output image towards a goal of making the output image more similar to the input image, wherein the modifying includes, when the input and output images are divided into patches and/or object instances and/or features, and the similarity measure is computed with respect to individual patches and/or individual object instances and/or individual features: in response to determining that a similarity with respect to a particular patch and/or a particular object instance and/or a particular feature meets a predetermined criterion, amending and/or replacing the particular patch and/or the particular object instance and/or the particular feature with content from at least one alternate image source.
2 . The method of claim 1 , wherein: the GMLM includes a neural network with a plurality of neurons or other processing units, inputs to each neuron or other processing unit are weighted with weights and are summed in a weighted sum to form an activation of the neuron or other processing unit, and at least a portion of the weights remain frozen when optimizing the one or more parameters that influence the behavior of the GMLM.
3 . The method of claim 2 , wherein at least 80% of the weights remain frozen when optimizing the one or more parameters that influence the behavior of the GMLM.
4 . The method of claim 1 , wherein the one or more parameters that influence the behavior of the GMLM and that are optimized include one or more of: a desired degree of adherence of the output image to an input image and/or to a text prompt, from which the input image and/or text prompt is generated; a number of iterations including de-noising steps of a diffusion model to be performed by the GMLM; an algorithm that rates an outcome of each iteration of the GMLM and adapts a next iteration accordingly; a desired style of the output image; and a text prompt that supplements the input image.
5 . The method of claim 1 , wherein at least one calibration image that is known to be realistic with respect to a given use case is chosen as then input image.
6 . The method of claim 1 , wherein: the input and output images are divided into patches and/or object instances and/or features, and the similarity measure is computed with respect to individual patches and/or individual object instances and/or individual features.
7 . The method of claim 6 , wherein multiple values of the similarity measure computed: for the individual patches and/or the individual object instances and/or the individual features and/or for the image as a whole, are aggregated to form an overall rating of the similarity of patches and/or object instances and/or features and/or the image as a whole.
8 . The method of claim 7 , wherein the aggregating of individual similarity values includes one or more of: multiplying the individual similarity values; forming a linear combination of the similarity values; selecting a best one of the individual similarity values; and selecting a worst one of the individual similarity values.
9 . The method of claim 6 , wherein the dividing into the object instances and/or features is performed using ground truth that is available regarding a presence of object instances and/or features in the input image.
10 . The method of claim 1 , wherein the alternate image source includes one or more of: the output produced by a further machine learning model from the same input image; and the input image.
11 . The method of claim 1 , wherein a simulated image of a given scenery is chosen as the input image.
12 . The method of claim 1 , wherein the predetermined similarity measure is chosen to combine vectorial embeddings from multiple machine learning models in one common space.
13 . The method of claim 1 , further comprising: manufacturing a physical product, and/or setting up a physical scenery, according to an output image obtained from the GMLM, or a modified version of the output image obtained from the GMLM.
14 . The method of claim 1 , further comprising: training an image processing machine learning model towards a given task using as training images: one or more output images from the GMLM or modified versions of the one or more output images from the GMLM.
15 . The method of claim 14 , further comprising: processing, by the trained image processing machine learning model, one or more images recorded by at least one sensor; computing, from output of the trained image processing machine learning model, an actuation signal; and actuating, with the actuation signal, a vehicle and/or a driving assistance system and/or a robot and/or a quality inspection system and/or a surveillance system and/or a medical imaging system.
16 . A non-transitory computer-readable data carrier on which is stored a computer program including machine-readable instructions for improving conformity of output images produced by a generative image-to-image machine learning model (GMLM), with a domain and/or distribution to which a given input image belongs, the instructions, when executed by one or more computers and/or compute instances, causes the one or more computers and/or compute instances to perform the following steps comprising: processing, by the GMLM, at least one input image into one or more output images; comparing, by a predetermined similarity measure, the one or more output images produced by the processing from the input image to the input image; and based on a result of the comparison: (i) optimizing one or more parameters that influence a behavior of the GMLM towards a goal of making subsequent output images produced from the input image more similar to the input image, and/or (ii) modifying at least a portion of at least one output image towards a goal of making the output image more similar to the input image, wherein the modifying includes, when the input and output images are divided into patches and/or object instances and/or features, and the similarity measure is computed with respect to individual patches and/or individual object instances and/or individual features: in response to determining that a similarity with respect to a particular patch and/or a particular object instance and/or a particular feature meets a predetermined criterion, amending and/or replacing the particular patch and/or the particular object instance and/or the particular feature with content from at least one alternate image source.
17 . One or more computers and/or compute instances with a non-transitory computer-readable data carrier on which is stored a computer program including machine-readable instructions for improving conformity of output images produced by a generative image-to-image machine learning model (GMLM), with a domain and/or distribution to which a given input image belongs, the instructions, when executed by the one or more computers and/or compute instances, causes the one or more computers and/or compute instances to perform the following steps comprising: processing, by the GMLM, at least one input image into one or more output images; comparing, by a predetermined similarity measure, the one or more output images produced by the processing from the input image to the input image; and based on a result of the comparison: (i) optimizing one or more parameters that influence a behavior of the GMLM towards a goal of making subsequent output images produced from the input image more similar to the input image, and/or (ii) modifying at least a portion of at least one output image towards a goal of making the output image more similar to the input image, wherein the modifying includes, when the input and output images are divided into patches and/or object instances and/or features, and the similarity measure is computed with respect to individual patches and/or individual object instances and/or individual features: in response to determining that a similarity with respect to a particular patch and/or a particular object instance and/or a particular feature meets a predetermined criterion, amending and/or replacing the particular patch and/or the particular object instance and/or the particular feature with content from at least one alternate image source.

Description

CROSS REFERENCE The present application claims the benefit under 35 U.S.C. § 119 of Europe Patent Application No. EP 24 21 0638.3 filed on Nov. 4, 2024, which is expressly incorporated herein by reference in its entirety. FIELD The present invention relates to the generation of realistic images by generative machine learning models. For example, these generated images may be used as training images for training a downstream machine learning model towards a given task. BACKGROUND INFORMATION The training of image processing machine learning models towards a given task requires a large set of training images. These training images need to be acquired somehow. If the training is a supervised training, each training image needs to be labelled with “ground truth” that the image processing machine learning network should ideally produce when being given the respective training image. Therefore, training images are a scarce resource. In particular, it is difficult to achieve a sufficient variability in the set of training images, so that this set of training images also covers situations that occur rarely but nonetheless need to be handled correctly. Generative image-to-image machine learning models are therefore used to augment the set of available training images. If a generated image is basically a variation of a training image for which a ground truth label is known, then the generated image may be used as a new, different training image, but the ground truth label may be re-used. However, the generated image should be free from added “hallucinations” or other artifacts that have no correspondence in the ground truth labels. SUMMARY The present invention provides a method for improving the conformity of output images produced by a generative image-to-image machine learning model, GMLM, with the domain and/or distribution to which a given input image belongs. In particular, this domain and/or distribution may relate to the semantic content of the input image, and/or to the rendering of this semantic content into the input image. For example, images of sceneries in the environment of a vehicle and/or robot may belong to different domains and/or distributions depending on the compositions of object instances therein, and also depending on generic conditions of the respective sceneries. For example, images acquired in fine-weather conditions on a sunny day may be considered to belong to one domain and/or distribution, and images acquired at nighttime, and/or in other poor-visibility conditions such as rain, fog or snow, may be considered to belong to another domain and/or distribution. One and the same image may belong to multiple domains and/or distributions. For example, the image may belong to a first domain and/or distribution by virtue of the composition of object instances therein, and it may belong to a second domain and/or distribution by virtue of the weather conditions in which it was taken. In particular, the GMLM may be trained to generate, from an input image that is in a source domain and/or distribution with respect to at least one property (such as object composition or weather conditions), an output image that is in a different target domain and/or distribution with respect to this property. In one example, the GMLM may be trained to generate, from an input image taken in fine-visibility conditions, an output image that looks as if it has been taken in poorer-visibility conditions, but otherwise still resembles the input image. In particular, the semantic content of the output image may still be substantially the same as the semantic content of the input image. That is, the GMLM may be used to perform a controlled domain transfer of the input image. Compared to domain transfer with a generative adversarial network, GAN, the advantage is that there is more control over whether “ground truth” labels for the input image are re-usable for the output image. According to an example embodiment of the present invention, in the course of the method, at least one input image is processed into one or more output images by the GMLM. For example, if the GMLM is a diffusion model, each such processing may start from a version of the image that has been corrupted with a different noise sample, e.g., represented by different “seeds” from which the processing starts. In this manner, repeated processing of one and the same input image may produce different output images. The one or more output images produced from the input image are compared to the input image by a predetermined similarity measure. In particular, this similarity measure may be specific to the application at hand and measure which properties in the output image should somehow adhere to the respective properties of the input image. In one example, the similarity measure may measure whether the output image has a semantic content that is substantially the same as the semantic content of the input image. The similarity measure may be computed based on