EP-4738265-A1 - IMPROVING THE GENERATION OF REALISTIC IMAGES BY GENERATIVE MACHINE LEARNING MODELS

EP4738265A1EP 4738265 A1EP4738265 A1EP 4738265A1EP-4738265-A1

Abstract

A method (100) for improving the conformity of output images (3) produced by a generative image-to-image machine learning model, GMLM (2), with the domain and/or distribution to which a given input image (1) belongs, comprising the steps of: • processing (110), by the GMLM (2), at least one input image (1) into one or more output images (3); • comparing (120), by a predetermined similarity measure (4), the one or more output images (3) produced from the input image (1) to the input image (1); and based on the result (4a) of this comparison: • optimizing (130) one or more parameters (2a) that influence the behavior of the GMLM (2) towards the goal of making subsequent output images (3) produced from the input image (1) more similar to the input image (1); and/or • modifying (140) at least a portion of at least one output image (3) towards the goal of making this output image (3) more similar to the input image (1).

Inventors

SHAPIRO, YOEL
Mullick, Koustav

Assignees

Robert Bosch GmbH

Dates

Publication Date: 20260506
Application Date: 20241104

Claims (19)

A method (100) for improving the conformity of output images (3) produced by a generative image-to-image machine learning model, GMLM (2), with the domain and/or distribution to which a given input image (1) belongs, comprising the steps of: • processing (110), by the GMLM (2), at least one input image (1) into one or more output images (3); • comparing (120), by a predetermined similarity measure (4), the one or more output images (3) produced from the input image (1) to the input image (1); and based on the result (4a) of this comparison: • optimizing (130) one or more parameters (2a) that influence the behavior of the GMLM (2) towards the goal of making subsequent output images (3) produced from the input image (1) more similar to the input image (1); and/or • modifying (140) at least a portion of at least one output image (3) towards the goal of making this output image (3) more similar to the input image (1).
The method (100) of claim 1, wherein • the GMLM (2) comprises a neural network with a plurality of neurons or other processing units, • the inputs to each neuron are weighted with weights and thereby summed in a weighted sum to form an activation of the respective neuron or other processing unit, and • at least a portion of these weights remain frozen (131) when optimizing the one or more parameters that influence the behavior of the GMLM (2).
The method (100) of claim 2, wherein at least 80 % of the weights, preferably all of the weights, remain frozen (131a) when optimizing the one or more parameters that influence the behavior of the GMLM (2).
The method (100) of any one of claims 1 to 3, wherein the parameters (2a) that influence the behavior of the GMLM (2) and that are optimized comprise (132) one or more of: • a desired degree of adherence of the output image (3) to an input image (1), and/or to a text prompt, from which it is generated; • a number of iterations, such as de-noising steps of a diffusion model, to be performed by the GMLM (2); • an algorithm that rates the outcome of each iteration of the GMLM (2) and adapts the next iteration accordingly; • a desired style of the output image (3); and • a text prompt that supplements the input image (1).
The method (100) of any one of claims 1 to 4, wherein at least one calibration image that is known to be realistic with respect to a given use case is chosen (111) as an input image (1).
The method (100) of any one of claims 1 to 5, wherein • the input (1) and output (3) images are divided (121) into patches, object instances and/or features (1a, 3a); and • the similarity measure (4) is computed (122) with respect to individual patches, object instances and/or features (1a, 3a).
The method (100) of claim 6, wherein multiple values (4a) of the similarity measure (4) computed for individual patches, object instances and/or features (1a, 3a), and/or for the image (1, 3) as a whole, are aggregated (123) to form an overall rating of the similarity of patches, object instances, features (1a, 3a), and/or the image (1, 3) as a whole.
The method (100) of claim 7, wherein the aggregating of individual similarity values (4a) comprises (123a) one or more of: • multiplying the individual similarity values (4a); • forming a linear combination of the similarity values (4a); • selecting the best of the individual similarity values (4a); and • selecting the worst of the individual similarity values (4a).
The method (100) of any one of claims 6 to 8, wherein the dividing into object instances and/or features is performed (121a) using ground truth that is available regarding the presence of object instances and/or features in the input image (1).
The method (100) of any one of claims 6 to 9, wherein the modifying of the output image (3) comprises: in response to determining (141) that the similarity (4a) with respect to a particular patch, object instance and/or feature (1a, 3a) meets a predetermined criterion, amending and/or replacing (142) this patch, object instance and/or feature (1a, 3a) with content from at least one alternate image source (5).
The method (100) of claim 10, wherein the alternate image source (5) comprises (142a) one or more of: • the output produced by a further machine learning model from the same input image (1); and • the input image (1).
The method (100) of any one of claims 1 to 11, wherein a simulated image of a given scenery is chosen (112) as the input image (1).
The method (100) of any one of claims 1 to 12, wherein the given similarity measure (4) is chosen (124) to combine vectorial embeddings from multiple machine learning models in one common space.
The method (100) of any one of claims 1 to 13, further comprising: manufacturing (150) a physical product, and/or setting up a physical scenery, according to an output image (3) obtained from the GMLM (2), or a modified version (3') of such an output image (3).
The method (100) of any one of claims 1 to 14, further comprising: training (160) an image processing machine learning model (6) towards a given task using one or more output images (3) from the GMLM (2), or modified versions (3') of these output images (3), as training images.
The method (100) of claim 15, further comprising: • processing (170), by the trained image processing machine learning model (6*), one or more images (7) recorded by at least one sensor (8); • computing (180), from the output (9) of the trained image processing machine learning model (6*), an actuation signal (180a); and • actuating (190) a vehicle (50), a driving assistance system (51), a robot (60), a quality inspection system (70), a surveillance system (80), and/or a medical imaging system (90), with the actuation signal (180a).
A computer program, comprising machine-readable instructions that, when executed by one or more computers and/or compute instances, causes the one or more computers and/or compute instances to perform the method (100) of any one of claims 1 to 16.
A non-transitory computer-readable data carrier, and/or a download product, with the computer program of claim 17.
One or more computers and/or compute instances with the computer program of claim 17, and/or with the non-transitory computer-readable data carrier and/or with the download product of claim 18.

Description

The present invention relates to the generation of realistic images by generative machine learning models. For example, these generated images may be used as training images for training a downstream machine learning model towards a given task. Background The training of image processing machine learning models towards a given task requires a large set of training images. These training images need to be acquired somehow. If the training is a supervised training, each training image needs to be labelled with "ground truth" that the image processing machine learning network should ideally produce when being given the respective training image. Therefore, training images are a scarce resource. In particular, it is difficult to achieve a sufficient variability in the set of training images, so that this set of training images also covers situations that occur rarely but nonetheless need to be handled correctly. Generative image-to-image machine learning models are therefore used to augment the set of available training images. If a generated image is basically a variation of a training image for which a ground truth label is known, then the generated image may be used as a new, different training image, but the ground truth label may be re-used. However, the generated image should be free from added "hallucinations" or other artifacts that have no correspondence in the ground truth labels. Disclosure of the invention The invention provides a method for improving the conformity of output images produced by a generative image-to-image machine learning model, GMLM, with the domain and/or distribution to which a given input image belongs. In particular, this domain and/or distribution may relate to the semantic content of the input image, and/or to the rendering of this semantic content into the input image. For example, images of sceneries in the environment of a vehicle and/or robot may belong to different domains and/or distributions depending on the compositions of object instances therein, and also depending on generic conditions of the respective sceneries. For example, images acquired in fine-weather conditions on a sunny day may be considered to belong to one domain and/or distribution, and images acquired at nighttime, and/or in other poor-visibility conditions such as rain, fog or snow, may be considered to belong to another domain and/or distribution. One and the same image may belong to multiple domains and/or distributions. For example, the image may belong to a first domain and/or distribution by virtue of the composition of object instances therein, and it may belong to a second domain and/or distribution by virtue of the weather conditions in which it was taken. In particular, the GMLM may be trained to generate, from an input image that is in a source domain and/or distribution with respect to at least one property (such as object composition or weather conditions), an output image that is in a different target domain and/or distribution with respect to this property. In one example, the GMLM may be trained to generate, from an input image taken in fine-visibility conditions, an output image that looks as if it has been taken in poorer-visibility conditions, but otherwise still resembles the input image. In particular, the semantic content of the output image may still be substantially the same as the semantic content of the input image. That is, the GMLM may be used to perform a controlled domain transfer of the input image. Compared to domain transfer with a generative adversarial network, GAN, the advantage is that there is more control over whether "ground truth" labels for the input image are re-usable for the output image. In the course of the method, at least one input image is processed into one or more output images by the GMLM. For example, if the GMLM is a diffusion model, each such processing may start from a version of the image that has been corrupted with a different noise sample, e.g., represented by different "seeds" from which the processing starts. In this manner, repeated processing of one and the same input image may produce different output images. The one or more output images produced from the input image are compared to the input image by a predetermined similarity measure. In particular, this similarity measure may be specific to the application at hand and measure which properties in the output image should somehow adhere to the respective properties of the input image. In one example, the similarity measure may measure whether the output image has a semantic content that is substantially the same as the semantic content of the input image. The similarity measure may be computed based on one single output image, but it may also, for example, be computed based on multiple output images. For example, when computing multiple output images from one and the same input image, the respective similarities of the output images to the input image may be aggregated, e.g., averaged.