CN-121999066-A - Generating reality images by generating machine learning models

CN121999066ACN 121999066 ACN121999066 ACN 121999066ACN-121999066-A

Abstract

A method (100) for improving the conformity of an output image (3) generated by a generated image-to-image machine learning model GMLM (2) with a domain and/or distribution to which a given input image (1) belongs, comprising the steps of processing (110) at least one input image (1) by the GMLM (2) into one or more output images (3), comparing (120) one or more output images (3) generated from the input image (1) with the input image (1) by means of a predetermined similarity measure (4), and based on the result (4 a) of this comparison, optimizing (130) one or more parameters (2 a) affecting the behavior of the GMLM (2) towards a target that makes the subsequent output image (3) generated from the input image (1) more similar to the input image (1), and/or modifying at least a part of at least one output image (3) towards the target that makes the output image (3) more similar to the input image (1).

Inventors

K. Murik
Y. SHAPIRO

Assignees

罗伯特·博世有限公司

Dates

Publication Date: 20260508
Application Date: 20251103
Priority Date: 20241104

Claims (18)

1. A method (100) for improving the conformity of an output image (3) produced by a generated image-to-image machine learning model GMLM (2) with a domain and/or distribution to which a given input image (1) belongs, comprising the steps of: -processing (110) at least one input image (1) by said GMLM (2) into one or more output images (3); -comparing (120) the one or more output images (3) generated from the input image (1) with the input image (1) by means of a predetermined similarity measure (4), and-based on the result (4 a) of the comparison: Optimizing (130) one or more parameters (2 a) affecting the behaviour of said GMLM (2) towards the goal of making a subsequent output image (3) generated from said input image (1) more similar to said input image (1), and/or -Modifying (140) at least a part of at least one output image (3) towards the target that makes the output image (3) more similar to the input image (1), wherein the modifying (140) comprises, wherein the input image (1) and the output image (3) are divided (121) into patches, object instances and/or features (1 a, 3 a), and the similarity measure (4) is calculated (122) with respect to individual patches, object instances and/or features (1 a, 3 a), in response to determining (141) that the similarity (4 a) with respect to a particular patch, object instance and/or feature (1 a, 3 a) fulfils a predetermined criterion, correcting and/or replacing (142) the patches, object instances and/or features (1 a, 3 a) with content from at least one alternative image source (5).
2. The method (100) of claim 1, wherein The GMLM (2) includes a neural network having a plurality of neurons or other processing units, Weighting the input of each neuron with a weight and summing thereby in a weighted sum to form the activation of the corresponding neuron or other processing unit, and At least a portion of these weights remain frozen (131) while optimizing the one or more parameters affecting the behavior of the GMLM (2).
3. The method (100) according to claim 2, wherein at least 80%, preferably all, of the weights remain frozen (131 a) when optimizing the one or more parameters affecting the behaviour of the GMLM (2).
4. A method (100) according to any one of claims 1 to 3, wherein the parameters (2 a) affecting the behaviour of the GMLM (2) and being optimized comprise (132) one or more of the following: -a desired degree of adherence of the output image (3) to an input image (1) and/or text prompt from which the output image (3) is generated; a number of iterations to be performed by the GMLM (2), such as a denoising step of a diffusion model; evaluating the result of each iteration of said GMLM (2) and adapting the algorithm of the next iteration accordingly; The desired pattern (3) of the output image, and Supplementing the text prompt of the input image (1).
5. The method (100) according to any one of claims 1 to 4, wherein at least one calibration image, which is known to be realistic with respect to a given use case, is selected (111) as the input image (1).
6. The method (100) according to any one of claims 1 to 5, wherein The input image (1) and the output image (3) are divided (121) into small blocks, object instances and/or features (1 a, 3 a), and -Calculating (122) said similarity measure (4) with respect to individual patches, object instances and/or features (1 a, 3 a).
7. The method (100) according to claim 6, wherein a plurality of values (4 a) of the similarity measure (4) calculated for individual patches, object instances and/or features (1 a, 3 a) and/or the image (1, 3) as a whole are aggregated (123) to form a total assessment of the similarity of patches, object instances, features (1 a, 3 a) and/or the image (1, 3) as a whole.
8. The method (100) of claim 7, wherein the aggregation of individual similarity values (4 a) comprises (123 a) one or more of: Multiplying the individual similarity values (4 a); forming a linear combination of similarity values (4 a); Selecting the best of the individual similarity values (4 a), and -Selecting the worst value of the individual similarity values (4 a).
9. The method (100) according to any one of claims 6 to 8, wherein the partitioning into object instances and/or features is performed (121 a) using available reference truth values regarding the presence of the object instances and/or features in the input image (1).
10. The method (100) according to any one of claims 1 to 9, wherein the alternative image source (5) comprises (142 a) one or more of the following: An output generated from the same said input image (1) by a further machine learning model, and The input image (1).
11. The method (100) according to any one of claims 1 to 10, wherein a simulated image of a given scene is selected (112) as the input image (1).
12. The method (100) according to any one of claims 1 to 11, wherein the similarity measure (4) given is chosen (124) to combine vector embedding from multiple machine learning models in one common space.
13. The method (100) according to any one of claims 1 to 12, further comprising: -manufacturing (150) a physical product and/or setting a physical scene from the output image (3) obtained from said GMLM (2), or a modified version (3') of such output image (3).
14. The method (100) according to any one of claims 1 to 13, further comprising training (160) the image processing machine learning model (6) for a given task using one or more output images (3) from the GMLM (2) or modified versions (3') of these output images (3) as training images.
15. The method (100) of claim 14, further comprising: -processing (170) one or more images (7) recorded by at least one sensor (8) by a trained image processing machine learning model (6); Calculating (180) an actuation signal (180 a) from an output (9) of the trained image processing machine learning model (6), and -Actuating (190) a vehicle (50), a driving assistance system (51), a robot (60), a quality inspection system (70), a monitoring system (80) and/or a medical imaging system (90) with the actuation signal (180 a).
16. A computer program comprising machine readable instructions which, when executed by one or more computers and/or computing instances, cause the one or more computers and/or computing instances to perform the method (100) of any one of claims 1 to 15.
17. A non-transitory computer readable data carrier and/or download product having a computer program according to claim 16.
18. One or more computers and/or computing instances having a computer program according to claim 16, and/or having a non-transitory computer readable data carrier according to claim 17, and/or having a downloaded product according to claim 17.

Description

Generating reality images by generating machine learning models Technical Field The invention relates to generating a real image by a generated machine learning model. For example, these generated images may be used as training images for training a downstream machine learning model for a given task. Background Training of image processing machine learning models for a given task requires a large set of training images. These training images need to be obtained in some way. If the training is supervised, each training image needs to be labeled with a "benchmark truth value" that the image processing machine learning network should ideally produce when given the corresponding training image (ground truth). Thus, training images are a scarce resource. In particular, it is difficult to achieve sufficient variability in the training image set so that this training image set also covers situations that rarely occur but still need to be handled correctly. Thus, a generative image-to-image machine learning model is used to augment a set of available training images. If the generated image is basically a variant of the training image of the known reference truth label, the generated image can be used as a new, different training image, but the reference truth label can be reused. However, the generated image should be free of added "illusions" or any artifacts that have no correspondence in the reference truth labels. Disclosure of Invention The present invention provides a method for improving the coincidence of an output image generated by a generative image-to-image machine learning model (GMLM) with the domain and/or distribution to which a given input image belongs. In particular, the domain and/or distribution may relate to semantic content of the input image and/or to rendering the semantic content into the input image. For example, images of a scene in the environment of a vehicle and/or robot may belong to different domains and/or distributions, depending on the composition of object instances therein, and also depending on the general conditions of the respective scene. For example, images acquired under good weather conditions on a sunny day may be considered to belong to one domain and/or distribution, and images acquired at night and/or under other poor visibility conditions such as rain, fog or snow may be considered to belong to another domain and/or distribution. The same image may belong to multiple domains and/or distributions. For example, an image may belong to a first domain and/or distribution due to the composition of the object instance therein, and the image may belong to a second domain and/or distribution due to the weather conditions in which the image was taken. In particular, GMLM may be trained to generate output images in different target domains and/or distributions relative to at least one attribute (such as object composition or weather conditions) from input images in the source domain and/or distribution relative to the attribute. In one example GMLM may be trained to generate an output image from an input image taken under good visibility conditions that appears to be taken under less good visibility conditions, but otherwise still resembles the input image. In particular, the semantic content of the output image may still be substantially the same as the semantic content of the input image. That is, GMLM may be used to perform controlled domain migration (transfer) of the input image. An advantage over domain migration with a generated countermeasure network (GAN) is that there is more control over whether the "reference truth" label of an input image is reusable for the output image. In the course of this method, at least one input image is processed by GMLM into one or more output images. For example, if GMLM is a diffusion model, each such process may begin with a version of the image that has been compromised by a different noise sample (e.g., represented by a different "seed" from which the process began). In this way, repeated processing of the same input image can produce different output images. One or more output images generated from the input image are compared to the input image by a predetermined similarity measure. In particular, the similarity measure may be specific to the application at hand and measures which properties in the output image should somehow respect the corresponding properties of the input image. In one example, the similarity metric may measure whether the output image has substantially the same semantic content as the semantic content of the input image. The similarity measure may be calculated based on a single output image, but may also be calculated based on a plurality of output images, for example. For example, when multiple output images are calculated from the same input image, the respective similarities of the output images and the input image may be aggregated, e.g., averaged. For example, when using a diffusion model as GMLM, this m