EP-4738280-A1 - CONTROLLED DEFECT AUGMENTATION VIA TEXT AND IMAGE GUIDED DIFFUSION MODEL

EP4738280A1EP 4738280 A1EP4738280 A1EP 4738280A1EP-4738280-A1

Abstract

A machine learning (ML) system includes a vision language model (VLM) and a diffusion model. The VLM is finetuned prior to training the diffusion model with data pairs. A data pair includes image data displaying an anomaly and text data describing the image data. The finetuned VLM includes an image encoder that generates image embeddings using the image data and a text encoder that generates text embeddings using the text data. Semantic subcode is generated using the image embeddings and the text embeddings. The diffusion model generates stochastic subcode using the image data. The diffusion model generates a reconstructed image using the stochastic and semantic subcodes. A loss is optimized based on an expected value of a difference between predicted noise of a noisy instance of the image data at a particular time and actual noise of that noisy instance. Parameters of the diffusion model are updated using the loss.

Inventors

AZARI, BAHARE
QIU, Chen
Schmedding, Sabrina
LIN, WAN-YI

Assignees

Robert Bosch GmbH

Dates

Publication Date: 20260506
Application Date: 20251028

Claims (20)

A computer-implemented method of a machine learning system that includes an image encoder, a text encoder, and a diffusion model, the method comprising: receiving a training dataset with data pairs, the data pairs include at least a first data pair that has at least (i) image data that displays an anomaly and (ii) text data describing the corresponding image data including the anomaly; generating, via the image encoder, image embeddings using pixels of the image data; generating, via the text encoder, text embeddings using the text data; generating semantic subcode using the image embeddings and the text embeddings; generating, via the diffusion model, stochastic subcode using the pixels of the image data; generating, via the diffusion model, reconstructed image data using the stochastic subcode and the semantic subcode; optimizing a loss based at least on an expected value of a difference between a predicted noise of a noisy image at a particular time and an actual noise of the noisy image at the particular time during the generation of the reconstructed image data; and updating parameters of the diffusion model using the loss.
The computer-implemented method of claim 1, wherein: the semantic subcode is a sum of an image component and a text component; the text component is computed by multiplying the text embeddings by a first coefficient that is a value of 0, a value between 0 and 1, or a value of 1; and the image component is computed by multiplying the image embeddings by a second coefficient, the second coefficient being one minus the first coefficient.
The computer-implemented method of claim 1, wherein: the image data displays an object; and the anomaly is a defect on the object.
The computer-implemented method of claim 1, further comprising: finetuning a pretrained vision language model (VLM) using a finetuning dataset, the finetuning dataset including (i) a first subset of digital images that includes non-anomalous image data and a first subset of corresponding text data describing the non-anomalous image data and (ii) a second subset of digital images that includes anomalous image data and a second subset of corresponding text data describing the anomalous image data, wherein, the image encoder is a finetuned image encoding component of the pretrained VLM, and the text encoder is a finetuned text encoding component of the pretrained VLM.
The computer-implemented method of claim 4, wherein: the finetuning dataset of the pretrained VLM includes at least another data pair; the another data pair includes another digital image displaying another image data and another text data describing the another image data; and the another text data includes (i) a data type indicating whether or not the another image data displays an object that is anomalous or non-anomalous, (ii) one or more attribute data indicative of one or more attributes of a defect of the object when the data type is anomalous.
The computer-implemented method of claim 5, wherein the another text data of the finetuning dataset that finetunes the VLM is more descriptive than the text data of the training dataset that trains the diffusion model.
The computer-implemented method of claim 1, further comprising: receiving a source image with source image data that is non-anomalous; receiving text input that describes (i) a desired anomaly to be generated with respect to the source image and (ii) at least one attribute of the anomaly; and generating, via the machine learning system, a synthetic image using the source image and the text input, wherein the synthetic image displays the source image data with the desired anomaly as described by the text input.
The computer-implemented method of claim 7, further comprising: creating a new dataset that include at least the source image and the synthetic image; and training an anomaly detector using the new dataset, the anomaly detector including at least one machine learning model.
A system comprising: one or more processors; one or more computer memory in data communication with the one or more processors, the one or more computer memory having computer readable data stored thereon, the computer readable data including instruction that, when executed by one or more processors, causes the one or more processors to perform a method of a machine learning system that includes an image encoder, a text encoder, and a diffusion model, the method including receiving a training dataset with data pairs, the data pairs include at least a first data pair that has at least (i) image data that displays an anomaly and (ii) text data describing the corresponding image data including the anomaly; generating, via the image encoder, image embeddings using pixels of the image data; generating, via the text encoder, text embeddings using the text data; generating semantic subcode using the image embeddings and the text embeddings; generating, via the diffusion model, stochastic subcode using the pixels of the image data; generating, via the diffusion model, reconstructed image data using the stochastic subcode and the semantic subcode; and optimizing a loss based at least on an expected value of a difference between a predicted noise of a noisy image at a particular time and an actual noise of the noisy image at the particular time during the generation of the reconstructed image data; and updating parameters of the diffusion model using the loss.
The system of claim 9, wherein: the semantic subcode is a sum of an image component and a text component; the text component is computed by multiplying the text embeddings by a first coefficient that is a value of 0, a value between 0 and 1, or a value of 1; and the image component is computed by multiplying the image embeddings by a second coefficient, the second coefficient being one minus the first coefficient.
The system of claim 9, wherein: the image data displays an object; and the anomaly is a defect on the object.
The system of claim 9, wherein the method further comprises: finetuning a pretrained vision language model (VLM) using a finetuning dataset, the finetuning dataset including (i) a first subset of digital images that includes non-anomalous image data and a first subset of corresponding text data describing the non-anomalous image data and (ii) a second subset of digital images that includes anomalous image data and a second subset of corresponding text data describing the anomalous image data, wherein, the image encoder is a finetuned image encoding component of the VLM, and the text encoder is a finetuned text encoding component of the VLM.
The system of claim 12, wherein: the finetuning dataset of the pretrained VLM includes at least another data pair; the another data pair includes another digital image displaying another image data and another text data describing the another image data; and the another text data includes (i) a data type indicating whether or not the another image data is anomalous or non-anomalous, (ii) one or more attribute data indicative of one or more attributes of a defect displayed in the another image data when the data type is anomalous.
The system of claim 13, wherein the another text data of the finetuning dataset that finetunes the VLM is more descriptive than the text data of the training dataset that trains the diffusion model.
The system of claim 9, wherein the method further comprises: receiving a source image with source image data that is non-anomalous; receiving text input that describes (i) a desired anomaly to be generated with respect to the source image and (ii) at least one attribute of the anomaly; and generating, via the machine learning system, a synthetic image using the source image and the text input, wherein the synthetic image displays the source image data with the desired anomaly as described by the text input.
The system of claim 15, wherein the method further comprises: creating a new dataset that include at least the source image and the synthetic image; and training an anomaly detector using the new dataset, the anomaly detector including at least one machine learning model.
A computer implemented method of generating a dataset for training a machine learning model, the method comprises: receiving a source image with source image data that is non-anomalous; receiving text input that describes (i) an anomaly to be generated with respect to the source image data and (ii) one or more attributes of the anomaly; generating, via an image encoder, source image embeddings using pixels of the source image; generating, via a text encoder, text input embeddings using the text input; generating a semantic subcode using the source image embeddings and the text input embeddings; generating, via a diffusion model, a stochastic subcode using the pixels of the source image; and generating, via the diffusion model, a synthetic image using the stochastic subcode and the semantic subcode, the synthetic image displaying the source image data with the anomaly as described by the text input, wherein, the dataset includes at least the source image and the synthetic image, and the dataset is configured to train the machine learning model to perform an anomaly detection task.
The computer-implemented method of claim 17, wherein: the semantic subcode is a sum of an image component and a text component; the text component is computed by multiplying the text embeddings by a first coefficient that is a value of 0, a value between 0 and 1, or a value of 1; and the image component is computed by multiplying the image embeddings by a second coefficient, the second coefficient being one minus the first coefficient.
The computer-implemented method of claim 17, wherein: the source image data displays an object; and the anomaly is a defect on the object.
The computer-implemented method of claim 19, wherein the one or more attributes of the anomaly include (i) a size of the defect and (ii) a location of the defect.

Description

TECHNICAL FIELD This disclosure relates generally to computer vision, and more particularly to controlled defect augmentation via a diffusion model guided by text and images. BACKGROUND A significant challenge in training efficient anomaly detection models is the scarcity of balanced datasets, which encompass both normal and defective images in suitable proportions. For example, defective images are much less available and less diverse in manufacturing settings. This lack of defective images in manufacturing settings creates challenges to training anomaly detection models in these manufacturing settings. Also, traditional defect augmentation methods with generative models can be biased to their training data. They often experience mode collapse, where they consistently generate overly similar outputs, and fail to produce diverse, authentic images, limiting their utility in producing effective augmented datasets for defective images. SUMMARY The following is a summary of certain embodiments described in detail below. The described aspects are presented merely to provide the reader with a brief summary of these certain embodiments and the description of these aspects is not intended to limit the scope of this disclosure. Indeed, this disclosure may encompass a variety of aspects that may not be explicitly set forth below. According to at least one aspect, a computer-implemented method relates to training at least a diffusion model with a training dataset that includes data pairs. The data pairs include at least a first data pair. The first data pair includes at least (i) image data that displays an anomaly and (ii) text data that describes the corresponding image data including the anomaly. The method includes generating, via an image encoder, image embeddings using pixels of the image data. The method includes generating, via a text encoder, text embeddings using the text data. The method includes generating semantic subcode using the image embeddings and the text embeddings. The method includes generating, via the diffusion model, stochastic subcode using the pixels of the image data. The method includes generating, via the diffusion model, reconstructed image data using the stochastic subcode and the semantic subcode. The reconstructed image data is a reconstruction of the image data via the diffusion model. The method includes optimizing a loss based on an expected value of a difference between a predicted noise of a noisy image at a particular time and an actual noise of the noisy image at the particular time during the generation of the reconstructed image data. The method includes updating parameters of the diffusion model using the loss. According to at least one aspect, a system includes at least one processor and at least one computer memory, which is in data communication with the one or more processors. The one or more computer memory has computer readable data stored thereon. The computer readable data includes instruction that, when executed by one or more processors, causes the one or more processors to perform a method of training at least a diffusion model with a training dataset that includes data pairs. The data pairs include at least a first data pair. The first data pair includes at least (i) image data that displays an anomaly and (ii) text data that describes the corresponding image data including the anomaly. The method includes generating, via an image encoder, image embeddings using pixels of the image data. The method includes generating, via a text encoder, text embeddings using the text data. The method includes generating semantic subcode using the image embeddings and the text embeddings. The method includes generating, via the diffusion model, stochastic subcode using the pixels of image data. The method includes generating, via the diffusion model, reconstructed image data using the stochastic subcode and the semantic subcode. The reconstructed image data is a reconstruction of the image data via the diffusion model. The method includes optimizing a loss based on an expected value of a difference between a predicted noise of a noisy image at a particular time and an actual noise of the noisy image at the particular time during the generation of the reconstructed image data. The method includes updating parameters of the diffusion model using the loss. According to at least one aspect, a computer-implemented method relates to generating a dataset for training a machine learning model. The method includes receiving a source image with source image data that is non-anomalous. The method includes receiving text input that describes (i) an anomaly to be generated on the source image and (ii) at least one attribute of the anomaly. The method includes generating, via an image encoder, source image embeddings using pixels of the source image. The method includes generating, via a text encoder, text input embeddings using the text input. The method includes generating a semantic subcod