CN-121998937-A - Construction method, data generation method and system of diffusion data enhancement model based on prototype prompt

CN121998937ACN 121998937 ACN121998937 ACN 121998937ACN-121998937-A

Abstract

The invention belongs to the technical field related to industrial vision quality detection, and discloses a construction method, a data generation method and a system of a diffusion data enhancement model based on prototype prompt, wherein (1) a CLIP image encoder is adopted to extract image characteristics of an image to be redrawn; the method comprises the steps of (1) obtaining prompt text features by encoding through a CLIP text encoder to form prompt features, (2) inputting an original image and an image with blocked foreground into a VAE encoder to obtain potential features after blocking the image foreground, obtaining noise adding potential features to obtain prediction noise, (3) reversely estimating the noise removing potential features through the prediction noise, obtaining an estimated image through a VAE decoder, calculating mean square error loss between the prediction noise and real noise, calculating perception mixing loss between the estimated image and the original image, and further adjusting U-Net network parameters to obtain a diffusion data enhancement model. The invention can generate a vivid training sample and improve the precision of the detection model in a small sample scene.

Inventors

YANG HUA
SUN LINQING
He zhongtian
Pan qianfeng

Assignees

华中科技大学

Dates

Publication Date: 20260508
Application Date: 20260123

Claims (10)

1. A method for constructing a diffusion data enhancement model based on prototype prompt is characterized by comprising the following steps: (1) Extracting image characteristics of an image to be redrawn by adopting a CLIP image encoder, and acquiring average characteristics of a foreground area based on the image characteristics and a foreground mask as prototype embedding; (2) After the foreground of the image is blocked, inputting the original image and the image after the foreground is blocked into a VAE encoder to obtain potential features, and adding noise to the potential features of the original image to obtain noise-added potential features; (3) The method comprises the steps of estimating denoising potential features reversely by predicted noise, inputting the denoising potential features into a VAE decoder to obtain an estimated image, calculating mean square error loss between the predicted noise and real noise, calculating perceived mixed loss between the estimated image and an original image, and adjusting U-Net network parameters based on the mean square error loss and the perceived mixed loss to obtain a diffusion data enhancement model, wherein the diffusion data enhancement model comprises a CLIP image encoder, a CLIP text encoder, a VAE encoder, a U-Net network and a VAE decoder.
2. The method for constructing a diffusion data enhancement model based on prototype hints as set forth in claim 1, wherein the prototype embedding is obtained by calculating a region-of-interest average pooling feature of the 4 th, 8 th, and 12 th layer features output by the CLIP image encoder, and the specific formula is: in the formula, In order to input an image of the subject, In the case of a CLIP image encoder, As a target truth-box, I is the number of layers corresponding to the output of the CLIP image encoder.
3. The method for constructing a diffusion data enhancement model based on prototype hints as set forth in claim 1, wherein the hints feature is obtained by embedding a prototype with redrawing foreground, splitting a hints text feature and a trainable marker, and inputting the split prototype into a projection network, wherein the projection network is composed of a linear layer and a self-attention layer, and the projection network has the following corresponding formula: in the formula, , , , For linear projection weights and biases for the ith layer feature, For the prototype output after projection, Features formed for prototype stitching of multi-layer features, For the predicted gating intensity coefficient, For visual cue features transformed by self-attention mechanisms, And In order to be a trainable position marker, In order to suggest a text feature, And prompting the characteristics for the spliced visual text.
4. The method for constructing a prototype-hint-based diffusion data enhancement model of claim 1, wherein the inverse estimating process inversely restores the noisy image according to the prediction noise, and the corresponding formula is: in the formula, Is a scalar coefficient in the diffusion process, representing from time 0 to time 0 Accumulation of all noise scaling factors for a time step, Representing the predicted noise tensor of the U-Net network, Which represents the VAE decoder and, Representing potential features estimated by the prediction noise, Representing the estimated image.
5. The method for constructing a prototype hints-based diffusion data enhancement model as in any one of claims 1-4, wherein the perceptual mixture loss comprises a pixel-level mean square error loss and a feature-level mean square error loss, and the corresponding formulas are: in the formula, The input image is represented by a representation of the input image, An estimated image is represented and, A foreground mask representing the redrawn area, Which represents the CLIP image encoder, A foreground mask representing the downsampling is presented, Representing the pixel-level mean square error loss, Representing the feature level mean square error loss.
6. The method for constructing a prototype hint-based diffusion data enhancement model according to any of claims 1-4, wherein parameters of a U-Net network are fine-tuned using a low-rank adapter.
7. The method of claim 3, wherein the parameters of the diffusion data enhancement model are kept frozen during training except for low-rank weights of the diffusion data enhancement model and the projection network.
8. A data generation method is characterized in that the data generation method adopts a diffusion data enhancement model constructed by the method for constructing a diffusion data enhancement model based on prototype prompt as claimed in any one of claims 1-7 to generate simulation data.
9. A system for constructing a diffusion data enhancement model based on prototype hints is characterized by comprising a memory and a processor, wherein the memory stores a computer program, and the processor executes the method for constructing the diffusion data enhancement model based on prototype hints according to any one of claims 1-7 when executing the computer program.
10. A computer readable storage medium, characterized in that it stores machine executable instructions, which when invoked and executed by a processor, cause the processor to implement the method of constructing a prototype hint based diffusion data enhancement model according to any of claims 1-7 or the method of generating data according to claim 8.

Description

Construction method, data generation method and system of diffusion data enhancement model based on prototype prompt Technical Field The invention belongs to the technical field related to industrial vision quality detection, and particularly relates to a method for constructing a diffusion data enhancement model based on prototype prompting, a data generation method and a data generation system. Background Visual inspection technology based on deep learning has become a core means for improving the level of quality control automation and intellectualization in manufacturing industries. Particularly, in the scenes of electronic component assembly (such as PCBA), precision part manufacturing, textile detection and the like, the target detection and defect detection model can efficiently and accurately identify various flaws on the surface of the product, and is obviously superior to the traditional manual visual inspection. However, these data-driven deep learning models are essentially a complex function fitting process whose full play of performance is strongly dependent on large-scale, high-quality, diverse labeling data sets for training. In practical industrial applications, the construction of such ideal data sets presents serious challenges, mainly in the following ways: (1) Sample scarcity problem, namely, the industrial production flow is highly optimized, the yield is usually maintained at an extremely high level, the occurrence frequency of defective samples is extremely low, and a typical 'small sample' learning scene is formed. For example, in some precision packaging processes, the incidence of certain types of defects may be less than one thousandth. Collecting a sufficient number of defect samples often requires a long period and high cost, severely restricting the timely deployment and iterative updating of the detection model. (2) The problems of data diversity and domain adaptation are that the environment of an industrial field is complex and changeable, and factors such as illumination conditions, camera parameters, workpiece postures, background interference and the like all can cause significant differences of image data distribution (namely differences of domains). Even in the same production line, there may be a distribution offset of data collected by different batches and different machines. Although the existing data enhancement technology (such as rotation, scaling, clipping, color dithering and the like) can simply expand the data volume, the generated samples are limited to linear transformation of original data distribution, and complicated nonlinear changes and new defect modes in the real world are difficult to simulate, so that the generalization capability of the model is insufficient when facing new environments and new defect forms. (3) In recent years, the generation of depth generation models such as an countermeasure network and a diffusion model has been a breakthrough in the field of image synthesis. However, the application of these networks in the field of data enhancement has the problems of unstable training, weak semantic control capability and insufficient domain fidelity, which may cause that the characteristic domain of the generated data deviates too much from the source data, and has negative effects on the performance of the detection model. Accordingly, there is a strong need in the art for a data enhancement solution that overcomes the limitations described above. The ideal method has the following characteristics of being capable of performing stable training under a small number of real sample conditions, accurately controlling generated contents, particularly attributes of foreground targets, according to semantic cues (such as text description), and ensuring that generated images are kept highly consistent with real data fields at both pixel level and feature level, thereby effectively improving the accuracy and robustness of a downstream detection model in a data scarcity scene. Disclosure of Invention Aiming at the defects or improvement demands of the prior art, the invention provides a construction method, a data generation method and a system of a diffusion data enhancement model based on prototype prompt, which aim to solve the problems of insufficient precision and weak generalization of the existing industrial detection model in a small sample scene. In order to achieve the above object, according to one aspect of the present invention, there is provided a method for constructing a diffusion data enhancement model based on prototype hints, including the steps of: (1) Extracting image characteristics of an image to be redrawn by adopting a CLIP image encoder, and acquiring average characteristics of a foreground area based on the image characteristics and a foreground mask as prototype embedding; (2) After the foreground of the image is blocked, inputting the original image and the image after the foreground is blocked into a VAE encoder to o