CN-121686146-B - Industrial defect data generation method based on visual language big model

CN121686146BCN 121686146 BCN121686146 BCN 121686146BCN-121686146-B

Abstract

The application relates to the technical field of defect detection, and provides an industrial defect data generation method based on a visual language big model, which is used for extracting visual feature vectors of industrial product images and semantic feature vectors described by defect texts through the visual language big model, mapping the visual feature vectors and the semantic feature vectors to a unified semantic space, and dynamically generating semantic guide vectors by combining a cross attention mechanism; the method comprises the steps of generating directional simulation noise based on visual morphology of defects, coding semantic guidance vectors into modulation signals through language branches of a diffusion model, realizing depth fusion of noise-added visual features and modulation signals in a denoising decoder, verifying semantic matching degree of defect images and defect text descriptions through a visual language big model, verifying positioning accuracy of defect areas by means of class activation diagrams, taking verification results as reward signals, alternately optimizing weight parameters of diffusion model parameters and a cross attention mechanism through reinforcement learning, and enabling a finally generated industrial defect data set to be stable in quality and reliable in attribute.

Inventors

ZHU ZUNJIE
JIN HENG
LI RUYUAN
ZHAO QIANG
WANG HONGKUI
CHEN YUNKAI
HU MEIQIN
DING GUIGUANG
Wen Hongfa
JIANG SHAOWEI

Assignees

杭州电子科技大学丽水研究院

Dates

Publication Date: 20260505
Application Date: 20260209

Claims (9)

1. An industrial defect data generation method based on a visual language big model is characterized by comprising the following steps: acquiring an industrial product image and a defect text description, respectively extracting a visual feature vector and a semantic feature vector through a visual language big model, and mapping the visual feature vector and the semantic feature vector to a unified semantic space; Fusing the visual feature vector and the semantic feature vector based on a cross-attention mechanism to dynamically generate a semantic guidance vector, wherein the dynamically generating the semantic guidance vector comprises: Analyzing the defect position and the size parameter based on the defect text description, and distributing initial weights for the visual feature vectors to generate weighted visual features; Analyzing the defect type based on the defect text description to segment the semantic feature vector to form hierarchical semantic features comprising core semantic features and auxiliary semantic features; Taking the weighted visual features and the core semantic features as initial fusion objects, fusing through a cross attention mechanism, calculating the matching degree of the fusion result and the defect text description, and dynamically adjusting the fusion coefficient of the auxiliary semantic features according to the matching degree; Iterative execution cross attention fusion, wherein each iteration updates the initial weight and hierarchical semantic feature of the visual feature vector by the current fusion result until the matching degree converges to dynamically generate a semantic guidance vector; Adding analog noise to an industrial product image, inputting the analog noise into a visual branch of a diffusion model to extract noise-added visual features, and inputting a semantic guidance vector into a language branch of the diffusion model and encoding the semantic guidance vector into a modulation signal; The denoising decoder performs depth fusion on the denoising visual features and the modulation signals to generate a defect image; Judging the semantic matching degree of the defect image and the defect text description through the visual language big model, verifying the positioning accuracy of the defect area through the class activation diagram, and comprehensively judging whether the defect image is qualified or not based on the semantic matching degree and the positioning accuracy; and if the result is unqualified, taking the result of the semantic matching degree and the positioning accuracy as a reward signal, alternately optimizing parameters of the diffusion model and weight parameters of the cross attention mechanism through reinforcement learning, regenerating a defect image, and if the result is qualified, generating an industrial defect data set.
2. The method for generating industrial defect data based on a large model of visual language according to claim 1, wherein the mapping to the unified semantic space comprises: Processing the industrial product image to obtain an enhanced image, inputting the industrial product image and the enhanced image into a visual language large model, and extracting visual feature vectors through residual fusion of enhanced features and basic visual features; Obtaining and analyzing the defect text description, inputting the analysis result into a visual language big model, and extracting semantic feature vectors by embedding domain knowledge and dynamically adjusting weight ratio by a gating mechanism; feature similarity of the visual feature vector and the semantic feature vector is calculated, and feature projection and normalization processing are carried out on the basis of the feature similarity so as to map to a unified semantic space.
3. The method for generating industrial defect data based on a visual language big model according to claim 2, wherein the encoding into a modulated signal comprises: Inputting the semantic guidance vector into a language branch of a diffusion model, extracting characteristic representations comprising defect type identification, position constraint boundaries and morphological constraint, and distributing coding weights according to the complexity of the defect type identification; Dynamically correcting the coding weight based on the region coincidence degree of the position constraint boundary and the industrial product image and the matching degree of the morphological constraint and the defect morphological template; And sequentially carrying out hierarchical coding on the characteristic representation of the corrected coding weight to generate an initial modulation signal, calculating the relevance between the initial modulation signal and the noise-added visual characteristic, adjusting the coding weight according to relevance feedback, and carrying out iterative coding until the modulation signal is generated.
4. A method of generating industrial defect data based on a large model of visual language as claimed in claim 3 wherein said extracting noisy visual features comprises: Generating directional simulation noise matched with the defect type, and dynamically adjusting the strength of the directional simulation noise according to the size parameter; judging the severity of the defect according to the size parameter, presetting the initial proportion of the industrial product image and the directional simulation noise, and overlapping according to the initial proportion to obtain a plurality of groups of noise images; inputting a plurality of groups of noise images into visual branches of a diffusion model, extracting noise-added image features of different levels through multi-scale convolution, and screening the noise-added image features according to the gradient direction of directional simulation noise; and performing cross-level fusion on the screened noisy image features to extract and obtain noisy visual features.
5. The method for generating industrial defect data based on a visual language big model according to claim 4, wherein the generating the defect image comprises: in a denoising decoder, decomposing the denoising visual features into low-order spatial features and high-order semantic features according to a hierarchy, and analyzing the modulation signals into position constraint factors and form constraint factors; Dynamically distributing space fusion weights based on the coordinate matching degree of the position constraint boundary and the low-order space features so as to spatially fuse the low-order space features and the position constraint factors; dynamically adjusting semantic fusion weights according to morphological constraints and morphological similarity of the high-order semantic features so as to carry out semantic fusion on the high-order semantic features and the morphological constraint factors; and splicing the space fusion result and the semantic fusion result to generate an initial defect image, calculating the coincidence degree of a defect region and a position constraint boundary in the initial defect image and the coincidence of defect morphology and morphology constraint to judge whether to reversely adjust the fusion weight and iteratively execute depth fusion until the defect image is generated.
6. The method for generating industrial defect data based on a large visual language model according to claim 5, wherein the parameters of the optimized diffusion model include: Extracting matching results of different dimensions and offset results of the defect area from the reward signal, and adjusting the weight ratio of the matching results and the offset results according to the defect severity to generate a reinforcement learning reward value; Directionally adjusting the visual branches and the language branches of the diffusion model according to the matching results of different dimensions, and adjusting the spatial fusion weight in the denoising decoder according to the offset result of the defect area; And regenerating a defect image based on the diffusion model after parameter adjustment, and verifying semantic matching degree and positioning accuracy again until the defect image is qualified so as to iteratively optimize parameters of the diffusion model.
7. The method for generating industrial defect data based on visual language big model according to claim 6, wherein the judging of the semantic matching degree comprises: Extracting defect types, morphological descriptions and size parameters from the defect text descriptions to form text semantic features, and inputting a defect image into a visual language big model to extract image semantic features with the same dimension; Respectively calculating the matching degree of the text semantic features and the image semantic features in different dimensions, and dynamically distributing the dimension weight of the matching degree according to the defect severity; And carrying out weighted summation on the matching degrees of different dimensions according to the dimension weights to obtain initial semantic matching degrees, judging whether the initial semantic matching degrees are qualified by taking the type matching degrees as judgment basis, judging the semantic matching degrees by combining the weighted summation results if the initial semantic matching degrees are qualified, and outputting the matching deviation of different dimensions.
8. The method for generating industrial defect data based on a visual language big model according to claim 7, wherein verifying the positioning accuracy of the defect area comprises: Generating a thermodynamic diagram of the defect image through the class activation diagram, taking a region with a response value larger than a response threshold value in the thermodynamic diagram as a potential defect region, and dividing the potential defect region into a core attention region and an edge attention region according to the response value; Calculating the core matching degree of the core attention area and the position constraint boundary and the edge matching degree of the edge attention area and the position constraint boundary; dynamically distributing the weight of the core matching degree and the edge matching degree based on the spatial property of the defect type, obtaining initial accuracy by weighting and summing, and analyzing the offset direction and the offset distance of the geometric center of the potential defect area and the geometric center of the position constraint boundary; and judging the positioning result by combining the initial accuracy and the offset distance to verify the positioning accuracy of the defect area.
9. The method for generating industrial defect data based on a visual language big model according to claim 8, wherein the optimization of the weight parameters of the cross-attention mechanism comprises: extracting matching results of different dimensions and offset results of the defect areas from the reward signals, and respectively corresponding to the hierarchical semantic features and the initial weights of the visual feature vectors in the cross attention mechanism; determining the updating priority of the weight coefficient according to the defect type and the defect severity to directionally adjust the weight parameter of the cross attention mechanism; and regenerating a semantic guidance vector based on the cross attention mechanism after the weight coefficient adjustment to regenerate a defect image and verify the semantic matching degree and the positioning accuracy again until the defect image is qualified so as to iteratively optimize the weight parameters of the cross attention mechanism.

Description

Industrial defect data generation method based on visual language big model Technical Field The application relates to the technical field of defect detection, in particular to an industrial defect data generation method based on a visual language big model. Background In order to improve the identification capability of the industrial vision algorithm on defects, a large number of defect samples are often required to be trained, however, the current defect data generation method has a certain limitation. The method can alleviate abrupt sense to a certain extent, but is usually based on unconditional generation of the whole image or undefined target guide, and the background content is easily destroyed, thus causing distortion of the whole form of the image. The method causes that the currently generated data has obvious differences from the actual production defects in the aspects of position, morphology, material property and the like of the defect area, and limits the adaptability and generalization capability of the data in an industrial detection algorithm. In order to overcome the defects in the prior art, the application aims to generate an industrial defect data set conforming to the vision and the semantics of a product based on a visual language big model. Disclosure of Invention Aiming at the defects of the prior art, the application provides an industrial defect data generation method based on a visual language big model, which comprises the steps of acquiring an industrial product image and defect text description, respectively extracting visual feature vectors and semantic feature vectors through the visual language big model, mapping the visual feature vectors and the semantic feature vectors to a unified semantic space, and fusing the visual feature vectors and the semantic feature vectors based on a cross attention mechanism to dynamically generate semantic guide vectors; Adding analog noise to an industrial product image, inputting the analog noise into a visual branch of a diffusion model to extract a noise-added visual feature, inputting a semantic guide vector into a language branch of the diffusion model, encoding the semantic guide vector into a modulation signal, and performing depth fusion on the noise-added visual feature and the modulation signal in a noise-removing decoder to generate a defect image; Judging the semantic matching degree of the defect image and the defect text description through the visual language big model, verifying the positioning accuracy of the defect area through the class activation diagram, comprehensively judging whether the defect image is qualified or not based on the semantic matching degree and the positioning accuracy, taking the result of the semantic matching degree and the positioning accuracy as a reward signal if the defect image is unqualified, alternately optimizing the parameters of the diffusion model and the weight parameters of the cross attention mechanism through reinforcement learning, regenerating the defect image, and generating an industrial defect data set if the defect image is qualified. As an alternative embodiment, the dynamically generating the semantic guidance vector includes: Analyzing the defect position and the size parameter based on the defect text description, and distributing initial weights for the visual feature vectors to generate weighted visual features; Analyzing the defect type based on the defect text description to segment the semantic feature vector to form hierarchical semantic features comprising core semantic features and auxiliary semantic features; Taking the weighted visual features and the core semantic features as initial fusion objects, fusing through a cross attention mechanism, calculating the matching degree of the fusion result and the defect text description, and dynamically adjusting the fusion coefficient of the auxiliary semantic features according to the matching degree; And performing cross attention fusion in iteration, wherein each iteration updates the initial weight and hierarchical semantic features of the visual feature vector according to the current fusion result until the matching degree converges to dynamically generate a semantic guidance vector. As an alternative embodiment, the mapping to the unified semantic space includes: Processing the industrial product image to obtain an enhanced image, inputting the industrial product image and the enhanced image into a visual language large model, and extracting visual feature vectors through residual fusion of enhanced features and basic visual features; Obtaining and analyzing the defect text description, inputting the analysis result into a visual language big model, and extracting semantic feature vectors by embedding domain knowledge and dynamically adjusting weight ratio by a gating mechanism; feature similarity of the visual feature vector and the semantic feature vector is calculated, and feature projection and normalization p