CN-122023419-A - Non-supervision detection method for metal surface defects based on variation self-encoder

CN122023419ACN 122023419 ACN122023419 ACN 122023419ACN-122023419-A

Abstract

The invention discloses an unsupervised detection method for metal surface defects based on a variation self-encoder network, and relates to the technical field of computer vision and industrial nondestructive detection. The method comprises the steps of adopting an integral framework of an encoder-variation bottleneck-decoder, inputting a single-channel gray level image, outputting a reconstructed image with the same size as the input image and an intermediate reconstruction result of three scales, enabling a multi-scale dense attention variation self-encoder network to comprise 6 modules, enabling three-domain decoupling attention gates to refine features from three complementary dimensions of a channel, a space and a frequency domain at each stage of the network encoder, enabling the decoder to selectively fuse the encoder detail features through attention gate jump connection, enabling a multi-scale reconstruction head to provide supervision signals at three resolution levels, and enabling learning scale fusion weights to adaptively fuse three-scale abnormal images. The method can realize high-precision and high-robustness detection of the multi-scale defects on the metal surface, and is suitable for industrial metal surface nondestructive detection scenes.

Inventors

CHANG YASHENG
YAN CHENG

Assignees

源泰(连云港)仪器有限公司

Dates

Publication Date: 20260512
Application Date: 20260414

Claims (5)

1. A metal surface defect unsupervised detection method based on a variation self-encoder is characterized in that a multi-scale dense attention variation self-encoder network for realizing the metal surface defect unsupervised detection method is as follows: Adopting the whole architecture of an encoder-a variation bottleneck-a decoder, inputting a single-channel gray level image, and outputting a reconstructed image with the same size as the input and an intermediate reconstruction result of three scales; the multi-scale dense attention-variant self-encoder network includes the following 6 modules: Module one multi-scale dense coding block Four paths of convolution branches are used in parallel in each coding stage of the multi-scale dense coding block, wherein the four paths of convolution branches are respectively a 1X 1 point convolution branch, a 3X 3 standard convolution branch, a 3X 3 expansion convolution branch with the expansion rate of 2 and a 7X 7 depth separable convolution branch, and after four paths of outputs are spliced in the channel dimension, the four paths of outputs are fused with batch normalization through 1X 1 convolution to realize multi-scale feature dense aggregation from fine granularity to coarse granularity; module two-three domain decoupling attention gate The three-domain decoupling attention gate connects three complementary attention mechanisms in series, namely channel attention, spatial attention and frequency domain attention: the channel attention is that global average pooling and global maximum pooling are carried out on the feature map at the same time, attention weight of channel dimension is generated through a full-connection layer after splicing, and the feature of each channel is selectively enhanced; The space attention, namely calculating the average value and the maximum value of the feature map weighted by the channel attention along the channel dimension, generating the attention weight of the space dimension by 7 multiplied by 7 convolution after splicing, and focusing the space position of the defect; The frequency domain attention is that the channel mean value of the feature map is subjected to two-dimensional fast Fourier transform, the amplitude spectrum is extracted, and then the frequency domain attention weight is generated through 1X 1 convolution, so that the perception capability of periodic texture abnormality is enhanced; The three attention mechanisms are sequentially connected in series to act on the feature map, so that the collaborative refinement of three dimensions of a channel, a space and a frequency domain is realized; module III, variable bottleneck layer The variational bottleneck layer maps the flattened feature vectors output by the encoder into mean vectors respectively And logarithmic variance vector The training phase uses a re-parameterization technique to sample the potential vectors from the posterior distribution: wherein z represents a potential vector after reparameterization; μ represents a potential mean vector obtained by full-connection mapping of the encoder output features; log σ2 represents the log vector of the potential variance σ2; ) Representing potential standard deviation vectors; epsilon represents a random noise vector that obeys a standard normal distribution N (0, I); I represents an identity matrix; The addition of element by element; Direct command of reasoning stage The variation constraint is realized through KL divergence loss term, so that the potential representation of the normal sample converges to the standard normal distribution, and a statistical basis is provided for the subsequent Markov distance anomaly discrimination; Module IV attention-gated jump connection In each up-sampling stage of the decoder, attention gates are introduced to selectively transfer jump features of corresponding layers of the encoder, attention coefficients are determined by decoder gating signals and the jump features of the encoder together, the jump features are weighted element by element after Sigmoid activation, and redundant feature transfer of background areas is restrained, so that the decoder is focused on detail information related to reconstruction defects; Module five multi-scale reconstruction head Setting a reconstruction output head formed by 1X 1 convolution and Sigmoid activation in three stages of 1/4 resolution, 1/2 resolution and original resolution of a decoder respectively to generate three-scale reconstruction images; Module six, learning scale fusion weight Introducing a learnable parameter vector Normalized by softmax and used as the fusion weight of the three-scale error map: wherein: The final error map is obtained after the three-dimensional error map is fused; The system is a trainable parameter vector and is used for representing the contribution degree of each scale error map in the fusion process; For the parameter vector The ith weight coefficient is obtained after the normalization of the softmax function; is the first Up-sampling the scale error map to the original resolution, and subscript Scale numbering; The weight is automatically optimized along with the training process, and manual parameter adjustment is not needed.
2. The method for unsupervised detection of metal surface defects according to claim 1, wherein the composite loss function for training the multi-scale dense attention-variant self-encoder network comprises a weighted sum of four losses: wherein: The total target loss function is adopted in model training; 、、 And Is a preset non-negative weight coefficient; A pixel-level mean square error loss term; is a structural similarity loss term; A KL divergence loss term between the latent variable distribution and the prior distribution; Is a frequency domain consistency loss term; respectively calculating the mean square error of the three-scale reconstructed image and the corresponding downsampling target image, and carrying out weighted summation on the scale-capable weights; Calculating a structural similarity index SSIM of the original resolution reconstructed image and the input image, taking 1-SSIM as loss, and restricting the structural integrity of the reconstructed image; Respectively carrying out two-dimensional fast Fourier transform on the reconstructed image and the input image, calculating the mean square error between the amplitude spectrums, and restraining the frequency domain structural consistency of the reconstructed image; constraint potential distribution converges to standard normal priori, and a beta-VAE idea is adopted to set smaller weight to prevent posterior collapse: wherein: A mean value of the potential variable gaussian distribution for the encoder output; Variance of the gaussian distribution for the potential variables output by the encoder; Is a logarithmic representation of variance; representing a summation operation of corresponding items of each dimension of the latent variable.
3. The method for unsupervised detection of metal surface defects according to claim 1 or 2, wherein the method comprises a training phase and an inference phase, wherein: Training the network by using normal samples only, reasoning all training samples after training, collecting potential mean vectors to construct a normal sample mean vector library, and calculating a covariance matrix of the normal sample mean vector library; The reasoning stage comprises inputting an image to be detected into a multi-scale dense attention variation self-encoder network to obtain a three-scale reconstructed image and a potential mean value vector, calculating a three-scale pixel level absolute error map, up-sampling to an original resolution, obtaining an initial anomaly map through weight fusion of a learnable scale fusion weight, calculating the mahalanobis distance between the potential mean value vector of a current sample and a mean value vector library of a normal sample, mapping the mahalanobis distance to a modulation coefficient, carrying out amplitude modulation on the anomaly map, and applying Sauvola self-adaptive threshold algorithm to the modulated anomaly map to generate a binary defect localization map.
4. The method for unsupervised detection of metal surface defects according to claim 3, wherein the specific training method of the training stage is characterized in that only normal samples are used in the training stage, an Adam optimizer is combined with a cosine annealing learning rate scheduling strategy, gradients are cut to ensure training stability, after training is completed, all training samples are inferred, potential mean value vectors of all samples are collected, a normal sample mean value vector library is constructed, and covariance matrices of normal sample mean value vector library are calculated and used for mahalanobis distance calculation in the inference stage.
5. The method for unsupervised detection of metal surface defects according to claim 3, wherein the inference stage performs the following steps: S1, inputting an image to be detected into a multi-scale dense attention variation self-encoder network to obtain a three-scale reconstructed image and a potential mean vector ; S2, calculating a three-scale pixel level absolute error map, and performing bilinear upsampling on the 1/4 and 1/2 resolution error maps to original resolution; s3, weighting and fusing the three-scale error map by using the learnable scale fusion weight to obtain an initial fusion anomaly map 。

Description

Non-supervision detection method for metal surface defects based on variation self-encoder Technical Field The invention relates to the technical fields of computer vision, deep learning and industrial nondestructive detection, in particular to an unsupervised detection method for metal surface defects based on a Multi-scale dense attention variation self-encoder network (Multi-SCALE DENSE Attention Variational Autoencoder Network, MSDAVAENET), which belongs to the technical field of image anomaly detection. Background The metal surface defect detection is a core link of industrial quality control and has important significance for guaranteeing the reliability of products. The prior detection technology is mainly divided into the following two types: Conventional machine vision method The traditional method relies on manually designed image features (such as gradients, texture statistics and the like) to realize defect positioning through threshold segmentation or template matching. Such methods are sensitive to illumination variation, background texture complexity, limited generalization capability, and difficult adaptation to diverse industrial scenes. (II) supervised method based on deep learning The supervised deep learning method (such as a target detection network and a semantic segmentation network) has excellent performance when the labeling data are sufficient, but industrial defect samples are rare and the labeling cost is extremely high, so that the practical application of the method is limited. The existing non-supervision self-encoder method based on the reconstruction error only needs normal sample training, does not need defect labeling, and has important practical value. However, the existing methods have the following technical drawbacks: 1. The receptive field is single, namely, a coder adopts a convolution kernel with fixed size (usually 3 multiplied by 3), can not sense defect characteristics with different sizes at the same time, and has unbalanced detection capability on micro defects and large-area defects; 2. the loss function limitation is that only Mean Square Error (MSE) is adopted as reconstruction loss, and structural similarity information and frequency domain characteristics of the image are ignored, so that the model is insensitive to structural defects; 3. Lack of attention mechanism, namely, all areas are treated with the same view in the encoding and decoding process, the characteristic response of the defect related area can not be highlighted, and the background texture noise interference is obvious; 4. the decoder can not multiplex the detail features extracted by the encoder, the high-frequency details of the reconstructed image are lost, and the contrast of the reconstruction error of the defect area is reduced; 5. The potential space is unconstrained, the potential space distribution of the common self-encoder is scattered, the potential representation of the normal sample and the abnormal sample lacks the separability in the statistical sense, and the discrimination is difficult to be assisted by utilizing the potential space information; 6. and (3) single-scale anomaly detection, namely calculating reconstruction errors only at the original resolution, and detecting defects with different sizes with inconsistent sensitivity. In summary, the existing unsupervised defect detection method has obvious defects in detection precision, robustness and multi-scale adaptability, and a systematic improvement scheme is needed. Disclosure of Invention Aiming at the defects of the prior art, the invention provides an unsupervised detection method for metal surface defects based on a multi-scale dense attention variation self-encoder network, the method systematically solves the defects in the prior art through collaborative innovation, and realizes high-precision and high-robustness detection of the multi-scale defects on the metal surface on the premise of not depending on the defect labeling data. The technical problems to be solved by the invention are realized by the following technical proposal. The invention relates to a metal surface defect unsupervised detection method based on a variation self-encoder, which is characterized in that a multi-scale dense attention variation self-encoder network for realizing the metal surface defect unsupervised detection method is as follows: Adopting the whole architecture of an encoder-a variation bottleneck-a decoder, inputting a single-channel gray level image, and outputting a reconstructed image with the same size as the input and an intermediate reconstruction result of three scales; the multi-scale dense attention-variant self-encoder network includes the following 6 modules: Module one multi-scale dense coding block Four paths of convolution branches are used in parallel in each coding stage of the multi-scale dense coding block, wherein the four paths of convolution branches are respectively a 1X 1 point convolution branch, a