CN-121982009-A - Industrial anomaly image real-time detection method based on self-supervision visual large model
Abstract
The invention discloses a real-time detection method for industrial abnormal images based on a self-supervision visual large model, and belongs to the field of industrial vision. Based on an architecture of offline deep training and online lightweight push understanding coupling, a real-time detection method suitable for industrial image anomaly detection is provided. A self-supervision large model trained in LVD-1689M by DINOv is used as a general feature extractor, weight freezing is kept, a real industrial abnormal dataset MVTec AD2 dataset is subjected to fine adjustment training, VIT ADAPTER is introduced to extract multi-scale features, the features are fused by Linear Bottleneck layers, after pictures are aligned with the DINO pre-trained resolution, L1 Loss, GIoU Loss, DIoU Loss and Focal Loss are used for convergence, and efficient migration from general features to specific industrial defect tasks is achieved. In the online stage, only the lightweight network is deployed, and image streams are received through Flask frames and millisecond-level real-time reasoning is completed. The method effectively solves the problems of high false alarm rate, rough positioning and poor instantaneity of the existing method while keeping high instantaneity.
Inventors
- ZHANG QIAN
- FANG PENG
Assignees
- 华东师范大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260129
Claims (3)
- 1. A real-time detection method of industrial abnormal image based on self-supervision visual large model comprises a data preparation stage, an off-line training stage, an on-line reasoning stage and a result verification stage, and concretely comprises the data preparation stage Step 1, preparing data, namely preparing a large natural data set for training a general visual feature extractor and an industrial abnormal image data set with labels and a supervision mode in advance; Offline training phase Feature extraction, namely performing back training on an LVD-1689M dataset by using a training framework of DINOv (Emerging Properties in Self-Supervised Vision Transformers) to obtain a universal visual feature extractor, introducing VIT ADAPTER a module consisting of a cross attention layer and a feed forward network to extract multi-scale features, adding a Linear Bottleneck layer after the multi-scale features are extracted, fusing the features and reducing the 1024 dimensions to 256 dimensions, wherein the specific formula is as follows: ; Wherein the method comprises the steps of For the layer normalization, In order to activate the function, Representing a linear transformation, projecting DINOv high-dimensional features into the desired dimension of the detector head; Step 3, designing a detection head, taking only the last layer of features by using DINOv to the feature extraction layer, introducing a Focus Query mechanism to enable the model to pay more attention to a local area, and aligning a picture with the DINO pre-training resolution by using Sliding Window Inference; step 4, detecting a head training strategy, freezing a backup parameter, setting weight attenuation by adopting a AdamW optimizer, setting the weight between 0.2 and 0.3, and using a Loss function L1 loss+ GIoU Loss + DIoU Loss +Focal Loss; Focal Loss, the formula is as follows: ; Wherein p t is the prediction probability of the model on the correct category, gamma is the focusing parameter, and the larger gamma is, the stronger the inhibition on the simple sample is; GIoU Loss is generalized cross-compensation loss, and the formula is as follows: ; Wherein B is a prediction frame, C is a minimum rectangle containing both the prediction frame and a real frame, DIoU Loss, namely distance intersection and compensation loss, the formula is as follows: ; Wherein, the For predicting the frame center With the center point of the real frame The Euclidean distance between the two, c is the diagonal length of the minimum circumscribed rectangle containing the prediction frame and the real frame; Online reasoning phase Step 5, model deployment, namely using the lightweight detection network model obtained after the fine tuning training in the step 2 and the step 3 to perform lightweight deployment by utilizing a Flask framework, selecting an industrial picture by a user through a software interface, preprocessing the picture by industrial control software, reasoning by utilizing the fine tuned lightweight network model, labeling the defect type and the defect position with high confidence in an original picture, and intuitively displaying the defect type and the defect position in the software interface; Result verification stage And 6, verifying the result, namely keeping the deployed lightweight detection model under the same fps condition, counting mAP and AU-Pro values detected by the model, and judging the detection effect of the model.
- 2. The method for detecting the industrial abnormal image in real time based on the self-supervision visual large model according to claim 1, wherein in the training strategy of the detection head in the step 4, a supervision learning strategy is adopted, fine labels provided by MVtec AD industrial defect data sets are utilized to perform end-to-end training on the DEIMv detection head, and in the process, the weight of the DINOv feature extractor is kept frozen, and only the detection head parameters are optimized.
- 3. The method for real-time detection of industrial anomaly images based on a self-supervised visual macro model according to claim 2, wherein the fine labels provided by MVtec AD industrial defect dataset include defect bounding boxes, class labels, and pixel level division masks.
Description
Industrial anomaly image real-time detection method based on self-supervision visual large model Technical Field The invention relates to the technical field of computer vision, in particular to an application of a self-supervision visual large model and image real-time detection in an industrial abnormal image scene, and particularly relates to a real-time detection method of an industrial abnormal image based on the self-supervision visual large model. Background The industrial anomaly image detection is an important subtask in computer vision, and the task needs to mark the type of defect anomaly in the image and the coordinate positioning of the defect according to the original image input by a user and by using a powerful visual detection model. The field specially researches how to realize reliable anomaly detection under the condition of scarce defective samples, mainly comprises two methods of reconfiguration and embedding, and although many methods are researched at present to solve the problem of scarce data, the bottleneck of the method still exists, namely high false alarm rate, namely the method is too sensitive to small changes of normal samples, such as illumination, position deviation and the like, and is easy to generate false alarms to influence the production beat. The positioning is rough, the abnormal heat map generated by the reconstruction method is prone to blurring of boundaries, and defect segmentation with pixel-level precision is difficult to achieve. Many embedded methods need nearest neighbor searching or distribution calculation in a high-dimensional feature space, have high calculation cost, and are difficult to meet the real-time requirement of high-speed online detection. In a departure from the efficient framework, such algorithms are typically self-organizing and fail to model their core ideas, such as normal features, effectively "distill" or "solidify" into a lightweight detection network that can run in real-time. Along with technological development, people find that by performing self-supervision pre-training on a very large-scale label-free data set, namely a self-supervision visual large model, extremely universal, steady and semantic information-rich visual characteristic representation can be learned, the model can understand deep semantics, geometric structures and context relations of images, has excellent robustness to partial shielding, deformation and illumination change of objects, can display excellent zero sample or small sample performance in intensive prediction tasks such as segmentation without training aiming at specific types, and particularly does not need manual labeling in the training process, so that the dependence on the large-scale labeling data set is fundamentally relieved. While it appears that the self-supervising visual large model can well fulfill industrial detection requirements, its inherent characteristics prevent direct application in industrial real-time scenarios, the number of parameters of the visual large model and the data volume of the training set tend to be extremely large, resulting in huge computational overhead, and single reasoning takes extremely long time, failing to meet real-time response requirements of the production line in millisecond level, is costly in deployment of edge devices, and has a powerful generalization basis, but lacks targeted optimization for specific industrial fields, such as specific material textures, defect modes, etc., and can present the problem of coexistence of "knowledge redundancy" and "insufficient targeting" when directly handling professional tasks. Disclosure of Invention The invention aims to improve the accuracy of real-time detection of industrial images by adopting a method with small calculation cost. By utilizing a self-supervision visual large model and combining a lightweight detection reasoning framework suitable for downstream tasks, a detection method is designed to meet the industrial anomaly detection precision and achieve millisecond-level real-time detection performance. In order to achieve the above purpose, the present invention adopts the following technical scheme: The industrial abnormal image real-time detection method based on the self-supervision visual large model comprises a data preparation stage, an offline training stage, an online reasoning stage and a result verification stage, and specifically comprises the following steps: Data preparation phase Step 1, data preparation, preparing a large natural data set for training a general visual feature extractor in advance, and a labeled and supervised industrial abnormal image data set Offline training phase Step 2, feature extraction, namely performing back training on an LVD-1689M dataset by using a training framework of DINOv (Emerging Properties in Self-Supervised Vision Transformers) to obtain a universal visual feature extractor, introducing VIT ADAPTER, extracting multi-scale features by using a module