CN-116883933-B - Security inspection contraband detection method based on multi-scale attention and data enhancement

CN116883933BCN 116883933 BCN116883933 BCN 116883933BCN-116883933-B

Abstract

The invention discloses a security inspection contraband detection method based on multi-scale attention and data enhancement. The method comprises the steps of constructing a contraband target detection model, training the backbone network on a natural image classification data set to obtain pre-training weights of the backbone network, training the contraband target detection model loaded with the pre-training weights of the backbone network on the natural image target detection data set to obtain the pre-training weights of the contraband target detection model, initializing the contraband target detection model, retraining on a security inspection X-ray image data set to obtain a trained contraband target detection model, adjusting the trained contraband target detection model, and inputting an X-ray image to be detected to obtain the type and the bounding box of the contraband. The invention can effectively detect the position of the contraband object in the X-ray images with disordered backgrounds and mutually overlapped objects, and improves the efficiency and reliability of security inspection.

Inventors

QIU JIAN
YE XIAOFENG
PENG LI
HAN PENG
LUO KAIQING
LIU DONGMEI

Assignees

华南师范大学

Dates

Publication Date: 20260505
Application Date: 20230620

Claims (9)

1. The security inspection contraband detection method based on multi-scale attention and data enhancement is characterized by comprising the following steps of: S1, constructing a contraband target detection model MSA-DETR, wherein the model comprises a data enhancement Module, a backbone network, a position coding Module, a transducer encoder-decoder and a target detection head, wherein the backbone network is MSANet, and a basic Module is an MSA Module; the multi-scale feature extraction module inputs feature graphs Dividing into S sub-feature graphs according to channel dimension Each sub-feature map is then mapped Respectively through a group of The convolution layer is used for outputting S feature graphs after overlapping and fusing all groups of convolution calculation results ; The multi-scale channel attention module outputs each sub-feature graph for the multi-scale feature extraction module Global average pooling and maximum pooling in the spatial dimension are obtained respectively And The global space information is aggregated, and the mutual dependence among channels is captured through the shared multi-layer perceptron, and after the mutual dependence is overlapped, the channel attention attempt is activated by utilizing a sigmoid function To realize cross-channel information interaction between feature graphs of different scales, attention is sought for all channels Performing a softmax operation to obtain a multi-scale channel attention map Enabling the contraband target detection model to adaptively select specific channel information of a specific scale feature map, and then according to the specific channel information Each channel weight in (a) is used for each feature map Recalibration to obtain a feature map ; Finally, the characteristic diagram Splicing in the channel dimension to obtain an output characteristic diagram of the multi-scale channel attention module ; The multi-scale space attention module outputs each sub-feature map for the multi-scale feature extraction module Global average pooling and maximum pooling in channel dimension are respectively carried out to obtain And To aggregate global channel information, then to capture the interdependence between spatial locations by a deformable convolution layer after connecting them, and to activate spatial attention diagram by using sigmoid function ; S2, training a backbone network MSANet on a natural image classification dataset to obtain a pre-training weight of MSANet, and then training a contraband target detection model MSA-DETR loaded with the pre-training weight of MSANet on a natural image target detection dataset to obtain the pre-training weight of MSA-DETR; S3, adjusting a target detection head of the contraband target detection model, loading the pre-training weight of the MSA-DETR obtained in the S2 to initialize the contraband target detection model, and then retraining the contraband target detection model loaded with the pre-training weight on a security inspection X-ray image data set to obtain a trained MSA-DETR model; and S4, removing a data enhancement module in the trained contraband target detection model, and inputting the X-ray image to be detected into the trained contraband target detection model to obtain the class and the bounding box of the contraband.
2. The method for detecting security contraband based on multi-scale attention and data enhancement according to claim 1, wherein in step S1, the data enhancement module uses ObjectMix method to preprocess the input image data, specifically as follows: In a sheet of input image Is to intercept another input image in the random region of (a) All contraband target areas in (1) In accordance with a set proportion with it Multiple fusion is carried out and then the fusion is used as input data And according to the fusion proportion after obtaining the output of the contraband target detection model The loss function of the contraband target detection model is calculated according to the following formula: Wherein, the Representing an image A set of all contraband target areas in the system, Representation corresponds to an image A set of image binary masks for the fused random regions, And (3) with The number of the elements in the two are the same; The fusion ratio obtained by random sampling is shown, Is a random number, passing through super parameter To control the proportion of data enhancement samples; the corresponding elements of the representation matrix are multiplied, Is shown in The fused region positions are added; Representation of The loss function before being fused is processed, Representing object detection model output and new object set A loss function between corresponding bounding box information, And representing the loss function of the target detection model after the fusion processing.
3. The method for detecting security contraband based on multi-scale attention and data enhancement according to claim 1, wherein in step S1, the backbone network MSANet passes through a 7×7 convolution layer and a 3×3 max pooling layer with the step sizes of 2, then passes through a network layer formed by stacking a plurality of basic residual blocks MSANet Block, and finally obtains MSANet output through a global average pooling layer, a full connection layer and a softmax function respectively during training image classification task; Each basic residual block MSANet Block is formed by connecting a1×1 convolution layer, a basic Module MSA Module and a1×1 convolution layer in a residual mode, wherein the basic Module MSA Module consists of a multi-scale feature extraction Module, a multi-scale channel attention Module and a multi-scale space attention Module; The backbone network has an input image size of The output characteristic diagrams are respectively of the size of 、 And The minimum size feature diagram is subjected to step length of 2 And (5) convolving to finally obtain four image feature images with different scales.
4. A security contraband detection method based on multi-scale attention and data enhancement as in claim 3, wherein said multi-scale feature extraction module is configured to input feature maps Dividing into S sub-feature graphs according to channel dimension Wherein Each sub-feature map is then mapped Respectively through a group of The convolution layer is used for outputting S feature graphs after overlapping and fusing all groups of convolution calculation results Wherein Then The method comprises the steps of carrying out convolution operation on multi-scale information of the previous i-1 sub-feature diagram, wherein the calculation process is shown in the following formula: Wherein, the Representation corresponds to the ith sub-feature map Is a group of (a) Convolution layer to reduce the number of parameters of the multi-scale feature extraction module, The convolution operation employs a block convolution.
5. The method for detecting security contraband based on multi-scale attention and data enhancement according to claim 4, wherein said multi-scale channel attention module outputs each sub-feature map for the multi-scale feature extraction module Global average pooling and maximum pooling in the spatial dimension are obtained respectively And To aggregate global spatial information, where Then capturing the interdependence among channels through the shared multi-layer perceptron, and activating the channel attention attempt by utilizing a sigmoid function after overlapping the two Wherein In order to realize cross-channel information interaction between different scale feature graphs, attention is sought to all channels Performing a softmax operation to obtain a multi-scale channel attention map Wherein Enabling the contraband target detection model to adaptively select specific channel information of a specific scale feature map, and then according to the specific channel information Each channel weight in (a) is used for each feature map Recalibration is performed to obtain Wherein The calculation process is as follows: wherein the MLP is a multi-layer perceptron comprising a hidden layer, The sigmoid function is represented as a function, Representing the multiplication of matrix corresponding elements; Finally, the characteristic diagram Splicing in the channel dimension to obtain an output characteristic diagram of the multi-scale channel attention module The following are provided: Wherein, the 。
6. The method for detecting security contraband based on multi-scale attention and data enhancement according to claim 4, wherein said multi-scale spatial attention module outputs each sub-feature map for the multi-scale feature extraction module Global average pooling and maximum pooling in channel dimension are respectively carried out to obtain And To aggregate global channel information, where The two are connected and then the interdependence between the space positions is captured through a deformable convolution layer, and the sigmoid function is utilized to activate the space attention diagram Wherein 。
7. The method for detecting security contraband based on multi-scale attention and data enhancement according to claim 6, wherein in order to realize cross-space information interaction between different scale feature patterns, all spatial attention is sought Performing a softmax operation to obtain a multi-scale spatial attention map Wherein The contraband target detection model can adaptively select the specific spatial position information of the specific scale feature map, and then according to the specific spatial position information Each of which is weighted by each spatial position Recalibration is performed to obtain Wherein The calculation process is as follows: Wherein, the For a convolution kernel of size of To increase the receptive field, to accommodate changes in shape and size of the target, The sigmoid function is represented as a function, Representing the multiplication of matrix corresponding elements; Finally, the characteristic diagram Splicing in the channel dimension to obtain an output characteristic diagram of the multi-scale space attention module The following are provided: Wherein, the 。
8. The method for detecting security contraband based on multi-scale attention and data enhancement according to any one of claims 3 to 7, wherein the output feature map of the basic Module MSA Module Output feature map from multiscale channel attention module And an output profile for a multiscale spatial attention module The corresponding elements are summed to obtain the following formula: 。
9. The method for detecting contraband based on multi-scale attention and data enhancement according to claim 1, wherein in step S1, the overall detection flow of the contraband target detection model MSA-DETR is as follows: The method comprises the steps of preprocessing an input X-ray image through a data enhancement module to convert part of the input image into a ObjectMix transformed image, extracting processed image data through a backbone network MSANet to obtain a multi-scale feature image, meanwhile, respectively carrying out position coding on the multi-scale feature image to represent position information in the feature image, carrying out dimension transformation on the multi-scale feature image and the position coding thereof, converting the multi-scale feature image into serial data, inputting the serial data into a transform encoder-decoder structure, outputting feature information of N target rectangular frames in a fixed number, wherein a self-attention module of an encoder and a cross-attention module of a decoder use multi-scale deformable attention module, and finally decoupling the feature information of the predicted N target rectangular frames into corresponding categories and boundary frame coordinates through a projection matrix in a target detection head.

Description

Security inspection contraband detection method based on multi-scale attention and data enhancement Technical Field The invention belongs to the field of target detection, and particularly relates to a security inspection contraband detection method based on multi-scale attention and data enhancement. Background In large-scale public transportation hubs with dense personnel, such as subway stations, railway stations, airports and the like, an X-ray security inspection machine is a necessary security inspection device for inspecting whether forbidden articles, such as control cutters, lighters, guns and the like, are carried in luggage of passengers or not so as to ensure public security. However, in the actual security inspection process, each piece of luggage requires a manual security inspector to actively observe the X-ray imaging result to determine whether contraband is included. However, the number of contraband in the actual scene is small, the security inspector is easy to miss inspection due to fatigue or lack of concentration, and long-time repeated work is not beneficial to the physical and mental health of the security inspector. On the other hand, the background articles in the scanned images are complicated, so that the detection efficiency of security inspectors is limited, and passengers are jammed in the trip peak period. Therefore, the security inspection contraband automatic detection algorithm has important practical value. The rapid development of deep learning algorithms makes computer vision algorithms based on Convolutional Neural Networks (CNNs) a mainstream tool in many scenarios in terms of image processing and visual understanding, etc., and locating contraband from security inspection X-ray images can be categorized into target detection problems in computer vision. Currently, the contraband detection algorithm uses a CNN-based model, including a two-stage algorithm focusing on improving accuracy and a single-stage algorithm focusing on improving instantaneity. With the introduction of the transducer structure into the field of computer vision in recent years, DETR is regarded as a new detection framework, and target detection is regarded as a direct set prediction problem, so that the need of many manual design components is eliminated, the detection flow is simplified, but the DETR is not applied to solve the problem of contraband detection. In the field of contraband detection, the disclosed high-quality data sets are fewer, so that the model is easy to generate over fitting due to lack of training data, and the robustness is poor. Aiming at the problem, a data enhancement method can be adopted to generate more false samples, training data is increased, and generalization capability of a model is improved. CutOut randomly selecting a region on the image, and then erasing all pixels of the region, but the processing method easily shields key targets in the image, so that key features in the image are lost, and model learning is affected. Mixup a new image is generated by carrying out weighted fusion on the two images, new mixed class samples are formed in samples of different classes, the distribution of few class samples is smoothed, the generalization performance of the model is improved, but on a data set with serious aliasing, the method can aggravate background aliasing, and the improvement of the accuracy of the model is limited. CutMix replacing a region of the current image with a region of a different image, instead of the full-image blending of the Mixup method, reduces the background aliasing of Mixup to some extent, but the selected regions are random, non-target regions are easy to select, resulting in a target object's label mismatch, and in a situation where the key target is clipped, resulting in the loss of key features. The difficulty in contraband detection is that X-rays have penetrability, and objects in the scanned luggage are randomly placed, so that target objects with different sizes in security X-ray images are easy to overlap or shield with other objects, and the detection process is interfered by noise in a disordered background area. Existing solutions mainly use higher-level features with stronger semantic information to eliminate lower-level noise, and introduce attention mechanisms to enhance the edge profile, color, etc. of the target. The CHR model attempts to insert reverse connections between different levels of the network, eliminating the interference of background noise by high-level supervision information provided by high-level features for low-level features, but in which the refinement function does not build explicit expressions and the classification problem is studied. The DOAM module considers the outline shape and the color texture characteristics of the object which are more focused by the model, but the module only acts on the input end of the model, and does not consider the multi-scale information of the object, so that the e