CN-115294356-B - Target detection method based on wide-area receptive field spatial attention
Abstract
The invention discloses a target detection method based on wide-area receptive field space attention, which comprises the steps of preparing an image data set for training and testing, constructing a target detection network based on wide-area receptive field space attention, wherein the target detection network comprises Backbone, neck, head and MSA parts, and extracting features of a test set image by using the trained network. The invention captures the pixel-level characteristic information from the angle of the wide-area receptive field, and simultaneously considers the mutual intersection among different characteristic information, thereby greatly improving the characteristic extraction effect under the condition of not obviously increasing the quantity of parameters and the calculated quantity.
Inventors
- WANG GAIHUA
- CAO QINGCHENG
- Zhai Qianyu
- GAN XIN
Assignees
- 湖北工业大学
Dates
- Publication Date
- 20260505
- Application Date
- 20220726
Claims (4)
- 1. The target detection method based on wide-area receptive field spatial attention is characterized by comprising the following steps of: Step 1, preparing an image data set for testing and training; Step 2, constructing a target detection network based on wide area receptive field space attention, wherein the network consists of Backbone, neck, head and MSA wide area receptive field space attention, a ResNet Backbone network is adopted by a backbox and is used for extracting the characteristics of pictures, a Neck structure is used for connecting the backbox and a Head and is used for fusing the characteristics, the Head is used for detecting objects to realize classification and regression of targets, and MSA is placed between the backbox and Neck and between Neck and the Head; The MSA has the structure that F epsilon R C×H×W is set as the input tensor, wherein C, H, W respectively represents channel, height and width, and the height and width of F are halved by 3X3 convolution E R C×H/2×W/2 , then respectively passing through a common convolution branch to obtain E R 1×H/2×W/2 and three depth separable convolution branches to obtain F 1 ∈ R C/2×H/2×W/2 、 F 2 ∈ R C/2×H/2×W/2 、 F 3 ∈ R C/2×H/2×W/2 , and then remolding F1, F2 and F3 into M1, M2 and M3 through dimensional change, namely three-dimensional change and two-dimensional change, namely: (1) M1, M2 and M3 have the same matrix shape [ H/2*W/2, C/2], H/2*W/2 and C/2 represent rows and columns of the matrix, M1, M2 and M3 are multiplied respectively to obtain three relation matrices N1, N2 and N3, each value in the relation matrices represents the relation between every two pixels in the feature, and the calculation formula of N1, N2 and N3 is as follows: (2) In the formula, Representing a matrix multiplication of the number of bits, , , Transposed matrices M1, M2, M3, N1, N2, N3 having the shape [ H/2*W/2, H/2*W/2], H/2*W/2 and H/2*W/2 representing rows and columns of the matrix, respectively; Reshaping N1, N2, N3 into T1, T2, T3, T1, T2, T3 with the shape [ H/2*W/2, H/2, W/2], H/2*W/2, H/2, W/2 representing channel, height and width, respectively, in order to obtain an output containing more useful global priors, will Spliced with T1, T2 and T3 to obtain the characteristic : (3) In the formula, E R (H/2*W/2)*3×H/2×W/2 , H/2, W/2, (H/2*W/2) 3 represents height, width and channel; Will be Reshaping to Y 1 to generate attention weight, then adjusting the attention weight Y 1 to Y 2 by using interpolation algorithm to obtain the same space size as the Input characteristic Input, reshaping Y 2 to three-dimensional space with the size of [1, W, H ] by reshaping operation, and multiplying the Input characteristic Input by Sigmoid function to obtain final Output; Step 3, training a target detection network model based on wide-area receptive field space attention by using a training set image; and 4, performing target detection on the test set image by using the network model trained in the step 3.
- 2. The method for detecting the target based on the wide-area receptive field spatial attention as in claim 1, wherein in step 1, the sizes of all images are adjusted to 512X 512 for multi-scale training, and a series of operations including random inversion, padding, random clipping, normalization processing and image distortion processing are performed on an image data set by adopting data enhancement.
- 3. The method for detecting the target based on wide-area receptive field spatial attention as set forth in claim 1, wherein in the step 2, a ResNet Backbone network outputs 4 feature maps [ C1, C2, C3, C4] with different sizes, steps are [4,8, 16, 32], a channel size [256,512,1024,2048], a Neck structure adopts three feature maps of Backbone [ C2, C3, C4], the channel is reduced to 256 after 1×1 convolution, feature fusion is performed through [ P1, P2, P3] in an FPN structure, then P3 is subjected to downsampling twice to obtain P4 and P5, finally the feature maps are subjected to ablation processing by adopting 3×3 convolution, 5 feature maps with different sizes are output, the steps are [8, 16, 32,64,128], and the channel size is 256.
- 4. The method for detecting the target based on wide-area receptive field spatial attention as set forth in claim 1, wherein in the step 3, the training set image size is unified to 512×512, the learning rate is set to 0.001, the batch_size is set to 4, the training times are 12 epochs, and the learning rate is reduced to 1/10 of the original learning rate in the 8 th and 11 th epochs.
Description
Target detection method based on wide-area receptive field spatial attention Technical Field The invention belongs to the technical field of target detection, and particularly relates to a target detection method based on wide-area receptive field spatial attention. Background In the development background of deep learning, convolutional neural networks have been accepted by more and more people, and applications are becoming more and more common. The deep learning-based target detection algorithm utilizes a Convolutional Neural Network (CNN) to automatically select features, and then the features are input into a detector to classify and position a target. In neural network learning, in general, the more parameters of a model, the more expressive power of the model, and the larger the amount of information stored in the model, but this causes a problem of information overload. By introducing a attention mechanism, the information which is more critical to the current task is focused in a plurality of input information, the attention degree of other information is reduced, even irrelevant information is filtered, the information overload problem can be solved, and the efficiency and the accuracy of task processing are improved. In recent years, attention mechanisms have been widely used for different deep learning tasks such as object detection, semantic segmentation, and pose estimation. Attention is divided into soft and hard attention. The soft attention mechanism is divided into three attention domains, spatial, channel and hybrid. The spatial domain refers to the corresponding spatial transformation in the image. The channel domain directly concentrates information in the global channel. The hybrid domain contains channel attention and spatial attention. In order for the network to focus more on the area around the salient object, the present invention proposes a wide-area receptive field spatial attention module to process the extracted feature map. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a target detection method based on wide-area receptive field space attention, which improves the characteristic expression capability of a network under the condition of not excessively increasing the number of model parameters. The invention mainly comprises pooling operation, remolding operation, cavity convolution block, up-sampling operation and the like, and greatly enhances the expression capability of important characteristic information. In order to achieve the above purpose, the technical scheme provided by the invention is a target detection method based on wide-area receptive field space attention, comprising the following steps: step 1, preparing an image dataset for testing and training. And 2, constructing a target detection network based on wide-area receptive field space attention. And step 3, training a target detection network model based on the wide-area receptive field spatial attention by using the training set image. And 4, performing target detection on the test set image by using the network model trained in the step 3. In the step 1, the sizes of all the images are adjusted to 512×512 size for multi-scale training, and a series of operations including random overturn, padding, random clipping, normalization processing and image distortion processing are performed on the image data set by adopting data enhancement. Moreover, the target detection network based on wide area receptive field space attention in the step 2 is composed of Backbone, neck, head and MSA wide area receptive field space attention, wherein a ResNet Backbone network is adopted by a backbox and is used for extracting the characteristics of pictures, a Neck structure is used for connecting the backbox and a Head and is used for fusing the characteristics, the Head is used for detecting objects and realizing classification and regression of targets, and MSA is placed between the backbox and Neck and between Neck and the Head. The ResNet Backbone network outputs 4 characteristic graphs with different sizes [ C1, C2, C3, C4], the step distance is [4,8,16,32], the channel size is [256,512,1024,2048], the Neck structure adopts three characteristic graphs of Backbone [ C2, C3, C4], the channel is reduced to 256 after 1X 1 convolution, characteristic fusion is carried out on the P3 through [ P1, P2, P3] in the FPN structure, then P4 and P5 are obtained through twice downsampling on the P3, finally ablation processing is carried out on the characteristic graphs through 3X 3 convolution, 5 characteristic graphs with different sizes are output, the step distance is [8,16,32,64,128], and the channel size is 256. The MSA has the structure that F epsilon R C×H×W is set as an input tensor, wherein C, H, W respectively represents a channel, a height and a width, F' epsilon R C×H/2×W/2 is obtained by halving the height and the width of F through 3X 3 convolution, F 0∈R1×H/2×W/2 is obtained by a