CN-121982367-A - Smoke detection method, device and medium based on multi-scale depth convolution

CN121982367ACN 121982367 ACN121982367 ACN 121982367ACN-121982367-A

Abstract

The invention provides a smoke detection method, device and medium based on multi-scale depth convolution, which comprises the steps of obtaining an input image, preprocessing, extracting features of the image by adopting a YOLOv backbone network, constructing a feature fusion module MSGF based on a multi-scale depth convolution and attention mechanism, replacing a C2f module in Neck of YOLOv to obtain a new Neck, carrying out feature fusion operation on the extracted features by adopting a new Neck, and predicting coordinates, categories and confidence of smoke from the fused features by utilizing a detection head. The invention can realize the weighted fusion of the features with different scales, and reduce the calculated amount while maintaining the detection precision.

Inventors

WANG WEIQIANG

Assignees

福州新视智能科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251224

Claims (9)

1. A smoke detection method based on multi-scale depth convolution is characterized by comprising the following steps of S1, acquiring an input image and preprocessing the input image; S2, extracting features of the image by adopting a YOLOv backbone network; Step S3, constructing a feature fusion module MSGF based on a multi-scale depth convolution and attention mechanism to replace a C2f module in Neck of YOLOv to obtain a new Neck, and performing feature fusion operation on the extracted features by adopting a new Neck; And S4, predicting the coordinates, the category and the confidence of the smoke from the fused features by using a detection head.
2. The method of claim 1, wherein the new Neck is based on Neck of YOLOv and the last three C2f modules are replaced with feature fusion modules MSGF.
3. The method of claim 1, wherein the feature fusion module MSGF comprises a channel compression unit, a multi-scale depth convolution unit, an attention module, a dynamic weight fusion unit, and an output fusion and residual unit; The channel compression unit comprises a 1 multiplied by1 convolution layer and is used for adjusting the channel number of the input characteristic diagram according to the channel compression ratio r during initialization; the multi-scale depth convolution unit comprises a 3X 3 convolution layer, a 5X 5 convolution layer and a 7X 7 convolution layer and is used for simultaneously executing three kinds of depth convolution operations with different convolution kernel sizes on the compressed features to obtain a first scale feature map, a second scale feature map and a third scale feature map; The attention module is used for carrying out grouping feature deconstructing on an input feature map, carrying out feature extraction through a depth separable convolution, carrying out information interaction among channels through an MLP, capturing space context information by utilizing self-adaptive pooling, generating attention weight through a gating mechanism, and finally weighting the features and recovering the shape; the dynamic weight fusion unit is used for generating three scale weight graphs after adjusting the feature graphs generated by the attention module by adopting a learnable temperature parameter temp, and carrying out pixel-by-pixel weighted summation on the first scale feature graph, the second scale feature graph and the third scale feature graph obtained in the multi-scale depth convolution unit to obtain a fused feature graph; The output fusion and residual error unit is used for adjusting the channel number by adopting 1X 1 convolution and BN regularization on the fused characteristic graph to form output characteristics, and executing residual error superposition when the input size is consistent with the output size.
4. The method of claim 3, wherein the attention module comprises: The grouping feature deconstructing module is used for dividing the input feature map into a plurality of subgroups according to the number of channels, wherein each subgroup comprises the same number of channels so as to reduce the calculation complexity of each subgroup and promote the local differential expression; the depth separable convolution module is used for replacing the traditional convolution by adopting a 7×7 depth convolution+1×1 point-by-point convolution structure, capturing local features, and carrying out nonlinear transformation on the features through the combination of GELU, layerNorm and 2 point-by-point convolutions; the self-adaptive pooling module is used for respectively executing self-adaptive average pooling operation in the horizontal direction and the vertical direction to obtain global contexts in the horizontal direction and the vertical direction; And the fusion gating module is used for adding elements of the feature graphs in two directions, forming a spatially sensitive attention weight through 1X 1 convolution and Sigmoid activation, multiplying the attention weight by the features of the depth convolution element by element, and restoring the spatially dimensional self-adaptive feature selection into an input dimension.
5. A smoke detection device based on multi-scale depth convolution, characterized in that the device comprises: the image input module is used for acquiring an input image and preprocessing the input image; the feature extraction module is used for extracting features of the image by adopting a YOLOv backbone network; the feature fusion module is used for constructing a feature fusion module MSGF based on a multi-scale depth convolution and attention mechanism, replacing a C2f module in Neck of YOLOv to obtain a new Neck, and carrying out feature fusion operation on the extracted features by adopting a new Neck; And the target detection module is used for predicting the coordinates, the category and the confidence of the smoke from the fused characteristics by utilizing the detection head.
6. The apparatus of claim 5, wherein the new Neck is based on Neck of YOLOv and the last three C2f modules are replaced with feature fusion modules MSGF.
7. The apparatus of claim 5, wherein the feature fusion module MSGF comprises a channel compression unit, a multi-scale depth convolution unit, an attention module, a dynamic weight fusion unit, and an output fusion and residual unit; The channel compression unit comprises a 1 multiplied by1 convolution layer and is used for adjusting the channel number of the input characteristic diagram according to the channel compression ratio r during initialization; the multi-scale depth convolution unit comprises a 3X 3 convolution layer, a 5X 5 convolution layer and a 7X 7 convolution layer and is used for simultaneously executing three kinds of depth convolution operations with different convolution kernel sizes on the compressed features to obtain a first scale feature map, a second scale feature map and a third scale feature map; The attention module is used for carrying out grouping feature deconstructing on an input feature map, carrying out feature extraction through a depth separable convolution, carrying out information interaction among channels through an MLP, capturing space context information by utilizing self-adaptive pooling, generating attention weight through a gating mechanism, and finally weighting the features and recovering the shape; the dynamic weight fusion unit is used for generating three scale weight graphs after adjusting the feature graphs generated by the attention module by adopting a learnable temperature parameter temp, and carrying out pixel-by-pixel weighted summation on the first scale feature graph, the second scale feature graph and the third scale feature graph obtained in the multi-scale depth convolution unit to obtain a fused feature graph; The output fusion and residual error unit is used for adjusting the channel number by adopting 1X 1 convolution and BN regularization on the fused characteristic graph to form output characteristics, and executing residual error superposition when the input size is consistent with the output size.
8. The apparatus of claim 7, wherein the attention module comprises: The grouping feature deconstructing module is used for dividing the input feature map into a plurality of subgroups according to the number of channels, wherein each subgroup comprises the same number of channels so as to reduce the calculation complexity of each subgroup and promote the local differential expression; the depth separable convolution module is used for replacing the traditional convolution by adopting a 7×7 depth convolution+1×1 point-by-point convolution structure, capturing local features, and carrying out nonlinear transformation on the features through the combination of GELU, layerNorm and 2 point-by-point convolutions; the self-adaptive pooling module is used for respectively executing self-adaptive average pooling operation in the horizontal direction and the vertical direction to obtain global contexts in the horizontal direction and the vertical direction; And the fusion gating module is used for adding elements of the feature graphs in two directions, forming a spatially sensitive attention weight through 1X 1 convolution and Sigmoid activation, multiplying the attention weight by the features of the depth convolution element by element, and restoring the spatially dimensional self-adaptive feature selection into an input dimension.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1 to 4.

Description

Smoke detection method, device and medium based on multi-scale depth convolution Technical Field The invention relates to the technical field of image processing, in particular to a smoke detection method, device and medium based on multi-scale depth convolution. Background In recent years, along with the development of computer vision technology, a target detection algorithm based on a convolutional neural network (Convolutional Neural Network, CNN) is widely applied in the fields of image recognition, unmanned driving, security monitoring and the like. The YOLO (You Only Look Once) series of algorithms become a mainstream real-time target detection framework because of the advantages of high detection speed, simple end-to-end reasoning and the like. In the YOLOv, YOLOv, 7, YOLOv, etc. versions, the network is typically made up of Backbone, neck, head three parts. The Neck module is mainly responsible for fusing and enhancing the features of different scales of the backstone, and is a key component for influencing the detection precision and speed. In the prior art, neck of YOLOv is generally a C2f (Cross STAGE PARTIAL WITH feature fusion) structure. The C2f module is formed by stacking a plurality of bottleneck (Bottleneck) units, channel compression is carried out through 1X 1 convolution, characteristics are extracted through multi-layer 3X 3 convolution, and finally splicing (Concat) fusion is carried out on channel dimensions, an input characteristic is divided into two parts by adopting a Cross STAGE PARTIAL (CSP) structure idea, one part is directly transmitted, the other part is spliced with the former after characteristics are extracted through a plurality of convolution layers, a characteristic fusion mechanism is that the C2f structure realizes information fusion mainly through channel splicing without explicit modeling on characteristic importance of different spatial positions or channel dimensions, and the module is commonly used in Neck and backbox of YOLOv for enhancing characteristic multiplexing and reducing parameter redundancy. Although the C2f structure has better calculation efficiency, the method has the defects that (1) the convolution kernel size of the C2f module is fixed, target characteristics under different receptive fields are difficult to capture fully, the detection performance of targets sensitive to scale changes (such as small targets or long-distance targets) is limited, (2) the alignment of the multi-scale features is difficult, the problem that the spatial dimensions are inconsistent is often caused by the fusion of feature images with different resolutions in a detection task, the feature distortion or the information loss is easy to occur, and (3) the calculation and parameter cost is large, the C2f module is usually stacked with a plurality of Bottleneck structures and comprises a large number of 3×3 convolution operations, so that the parameter quantity and the calculation quantity (FLOPs) are obviously increased, and the lightweight deployment is unfavorable; To further enhance feature expression capability, partial improvement techniques have been attempted in recent years to introduce Attention mechanisms, such as CBAM module (Convolutional Block Attention Module) to enhance feature expression in a serial fashion of channel Attention and spatial Attention, EMA module (EFFICIENT MULTI-axis Attention) to enhance the spatial receptive field of the network with lower computational overhead by performing feature convolution and fusion operations in the horizontal and vertical directions, respectively, coordinate Attention (CA) module to introduce coordinate information in channel Attention to model position sensitive features by global pooling in the horizontal and vertical directions, respectively; However, the attention mechanism improvement module still has the obvious defects that the spatial attention of the CBAM module is only based on two-dimensional convolution of a single channel, the calculation complexity is high, the EMA module still needs to independently perform multi-axis convolution calculation and result recombination structurally, so that a calculation path is long, the nonlinear modeling capacity of a channel is limited, the model reasoning speed is influenced, the multi-scale feature extraction capacity cannot be effectively integrated, and the Coordinate Attention (CA) module is not combined with the multi-scale convolution structure, so that multi-scale information is difficult to capture simultaneously under the light weight condition. In summary, in the prior art, no matter the C2f module adopted by YOLOv algorithm or various attention mechanism introducing improvement modules can be used, the requirements of multi-scale feature fusion and directional attention modeling cannot be simultaneously considered, the problem of large parameter quantity and calculation quantity generally exists, and the complexity of the model is difficult to be effe