Search

CN-121982366-A - Flame and smoke detection method, system and medium based on YOLOv double-backbone network

CN121982366ACN 121982366 ACN121982366 ACN 121982366ACN-121982366-A

Abstract

The invention provides a flame and smoke detection method, a system and a medium based on YOLOv double-backbone networks, which comprise the steps of acquiring an input image, preprocessing, constructing a double-backbone network model, extracting different types of image features from the preprocessed image, wherein the double-backbone network model comprises a backbone network A and a backbone network B, respectively connecting a 1X 1 convolution layer to the bottoms of the backbone network A and the backbone network B, unifying output formats of the two backbone networks through the 1X 1 convolution layer, linking features after unifying the formats, inputting the features into a C3 module for feature integration to obtain a first P5 layer feature, inputting the first P5 layer feature into a Neck module of YOLOv standard for further feature pyramid integration, and outputting the position, confidence and category of a target based on the features output by the Neck module, wherein the target comprises flame and/or smoke. The detection network can improve the detection efficiency of smoke and flame in complex scenes (such as fire monitoring).

Inventors

  • WANG WEIQIANG

Assignees

  • 福州新视智能科技有限公司

Dates

Publication Date
20260505
Application Date
20251224

Claims (10)

  1. 1. A flame and smoke detection method based on YOLOv double-backbone network is characterized by comprising the following steps: S1, acquiring an input image and preprocessing the input image; S2, constructing a dual-backbone network model, and extracting different types of image features from the preprocessed image, wherein the dual-backbone network model comprises a backbone network A and a backbone network B, wherein the backbone network A is a YOLOv-based backbone network structure, a C3 module in the backbone network A is replaced by an ECA_C3 module, and the ECA_C3 module comprises an efficient channel attention mechanism for enhancing semantic features, and the backbone network B is an original YOLOv backbone network structure for extracting basic contours and spatial features of the image; step S3, respectively accessing a 1X 1 convolution layer at the bottoms of the backbone network A and the backbone network B, unifying output formats of the two backbone networks through the 1X 1 convolution layer, linking the characteristics after unifying the formats, inputting the characteristics into a C3 module for characteristic integration, and obtaining a first P5 layer characteristic; S4, acquiring the first P5 layer feature, the P3 layer feature and the P4 layer feature in the backbone network A, inputting the first P5 layer feature, the P3 layer feature, the P4 layer feature and the P5 layer feature to a Neck module of YOLOv standard for linking and further feature pyramid integration, and outputting updated P3 layer feature, P4 layer feature and P5 layer feature; and S5, outputting the position, the confidence and the category of the target through the detection head based on the characteristics output by the Neck module, wherein the target comprises flame and/or smoke.
  2. 2. The method of claim 1, wherein the preprocessing comprises slicing and recombining the input image by a Focus module to downsample the input image.
  3. 3. The method of claim 1, wherein the ECA_C3 module is inherited from a C3 module of an original YOLO for replacing an original C3 module of three feature extraction layers P3, P4 and P5 in a backbone network, wherein the original default Bottleneck sub-module is replaced by ECA attention enhancement ECABottleneck, and the rest is consistent with the original C3 module, wherein the ECA_C3 module specifically comprises a first branch, a second branch, a channel splicing layer and a Conv module, the first branch comprises a Conv1×1 and a ECA Bottleneck, the second branch comprises a Conv1×1, and the outputs of the first branch and the second branch are spliced by the channel splicing layer and then integrated by the Conv module.
  4. 4. The method of claim 3, wherein ECABottleneck comprises a Conv1×1 module, a Conv3×3 module, a AvgPool module, an ECA Att module, a Sigmoid module and a feature fusion layer, wherein the Conv1×1 module is used for channel dimension reduction, the input channel number c1 is compressed to a preset hidden channel number, the Conv3×3 module is used for restoring the hidden channel number to an output channel number and extracting features, the AvgPool module is used for carrying out global average pooling on the feature map to generate a global feature vector, the ECA Att module is used for standard ECA attention calculation, the attention weight of each channel is calculated through one-dimensional convolution and then restored to the same dimension as the original feature map, the Sigmoid module is used for carrying out normalization treatment on the attention weight, and the feature fusion layer is used for multiplying the attention weight with the original feature map and carrying out channel weighting.
  5. 5. The method according to claim 1, wherein the step S4 comprises the following steps: upsampling the first P5 layer feature in the step S3 to obtain a first P4 layer feature; Linking the first P4 layer feature with the P4 layer feature in the backbone network A, inputting the linked first P4 layer feature into a C3 module for feature integration to obtain a second P4 layer feature; upsampling the second P4 layer feature to obtain a first P3 layer feature; Linking the first P3 layer feature with the P3 layer feature in the backbone network A, inputting the linked first P3 layer feature into a C3 module for feature integration to obtain a second P3 layer feature, and outputting the second P3 layer feature as an updated P3 layer feature to a subsequent detection head; Downsampling the second P3 layer feature to obtain a third P4 layer feature; Linking the third P4 layer feature with the second P4 layer feature, inputting the third P4 layer feature to a C3 module for feature integration to obtain a fourth P4 layer feature, and outputting the fourth P4 layer feature as an updated P4 layer feature to a subsequent detection head; downsampling the fourth P4 layer feature to obtain a second P5 layer feature; and linking the second P5 layer characteristic with the first P5 layer characteristic, inputting the second P5 layer characteristic into a C3 module for characteristic integration to obtain a third P5 layer characteristic, and outputting the third P5 layer characteristic serving as an updated P5 layer characteristic to a subsequent detection head.
  6. 6. A flame and smoke detection system based on YOLOv double backbone network is characterized in that the system comprises: the data input module is used for acquiring an input image and preprocessing the input image; The feature extraction module is used for constructing a dual-backbone network model and extracting different types of image features from the preprocessed image, wherein the dual-backbone network model comprises a backbone network A and a backbone network B, the backbone network A is a YOLOv-based backbone network structure, a C3 module in the backbone network A is replaced by an ECA_C3 module, and the ECA_C3 module comprises an efficient channel attention mechanism and is used for enhancing semantic features, and the backbone network B is an original YOLOv backbone network structure and is used for extracting basic contours and spatial features of the image; The feature integration module is used for respectively connecting a 1X 1 convolution layer to the bottoms of the backbone network A and the backbone network B, unifying output formats of the two backbone networks through the 1X 1 convolution layer, linking the features after unifying the formats, inputting the features into the C3 module for feature integration, and obtaining a first P5 layer feature; Neck module, configured to obtain the first P5 layer feature, and the P3 layer feature and the P4 layer feature in the backbone network a, input the first P5 layer feature and the P4 layer feature to a Neck module of YOLOv standard to perform linking and further feature pyramid integration, and output the updated P3 layer feature, P4 layer feature and P5 layer feature; And the detection head module is used for outputting the position, the confidence and the category of the target through the detection head based on the characteristics output by the Neck module, wherein the target comprises flame and/or smoke.
  7. 7. The system of claim 6, wherein the ECA_C3 module inherits from the C3 module of the original YOLO for replacing the original C3 module in the three feature extraction layers P3, P4 and P5 in the backbone network, wherein the original default Bottleneck sub-module is replaced by ECA attention enhancement ECABottleneck, and the rest is consistent with the original C3 module, wherein the ECA_C3 module specifically comprises a first branch, a second branch, a channel splicing layer and a Conv module, the first branch comprises a Conv1×1 and a ECA Bottleneck, the second branch comprises a Conv1×1, and the outputs of the first branch and the second branch are spliced by the channel splicing layer and then integrated by the Conv module.
  8. 8. The system of claim 7, wherein ECABottleneck comprises a Conv1×1 module, a Conv3×3 module, a AvgPool module, an ECA Att module, a Sigmoid module, and a feature fusion layer, wherein the Conv1×1 module is used for channel dimension reduction, compressing the input channel number c1 to a preset hidden channel number, the Conv3×3 module is used for restoring the hidden channel number to the output channel number and extracting features, the AvgPool module is used for global average pooling of feature graphs to generate global feature vectors, the ECA Att module is used for standard ECA attention calculation, the attention weight of each channel is calculated through one-dimensional convolution and then restored to the same dimension as the original feature graphs, the Sigmoid module is used for normalization processing of the attention weight, and the feature fusion layer is used for multiplying the attention weight with the original feature graphs and weighting the channels.
  9. 9. The system of claim 6, wherein the Neck module comprises: Up-sampling the first P5 layer characteristics output by the characteristic integration module to obtain first P4 layer characteristics; Linking the first P4 layer feature with the P4 layer feature in the backbone network A, inputting the linked first P4 layer feature into a C3 module for feature integration to obtain a second P4 layer feature; upsampling the second P4 layer feature to obtain a first P3 layer feature; Linking the first P3 layer feature with the P3 layer feature in the backbone network A, inputting the linked first P3 layer feature into a C3 module for feature integration to obtain a second P3 layer feature, and outputting the second P3 layer feature as an updated P3 layer feature to a subsequent detection head; Downsampling the second P3 layer feature to obtain a third P4 layer feature; Linking the third P4 layer feature with the second P4 layer feature, inputting the third P4 layer feature to a C3 module for feature integration to obtain a fourth P4 layer feature, and outputting the fourth P4 layer feature as an updated P4 layer feature to a subsequent detection head; downsampling the fourth P4 layer feature to obtain a second P5 layer feature; and linking the second P5 layer characteristic with the first P5 layer characteristic, inputting the second P5 layer characteristic into a C3 module for characteristic integration to obtain a third P5 layer characteristic, and outputting the third P5 layer characteristic serving as an updated P5 layer characteristic to a subsequent detection head.
  10. 10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1 to 5.

Description

Flame and smoke detection method, system and medium based on YOLOv double-backbone network Technical Field The invention relates to the technical field of computer vision, in particular to a flame and smoke detection method, system and medium based on YOLOv double-backbone network. Background At present, a target detection technology based on deep learning is widely applied in the field of computer vision, wherein YOLO (You Only Look Once) series models are widely adopted due to high detection speed and high precision. YOLOv5 is an important version of the series, which adopts a single backbone network structure, extracts image features through a convolution layer and a residual error module, performs multi-scale feature fusion at Neck, and finally outputs a detection result by Head. The original YOLOv framework mainly comprises Backbone, neck and a Head, wherein the back is used for extracting features layer by utilizing a C3 module in a CSP (Cross STAGE PARTIAL) structure to obtain feature graphs (P3, P4 and P5) under different resolutions, neck is used for carrying out top-down and bottom-up feature fusion by adopting a FPN+PAN structure to enhance the detection capability of targets with different scales, and the Head predicts class probability and boundary frame parameters on multiple scales to realize end-to-end detection. However, due to insufficient backbone characteristics, false detection or omission is easy to occur when high-similarity targets such as smoke, flame and the like exist. Currently, there is also YOLOv5 +eca improvement method, which is to further improve the detection accuracy, and some researchers introduce ECA (EFFICIENT CHANNEL Attention) module into the backup of YOLOv5, and replace the original C3 module with the c3_eca module. The method has the core thought that a attention mechanism is introduced in the channel dimension, so that the model can more effectively and selectively pay attention to the characteristics of different channels, and the distinguishing capability of small targets or fuzzy targets under a complex background is improved. However, the introduction of ECA increases the complexity of the model, and when the data volume is insufficient, the model is easier to perform better on training data, but the generalization performance is not strong, so that the effect is reduced in the actual test. In view of this, it is needed to study a target detection network with high similarity, such as smoke, flame, etc., and improve detection accuracy, efficiency and robustness. Disclosure of Invention The invention aims to solve the technical problem of providing a flame and smoke detection method, a system and a medium based on YOLOv double-backbone network, which can improve the detection efficiency of smoke, flame and the like in complex scenes (such as fire monitoring). In a first aspect, the invention provides a flame and smoke detection method based on YOLOv dual backbone network, the method comprising: S1, acquiring an input image and preprocessing the input image; S2, constructing a dual-backbone network model, and extracting different types of image features from the preprocessed image, wherein the dual-backbone network model comprises a backbone network A and a backbone network B, wherein the backbone network A is a YOLOv-based backbone network structure, a C3 module in the backbone network A is replaced by an ECA_C3 module, and the ECA_C3 module comprises an efficient channel attention mechanism for enhancing semantic features, and the backbone network B is an original YOLOv backbone network structure for extracting basic contours and spatial features of the image; step S3, respectively accessing a 1X 1 convolution layer at the bottoms of the backbone network A and the backbone network B, unifying output formats of the two backbone networks through the 1X 1 convolution layer, linking the characteristics after unifying the formats, inputting the characteristics into a C3 module for characteristic integration, and obtaining a first P5 layer characteristic; the single-point fusion only acts on 1/32 level, and does not carry out backbone fusion on the P3 and P4 levels, so as to reduce the calculation amount and the feature redundancy. S4, acquiring the first P5 layer feature, the P3 layer feature and the P4 layer feature in the backbone network A, inputting the first P5 layer feature, the P3 layer feature, the P4 layer feature and the P5 layer feature to a Neck module of YOLOv standard for linking and further feature pyramid integration, and outputting updated P3 layer feature, P4 layer feature and P5 layer feature; and S5, outputting the position, the confidence and the category of the target through the detection head based on the characteristics output by the Neck module, wherein the target comprises flame and/or smoke. Further, the preprocessing comprises the steps of slicing and recombining the input image through a Focus module, so as to realize downsampling of t