CN-121661597-B - Window state detection method based on learnable multi-mode fusion gating mechanism

CN121661597BCN 121661597 BCN121661597 BCN 121661597BCN-121661597-B

Abstract

The invention discloses a window state detection method based on a learnable multi-mode fusion gating mechanism, which relates to the technical field of state detection and comprises the steps of acquiring an RGB video frame and an IR image which are synchronous in time and aligned in space by using a camera device, marking window corner points by a user, generating a window region Mask on the RGB video frame, cutting the RGB video frame and the IR image to obtain an RGB sub image img_rgb and an IR sub image img_ir, introducing a learnable gating fusion model FGM in a dual-mode input stage, realizing self-adaptive weighted fusion of RGB mode characteristics Frgb and infrared mode characteristics Fir, enabling the learnable parameter alpha to be automatically optimized in a network training process and be constrained by a Sigmoid function, enabling the model to dynamically adjust weights according to illumination conditions and infrared signal quality, keeping stability and accuracy of window state recognition results under complex environments such as strong light, reflection or low-illumination, and effectively improving the adaptability of the system to different environments.

Inventors

WANG SHIWANG
ZHOU DAIYI

Assignees

上海源控自动化技术有限公司

Dates

Publication Date: 20260505
Application Date: 20260203

Claims (9)

1. The window state detection method based on the learnable multi-mode fusion gating mechanism is characterized by comprising the following steps of: S1, acquiring an RGB video frame and an IR image which are time-synchronous and spatially aligned by using a camera device, marking window corner points by a user, and generating a window region Mask in the RGB video frame; s2, clipping is carried out on the RGB video frame and the IR image to obtain RGB subgraph Img_rgb and IR subgraph Img_ir; S3, inputting the RGB subgraph Img_rgb into a first encoder network enc_rgb, extracting RGB modal characteristic representation Frgb, inputting the IR subgraph Img_ir into a second encoder network enc_ir, extracting infrared modal characteristic representation Fir; S4, introducing a learnable parameter alpha, restricting the learnable parameter alpha to a range from 0 to 1 through a sigmoid function, and carrying out weighted fusion on the characteristic representation Frgb and the infrared mode characteristic representation Fir to obtain a fusion characteristic Ffused; S5, inputting the fusion characteristics Ffused into a classification network, and outputting window state categories.
2. The window state detection method based on the learnable multi-mode fusion gating mechanism according to claim 1, wherein the image pickup device comprises a light acquisition unit RGB_cam and an infrared acquisition unit IR_cam, the light acquisition unit RGB_cam and the infrared acquisition unit IR_cam are aligned with a space calibration module SCM in a sampling frame level through a time synchronization module TSM, the time synchronization module TSM is used for controlling two-mode sampling clock signals to be consistent, and the space calibration module SCM is used for calculating an external reference matrix and an internal reference parameter between two modes, and image space alignment mapping is completed through a geometric transformation matrix T_cal.
3. The method for detecting the window state based on the learnable multi-modal fusion gating mechanism according to claim 2, wherein the process of performing the alignment mapping of the image space through the geometric transformation matrix t_cal comprises the following steps: s11, extracting feature point sets in images of the same calibration target; S12, performing matching calculation based on the feature point set to obtain a space mapping relation from an RGB coordinate system to an IR coordinate system; S13, solving an external parameter matrix R_ext and a translation vector T_ext by utilizing a least square optimization algorithm, and constructing a geometric transformation matrix T_cal by combining respective internal parameter K_rgb and internal parameter K_ir; S14, performing homogeneous coordinate mapping on pixel coordinates of the RGB video frame through the geometric transformation matrix T_cal to obtain pixel projection positions under an infrared coordinate system, and realizing space alignment of the RGB video frame and an IR image.
4. The method for detecting window states based on a learnable multimodal fusion gating mechanism of claim 1, wherein performing a cropping process on the RGB video frames and the IR image comprises: s21, extracting four vertex coordinates from a corner coordinate set point set C_set={(x 1 ,y 1 ),(x 2 ,y 2 ),(x 3 ,y 3 ),(x 4 ,y 4 )} marked by a user; S22, calculating central position parameters of the window area according to the corner coordinates: s23, calculating a width parameter Wth and a height parameter Hth of the window area; S24, verifying effective pixel distribution of the window area based on the window area Mask, and executing morphological expansion operation at the Mask edge; S25, defining a clipping region by taking the central position parameter Ctr as a central point according to the width parameter Wth and the height parameter Hth, and executing synchronous clipping on the RGB video frame and the IR image.
5. The method for detecting a window state based on a learnable multimodal fusion gating mechanism according to claim 1, wherein S3 comprises: in an RGB video frame characteristic extraction branch, splicing an RGB sub-image RGB_sub obtained by synchronous cutting and a corresponding binarization Mask according to a channel direction to form a multi-channel input tensor T_rgb= Concat (RGB_sub, mask), and inputting the multi-channel input tensor T_rgb= Concat (RGB_sub, mask) into a convolutional encoder network convolutional encoder CNN_rgb; The convolutional encoder CNN_rgb comprises a feature extraction stack structure formed by a plurality of stages of convolutional layers Conv, batch normalization layers batch normalization BN and nonlinear activation function layer activation ReLU, extracts low-layer texture edge features and middle-layer semantic structure features of a window area, and outputs RGB modal feature representations Frgb.
6. The method for detecting a window state based on a learnable multimodal fusion gating mechanism according to claim 1, wherein S3 further comprises: In the IR image feature extraction branch, an IR image IR_sub obtained by synchronous cutting and a corresponding Mask are spliced according to a channel to form an input tensor T_ir= Concat (IR_sub, mask), and the input tensor T_ir is input to a lightweight encoder network lightweight encoder CNN_ir; the lightweight encoder CNN_ir consists of depth convolution DConv of a depth separable convolution layer and point convolution PConv of a point convolution layer, extracts thermal distribution characteristics and illumination compensation characteristics in an infrared mode, and outputs infrared mode characteristics to represent Fir.
7. The window state detection method based on the learnable multi-mode fusion gating mechanism according to claim 1, wherein a learnable parameter alpha is output by a fusion model FGM, the fusion model FGM comprises an input layer, a plurality of hidden layers and an output layer, the input layer receives spliced feature vectors from RGB mode features Frgb and IR mode features Fir, the hidden layers comprise a first hidden layer, a second hidden layer and a third hidden layer which respectively comprise 128, 256 and 128 nerve nodes, each hidden layer adopts a ReLU activation function to carry out nonlinear mapping, a BatchNormalization normalization layer is arranged between the second hidden layer and the third hidden layer to stabilize a training process, a Dropout layer is arranged behind the third hidden layer to prevent overfitting, the Dropout rate is set to be 0.3, the output layer adopts a Sigmoid activation function to generate the learnable parameter alpha, and the learnable parameter alpha carries out weighted fusion on a feature representation Frgb and an infrared external mode feature representation Fir to obtain a fusion feature Ffused.
8. The method for detecting the state of a window based on a learnable multimodal fusion gating mechanism according to claim 7, wherein the optimizing process of the learnable parameter α is as follows: And taking the window state classification error as a Loss function Loss, and adopting an adaptive moment estimation optimizer to execute gradient descent iteration on the weight parameters in the fusion model FGM, wherein the learnable parameters alpha are input into a Sigmoid function by an output layer linear transformation result z in each forward propagation to carry out nonlinear mapping, and the specific calculation formula is as follows: ; Wherein z is a linear weighted result of the output layer of the fusion model FGM, e is a natural constant, and the value range of alpha is limited between 0 and 1.
9. The method for detecting the window state based on the learnable multi-mode fusion gating mechanism according to claim 1, wherein the fusion feature F_fused is input into a classification network Clsnet, the classification network Clsnet comprises a feature compression module and a multi-layer perceptron structure MLP, feature abstraction is realized through multi-layer nonlinear mapping, a Softmax classification layer is arranged at the output end of the multi-layer perceptron structure MLP, and a class probability vector P is calculated according to the following formula, wherein when a probability value corresponding to an opening state in the class probability vector P is larger than a preset threshold THR, a window is output to be in an opening label, and otherwise, a closing label is output.

Description

Window state detection method based on learnable multi-mode fusion gating mechanism Technical Field The invention relates to the technical field of state detection, in particular to a window state detection method based on a learnable multi-mode fusion gating mechanism. Background Along with the rapid development of intelligent building and energy-saving control systems, the window is used as an important part for influencing energy consumption, air circulation and safety monitoring in building environments, the state detection technology of the window gradually becomes an important research direction of an indoor environment intelligent perception system, traditional window state detection mainly depends on a mechanical sensor or a magnetic reed switch for detection, but the method has higher cost in installation and maintenance, is complex in wiring and is easily influenced by environmental factors to generate false alarm or missing report, in recent years, the detection technology based on computer vision is started to be applied to the window state recognition field, the window image is acquired through a camera and is subjected to image classification by utilizing a deep learning model, so that non-contact state judgment is realized, single-mode visible light images are easy to generate recognition errors in environments such as strong reflection, backlight, night low illumination and the like, the detection precision is insufficient, in order to solve the problem, researchers gradually introduce an infrared imaging technology and a multi-mode fusion strategy, so that the recognition robustness of infrared mode robust heat radiation information and RGB image structural texture characteristics are combined, the recognition robustness of the model under the complex illumination condition is improved, a learning door control-based window state detection method is developed, the intelligent state detection method can be realized under the condition of opening and the intelligent window state management condition is more important and the window safety management condition is realized; The existing window detection frame easily comprises a large number of irrelevant areas (background and partial areas of other windows) when features are extracted, the feature confusion is serious under the scenes of multi-window adjacency, glass reflection or illumination change, the classification precision is influenced, in addition, the existing window detection frame is easily subjected to strong interference of environmental background when classification is directly carried out based on full-image features, the state identification is difficult to be carried out on a designated window, and the situation is poor in a multi-window scene; In view of the above technical drawbacks, a solution is now proposed. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a window state detection method based on a learnable multi-mode fusion gating mechanism. In order to achieve the purpose, the invention is realized by the following technical scheme that the window state detection method based on the learnable multi-mode fusion gating mechanism comprises the following steps: S1, acquiring an RGB video frame and an IR image which are time-synchronous and spatially aligned by using a camera device, marking window corner points by a user, and generating a window region Mask in the RGB video frame; s2, clipping is carried out on the RGB video frame and the IR image to obtain RGB subgraph Img_rgb and IR subgraph Img_ir; S3, inputting the RGB subgraph Img_rgb into a first encoder network enc_rgb, extracting RGB modal characteristic representation Frgb, inputting the IR subgraph Img_ir into a second encoder network enc_ir, extracting infrared modal characteristic representation Fir; S4, introducing a learnable parameter alpha, restricting the learnable parameter alpha to a range from 0 to 1 through a sigmoid function, and carrying out weighted fusion on the characteristic representation Frgb and the infrared mode characteristic representation Fir to obtain a fusion characteristic Ffused; S5, inputting the fusion characteristics Ffused into a classification network, and outputting window state categories. The image pickup device comprises a light acquisition unit RGB_cam and an infrared acquisition unit IR_cam, wherein the light acquisition unit RGB_cam and the infrared acquisition unit IR_cam are aligned with a space calibration module SCM through a time synchronization module TSM, the time synchronization module TSM is used for controlling two modes to be consistent in sampling clock signals, the space calibration module SCM is used for calculating an external reference matrix and an internal reference parameter between the two modes, and alignment mapping of an image space is completed through a geometric transformation matrix T_cal. The process for completing the alignment mapping of the image space thr