CN-121971061-A - Vital sign monitoring method and system based on multi-mode sensing and space-time repair

CN121971061ACN 121971061 ACN121971061 ACN 121971061ACN-121971061-A

Abstract

The invention discloses a vital sign monitoring method and a system based on multi-modal perception and space-time repair, which belong to the technical field of artificial intelligence, wherein a multi-modal input tensor containing physical prior information is constructed by acquiring an original video frame sequence of a target object and based on the original video frame sequence; the method comprises the steps of inputting a multi-mode input tensor into a feature extraction network, extracting space-time features and stability features through parallel processing, fusing the space-time features and the stability features to obtain basic features, generating attention weights based on motion information in the multi-mode input tensor, carrying out dynamic weighted correction on the basic features by the attention weights to obtain anti-interference features, carrying out time sequence long-range dependent modeling on the anti-interference features, and carrying out logic restoration and smoothing processing on feature fragments which are currently interfered by motion by utilizing undisturbed historical feature fragments, so that rPPG signals are predicted and obtained. The method realizes industrial illumination environment adaptation and physical motion sensing, and can realize long-range time sequence repair when signals are lost.

Inventors

LIU DONGSHENG
SUN RENJIE
Jiao ting
GAN QIYUN
ZHANG FENGFAN
WANG XINGEN
WU TONG
GUO FEIPENG
LI YING
JIN RUI
XIAO JUN
SHEN ZHONGTAO

Assignees

浙江工商大学

Dates

Publication Date: 20260505
Application Date: 20260403

Claims (9)

1. The vital sign monitoring method based on multi-mode sensing and space-time repair is characterized by comprising the following steps of: Acquiring a video frame sequence of a target object to construct a multi-mode input tensor containing physical prior information; Constructing a feature extraction network, extracting space-time features and stability features from the multi-mode input tensor through parallel processing, and fusing the space-time features and the stability features to obtain basic features; Generating attention weights based on motion information in the multi-mode input tensor, and carrying out dynamic weighting correction on the basic characteristics by using the attention weights to obtain anti-interference characteristics; The method comprises the steps of carrying out time sequence long-range dependency modeling on the anti-interference features, repairing the feature fragments which are currently interfered by movement by utilizing the feature fragments which are not interfered to generate vital sign signals, wherein a movement invariance antagonism loss is constructed, a discriminator is introduced to judge which movement state the features belong to, the feature extraction network is used as a generator to generate the features which cannot be distinguished by the discriminator, and the generator is trained by minimizing the generation antagonism loss so as to decouple the features from movement information.
2. The method for monitoring vital signs based on multi-modal sensing and space-time restoration of claim 1, wherein the multi-modal input tensor is constructed by calculating normalized images of the video frame sequence to eliminate illumination baseline drift, calculating a color invariance feature map by a projection algorithm to obtain physiological priors with specular reflection removed, obtaining optical flow motion fields by calculating adjacent frame pixel displacements to obtain physical motion information of an encoded object, calculating a facial structure heat map based on a key point generation mask to obtain a facial geometry, and splicing the normalized images, the color invariance feature map, the optical flow motion fields and the facial structure heat map in a channel dimension to obtain the multi-modal input tensor.
3. The method for vital sign monitoring based on multi-modal awareness and spatio-temporal remediation of claim 1, wherein an optical flow motion field is separated from the multi-modal input tensor, and convolving and pooling operations are performed on the optical flow motion field to extract a feature map characterizing motion intensity, the feature map is mapped to spatial attention weights by using an activation function, and dot multiplication operations are performed on the basic features by using the spatial attention weights to establish a physical feedback mechanism with large optical flow intensity and low weight for suppressing feature responses of areas with large optical flow intensity.
4. The method for monitoring vital signs based on multi-modal sensing and space-time restoration according to claim 1, wherein the anti-interference features are flattened into feature sequences, a multi-head self-attention mechanism is utilized to calculate a correlation matrix of current time features and historical time features, undisturbed historical context information is aggregated according to the correlation matrix, the damaged feature fragments at the current time are reconstructed, and finally one-dimensional vital sign signals are obtained through full-connection layer mapping.
5. The method for monitoring vital signs based on multi-modal sensing and space-time repair of claim 1, wherein the frequency domain consistency loss is constructed by calculating Euclidean distances between power spectral densities of the predicted signals in two different time windows so as to restrict the consistency of frequency distribution in the different time windows and obtain consistent vital sign spectrums in different motion states.
6. The method for monitoring vital signs based on multi-modal sensing and space-time repair of claim 5, wherein the vital sign spectrum is a heart rate spectrum and the vital sign signal generated based on time-series long-range dependent modeling is a remote photoplethysmography signal.
7. The method for vital sign monitoring based on multi-modal awareness and spatio-temporal remediation of claim 5, wherein a composite loss is constructed comprising the motion invariance contrast loss, the frequency domain consistency loss, and a time domain cross entropy loss.
8. The method for monitoring vital signs based on multi-modal awareness and space-time restoration of claim 1, wherein the feature extraction network comprises a space-time convolution branch and a feature stabilization branch, the space-time convolution branch utilizes the convolution network to extract high-dimensional space-time features from the multi-modal input tensor so as to capture skin color time sequence changes, the feature stabilization branch utilizes a lightweight convolution module to extract global stability features from the multi-modal input tensor, and the high-dimensional space-time features and the global stability features are added element by element through residual connection to obtain basic features.
9. The vital sign monitoring system based on multi-modal sensing and space-time repair comprises an input construction module, a feature extraction module, an attention correction module and a time sequence prediction module, and is characterized in that the vital sign monitoring method based on multi-modal sensing and space-time repair as set forth in claim 1 is adopted to sequentially execute construction of multi-modal input tensors, extraction of basic features, generation of anti-interference features and generation of vital sign signals.

Description

Vital sign monitoring method and system based on multi-mode sensing and space-time repair Technical Field The invention belongs to the technical field of artificial intelligence, and particularly relates to a vital sign monitoring method and system based on multi-mode sensing and space-time repair. Background The remote photoplethysmogram (rPPG) technology can utilize a common camera to extract physiological indexes such as heart rate in a non-contact manner, and is widely applied to the fields such as telemedicine and the like. However, when the technology is applied to industrial high-risk operation scenes such as mines and tunnels, the rPPG monitoring method in the prior art generally only depends on RGB video frames as input, and a convolutional neural network is utilized to extract characteristics, and the method has the defects that firstly, industrial environment illumination is extremely poor and dust shielding exists, physiological color change and environment light shadow change are difficult to distinguish by single RGB input, secondly, operators often accompany severe irregular body movement, the prior art lacks an explicit physical movement sensing mechanism, head shaking is easy to misjudge as a pulse signal, and finally, when the signals are lost instantaneously due to severe shaking, the conventional model lacks long-range time sequence restoration capability, and monitoring is interrupted. Disclosure of Invention In order to solve the defects of the prior art and realize the purposes of multi-mode sensing and space-time repair vital sign monitoring, the invention adopts the following technical scheme: A vital sign monitoring method based on multi-modal sensing and space-time repair comprises the following steps: Acquiring a video frame sequence of a target object to construct a multi-mode input tensor containing physical prior information; Constructing a feature extraction network, extracting space-time features and stability features from the multi-mode input tensor through parallel processing, and fusing the space-time features and the stability features to obtain basic features; Generating attention weights based on motion information in the multi-mode input tensor, and carrying out dynamic weighting correction on the basic characteristics by using the attention weights to obtain anti-interference characteristics; The method comprises the steps of carrying out time sequence long-range dependent modeling on anti-interference features, carrying out logic restoration and smoothing processing on feature fragments which are interfered by current motion by utilizing undisturbed historical feature fragments to generate vital sign signals, wherein motion invariance fight loss is constructed, a discriminator is introduced to judge which motion state the features belong to, a feature extraction network is used as a generator to generate features which cannot be distinguished by the discriminator, the generator is trained by minimizing the fight loss so as to decouple the features from motion information, and the model is distilled to force the model to learn physiological features which are completely decoupled from the motion features, so that vital sign (heart rate) signals are ensured not to change along with the large swing of a body. Further, the multi-modal input tensor is constructed by calculating normalized images of the video frame sequence to eliminate illumination baseline drift, calculating a color invariance feature map by a projection algorithm to obtain physiological priori without specular reflection, calculating adjacent frame pixel displacement to obtain optical flow motion fields to obtain physical motion information of an encoded object, calculating a facial structure heat map based on a key point generation mask to obtain a facial geometry, and splicing the normalized images, the color invariance feature map, the optical flow motion fields and the facial structure heat map in a channel dimension to obtain the multi-modal input tensor, so that RGB visual information cannot be singly relied on in low illumination and dust diffuse extreme environments, and the physiological region is locked by the geometry structure and the color invariance priori, so that the original single video frame sequence is constructed into the multi-modal input tensor. Further, an optical flow motion field is separated from the multi-mode input tensor, convolution and pooling operations are carried out on the optical flow motion field to extract a characteristic diagram representing the motion intensity, the characteristic diagram is mapped into a spatial attention weight with a value between 0 and 1 by using an activation function, and dot multiplication operation is carried out on the basic characteristic by using the spatial attention weight to establish a physical feedback mechanism with large optical flow intensity and low weight, and the physical feedback mechanism is used for inhibiting characte