CN-121999806-A - Cardiopulmonary disease risk prediction method and system based on self-supervision diffusion enhancement

CN121999806ACN 121999806 ACN121999806 ACN 121999806ACN-121999806-A

Abstract

The invention discloses a cardiopulmonary disease risk prediction method and system based on self-supervision diffusion enhancement, which comprises the steps of firstly carrying out data acquisition and preprocessing, then compressing original audio data into a potential space through a depth compression self-encoder DCVAE, adding noise at a designated time step to obtain a noise-added potential representation, then carrying out self-supervision feature learning by using a diffusion converter DiT model, forward propagating by using a DiT model, collecting middle layer features, and finally carrying out pyramid pooling on the middle layer features for risk prediction. The invention can realize high-efficiency and accurate disease risk prediction by combining an advanced signal processing technology and a deep learning model, provides important help for early diagnosis and personalized treatment of diseases, has important significance in the medical field, and lays a foundation for development of intelligent health monitoring equipment.

Inventors

GUO LIQUAN
ZHANG BOCHAO
WANG JIPING
XIONG DAXI
ZHOU WEINAN
ZHOU WEI

Assignees

中国科学院苏州生物医学工程技术研究所

Dates

Publication Date: 20260508
Application Date: 20260407

Claims (9)

1. A cardiopulmonary disease risk prediction method based on self-supervision diffusion enhancement, which is characterized by comprising the following steps: step 1, data acquisition and pretreatment; Step 2, compressing original audio data into a potential space through a depth compression self-encoder DCVAE, and adding noise at a specified time step to obtain a noisy potential representation; step 3, performing self-supervision feature learning by using a diffusion converter DiT model, forward propagating through a DiT model, and collecting middle layer features; and 4, pyramid pooling is carried out on the intermediate features for risk prediction.
2. The method for predicting the risk of the cardiopulmonary disease based on the self-supervision and diffusion enhancement according to claim 1 is characterized in that in the step 1, data are downsampled to 4000Hz for acquired breathing sounds and heart sound signals of a patient, a third-order Butterworth filter is further adopted to reduce noise, the cardiopulmonary sound is subjected to periodic segmentation through an autocorrelation technology, the segmented periodic signals are converted into time-frequency representations through short-time Fourier transformation, a mel filter bank is further used for carrying out logarithmic compression on a time-frequency graph, and the visibility of low-frequency features is enhanced.
3. The method for predicting risk of cardiopulmonary disease based on self-supervision diffusion enhancement according to claim 1, wherein in step 2, the Mel time-frequency diagram is subjected to feature compression by DCVAE, the optimization difficulty under high compression ratio is reduced by a residual self-coding process, and the generalization capability of the model from low resolution training to high resolution application is ensured by a decoupling high resolution adaptation process, and specifically comprises the following steps: step 2-1, in the residual error self-coding process, a non-parametric shortcut is added in a downsampling and upsampling block to enable a neural network module to learn residual errors based on space-to-channel operation, and the size of an input image is set to be H, W respectively represent the height and width of the feature map, and the encoder performs feature extraction and spatial downsampling on the image through a series of convolution layers and downsampling blocks; Representing a real set; And 2-2, decoupling a high-resolution adaptation training process, wherein the high-resolution adaptation training process is used for relieving generalization penalty of the high-spatial compression ratio self-encoder.
4. A method of predicting risk of heart lung disease based on self-supervised diffusion enhancement as claimed in claim 3, wherein in the downsampling step of step 2-1, for each downsampling block in the encoder, the input is The output target is The output is obtained by the following sub-steps: Step 2-1-1-1, space-to-channel operation, for input feature map The space dimension is halved by applying space-to-channel operation, and the number of channels is increased by 4 times, H ' represents the height of the input feature diagram of the current downsampling block, W ' represents the width of the input feature diagram of the current downsampling block, and C ' represents the number of channels of the input feature diagram of the current downsampling block; Input: Output: Wherein the space-to-channel operation rearranges each 2x2 spatial region into a channel dimension as follows: Wherein the method comprises the steps of For the value of a specific element in the output feature map Y obtained after the "space-to-channel operation", i denotes the output feature map row index, j denotes the output feature map column index, , , And (2) and Representing the offset in the 2x2 region, c=kmod K represents an output feature map channel index; Dividing the output Y into two feature maps along the channel dimension, each of which has the size of : Wherein, the ; Step 2-1-1-3, averaging operation, namely element-by-element averaging the two groups to obtain shortcut output: Wherein the method comprises the steps of Finally, the output of the downsampling block is the output of the neural network module And shortcut output And: ; Step 2-1-1-4, generating final potential representation, obtaining a preliminary potential feature map after all encoder stages Wherein For the cumulative downsampling multiple so far, Representing preliminary latent feature maps The number of channels; this feature map is converted to the final high compression ratio potential representation Z by a space-to-channel operation: Wherein the method comprises the steps of , wherein, Compressing the space dimension of the feature map to the channel dimension, f is the total space compression ratio, c is the potential channel number, and the potential shape is Is denoted as H C represents the number of channels of the current up-sampling block output feature map.
5. A method of predicting risk of heart-lung disease based on self-supervised diffusion enhancement as claimed in claim 3, wherein in the upsampling step of step 2-1, for each upsampled block in the decoder, its input is The size of the output characteristic diagram is C represents the channel number of the current up-sampling block output feature map, and the steps are as follows: Step 2-1-2-1, channel-to-space operation, which is to apply the channel-to-space operation to the input feature map to double the space dimension and reduce the number of channels by 4 times: Input: Output: wherein, the channel-to-space operation rearranges the channel dimension into a space dimension, and the specific formula is: wherein i=0..the H-1, j=0..the W-1, k=0..the C/2-1, and ; Step 2-1-2-2, copy the output Y into two identical feature maps : Wherein the method comprises the steps of ; Step 2-1-2-3, connecting the two copied feature graphs along the channel dimension to obtain shortcut output: Wherein the method comprises the steps of , Representing a splicing operation, specifically splicing two feature maps together along a channel dimension; finally, the output of the up-sampling block is the output of the neural network module And shortcut output And, wherein A neural network module consisting of a transposed convolution or interpolation upsampling layer, a convolution layer, and a nonlinear activation function: step 2-1-2-4, outputting the reconstructed image, up-sampling by all decoder stages, and outputting the reconstructed image by final convolution layer 。
6. A method for predicting risk of heart lung disease based on self-supervised diffusion enhancement as claimed in claim 3, wherein in step 2-2, the training process comprises the steps of: Step 2-2-1, low resolution full training, learning image reconstruction capability on the low resolution dataset, i.e. recovering the overall structure of the input image from the compressed latent representation, inputting the low resolution image with the training range of the whole self-encoder, first stage loss function Is the sum of the L1 loss and LPIPS loss; Wherein, the Is the reconstructed image of the object, Is a true image of the person, And Is the loss weight; Optimizer parameters AdamW optimizer, learning Rate Weight decay ,betas=(0.9,0.999); Step 2-2-2, high resolution potential adaptation, adapting the potential space of the model to high resolution input, relieving generalization penalty, inputting high and low resolution images, only middle layer in training range, freezing other layers, and a second stage loss function Is the sum of the L1 loss and LPIPS loss; Optimizer parameters AdamW optimizer, learning Rate Weight decay ,betas=(0.9,0.999); Step 2-2-3, low resolution local refinement, improving local detail quality of reconstructed image by using counter loss while avoiding instability of high resolution GAN training, inputting low resolution image, training range only head layer of decoder, freezing other layers, third stage loss function L1 loss, LPIPS loss, and PatchGAN loss; Wherein, the Is based on the loss of antagonism of PatchGAN, Is the weight of the different losses; Optimizer parameters AdamW optimizer, learning Rate , betas=(0.5,0.9)。
7. The method for predicting risk of cardiopulmonary disease based on self-supervised diffusion enhancement according to claim 1, wherein in step 3, feature learning is further performed by a self-supervised diffusion model to enhance generalized learning of the model for different heart sound features, specifically comprising the following steps: step 3-1, latent feature map input, original latent feature map output of DC-VAE downsampling encoder H, W and C are the spatial height, width and channel number of the potential feature map compressed by DCVAE, and are recorded as input in self-supervision diffusion, and the potential feature map is subjected to standardization processing to obtain the feature map ; Step 3-2, processing by a conditional coder, marking the original potential feature map and the feature map of the enhanced version thereof as The encoder first partitions the latent feature map into Generates a patch sequence Wherein Representing the number of patches for the feature map Then, each patch is mapped to an embedding space through linear projection to obtain a patch embedding sequence Wherein In order to project the weight matrix, In order to embed the dimensions in-line, To preserve spatial information, the encoder adds a position code that can be learned Forming an embedded representation of location awareness The sequence is then processed through a multi-layer transducer module to generate a high-level feature representation Wherein Representing a neural network layer formed by stacking multiple layers of transducer modules, and finally, obtaining a global condition representation by averaging and pooling all patch representations Wherein Representing the feature sequence output after being processed by the multi-layer transducer module The feature vector of the i-th patch of (b), the representation being to guide a subsequent diffusion process as the condition information; step 3-3, a latent diffusion process in which the framework builds a forward diffusion chain by progressive noise addition, specifically, at time step t, a noise latent feature map Calculated by linear combination: , Wherein As a result of the standard gaussian noise, The normal distribution is represented, the I represents the identity matrix, and the ' to ' represents the ' compliance. For initial latent feature map, the accumulated retention coefficients up to time step t In order to accumulate the noise scheduling parameters, Representing the noise retention coefficient of step s, the signal retention coefficient of step t By a predefined sequence of noise variances Determining, simultaneously, converting time step information t into time step embedding by sine embedding and multi-layer perceptron , For the embedding dimension, the embedding will co-modulate the denoising process with the conditional representation; Step 3-4, condition denoising model is the core component of the framework, the objective is to recover the original signal from the noise latent feature map, the model first embeds Condition representation e with time step Fusion is carried out to obtain a joint condition vector Wherein the method comprises the steps of Generating DiT self-adaptive parameters and scale parameters of the block by a plurality of small multi-layer perceptron based on the condition vector Offset parameter By passing through , The process is carried out in a manner that, Representing the segmentation operation, residual join scaling parameters By passing through Calculation of wherein Combining the condition vectors, these parameters are then used to modulate the internal computational flow of the diffusion transform block DiT-Blocks to enable them to adaptively process noise inputs based on the condition information, a noise prediction network The goal of (1) is to minimize the difference between the predicted noise and the real noise, and the training goal is defined as Wherein Representing expectations and taking an average value; The noise representing the prediction of the model is represented, The time step of the diffusion is indicated, Representing an initial potential signature without noise, A noisy latent feature map at time step t is represented, Representing a condition representation; Step 3-5, classifier independent guided sampling, in order to effectively utilize condition information in the sampling process, the framework adopts a classifier independent guided strategy, and the strategy calculates condition prediction simultaneously And unconditional Wherein Representing the embedding of null conditions, obtaining the prediction of the guide noise by linear interpolation of the two prediction results Wherein the guide scale parameter s >1 controls the intensity of the condition information; step 3-6, iterative denoising sampling, wherein the sampling process is realized by iterative denoising, and the sampling process is realized from pure noise Initially, step-wise generation of denoised latent feature maps for each time step Sampling step calculation Wherein As a result of the random noise, The noise representing the prediction of the model is represented, Is the variance of the diffusion process, and the de-noised potential feature map is finally obtained after T iterations , Representing a potential feature map at an initial time step t=0, which feature map retains key semantic information of the original signal, while removing noise interference; step 3-7, downstream task adaptation, first, extracting high-level representations from input latent feature graphs using a pre-trained condition encoder Then, task-specific pre-measurement heads are connected to enable the learned representation to be efficiently adapted to various downstream tasks.
8. The method for predicting risk of cardiopulmonary disease based on self-supervised diffusion enhancement according to claim 1, wherein in step 4, in the stage of depth self-coding and self-supervised diffusion, the patch is delivered through a given feature map, then the patch is encoded through patch embedding, and fed into a pyramid Chi Bianya P2T, specifically comprising the following steps: step 4-1, embedding the patch, and marking the potential feature map as Firstly, dividing and encoding patches in an image through patch embedding, wherein the size of the patches is as follows Then Where H ', W' is the height and width of the feature map, H, W representing that the image has H patches per column and W patches per row, thus there are HW patches in total; patch embedding is obtained using a convolution Conv, defining a convolution kernel=s, a step size stride=s, Wherein the size of the convolution kernel is equal to the size of the patch, the step size is set to be the size of the patch so that each patch does not overlap, n=hw is the number of patched token; step 4-2, designing a pyramid pooling attention mechanism; Step 4-2-1, given input First, Q is obtained by a linear layer Wherein the method comprises the steps of The number is indicated as such, Is a learnable parameter reshape Representing the passing of a reshaping operation H is the number of attention heads, d is the number of channels per attention head; Step 4-2-2, before executing the pooling operation Remodelling is as follows: Then performing pooling operations of different scales in the spatial dimension: Wherein the method comprises the steps of Represents an average pooling of the data in the pool, Representing the generated pyramid feature map, n being the number of pooling layers; Step 4-2-3, then, feeding the generated multi-scale pyramid feature map into a depth convolution for relative position coding: Wherein the method comprises the steps of Representing a pyramid feature map generated after pooling of the ith layer, Representing depth convolution, wherein the convolution kernel is 3x3, and the feature map is subjected to position coding And restoring the feature map subjected to the position coding to the same shape as the input: wherein Ni is the number of tokens remaining after passing through the ith pooling layer; step 4-2-4, then, stitching the pyramid feature graphs: Wherein the method comprises the steps of The representation layer is normalized and, Is a spliced sequence, and M= is caused to be = ("a") ) Then: Furthermore, P contains the context information of input X, and can be used as a powerful substitute for input X in calculating the multi-head self-attention module; Step 4-2-5, generating a key matrix K and a value matrix V through P fused with context information of different scales, inquiring a matrix Q, and reshaping the key matrix K and the value matrix V into a shape convenient to calculate: Wherein, the Is a learnable parameter, then, by computing a attention matrix marked to the localization, and then weighting and summing the value matrix V: Wherein the method comprises the steps of The activation function is represented as a function of the activation, Is a mark-to-localized attention matrix, Is the channel size of K and, Is the output of the pooled attention, h represents the number of heads of the multi-head attention, N represents the number of tokens of the query, M represents the number of tokens of the key/value, and D represents the output channel dimension.
9. A cardiopulmonary disease risk prediction system based on self-supervision and diffusion enhancement, which is based on the cardiopulmonary disease risk prediction method based on self-supervision and diffusion enhancement according to any one of claims 1-8, and is characterized by comprising a data acquisition and preprocessing module, a feature compression module, a self-supervision and diffusion feature learning module, a risk prediction module and a risk prediction module, wherein the feature compression module compresses original audio data into a potential space through a depth compression self-encoder DCVAE, adds noise at a designated time step to obtain a noisy potential representation, the self-supervision and diffusion feature learning module utilizes a diffusion converter DiT model to perform self-supervision feature learning, propagates forward through a DiT model to collect middle layer features, and the risk prediction module performs pyramid pooling on the middle features for risk prediction.

Description

Cardiopulmonary disease risk prediction method and system based on self-supervision diffusion enhancement Technical Field The invention belongs to the technical field of crossing of medical artificial intelligence and computer aided diagnosis, and particularly relates to an intelligent lung complications risk prediction technology based on cardiopulmonary sound analysis. The technical field further covers the technical field of learning general characterization from an original heart-lung sound signal by utilizing self-supervision learning, and performing feature enhancement by adopting a diffusion model to solve the problem of medical data scarcity, thereby constructing a robust and accurate deep learning risk prediction model, aiming at realizing risk probability prediction of pulmonary complications, and forming a complete system solution from signal processing, model construction to clinical auxiliary decision making. Background Early screening and accurate risk assessment of cardiovascular and respiratory diseases is critical to improving public health. Heart and lung sounds, which are the most direct and noninvasive acoustic characterization of heart and lung physiological activities, contain abundant pathological information. Auscultation is a diagnosis means which has long history and popularization, the diagnosis accuracy is highly dependent on the clinical experience of doctors, and the auscultation has inherent limitations of strong subjectivity, difficult quantification and the like. In recent years, along with the rapid development of artificial intelligence, particularly deep learning technology, the automatic classification and diagnosis research based on heart and lung sound signals have significantly progressed, and a new approach is opened up for objective and efficient auxiliary diagnosis. In the existing research, self-supervision learning has been widely applied to heart and lung sound analysis to alleviate the problem of scarcity of labeling data. For example, the health acoustic representation （HeAR; Baur, Sebastien, et al. "HeAR--Health Acoustic Representations." arXiv preprint arXiv:2403.02522 (2024).） framework pre-trains on large-scale audio data through masking from the encoder, learns general health acoustic characterizations, and achieves excellent performance on multiple tasks such as lung sound classification. Song et al (Song, Wenjie, Jiqing Han, and Hongwei Song. "Contrastive embeddind learning method for respiratory sound classification." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.) model the quasi-periodic dependency of breath sounds by self-supervised pre-training, and enhance the capture of internal structures of the respiratory cycle by a model using periodic consistency loss. However, such methods focus on general characterization learning, fail to be deeply fused with advanced generation models (such as diffusion models), and remain rough for the utilization of periodic structures, failing to minutely perform feature learning and enhancement at the periodic unit level. On the other hand, generative models, particularly diffusion models, show great ability in medical image and data enhancement, but are early in their application in heart and lung sound analysis. In the prior art, the whole time sequence signal is directly diffused and denoised, and the inherent strong periodicity of the cardiopulmonary sound signal is ignored. For example, a mask-generated-distillation-based heart sound signal abnormality detection algorithm (CN 117436014 a) discloses a mask-generated-distillation-based heart sound signal abnormality detection algorithm, which uses self-supervision Vision Transformer as a teacher model, and trains a lightweight student model through a mask-generated distillation technique to detect heart sound abnormalities. While self-supervision was introduced, the core is model compression and knowledge migration, and data enhancement and characterization learning potential of the generated model (especially the diffusion model) is not utilized to cope with data scarcity and periodic structure modeling. Similarly, a multi-channel heart-lung sound abnormality recognition system and device (CN 111554319 a) based on low-rank tensor learning discloses a multi-channel heart-lung sound abnormality recognition system and device based on low-rank tensor learning, which performs abnormality recognition through multi-channel signal fusion and low-rank tensor learning, and focuses on signal processing and parameter optimization, and does not involve deep analysis and utilization of physiological cycles. In addition, a prediction method for heart sound signals and a heart auscultation device (CN 115089206B) using the same refer to mining "high-level time-sequence semantic information", but the basis is still a whole data segment. The avoidance or coarse granularity processing of the period segmentation makes i