Search

EP-4734833-A1 - ENHANCED VISION-BASED VITALS MONITORING USING MULTI-MODAL SOFT LABELS

EP4734833A1EP 4734833 A1EP4734833 A1EP 4734833A1EP-4734833-A1

Abstract

A method performed by at least one processor includes obtaining an image of a subject; preprocessing the image of the subject; inputting the preprocessed image into a machine learning model trained in accordance with a first frequency distribution corresponding to a first ground truth obtained from one or more sensors performing a vital measurement on one or more test subjects; and obtaining, from the machine learning model, an estimate of a signal corresponding to the vital measurement of the subject.

Inventors

  • VATANPARVAR, KOROSH
  • SPETH, Jeremy
  • RASHID, Nafiul
  • ZHU, LI
  • GWAK, MIGYEONG
  • KUANG, JILONG
  • GAO, JUN

Assignees

  • Samsung Electronics Co., Ltd.

Dates

Publication Date
20260506
Application Date
20240905

Claims (15)

  1. [Claim 1 ] A method performed by at least one processor, the method comprising: obtaining an image of a subject; preprocessing the image of the subject; inputting the preprocessed image into a machine learning model trained in accordance with a first frequency distribution corresponding to a first ground truth obtained from one or more sensors performing a vital measurement on one or more test subjects; and obtaining, from the machine learning model, an estimate of a signal corresponding to the vital measurement of the subject.
  2. [Claim 2] The method according to claim 1, further comprising: obtaining, from one or more sensors performing the vital measurement on the subject, a second frequency distribution corresponding to a second ground truth of the vital measurement; converting the estimate of the signal corresponding to the vital measurement of the subject to a frequency domain signal; determining an error between the frequency domain signal and the second frequency distribution; and updating the machine learning model based on the determined error.
  3. [Claim 3] The method according to claim 1, wherein the first frequency distribution is a Gaussian distribution centered at the first ground truth and having a first standard deviation that is a function of a N frames sampled a f frames per second, wherein N and f are positive integers.
  4. [Claim 4] The method according to claim 2, wherein the first frequency distribution is a Gaussian distribution centered at the first ground truth and having a second standard deviation that is a function of a N frames sampled a f frames per second, wherein N and f are positive integers.
  5. [Claim 5] The method according to claim 3, wherein the determining the error comprises: determining a mean squared error (MSE) loss between the frequency domain signal and the second frequency distribution.
  6. [Claim 6] The method according to claim 5, wherein the determining the error further comprises: determining a signal-to-noise ratio (SNR) based on a proportion of a power centered at a peak frequency of the frequency domain signal compared to a sum of power between a lower cutoff frequency and an upper cutoff frequency of the frequency domain signal.
  7. [Claim 7] The method according to claim 6, wherein the determining the error further comprises: determining an irrelevant power ratio (IPR) based on a proportion of a power between the lower cutoff frequency and the upper cutoff frequency of the frequency domain signal compared to a total power of the frequency domain signal.
  8. [Claim 8] The method of claim 1, wherein the preprocessing the image of the subject comprises: detecting a region of interest of the image subject; and resizing the region of interest of the image subject.
  9. [Claim 9] The method of claim 8, wherein the region of interest is at least a portion of a face of the subject.
  10. [Claim 10] The method of claim 1, wherein the vital measurement is one of a pulse rate, blood pressure, oxygen saturation level.
  11. [Claim 11 ] The method of claim 1 wherein the machine learning model is a three dimensional (3D) Convolutional Neural Network (CNN).
  12. [Claim 12] An apparatus comprising: a memory; processing circuitry coupled to the memory, the processing circuitry configured to: obtain an image of a subject, preprocess the image of the subject, input the preprocessed image into a machine learning model trained in accordance with a first frequency distribution corresponding to a first ground truth obtained from one or more sensors performing a vital measurement on one or more test subjects, and obtain, from the machine learning model, an estimate of a signal corresponding to the vital measurement of the subject.
  13. [Claim 13] The apparatus according to claim 12, wherein the processing circuitry is further configured to: obtain, from one or more sensors performing the vital measurement on the subject, a second frequency distribution corresponding to a second ground truth of the vital measurement, convert the estimate of the signal corresponding to the vital measurement of the subject to a frequency domain signal, determine an error between the frequency domain signal and the second frequency distribution, and update the machine learning model based on the determined error.
  14. [Claim 14] The apparatus according to claim 12, wherein the first frequency distribution is a Gaussian distribution centered at the first ground truth and having a first standard deviation that is a function of a N frames sampled a f frames per second, wherein N and f are positive integers.
  15. [Claim 15] The apparatus according to claim 13, wherein the first frequency distribution is a Gaussian distribution centered at the first ground truth and having a second standard deviation that is a function of a N frames sampled a f frames per second, wherein N and f are positive integers.

Description

Description Title of Invention : ENHANCED VISION-BASED VITALS MONITORING USING MULTI-MODAL SOFT LABELS Technical Field [0001] This disclosure is directed to utilizing soft labels for enhanced vision-based vitals monitoring. Background Art [0002] Heart rate (HR) and heart rate variability (HRV) are valuable vitals, biomarkers, and physiological parameters for estimating a person’s cardiac function. Most devices that measure the cardiac pulse require contact with the subject’s body, such as fingertip oximeters (PPG) or electrocardiogram (ECG) patches. Additionally, measurement devices may be prohibitively expensive, resulting in measurements only being taken during visits to medical establishments. [0003] A vision-based method for non-contact measurement of a blood volume pulse from a camera has been introduced. This vision-based method is known as remote photoplethysmography (rPPG). rPPG enables low-cost and ubiquitous health monitoring from low-cost cameras, which are readily available in mobile phones, computers, tablets, etc. The rPPG signal may be analyzed to extract multiple physiological parameters including, but not limited to, HR, HRV, RR (respiration rate), SpO2 (oxygen saturation), or BP (blood pressure). While the beneficial impacts are clear, implementing accurate rPPG systems is difficult in practice. [0004] rPPG allows for non-contact measurement of the blood volume pulse from commodity cameras. The vast majority of research has evaluated the robustness of rPPG systems via the frequency (e.g., pulse rate in beats per minute (bpm)) over short time windows. As the systems improve, it is beneficial to support more challenging measurement configurations. [0005] Although camera-based vitals measurements have improved over recent years, traditional rPPG methods follow step-by-step transformations from single input video to a time signal representing the pulse (rPPG). Popular methods include color transformations, blind source separation, and signal processing. These methods do not always handle noise factors (e.g., motion) in an environment. To create the most robust rPPG algorithms, researchers have begun exploring data-driven methods such as deep learning using convolutional neural networks (CNN) or transformers to predict an rPPG time signal from only the video. The neural networks are trained with supervised learning frameworks, where PPG or ECG ground truth signal is used as the target label during b ackpropagation. [0006] However, for deep learning systems to be trustworthy and generalizable, the current solutions require large training datasets with a diverse set of skin tones, lighting, camera sensors, motion, and coverage of the physiological ranges. Collecting such diverse data is challenging due to the need for simultaneous capture of a physiological ground truth. Many modem deep learning frameworks for rPPG even require a time-synchronized PPG waveform. Summary of Invention [0007] According to an aspect of the disclosure, a method performed by at least one processor comprises obtaining an image of a subject; preprocessing the image of the subject; inputting the preprocessed image into a machine learning model trained in accordance with a first frequency distribution corresponding to a first ground truth obtained from one or more sensors performing a vital measurement on one or more test subjects; and obtaining, from the machine learning model, an estimate of a signal corresponding to the vital measurement of the subject. [0008] According to an aspect of the disclosure, an apparatus comprises: a memory; processing circuitry coupled to the memory, the processing circuitry configured to: obtain an image of a subject, preprocess the image of the subject, input the preprocessed image into a machine learning model trained in accordance with a first frequency distribution corresponding to a first ground truth obtained from one or more sensors performing a vital measurement on one or more test subjects, and obtain, from the machine learning model, an estimate of a signal corresponding to the vital measurement of the subject. According to an aspect of the disclosure, a non-transitory computer readable medium having instructions stored therein, which when executed by a processor cause the processor to execute a method comprising: obtaining an image of a subject; preprocessing the image of the subject; inputting the preprocessed image into a machine learning model trained in accordance with a first frequency distribution corresponding to a first ground truth obtained from one or more sensors performing a vital measurement on one or more test subjects; and obtaining, from the machine learning model, an estimate of a signal corresponding to the vital measurement of the subject. Brief Description of Drawings [0009] Further features, the nature, and various advantages of the disclosed subject matter will be more apparent from the following detailed description and the accompanying drawings in which: [