CN-121999534-A - Rehabilitation training human body posture estimation method and system based on conditional space-time diagram diffusion model

CN121999534ACN 121999534 ACN121999534 ACN 121999534ACN-121999534-A

Abstract

The invention discloses a rehabilitation training human body posture estimation method and system based on a conditional space-time diagram diffusion model, and belongs to the field of artificial intelligent rehabilitation medicine. The system comprises a multi-view image acquisition module, a two-dimensional gesture detection module, a coarse three-dimensional reconstruction module, a standard action library module, a conditional space-time diagram diffusion optimization module and an output module. The method comprises the steps of acquiring videos through a low-cost multi-camera, obtaining a rough gesture sequence through two-dimensional detection and three-dimensional reconstruction, and then matching and extracting semantic features based on a standard action library, performing space-time joint optimization on the rough sequence as a conditional guiding conditional space-time graph diffusion model to generate a three-dimensional gesture sequence which is high in quality, smooth and accords with clinical semantics. The invention constructs a cooperative framework of a lightweight front end and an intelligent rear end, remarkably improves the accuracy and smoothness of human body posture estimation in a rehabilitation scene on the premise of controllable hardware cost, and provides a practical and reliable intelligent rehabilitation assessment tool for medical staff.

Inventors

WANG PINGZHI
QIAO PENGYAN
NIU QING
WANG ZHENGTAO
WANG MINGYANG
HUANG XULI
ZHONG KUNHUA
CHEN YUWEN

Assignees

山西白求恩医院(山西医学科学院、华中科技大学同济医学院附属同济医院山西医院、山西医科大学第三医院、山西医科大学第三临床医学院)
中国科学院重庆绿色智能技术研究院

Dates

Publication Date: 20260508
Application Date: 20260130

Claims (9)

1. The rehabilitation training human body posture estimation system based on the conditional space-time diagram diffusion model is characterized by comprising a multi-view image acquisition module (1), a two-dimensional posture detection module (2), a coarse three-dimensional reconstruction module (3), a standard action library module (4), a conditional space-time diagram diffusion optimization module (5) and an output module (6); The multi-view image acquisition module (1) comprises four RGB cameras (11) and a synchronous controller (12) and is used for synchronously acquiring multi-view video streams of a rehabilitation training area, wherein the cameras (11) are arranged in a rectangular layout and surround to cover the main view angles of the movement of a patient, and the input end of the synchronous controller (12) is connected with the output ends of all the cameras (11) and is used for synchronously aligning the multi-view video streams; the input end of the two-dimensional gesture detection module (2) is connected with the output end of the synchronous controller (12) of the multi-view image acquisition module (1) and is used for extracting two-dimensional human body key point coordinates and confidence coefficients thereof from video streams of all view angles in real time; The input end of the coarse three-dimensional reconstruction module (3) is respectively connected with the output end of the multi-view image acquisition module (1) and the output end of the two-dimensional gesture detection module (2) and is used for combining camera calibration parameters, fusing multi-view key points based on a multi-view geometric principle and generating a coarse three-dimensional human gesture coordinate sequence; The input end of the standard action library module (4) is connected with the output end of the coarse three-dimensional reconstruction module (3) and is used for storing a standard action library for standardizing rehabilitation actions and calculating standard action characteristics which are most matched with the input coarse three-dimensional human body gesture coordinate sequence; The input end of the conditional space-time diagram diffusion optimizing module (5) is respectively connected with the output end of the coarse three-dimensional reconstruction module (3) and the output end of the standard action library module (4), and is a conditional space-time diagram diffusion model which is used for carrying out space-time joint optimization on the coarse three-dimensional gesture sequence under the guidance of standard action characteristics to generate a high-quality three-dimensional gesture sequence; And the input end of the output module (6) is connected with the output end of the conditional space-time diagram diffusion optimizing module (5) and is used for providing a three-dimensional visualization and data export interface for the high-quality three-dimensional gesture sequence.
2. The rehabilitation training human body posture estimation system based on the conditional space-time diagram diffusion model according to claim 1, wherein the conditional space-time diagram diffusion model is a conditional diffusion model based on a space-time diagram U-Net, and comprises a space-time diagram construction submodule (51), a conditional encoder submodule (52), a time step encoder submodule (53), a space-time diagram neural network-based U-Net network (54), an iterative optimization controller (55) and a sequence reconstruction submodule (56); the space-time diagram construction submodule (51) is used for constructing a rough three-dimensional human body posture coordinate sequence into a space-time diagram frame by frame , wherein, For each node set of joint and time frame pairs, Is a space edge set connected with a human body framework, A temporal edge set for the same joint connection between adjacent frames; the condition encoder submodule (52) is a multi-layer perceptron (MLP), and the input end of the condition encoder submodule is connected with the output end of the standard action library module (4) and is used for characterizing the standard action Encoding into condition vectors applicable to different network layers; the time step encoder sub-module (53) employs sinusoidal position encoding for generating a time step corresponding to the diffusion model Is a time embedded vector of (a); The U-Net network (54) based on the space-time diagram neural network comprises an encoder (541), a bottleneck layer (542), a decoder (543), jump connection (544) and a linear layer (545), wherein under the guidance of action condition characteristics, a rough three-dimensional human body posture coordinate sequence and a corresponding space-time diagram and time embedding vector thereof are input frame by frame, and noise of the rough three-dimensional human body posture coordinate sequence is predicted; The iterative optimization controller (55) is an inverse process of the diffusion model, an input end of the iterative optimization controller is connected with an output end of a U-Net network (54) based on a space-time diagram neural network, and an output end of the iterative optimization controller is connected with a time step encoder sub-module (53) and is used for realizing gradual denoising through iteration according to noise predicted by the U-Net network (54) to obtain optimized single-frame three-dimensional human body posture coordinates; And the input end of the sequence reconstruction sub-module (56) is connected with the output end of the iterative optimization controller (55), and the output end of the sequence reconstruction sub-module is connected with the output module (6) and is used for recombining single-frame three-dimensional human body gesture coordinates into a sequence format frame by frame to obtain a high-quality three-dimensional gesture sequence.
3. The rehabilitation training human body posture estimation system based on a conditional space-time diagram diffusion model according to claim 2, characterized in that the encoder (541) is composed of L space-time diagram nerve layers stacked with downsampling, each layer of space-time diagram nerve layer being connected to the conditional encoder sub-module (52); the bottleneck layer (542) is a multi-head self-attention mechanism, and input ends of the bottleneck layer are respectively connected with the output end of the encoder (541) and the output end of the conditional encoder submodule (52); The decoder (543) is symmetrical to the encoder (541) in structure and consists of L space-time diagram nerve layers which are stacked by up-sampling, and each space-time diagram nerve layer is connected with the condition encoder submodule (52); The jump connection (544) directly connects the output of the layer corresponding to the encoder (541) with the input of the layer corresponding to the decoder (543), so as to splice the two features; the linear layer (545) is used for mapping noise characteristics into noise of a rough three-dimensional human body posture coordinate sequence and outputting the noise.
4. A rehabilitation training human body posture estimation system based on a conditional space-time diagram diffusion model according to claim 2 or 3, characterized in that each space-time diagram nerve layer is formed by a space diagram convolution (GCN) and a Time Convolution (TCN) in series, an input end of the space-time diagram nerve layer is spliced with an output end of the time convolution through a residual connection and then is output through an adaptive normalization layer (adaLN), and the adaptive normalization layer (adaLN) is connected with an output end of the conditional coder submodule (52) and generates scaling parameters and offset parameters according to standard action characteristics.
5. The rehabilitation training human body posture estimation method based on the conditional space-time diagram diffusion model is characterized by comprising the following steps of: The method comprises the following steps of S1, arranging four RGB cameras in a rehabilitation training area to form a surrounding type observation array, realizing microsecond synchronization through a hardware trigger or a software timestamp, and collecting multi-view video streams of the rehabilitation training area; S2, inputting the multi-view video stream into a lightweight two-dimensional attitude estimation model in parallel, and detecting two-dimensional human body key points and confidence in each frame of video in real time; S3, based on calibration parameters, carrying out triangle measurement with confidence weighting on two-dimensional key points of four visual angles, calculating three-dimensional joint point coordinates frame by frame to form a rough three-dimensional human body posture coordinate sequence ; S4, calculating and roughening three-dimensional human body posture coordinate sequence based on standard motion library of standard rehabilitation motions Standard actions of best match Features of (2) ; S5, building a conditional space-time diagram diffusion model, and under the guidance of standard action characteristics, carrying out noise prediction and diffusion iteration denoising on the rough three-dimensional gesture sequence based on the space-time diagram to generate a high-quality three-dimensional gesture sequence ; S6, sequencing the high-quality three-dimensional gestures And rendering into a three-dimensional skeleton animation in real time, and outputting joint rotation data in BVH or JSON format for professional analysis software.
6. The method according to claim 5, wherein step S4 specifically comprises: S401, acquiring action time sequence data corresponding to three-dimensional key point coordinates of standard rehabilitation actions executed by a professional therapist by using a high-precision optical action capturing system and storing the action time sequence data into a standard action library; s402, carrying out rough three-dimensional human body posture coordinate sequence Adopting a Principle Component Analysis (PCA) projection distance and Dynamic Time Warping (DTW) combined mode to perform similarity matching with action time sequence data in a standard action library, and identifying a rough three-dimensional human body posture coordinate sequence Corresponding best matching standard action category ; S403, extracting standard action category by using convolution network Standard motion characteristics corresponding to motion time sequence data of (a) 。
7. The method according to claim 5, wherein step S5 comprises in particular a model training phase and a model reasoning phase: the model training phase (one) comprises: S5011, coordinates of three-dimensional key points of standard actions Gradually adding Gaussian noise Obtaining random sampling time steps Noisy three-dimensional keypoint coordinates of (c) ; S5012, coordinates of three-dimensional key points containing noise Space-time diagram corresponding to the space-time diagram Standard motion characteristics Conditional encoding and time step of (c) Is input together with the coding of a U-Net network (54) predicted coarse three-dimensional human body posture coordinate sequence based on a space-time diagram neural network ; S5013, establishing a loss function by adopting a mean square error, and optimizing parameters of a U-Net network (54) based on a space-time diagram neural network through back propagation; (II) the model reasoning stage comprises: S5021, extracting rough three-dimensional human body posture coordinate sequence frame by frame As an initial noise state And constructing a space-time diagram Wherein The maximum diffusion step number is preset; s5022 in standard motion characteristics Under the guidance of (1), N steps of back diffusion iteration are executed, noise of a rough three-dimensional human body posture coordinate sequence is predicted by using a trained U-Net network, and the state is updated gradually according to a back denoising formula of a diffusion model; s5023, after iteration is completed, obtaining an optimized high-quality three-dimensional gesture sequence 。
8. An electronic device comprising at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, causing the at least one processor to perform the conditional time space graph diffusion model based rehabilitation training human posture estimation method of any one of claims 5 to 7.
9. A computer readable storage medium having stored thereon computer instructions for causing a processor to perform the method of rehabilitation training human posture estimation based on a conditional space-time graph diffusion model according to any of claims 5 to 7.

Description

Rehabilitation training human body posture estimation method and system based on conditional space-time diagram diffusion model Technical Field The invention relates to a rehabilitation training human body posture estimation method and system based on a conditional space-time diagram diffusion model, belongs to the field of artificial intelligent rehabilitation medicine, and particularly relates to a rehabilitation training human body posture estimation based on a conditional space-time diagram diffusion model. Background In the field of rehabilitation medicine, the accurate and objective quantitative evaluation of the degree of dyskinesia of a patient has important clinical significance for making personalized rehabilitation schemes and monitoring rehabilitation progress. The vision-based three-dimensional human body posture estimation technology provides an innovative solution for realizing non-contact, low-cost and quantifiable rehabilitation evaluation. The three-dimensional joint motion track of the patient during the rehabilitation training action is accurately captured, so that a rehabilitation doctor can objectively evaluate key rehabilitation indexes such as joint movement degree and movement coordination, the subjective limitation of traditional manual evaluation is overcome, the real-time monitoring and feedback of the rehabilitation training quality are realized, visual guidance is provided for the patient, and a long-term continuous rehabilitation data file is established to support curative effect analysis and scheme optimization. The technology has wide application prospects in the scenes of nerve rehabilitation (such as reconstruction of the motion function after cerebral apoplexy), orthopedics rehabilitation (such as functional recovery after joint replacement operation), senile rehabilitation (such as fall prevention training) and the like, and has become a research hot spot in the field of intelligent rehabilitation medical treatment. The three-dimensional posture estimation of the current rehabilitation training mainly depends on the following three technical routes, but has obvious defects: first, a solution based on a professional motion capture system. The optical motion capturing system represented by Vicon, optiTrack realizes millimeter-level precision three-dimensional motion capturing by utilizing a plurality of infrared cameras through attaching reflective mark points on the body surface of a patient. Although the precision of the scheme is extremely high, the scheme has serious limitations that the hardware cost is high (hundreds of thousands to millions of yuan), a special laboratory environment is needed, the attachment of the mark points is time-consuming and can influence the natural movement of a patient, and the method is difficult to popularize and apply in clinical daily rehabilitation. Second, a lightweight monocular vision based solution. The monocular two-dimensional attitude estimation model represented by MEDIAPIPE POSE and OpenPose is combined with a depth estimation algorithm, and key points of a human body can be extracted in real time only by using a common RGB camera. The scheme has the advantages of convenient deployment and low cost, but has the inherent defects of depth blurring problem of monocular vision, limited three-dimensional reconstruction precision, sensitivity to shielding, easy key point loss or jump, lack of time sequence consistency constraint, obvious jitter of an output sequence and incapability of fully utilizing a space-time structure of rehabilitation actions and domain knowledge to guide optimization. Third, schemes based on conventional multi-mesh geometric reconstruction. Three-dimensional gestures are reconstructed based on the principle of triangulation by synchronously acquiring multiple RGB cameras from different view angles. Although the precision is improved to a certain extent, the scheme still has the challenges of extremely high requirements on camera calibration precision, direct transfer of calibration errors to reconstruction results, sensitivity to shielding and noise, ambiguity of reconstruction results, complex calculation of a traditional optimization method (such as beam method adjustment), difficulty in meeting real-time requirements, and no systematic integration of rehabilitation medical knowledge (such as standard action library and physiological motion constraint) into the optimization process. In combination, the prior art faces the core contradiction of 'high-precision system is high in cost and difficult to popularize' and 'low-cost system is insufficient in precision and difficult to meet clinical demands'. The specific technical challenges include firstly, how to achieve reconstruction accuracy close to that of a professional motion capture system under limited hardware cost, secondly, how to effectively process specific noise types (such as patient tremble and nonstandard motion modes) in a rehabilitation scen