CN-121999098-A - Three-dimensional digital human-object interactive motion synthesis method and system based on diffusion model

CN121999098ACN 121999098 ACN121999098 ACN 121999098ACN-121999098-A

Abstract

The invention belongs to the fields of computer vision, computer graphics and robots, and relates to a three-dimensional digital human and object interactive motion synthesis method and system based on a diffusion model. The method comprises the steps of obtaining three types of feature vectors, namely three-dimensional grid shapes of objects, frame-by-frame motion sequences of the objects and body type feature vectors of digital people, carrying out conditional feature coding on the obtained three types of feature vectors to obtain conditional vectors, predicting noise-free estimation of a current time step by using a noise remover iteratively based on a diffusion model, optimizing generated interactive motion sequences of the three-dimensional digital people and the objects according to a generating guiding strategy, obtaining three-dimensional grids of the human body based on the interactive motion sequences of the three-dimensional digital people and the objects, and obtaining an interactive motion synthesis result of the three-dimensional digital people and the objects through three-dimensional modeling software. The invention can synthesize the interactive motion sequence of the three-dimensional digital person and the object with strong sense of reality, and the synthesized three-dimensional digital person has the motions of the trunk and the hands at the same time.

Inventors

DENG XIAOMING
ZHANG YONGHAO
HE QIANG
MA CUIXIA
ZHANG YINDA
WANG HONGAN

Assignees

中国科学院软件研究所

Dates

Publication Date: 20260508
Application Date: 20251212
Priority Date: 20241213

Claims (10)

1. The three-dimensional digital human-object interaction motion synthesis method based on the diffusion model is characterized by comprising the following steps of: three types of feature vectors, namely, the three-dimensional grid shape of an object, a frame-by-frame motion sequence of the object and the figure feature vector of a digital person are obtained; performing conditional feature coding on the three types of feature vectors to obtain a conditional vector; Based on a diffusion model, predicting noiseless estimation of the current time step by using a denoising device in an iterative mode by utilizing a condition vector, and optimizing an interactive motion sequence of the generated three-dimensional digital person and the object according to a generated guiding strategy; Based on the interactive motion sequence of the three-dimensional digital person and the object output by the diffusion model, the three-dimensional grid of the human body is obtained according to the reasoning process of the human body parameterization model, and the interactive motion synthesis result of the three-dimensional digital person and the object is obtained through three-dimensional modeling software.
2. The method of claim 1, wherein the conditional feature encoding of the three types of feature vectors comprises: performing two-level position coding on the three types of feature vectors to obtain primary condition vectors, wherein the two-level position coding comprises feature level position coding and sequence frame level position coding; the primary condition vector is input into a condition encoder for encoding, and a condition vector is obtained.
3. The method of claim 1, wherein the loss function employed during the training phase of the diffusion model is: Wherein, the As a function of the loss of the diffusion model, In order to contact the perceived reconstruction loss, In order for the contact to perceive a loss of interaction, 、、 Is a balance coefficient.
4. A method according to claim 3, wherein the 、 And The method comprises the following steps of: First calculate : Wherein the method comprises the steps of Is a denoising device Representing the uniform sampling between 1 and N of the time steps of the diffusion process and taking mathematical expectation, the denoising device receives the noise steps The following data Condition vector generated by condition encoder Outputting clean data after noise reduction ; After generating the noise-reduced clean human body posture data, deducing all the joints of the left hand and the right hand through SMPL-X And Simultaneously generating wrist relative positions With the centroid position of the object Adding to obtain wrist positions of the left hand and the right hand And After which calculation : Wherein the method comprises the steps of And Is a contact tag of left and right hands; Is the equilibrium coefficient; Then calculate using the following : Wherein, the The node of the hand inferred by SMPL-X, A true value representing the node of the hand, k representing the traversing left and right hands, An exponentially decaying distance-sensing weight is introduced for each hand joint.
5. The method of claim 1, wherein the generating guidance strategy comprises grip stability guidance, hand contact guidance, and foot-to-floor penetration guidance.
6. The method of claim 5, wherein the grip stability guide, hand contact guide, and foot-to-floor penetration guide are calculated by: 1) Calculating grip stability guidance: Wherein, the Representing a grip stability guide function, Representing clean human body posture data after noise reduction The wrist position of the human body obtained through the reasoning process of the SMPL-X, Represents the position of the corrected wrist position in the world coordinate system, Representing the optimized upper body parameters, Representing the initial upper body parameters of the person, Representing the learning rate in the gradient descent, for controlling the step size of each parameter update, Representing the noisy upper body parameters of step n, Representing the gradient corresponding to the noisy upper body parameter of step n, Representing a human parametric representation with all noise removed, Is the contact tag of the 1 st-T frame; 2) Calculating hand contact guidance: Wherein, the Indicating the super-parameters of balanced contact and penetration, Representing the hand parameters after the optimization, The initial hand parameters are indicated as such, Representing the noise-containing hand parameters of step n, Representing the corresponding gradient of the noise-containing hand parameter of the nth step, Representing the learning rate in the gradient descent, for controlling the step size of each parameter update, Representing the contact guiding function of the hand object, The penetration distance is indicated as such, Representing the contact distance; 3) Calculating penetration guidance of foot and floor: Wherein, the Representing the penetration guide function of the foot with the floor, Representing the parameters of the whole body after optimization, Representing the initial whole-body parameters of the patient, Representing the noisy whole-body parameters of step n, Representing the gradient corresponding to the noise-containing whole body parameter at the nth step, Representing the learning rate in the gradient descent, for controlling the step size of each parameter update, Representing all vertices of the human body.
7. The method according to claim 1, wherein the three-dimensional digital person-object interactive motion synthesis result is obtained by three-dimensional modeling software, and is viewed in three-dimensional form or rendered into two-dimensional video.
8. A diffusion model-based three-dimensional digital human-object interactive motion synthesis system, comprising: The feature vector acquisition module is used for acquiring three types of feature vectors, namely a three-dimensional grid shape of an object, a frame-by-frame motion sequence of the object and a body type feature vector of a digital person; the conditional feature coding module is used for carrying out conditional feature coding on the three types of feature vectors to obtain a conditional vector; the diffusion model generation module is used for predicting noise-free estimation of the current time step by using a denoising device in an iterative mode based on a diffusion model and optimizing the generated interactive motion sequence of the three-dimensional digital person and the object according to a generation guiding strategy; The interactive motion result acquisition module is used for acquiring a three-dimensional grid of the human body according to the reasoning process of the human body parameterized model based on the interactive motion sequence of the three-dimensional digital human body and the object output by the diffusion model, and acquiring a three-dimensional digital human body and object interactive motion synthesis result through three-dimensional modeling software.
9. A computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-7.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1-7.

Description

Three-dimensional digital human-object interactive motion synthesis method and system based on diffusion model Technical Field The invention particularly relates to the fields of computer vision, computer graphics and robots, in particular to a three-dimensional digital human and object interaction motion synthesis method and system based on a diffusion model. Background Virtual three-dimensional digital people have gone into people's daily lives, and it is important to study how to improve the sense of realism of digital people. And research on the interactive motion of the synthesized digital person and the object is a great breakthrough in the field of digital persons. However, due to uncertainty in human joint control, complexity of human hand and fine object interactive motion, and lack of related data sets, existing three-dimensional digital human and object interactive motion synthesis algorithms cannot produce realistic results. The existing three-dimensional digital human and object interactive motion synthesis algorithm works by dividing the generation result into a method for grabbing objects by digital human and synthesizing a single frame and a method for synthesizing a digital human and object interactive motion sequence. The method for synthesizing the single frame is simpler, more existing works exist, the field development is sufficient, and the method is not discussed in the invention. The invention mainly aims at a synthetic method of a digital human and object interactive motion sequence. The existing three-dimensional digital human and object interactive motion sequence synthesis algorithm based on deep learning can be divided into a method based on reinforcement learning and a method based on a generating algorithm. The reinforcement learning-based method encourages a human hand to grasp an object and generate a certain degree of subsequent interaction action through carefully designed rewarding and punishing functions of the interaction motion of the human and the object, and the reinforcement learning method can generate a relatively real result based on a small amount of data. The method based on the generation algorithm learns statistical rules from mass data through a deep learning generation model such as a diffusion model, generates a result which accords with conditions in data distribution, and the synthesis result of the generation algorithm is highly dependent on a large amount of high-quality data. The development of the existing reinforcement learning-based method is in a starting stage, and due to the complexity of rewarding and punishing function design and the high degree of freedom characteristics caused by the fact that a digital person operates an object with two hands, the algorithm can only generate actions that the digital person grabs the object with one hand and moves simply, and cannot generate whole-body actions comprising two hands, so that the reality of a synthesized result is weak. Existing methods based on the generation algorithm are limited by the fitting capability of the generation model to complex data, and are mainly divided into methods without two hands and methods with only one or two hands. The method without two hands can only synthesize the actions of the joints of the trunk of the human body, mainly cope with the interactive motion scene of the digital human body and the larger object, and the method with only one hand or two hands only synthesizes the motions of the hands, does not contain the body of the human body, and is suitable for operating the interactive motion scene of the smaller object. These two methods cannot produce the motion of the whole body of the digital person, and the reality of the synthesized result is not strong. The generated model used in the prior art comprises a variational self-encoder (VAE) and a diffusion model, and the diffusion model is novel, so that the research and the application of the prior art in the field of digital human and object interactive motion synthesis are insufficient. Disclosure of Invention The invention aims to synthesize a three-dimensional digital human and object interactive motion sequence with strong sense of reality by using a diffusion model, wherein the synthesized three-dimensional digital human has the motions of trunk and hands at the same time. The technical scheme adopted by the invention is as follows: A three-dimensional digital human and object interactive motion synthesis method based on a diffusion model comprises the following steps: three types of feature vectors, namely, the three-dimensional grid shape of an object, a frame-by-frame motion sequence of the object and the figure feature vector of a digital person are obtained; performing conditional feature coding on the three types of feature vectors to obtain a conditional vector; Based on a diffusion model, predicting noiseless estimation of the current time step by using a denoising device in an iterative mode by u