CN-116386141-B - Multi-stage human motion capturing method, device and medium based on monocular video

CN116386141BCN 116386141 BCN116386141 BCN 116386141BCN-116386141-B

Abstract

A multi-stage human motion capturing method, device and medium based on monocular video, for monocular RGB video shot under fixed camera situation, divides human motion capturing into multiple stages, wherein a first stage uses a human gesture detector to estimate two-dimensional human joint points on the input monocular video frame by frame, a second stage uses deep learning to acquire space and time sequence information in a video sequence, learns a mapping relation from the two-dimensional human joint points to three-dimensional human joint points in camera space, senses motion track and touchdown condition of a human body in the three-dimensional space, and a third stage introduces inverse kinematics idea, fits a three-dimensional human grid model onto three-dimensional bones by formulating a reasonable punishment strategy to more truly characterize the motion sequence. The invention obviously improves the alignment degree of the reconstructed human body model and the input RGB image, more truly estimates the contact condition of the feet and the ground, and has obvious improvement in vision.

Inventors

WANG LIMIN
TIAN YATING
WU GANGSHAN

Assignees

南京大学

Dates

Publication Date: 20260508
Application Date: 20230330

Claims (8)

1. A multi-stage human motion capturing method based on monocular video is characterized in that for monocular RGB video shot under the condition of a fixed camera, human motion capturing is divided into a plurality of stages, wherein the first stage carries out human gesture detection on the input monocular video frame by frame, estimates two-dimensional human key points and constructs a local space under a camera coordinate system of a three-dimensional human model SMPL; The first stage is to realize data preprocessing through a preprocessing module, detect two-dimensional human body key points frame by frame for monocular video, normalize two-dimensional human body key point sequences, convert the coordinates of 24 joint points of a three-dimensional human body model SMPL from a world coordinate system to a camera coordinate system, and convert other joint points into relative coordinates relative to root nodes based on the root nodes, so as to construct a local space; meanwhile, clustering the heights of the joints of the left foot and the right foot by a clustering method to estimate the height of the ground and generate ground contact labels of the joints of the left foot and the right foot; the second stage is to estimate the pose through a camera correction module, a human body pose estimation module and a human body track estimation module, wherein the human body pose estimation module is used for estimating the position of a three-dimensional human body joint point in a local space, a three-dimensional human body pose sequence is obtained according to a two-dimensional human body key point sequence, the camera correction module is used for obtaining internal parameters and external parameters of a camera based on a monocular video and obtaining a visual angle of the video, the human body track estimation module is used for predicting the human body displacement and touchdown condition under a camera coordinate system, and a three-dimensional human body track sequence and touchdown probability are obtained according to the two-dimensional human body key point sequence and the internal parameters and the external parameters of the camera; the third stage is to fit the three-dimensional human body posture sequence, the three-dimensional human body track sequence and the touchdown probability based on reverse dynamics through a model fitting module to obtain a three-dimensional human body model motion sequence; The model fitting module iteratively optimizes the shape parameters and joint rotation amounts of the human body grid model in steps according to the predicted three-dimensional human body posture sequence, the three-dimensional human body track sequence and the touchdown probability, and assigns values to the parameterized three-dimensional human body model to obtain a real driving result, wherein the model fitting module comprises four steps, namely: S51, iteratively fitting three-dimensional human body shape parameters, namely dynamically updating the bone length of a human body when performing real-time motion capture, inputting video frames into a human body posture estimation module to estimate and obtain three-dimensional joint point coordinates Calculating the length of each bone of the human body Recording the estimated value of the length of human skeleton based on the past frame as Estimating and calculating an estimated value based on the current frame in a human body posture estimation module After that, to Updating, wherein the updating rule is as follows: Wherein Will be As the latest human skeleton length for use in the three-dimensional human model fitting stage; three-dimensional joint position through manikin Obtaining a bone vector Will be Shape parameters of the SMPL model, considered as fitting targets For the parameters to be updated, the loss function is defined as follows: Wherein the method comprises the steps of Obtaining a final fitting result after iteration; s52 calculating the global rotation amount : Corresponding SMPL model pose parameters In (a) and (b) , Representing global rotation, determining a rigid structure with three joints of the SMPL model, based on initial positions of the three joints And predicted location Obtaining a rotation matrix Make the following Vector obtained after rotation transformation With predicted position Distance sum is minimized such that the distance is minimized Namely, is The formula is as follows: Wherein the method comprises the steps of Representing a three-dimensional rotation group formed by a three-dimensional rotation matrix, and obtaining a closed solution by the formula through Singular Value Decomposition (SVD); S53, calculating the rotation quantity of each joint, namely defining father nodes in the three-dimensional human model along a human motion skeleton chain and recording the nodes In the initial position of the three-dimensional human body model is The target position is Nodes in reconstructed model The position of (2) is Recording node The bone vector is Attitude parameter Representing nodes Relative to its parent node Is fixed by the rotation amount of (a) Then node Absolute rotation amount of (2) First, align the root nodes of the SMPL model, have Obtaining global rotation amount through S52 Then, moving along the moving skeleton chain of the three-dimensional human body model from the root node to the leaf node, and gradually calculating the relative rotation quantity of each node If the currently calculated node is Reconstructed node Is the position of the father node of (a) The initial skeleton vector is The rotation influence of the father node is counteracted to obtain a target skeleton vector as The rotation axis is recorded as The rotation angle is Then , The rotation matrix is obtained by the Rodrigues formula: Wherein the method comprises the steps of Representation of The identity matrix is used as a matrix of units, Representation of The relative rotation amount Namely, is ; S54, optimizing based on the detected two-dimensional key points and touch labels, namely, the shape parameters obtained through the steps S51, S52 and S53 And attitude parameters Obtaining an SMPL model, and regressing the SMPL model into three-dimensional coordinates through linear mapping Combining with the camera internal reference matrix estimated by the camera correction module Solving two-dimensional key point coordinates after SMPL model projection This step combines normalized two-dimensional human body key point sequences Ground contact probability predicted by human body track prediction module Further carrying out iterative optimization on the human body model, wherein the parameters to be updated are attitude parameters Attitude parameters The loss function is defined as follows: Wherein the method comprises the steps of And iterating to obtain a final optimized result.
2. The monocular video-based multi-stage human motion capture method of claim 1, wherein the first stage of data preprocessing comprises three subtasks of extracting two-dimensional human keypoints, converting three-dimensional human keypoints, and generating touch labels: S11, extracting and normalizing two-dimensional human body key points, namely downsampling an original video in a training set, extracting frames, and sending the video frames into a human body posture detector to estimate an image coordinate sequence of 25 two-dimensional human body key points , For video frames, then will be under the condition of maintaining the aspect ratio The coordinates of the axes being defined by Normalized to I.e. Wherein Width and height of video frames; S12, converting the three-dimensional human body joint point, namely converting the three-dimensional human body joint point coordinate from a world coordinate system to a camera coordinate system for supervision and recording by viewpoint conversion For the rotation matrix and displacement amount of the camera, The conversion formula is that the position coordinates of 24 joint points of the three-dimensional human model SMPL under the world coordinate system are as follows: Wherein Finally, reserving the space position coordinates of the root node in the camera coordinate system, subtracting the three-dimensional coordinates of the root node from the three-dimensional coordinates of other root nodes, and converting the three-dimensional coordinates into relative coordinates relative to the root node, namely The true value of the three-dimensional track is the three-dimensional coordinate sequence of the root node under the camera coordinate system, namely The depth of the root node is Wherein Representing a slicing operation along a corresponding dimension; And S13, generating a two-class touch label, namely, for a three-dimensional joint point sequence, firstly, performing difference between two adjacent frames to calculate the displacement speed of the joint point, if the speed of the joint point is smaller than a set threshold value, then considering that the joint point is in a static state in the frame, then clustering the heights of the static left and right foot joints by using a DBSCAN method in machine learning, taking the minimum median minus an offset constant as the height of the ground, and for eight joint points of the left foot, the right foot, the left ankle, the right ankle, the left knee, the right knee, the left wrist and the right wrist, if the speed is smaller than the set threshold value, and the height difference between the joint point and the ground is within the set range, considering that the joint point is in a ground touching state, and the touch label is 1.
3. The monocular video-based multi-stage human motion capture method of claim 1, wherein the human body pose estimation module comprises two stages of encoding and decoding, and normalizes the two-dimensional human body pose key point coordinate sequence As input, the features of the input sequence are extracted, after which the features are fed into the decoder for prediction of intermediate frames, i.e. the first The relative positions of human body articulation points of the frames in the three-dimensional space are specifically as follows: in the encoding stage, long-term characteristics are extracted according to grouping of motion chains in a channel dimension and hole convolution is carried out in a time domain dimension, an encoder is a full convolution network, the implementation of hole convolution operation and residual error learning is adopted, namely human bones are divided into six partial motion chains, namely a head motion chain, a root node motion chain, a left arm motion chain, a right arm motion chain, a left leg motion chain and a right leg motion chain, the respective motions are relatively independent, inputs are divided into 6 groups along the channel dimension to extract the characteristics in a grouping way, and in the encoder, each layer of network output is divided into 7 parts, wherein the 7 parts comprise 6 partial characteristics And a global feature First, the The output of the layer network is noted as From the first Layer network To the first Layer network The calculation process of (1) is as follows, global convolution is used to extract global information, and meanwhile, the grouping is convolved on the { global, grouping } interlinked features, namely: Wherein, the All are hole convolution operations with the step length of 3, the convolution kernel size of 3 and the hole coefficient of 3, For concatenated operation along the channel dimension, the encoder ultimately outputs a joint packet signature ; Then, in the decoding stage, the parameterized three-dimensional human body model is subjected to regional regression in the form of multiple branches to form 24 joint points and grouped local features Respectively input to 6 independent decoders In predicting the position of a joint on a corresponding local kinematic chain of a human model The method comprises the following steps: 。
4. The method of claim 1, wherein the camera correction module uses a pre-trained ResNet-50 model as a backbone network and replaces the last layer of network with three separate MLP head networks for estimating the vertical field of view of the camera, respectively Amount of rotation along the x-axis And rotation angle along the z-axis Camera focal length is calculated by vfov: the internal reference matrix of the camera is: Obtaining a projection matrix of the camera according to the internal parameters and the external parameters of the camera, projecting three-dimensional nodes of the human body obtained by network prediction onto a two-dimensional plane by the projection matrix in a training stage to obtain two-dimensional coordinates of corresponding joints on an image, and calculating distance loss with a two-dimensional truth value to monitor; For the following 、 And A value range is predefined for the prediction of each quantity in the system And uniformly sampling 256 candidate values in the range, three MLP header networks each output a 256-dimensional vector Representing the probability of a candidate value, Vector approximation to find the index corresponding to the maximum probability by softargmax functions Will be Normalized to Obtaining normalized index in range After which the index is mapped to a candidate value range Candidate value of corresponding position is obtained , The prediction result of the network is calculated as follows: In training, standard is adopted for prediction of pitch and roll A loss function, for vfov predictions, a Geman-McClure loss function is used, i.e 。
5. The method for capturing human body motion in multiple phases based on monocular video according to claim 4, wherein for an application scene of real-time motion capturing, an image of a current moment captured by a camera is sent to a camera correction module to estimate camera parameters, and an estimated value based on a past frame is recorded as Estimating an estimation value based on the current frame in the network After that, to Updating, wherein the updating rule is as follows: Wherein Will be As the latest camera estimate for use by subsequent modules.
6. The monocular video-based multi-stage human motion capture method of claim 1, wherein the human trajectory estimation module comprises two stages of encoding and decoding, based on a sequence of coordinates in normalized two-dimensional human pose keypoints Only key points on a left leg motion chain and a right leg motion chain are taken as input, and in the encoding stage, the encoder uses a cavity convolution network to extract the characteristics of an input sequence; in the decoding stage, the extracted features respectively pass through two different head networks to predict the three-dimensional coordinates of the human root node in the intermediate frame under the camera coordinate system Ground contact probability of each joint point of human body Camera internal reference matrix estimated according to camera correction module Three-dimensional joint point coordinates predicted by human body posture estimation module relative to root node Three-dimensional space coordinates of root node predicted by human body track prediction module Obtaining the coordinates of the joint points projected on the two-dimensional image plane The calculation process is as follows: 。
7. An electronic device comprising a storage medium for human motion capture from monocular video, the storage medium for storing a computer program, and a processor for executing the computer program, the computer program when executed implementing the monocular video-based multi-stage human motion capture method of any one of claims 1-6.
8. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed, implements the monocular video-based multi-stage human motion capture method of any one of claims 1-6.

Description

Multi-stage human motion capturing method, device and medium based on monocular video Technical Field The invention belongs to the technical field of computer software, and relates to a human body motion capture technology under a monocular camera. In particular to a method for predicting three-dimensional rotation quantity of human joints based on monocular RGB video in a scene with a fixed camera position so as to drive a parameterized three-dimensional human model. Background Motion Capture (Motion Capture) is an emerging animation technique that produces corresponding virtual asset Motion animations while accurately estimating the Motion of a person. The technology is widely applied in the fields of entertainment, sports, medical treatment and the like. The monocular motion capture technology has low requirements on equipment, is simple to deploy, greatly reduces the use threshold, and therefore has a larger market. With the development of deep learning technology, the precision of monocular motion capture technology is continuously improved. The SMPL (Skinned Multi-Person Linear Model) is a three-dimensional model of the human body based on vertices, capable of accurately representing different shapes and poses of the human body, and comprises 24 nodes, including 23 nodes and 1 root node. In the field of deep learning, existing monocular motion capture methods can be divided into two main categories according to stage states. The first is a single-stage method, such as HMR and PyMAF, which uses neural networks to regress the required pose rotation parameters of the human model from the original RGB input end to end, without explicit intermediate states and intermediate supervision in the process. But the mapping from the original picture to the abstract model parameters is highly non-linear, which results in the predicted results often not aligned with the picture accurately enough. The second type is a multi-stage method, such as NBF and Pose2Mesh, which reduces the fitting difficulty of each step of the network by decomposing tasks, and the network gradually outputs intermediate expressions, such as human body joint points, and continues to predict on the basis until a target result is obtained. Since each stage in the multi-stage method predicts the output value of the previous stage as input, the prediction error is accumulated stage by stage. In addition, the existing motion capture method is mostly image-based, and for monocular video input, the image-based method can only predict frame by frame, and because time sequence information cannot be extracted and utilized, an estimated video result often has jitter and is sensitive to shielding. In addition, the main stream method adopts a weak perspective projection camera, namely, the camera is assumed to be far away from a person, the depth of the human body is ignored, and the weak perspective projection camera is not consistent with a real scene and cannot represent the characteristics of perspective projection. And, the mainstream regression network simply omits supervision of human contact with the environment, easily resulting in visual inconsistencies and unrealistics. Disclosure of Invention The invention aims to solve the problems that the precision of a deep learning scheme adopted by the existing monocular video motion capturing method cannot meet the requirement, and on the other hand, the existing motion capturing method is based on images, does not consider the time sequence information and scene depth information of video and can influence the effectiveness of motion capturing and the accuracy of capturing results. The technical scheme of the invention is that the multi-stage human motion capturing method based on monocular video is characterized in that for monocular RGB video shot under a fixed camera situation, human motion capturing is divided into a plurality of stages, wherein the first stage is used for detecting human body gestures frame by frame for input monocular video, estimating two-dimensional human body key points and constructing a local space under a camera coordinate system of a three-dimensional human body model SMPL, the second stage is used for acquiring space and time sequence information in a video sequence by utilizing a deep learning method, learning a mapping relation from the two-dimensional human body key points of video frame images to three-dimensional human body joint points in the camera space, detecting and perceiving motion trail and ground touching conditions of a human body in the three-dimensional space, and the third stage is used for fitting a three-dimensional human body grid model to a three-dimensional skeleton by formulating a punishment strategy according to reverse dynamics so as to describe the motion sequence, and realizing motion capturing. Further, the first stage is to implement data preprocessing through a preprocessing module, detect two-dimensional human body key points frame by fram