CN-121982254-A - Monocular image three-dimensional human body reconstruction method based on dense normal alignment

CN121982254ACN 121982254 ACN121982254 ACN 121982254ACN-121982254-A

Abstract

The invention discloses a monocular image three-dimensional human body reconstruction method based on dense normal alignment, which comprises the steps of firstly cutting an input original RGB image, extracting two-dimensional key points, a surface normal image and a continuous surface embedded vector image, obtaining initial human body posture shape parameters and translation parameters of a monocular camera by using a regressor, further providing a pixel-level normal alignment algorithm, utilizing the continuous surface embedded vector image to establish a corresponding relation between image pixels and human body grid vertexes, generating a pair Ji Faxiang image, then carrying out iterative optimization, combining the two-dimensional key points, the surface normal image, the continuous surface embedded vector image and the aligned normal image to calculate a total energy function, reversely propagating and updating the human body posture shape parameters until convergence, and finally outputting a three-dimensional human body grid and completing reconstruction. The invention fully utilizes the dense three-dimensional geometric information in the surface normal map, and remarkably improves the accuracy and the robustness of human body reconstruction under the existence of depth ambiguity and special body types in the monocular image.

Inventors

LI GUIQING
ZHANG ZHAOBO
NIE YONGWEI

Assignees

华南理工大学

Dates

Publication Date: 20260505
Application Date: 20260126

Claims (6)

1. The monocular image three-dimensional human body reconstruction method based on dense normal alignment is characterized by comprising the following steps of: S1, inputting an original RGB image containing a person and obtained by shooting with a monocular camera, and obtaining a cut image with a fixed size through human body target detection and image cutting; S2, extracting features of the cut image to obtain a two-dimensional key point, a surface normal map and a continuous surface embedded vector map, predicting the cut image by using a human body mesh regressor associated with a human body parameterized model SMPL and a monocular camera to obtain initial SMPL attitude parameters and shape parameters, and outputting initial monocular camera translation parameters; S3, constructing and solving a least square problem by using the extracted two-dimensional key points, and carrying out moving average calculation on a solving result and initial translation parameters of the monocular camera to obtain comprehensive optimal translation parameters of the monocular camera; S4, generating a human body grid for given gesture parameters and shape parameters of the SMPL, setting monocular camera parameters of the micro-renderer as comprehensive optimal translation parameters of the monocular camera by utilizing the micro-renderer, rendering the human body grid to obtain an SMPL rendering normal map, and calculating by a pixel level normal alignment algorithm to obtain a pair Ji Faxiang map, wherein the pixel level normal alignment algorithm utilizes a continuous surface embedded vector map to determine the unique corresponding vertex of the pixels of a human body area of a clipping image on the human body grid, and then utilizes the SMPL rendering normal map to calculate the average normal vector of the adjacent triangular patches of the vertex as the normal vector of the pixels so as to obtain a pair Ji Faxiang map; S5, taking the gesture parameters and the shape parameters of the SMPL as target optimization parameters, starting iterative optimization calculation by taking the initial gesture parameters and the shape parameters of the SMPL obtained in the step S2 as initial values of the target optimization parameters, calculating a pair Ji Faxiang diagram corresponding to the current target optimization parameters by utilizing the step S4 in each iterative process, calculating a total energy function by combining a two-dimensional key point, a surface normal diagram and a continuous surface embedded vector diagram, obtaining gradients of the target optimization parameters by utilizing a back propagation algorithm, updating the target optimization parameters, and minimizing energy loss; and S6, generating an optimized human body grid by using the posture parameters and the shape parameters of the optimized SMPL through a linear mixed skin algorithm, rendering the optimized human body grid through a renderer, and superposing the rendered human body grid on a clipping image to obtain a three-dimensional human body reconstruction result of the monocular image.
2. The method of claim 1, wherein in step S2, two-dimensional key points are extracted by a two-dimensional key point detector OpenPose and expressed as Extracting a surface normal map using a surface normal regressor Sapiens and expressed as Extracting successive surface embedding vector diagrams by using dense correspondence regressor DensePose and representing as The said For feature maps of the same resolution as the cropped image, in which each pixel p corresponds to a 16-dimensional embedded vector , Represents a set of real numbers, Representing in a 16-dimensional real space, predicting initial pose parameters of SMPL using a human mesh regressor CLIFF And initial shape parameters CLIFF simultaneously output initial panning parameters of the monocular camera, expressed as 。
3. A method for reconstructing a monocular image three-dimensional human body based on dense normal alignment according to claim 2, wherein said step S3 comprises the steps of: S31, setting the focal length parameters of the monocular camera to be the same as the focal length parameters of the monocular camera by using a complete perspective projection model Wherein And The height and width of the image, respectively; s32, at given point And Is provided under the condition that For the translation variable of the monocular camera to be solved, the two-dimensional projection point of the three-dimensional articulation point of the SMPL is expressed as , The calculation formula of (2) is as follows: ; In the formula, Representing a given And The three-dimensional node of the lower SMPL, Representing the projection matrix of the monocular camera, The method is specifically expressed as follows: ; s33, at given point And Solving the following least squares problem: ; The method can solve the problems by a singular value decomposition method, and the obtained solving result is the initial optimal translation parameter of the monocular camera ; S34 pair And Performing position moving average calculation to obtain comprehensive optimal translation parameters of the monocular camera The calculation formula is as follows: ; Fixing translation parameters of monocular camera in subsequent optimization process to be The focal length parameter is fixed as 。
4. A method for three-dimensional human reconstruction of monocular images based on dense normal alignment according to claim 3, wherein said step S4 comprises the steps of: S41 gesture parameters for a given SMPL And shape parameters Human body mesh generation using linear hybrid skin algorithm ; S42, reading a pre-established static lookup table in which the mapping of the vertices of the predefined SMPL to the embedded vectors is stored Wherein Is a vertex set of SMPL, use Representing the index as Defining 16-dimensional embedded vectors corresponding to vertices of (3) For pixel-to-vertex mapping, for any pixel of the human body region of the cropped image , An index representing the mesh vertex for which the embedded vector distance corresponding to the pixel is the smallest, The formula of (2) is as follows: ; In the formula, Total number of vertices for SMPL; s43, using micro-renderers NMR and setting the monocular camera translation parameters of NMR to The focal length parameter is set as For a pair of Performing normal rendering to obtain an SMPL rendering normal map Using Representing pixels At the position of Corresponding normal vector in (a); S44, will be The first of (3) The triangular patches are shown as Definition of Is that The corresponding normal vector is used to determine the normal vector, The specific calculation formula of (2) is as follows: ; In the formula, Representation of The number of pixels to be covered is the number of pixels, Representation of Middle covered first A pixel; S45. definition of For peak-to-pixel mapping, for Index in middle is Is the vertex of (2) , Representation of The normal vector of the corresponding pixel is determined, The specific calculation formula of (2) is as follows: ; In the formula, Representation and representation A collection of adjacent triangular patches of face, Representation and representation The number of adjacent triangular patches; s46, generating a pair Ji Faxiang diagram consistent with the size of the clipping image The alignment normal map According to a given set And Generated, define Is a pixel At the position of Is used to determine the normal vector of the corresponding vector, The formula of (2) is as follows: ; In the formula, Representing a set of human region pixels in a cropped image, The zero vector is represented for setting the normal vector of the non-human region pixels to zero.
5. The method for reconstructing a monocular image three-dimensional human body based on dense normal alignment according to claim 4, wherein said step S5 comprises the steps of: S51 using PyTorch as a back propagation framework, using the PyTorch form of SMPL, will And As target optimization parameters, pyTorch versions of NMR were applied, and Adam optimizers were used to update the target optimization parameters, with a preset maximum number of iterations of Wherein the Adam optimizer is one of PyTorch optimization algorithms based on adaptive learning rate; s52, constructing a total energy function Comprises five error terms, namely two-dimensional joint re-projection error term Dense vertex reprojection error term Dense normal alignment error term Global pose prior term Local pose prior term Setting up As a parameter of the weight of the error, The specific form of (2) is as follows: ; For minimizing two-dimensional joint projection errors of three-dimensional joints on a cropped image: ; In the formula, For the number of off-nodes of the SMPL, Is OpenPose to The confidence level estimated by the individual nodes of interest, Is a robust micro Geman-mccure function, Is currently given by And Lower SMPL of The three-dimensional position of the individual articulation points, Is predicted by OpenPose Two-dimensional positions of the individual nodes; the method is used for measuring errors between pixels of a human body region in a clipping image and two-dimensional projection positions of surface vertexes of corresponding SMPL, and the expression is as follows: ; The normal vector error between the normal vector of the pixel of the human body region in the clipping image and the surface vertex of the corresponding SMPL is the error representation after the normal alignment of the pixel level, and the expression is as follows: ; For constraining the pose to a reasonable range, the expression is: ; Wherein a Gaussian mixture model comprising 8 Gaussian components trained on the CMU data set is utilized, Is the first in the Gaussian mixture model The weight of the individual gaussian components, And Respectively the first The mean vector and covariance matrix corresponding to the gaussian components, Represent the first The Gaussian components are input as The probability density function value at the location, Is a solver constant; For penalizing unnatural bending of the elbow and knee joints to maintain anatomical rationality: ; In the formula, Represents the first of SMPL The rotation parameters of the individual nodes of the joint, Is an exponential function capable of imposing a significantly increased penalty value for joint rotation angles that exceed the normal physiological range of motion; S53, performing iterative computation, wherein in each iterative process, for And The pair Ji Faxiang map is rebuilt using the process of step S4 and recalculated And back-propagating with PyTorch to obtain And Is updated using Adam optimizer And In (1) Convergence or iteration number reaches Then ending the iterative computation and outputting the optimized SMPL attitude parameters And shape parameters 。
6. The method for three-dimensional human reconstruction based on dense normal alignment of monocular images according to claim 5, wherein in step S6, according to And Generating optimized human body mesh by using linear mixed skin algorithm, and setting monocular camera translation parameter and focal length parameter of the renderer as follows And And rendering an optimized human body grid image, and superposing the image on the clipping image to obtain a three-dimensional human body reconstruction result of the monocular image.

Description

Monocular image three-dimensional human body reconstruction method based on dense normal alignment Technical Field The invention relates to the technical field of three-dimensional human body reconstruction, in particular to a monocular image three-dimensional human body reconstruction method based on dense normal alignment. Background Along with the rapid development of the fields of virtual reality, intelligent monitoring, digital people and the like, the demand for accurately reconstructing the shape and the posture of a three-dimensional human body from a single RGB image is growing. The existing methods are mainly divided into two types, namely an optimization-based method and a regression-based method. The former method comprises SMPLify and the like, realizes three-dimensional human body reconstruction by fitting SMPL to two-dimensional joint points, but relies on sparse two-dimensional clues, and the reconstruction result has the problems of local posture error caused by depth ambiguity, body shape trend averaging and the like. The latter uses deep learning to directly regress SMPL parameters, such as HMR, CLIFF, etc., and has high speed, but relies on a large amount of high-quality labeling data, while the current outdoor real labeling data is insufficient, and the accuracy of the pseudo-true value labeling data is quite dependent on the pseudo-true value labeling data obtained by an optimization-based method, so that the reconstruction result of the existing regression method also has the problem of local posture errors caused by depth ambiguity. In recent years, although contour and depth information are introduced, the problem of two-dimensional to three-dimensional ambiguity in monocular images is difficult to process due to the constraint of a plurality of stay in a two-dimensional space or a coarse level constraint, and the reconstructed human body posture and shape result is unreliable. Disclosure of Invention The invention aims to overcome the defects and shortcomings of the prior art, and provides a monocular image three-dimensional human body reconstruction method based on dense normal alignment, which can effectively solve the problems of depth ambiguity and average body shape in the existing monocular human body reconstruction and effectively improve the accuracy of three-dimensional human body posture and shape reconstruction. In order to achieve the purpose, the technical scheme provided by the invention is that the monocular image three-dimensional human body reconstruction method based on dense normal alignment comprises the following steps: S1, inputting an original RGB image containing a person and obtained by shooting with a monocular camera, and obtaining a cut image with a fixed size through human body target detection and image cutting; S2, extracting features of the cut image to obtain a two-dimensional key point, a surface normal map and a continuous surface embedded vector map, predicting the cut image by using a human body mesh regressor associated with a human body parameterized model SMPL and a monocular camera to obtain initial SMPL attitude parameters and shape parameters, and outputting initial monocular camera translation parameters; S3, constructing and solving a least square problem by using the extracted two-dimensional key points, and carrying out moving average calculation on a solving result and initial translation parameters of the monocular camera to obtain comprehensive optimal translation parameters of the monocular camera; S4, generating a human body grid for given gesture parameters and shape parameters of the SMPL, setting monocular camera parameters of the micro-renderer as comprehensive optimal translation parameters of the monocular camera by utilizing the micro-renderer, rendering the human body grid to obtain an SMPL rendering normal map, and calculating by a pixel level normal alignment algorithm to obtain a pair Ji Faxiang map, wherein the pixel level normal alignment algorithm utilizes a continuous surface embedded vector map to determine the unique corresponding vertex of the pixels of a human body area of a clipping image on the human body grid, and then utilizes the SMPL rendering normal map to calculate the average normal vector of the adjacent triangular patches of the vertex as the normal vector of the pixels so as to obtain a pair Ji Faxiang map; S5, taking the gesture parameters and the shape parameters of the SMPL as target optimization parameters, starting iterative optimization calculation by taking the initial gesture parameters and the shape parameters of the SMPL obtained in the step S2 as initial values of the target optimization parameters, calculating a pair Ji Faxiang diagram corresponding to the current target optimization parameters by utilizing the step S4 in each iterative process, calculating a total energy function by combining a two-dimensional key point, a surface normal diagram and a continuous surface embedded vector d