CN-122023721-A - Multi-user whole-body grid three-dimensional reconstruction method and system based on hierarchical query and graph convolution refinement
Abstract
The invention provides a multi-person whole body grid three-dimensional reconstruction method and system based on hierarchical query and graph convolution refinement, wherein the hierarchical query decoder is utilized to conduct parallel prediction from thick to thin on parameters of each part of a human body; and the method comprises the steps of aggregating all the preliminary gesture parameters into a human body global skeleton diagram structure, carrying out whole body gesture collaborative optimization through a gesture diagram refiner based on a diagram convolution pyramid to obtain an optimized whole body fusion parameter set, and inputting the optimized whole body fusion parameter set, shape characteristics and facial expression parameters into an SMPL-X human body model to carry out three-dimensional human body grid reconstruction. The invention realizes the explicit capturing of the natural dependency relationship of each part of the human body from thick to thin, solves the problem of spatial decoupling of the body part while maintaining high efficiency, remarkably improves the structural consistency of posture estimation, effectively corrects the distortion of the local posture by explicitly modeling the dependency relationship among joints and fusing multi-scale characteristics, and has excellent robustness and accuracy.
Inventors
- GAO QING
- ZHANG MINGYU
Assignees
- 中山大学·深圳
- 中山大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260202
Claims (10)
- 1. The multi-person whole body grid three-dimensional reconstruction method based on hierarchical query and graph convolution refinement is characterized by comprising the following steps of: S1), carrying out multi-scale feature extraction on an image to be processed and screening human body candidate targets; S2), carrying out parallel prediction from thickness to fineness on parameters of each part of a human body by utilizing a hierarchical query Decoder HQ-Decoder to obtain a body posture, a shape characteristic, four-limb parameters, a hand posture, a facial expression parameter and a mandibular posture; S3), aggregating all the preliminary gesture parameters into a human body global skeleton diagram structure, and performing whole-body gesture collaborative optimization through a gesture diagram refiner based on a diagram convolution pyramid to obtain an optimized whole-body fusion parameter set; S4), inputting the optimized whole body fusion parameter set, shape characteristics and facial expression parameters into the SMPL-X human body model to reconstruct a three-dimensional human body grid, and obtaining a high-quality whole body three-dimensional grid.
- 2. The multi-person whole body grid three-dimensional reconstruction method based on hierarchical query and graph convolution refinement as set forth in claim 1, wherein in step S1), feature encoding is performed on an image to be processed by using a Backbone network Backbone to obtain an image feature sequence Based on the image characteristic sequence Calculating human body confidence scores, sorting the human body confidence scores in descending order, and selecting image features corresponding to the first K human body confidence scores as human body query vectors 。
- 3. The multi-person whole Body grid three-dimensional reconstruction method based on hierarchical query and graph convolution refinement as set forth in claim 2, wherein in step S2), said hierarchical query Decoder HQ-Decoder follows a kinematic tree structure of human biology including a human Body Decoder, a Body Decoder, a limb Decoder Limbs Decoder, a face Decoder FaceDecoders, a hand Decoder Hand Decoders, and a human Body query vector is obtained by said human Body Decoder And image feature sequence Decoding and outputting class probability Coordinates with human bounding box And human Body characteristics, and then sequentially performing characteristic extraction and parameter regression through a Body Decoder, a limb Decoder Limbs Decoder, a face Decoder FaceDecoders and a hand Decoder Hand Decoders.
- 4. The multi-person whole-body grid three-dimensional reconstruction method based on hierarchical query and graph convolution refinement as set forth in claim 3, wherein in step S2), a vector is queried for each body part Fusion is performed by embedding the average parameters of the body part and embedding the positions of the detection frames so as to enhance the query vector of the body part, namely: ; In the formula, Querying vectors for the enhanced body part; representing a splicing operation; Embedding average parameters; embedding for the position; Vector for enhanced body part queries Feeding into Body Decoder, sequentially executing and image features Cross-attention between queries, self-attention between queries, and feed forward network operation, namely: ; ; ; In the formula, Representing a cross-attention feature; representing a cross-attention operation; normalizing the layers; representing self-attention features between queries; Representing self-attention operations between queries; Is a physical feature; Is a multi-layer perceptron; By physical features Decoding to obtain body basis rotation vector Shape characteristics 。
- 5. The method for three-dimensional reconstruction of a multi-person whole body mesh based on hierarchical query and convolution refinement as set forth in claim 4, wherein in step S2), said extremity decoder Limbs Decoder performs a physical characterization by using a hierarchical query and convolution refinement Decoding and regression are carried out on the embedded positions of the limbs to obtain the characteristics and boundary frames of the face and the hands and the parameters of the limbs ; The face decoder FaceDecoders and the hand decoder Hand Decoders decode the features of the face and the hand in parallel to obtain the hand gesture Facial expression parameters And mandibular pose 。
- 6. The multi-person whole body grid three-dimensional reconstruction method based on hierarchical query and graph convolution refinement as claimed in claim 1, wherein in the step S3), the graph refiner based on the graph convolution pyramid adopts stacked graph convolution blocks to form a pyramid structure, and the multi-layer graph convolution blocks are utilized to mine spatial topological constraint among joints and correct the inconsistency of local parameters in space.
- 7. The multi-person whole body grid three-dimensional reconstruction method based on hierarchical query and graph convolution refinement according to claim 6, wherein the method is characterized by receiving initial posture features through an input layer, then carrying out downsampling and global feature extraction through a multi-layer graph convolution block, then carrying out upsampling and restoration on the extracted global features through a reverse path, feeding back global structure information to each specific joint node, wherein each graph convolution block explicitly transmits information between adjacent joints by utilizing graph convolution operation so as to ensure that elbow movement is constrained by shoulders and knee movement is constrained by hips, correcting local violating human anatomy distortion through coarse granularity features of pyramid top layers, and finally obtaining a corrected whole body fusion parameter set 。
- 8. The method for three-dimensional reconstruction of a multi-person whole body grid based on hierarchical query and graph convolution refinement as set forth in claim 7, wherein in step S4), said SMPL-X manikin integrates whole body fusion parameters by a linear hybrid skin function Shape characteristics Facial expression parameters Mapping to a final 3D mesh, namely: ; In the formula, Is a linear hybrid skin function; Human body mesh representing final output, including A plurality of vertices; ; representing a 3D joint position; is a mixed weight.
- 9. The multi-person whole-body grid three-dimensional reconstruction system based on hierarchical query and graph convolution refinement is characterized by comprising: the feature extraction module is used for carrying out multi-scale feature extraction on the image to be processed and screening human body candidate targets; The parameter estimation module is used for carrying out parallel prediction from thickness to fineness on the parameters of each part of the human body by using a hierarchical query Decoder HQ-Decoder to obtain the body posture, the shape characteristics, the limb parameters, the hand posture, the facial expression parameters and the mandibular posture; The gesture optimization module is used for aggregating all the preliminary gesture parameters into a human body global skeleton diagram structure, and carrying out whole-body gesture collaborative optimization through a gesture diagram refiner based on a diagram convolution pyramid to obtain an optimized whole-body fusion parameter set; And the human body grid reconstruction module is used for calling the SMPL-X human body model, and carrying out three-dimensional human body grid reconstruction based on the optimized whole body fusion parameter set, the shape characteristics and the facial expression parameters, and the high-quality whole body three-dimensional grid.
- 10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the multi-person whole-body mesh three-dimensional reconstruction method based on hierarchical query and graph volume refinement as claimed in any one of claims 1-8 when executing the program.
Description
Multi-user whole-body grid three-dimensional reconstruction method and system based on hierarchical query and graph convolution refinement Technical Field The invention relates to the technical field of computer vision and three-dimensional reconstruction, in particular to a multi-person whole body grid three-dimensional reconstruction method and system based on hierarchical query and graph convolution refinement. Background With the rapid development of the fields of human body understanding, animation production, games, body intelligence and the like, the demand for three-dimensional reconstruction (3D Human Mesh Recovery, HMR) of the human body grid of a plurality of people is increasingly urgent. In the field of computer vision and human understanding, three-dimensional reconstruction of a multi-person whole body human mesh aims at jointly estimating the body gestures, hand gestures and facial expressions of multiple persons from a single image. Unlike conventional body mesh three-dimensional reconstruction, which focuses only on the torso of the body, whole body mesh three-dimensional reconstruction requires simultaneous handling of large-scale movements of the body and fine details of the hands, face. The Multi-stage method of the traditional method is seriously dependent on a process of detecting before cutting, and the design not only cuts off the internal connection between each part of the human body and the individual and increases the complexity of the system, but also is very sensitive to the noise of a detection frame, and is easy to cause unnatural artifacts and inconsistency of the recovered gesture. In order to solve the problems of high system complexity and cutting off the inherent links of various parts of the body caused by the conventional Multi-stage method depending on detection frame cutting, a single-stage (One-stage) framework gradually becomes a research hotspot. However, the existing single-stage method avoids the clipping step, but usually adopts flattened Token representation, lacks explicit anatomical structure guidance, forces the network to implicitly learn the complex structural relationship of the human body, and often causes the spatial decoupling of all parts of the body, and the generated gesture lacks overall consistency. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a multi-person whole body grid three-dimensional reconstruction method and a system based on hierarchical query and graph convolution refinement, and the invention discloses a method and a system for reconstructing a gesture graph based on graph convolution by introducing a hierarchical query decoder following a human kinematics tree, features can be gradually extracted in a coarse-fine mode and the dependency relationship among joints can be explicitly modeled, so that the structural consistency and accuracy of three-dimensional reconstruction of the multi-person whole body grid are remarkably enhanced under the condition that intermediate detection and clipping are not needed. In a first aspect, the invention provides a multi-user whole-body grid three-dimensional reconstruction method based on hierarchical query and graph convolution refinement, which comprises the following steps: S1), carrying out multi-scale feature extraction on an image to be processed and screening human body candidate targets; S2), carrying out parallel prediction from thickness to fineness on parameters of each part of a human body by utilizing a hierarchical query Decoder HQ-Decoder to obtain a body posture, a shape characteristic, four-limb parameters, a hand posture, a facial expression parameter and a mandibular posture; S3), aggregating all the preliminary gesture parameters into a human body global skeleton diagram structure, and performing whole-body gesture collaborative optimization through a gesture diagram refiner based on a diagram convolution pyramid to obtain an optimized whole-body fusion parameter set; S4), inputting the optimized whole body fusion parameter set, shape characteristics and facial expression parameters into the SMPL-X human body model to reconstruct a three-dimensional human body grid, and obtaining a high-quality whole body three-dimensional grid. Preferably, in step S1), the image to be processed is feature-coded by using a Backbone network Backbone to obtain an image feature sequenceBased on the image characteristic sequenceCalculating human body confidence scores, sorting the human body confidence scores in descending order, and selecting image features corresponding to the first K human body confidence scores as human body query vectors。 Preferably, in step S2), the hierarchical query Decoder HQ-Decoder follows a kinematic tree structure of human biology, including a human Body Decoder, a Body Decoder, a limb Decoder Limbs Decoder, a face Decoder FaceDecoders, and a hand Decoder Hand Decoders. Preferably, in step S3), the gesture graph refiner based on