CN-121280635-B - Cross-modal human body posture data generation method based on image generation point cloud

CN121280635BCN 121280635 BCN121280635 BCN 121280635BCN-121280635-B

Abstract

The invention discloses a cross-modal human body posture data generation method based on image generation point cloud, which adopts a conditional diffusion model to perform two-stage diffusion generation, and controlling and generating corresponding point cloud mode data by taking the single-view image mode data of the human body gesture as the condition information. Firstly, denoising noise voxels by using aerial view characteristics of a depth map as conditions in a voxel diffusion stage to generate rough global attitude voxel distribution so as to relieve the problem of unbalanced distribution of key attitude points, and secondly, denoising noise point clouds in each local voxel space by using attitude key point characteristics extracted by a color image as conditions in a point cloud diffusion stage to generate fine local point clouds, and finally merging the fine local point clouds into a complete point cloud. The generated point cloud is similar to the real point cloud in spatial distribution and gesture representation, and can be used for expanding a data set of model training.

Inventors

SHI ZHENYU
HE SHIBO
GU CHAOJIE
QIAN BIN

Assignees

浙江大学

Dates

Publication Date: 20260508
Application Date: 20251208

Claims (8)

1. The cross-mode human body posture data generation method based on the image generation point cloud is characterized by comprising the following steps of: S1, acquiring paired images containing human body gestures and a point cloud data set, wherein the images contain color images and depth images; S2, establishing a cross-modal generation model from an image to a point cloud, inputting the point cloud and the image by the model, performing global voxel diffusion on the point cloud by using a depth image as a condition to generate rough gesture voxel distribution, and then generating a fine point cloud in each voxel by using a color image as a condition through local point cloud diffusion; the global voxel diffusion in the cross-modal generation model is realized by using a noise prediction network, firstly, the three-dimensional space of an input point cloud is subjected to voxel division, gaussian noise sampling is carried out according to the three-dimensional voxel space obtained by division to obtain noise voxels, the noise voxels are input into the noise prediction network for noise prediction, the noise is gradually removed through a standard diffusion reverse process, rough gesture voxel distribution is iteratively recovered, and the condition information fusion is carried out on the voxel data by using a depth image in a denoising process; The gesture classification model is also used for predicting gesture labels corresponding to voxel distribution of the current denoising step in real time, and loss calculation between a prediction result and a real label adopts a cross entropy loss function as semantic guidance loss; S3, training the cross-modal generation model in two stages by using the data set; and S4, generating a fine point cloud by using the trained cross-modal generation model, and merging the fine point cloud into a complete point cloud similar to the real point cloud.
2. The method for generating cross-modal human body posture data based on image generation point cloud as claimed in claim 1, wherein the image data comprises a color image acquired by a common RGB camera and a depth image acquired by a depth camera, and the point cloud data is acquired by a lidar sensor.
3. The method for generating cross-modal body posture data based on image generation point cloud as claimed in claim 1, wherein the loss function used for training of the global voxel diffusion section in the cross-modal generation model comprises joint supervision of voxel noise loss constructed by using L2 norms and semantic guidance loss constructed by using cross entropy loss.
4. The method for generating cross-modal human body posture data based on image generation point cloud as claimed in claim 1, wherein the local point cloud diffusion specifically comprises: firstly, carrying out standardization processing on rough gesture voxel distribution, carrying out point cloud filling to ensure that all voxels with non-zero point cloud density in the gesture voxels have the same point cloud quantity, taking Gaussian noise ball distribution with fixed points in a unit voxel space as noise point clouds, then inputting the noise point clouds with the fixed points and corresponding condition features thereof into a noise prediction network based on PointNet architecture, taking three-dimensional coordinates of each point in the point clouds as input, directly predicting the added position noise of each point in the point clouds, carrying out iterative execution on the denoising process to gradually recover local point clouds conforming to the real gesture distribution, and carrying out condition information fusion on the point cloud data by using gesture key points of color images as condition information in the denoising process.
5. The method for generating cross-modal human body posture data based on point cloud image generation according to claim 4, wherein in the local point cloud diffusion process, the initial noise distribution variance of each local point cloud is further predicted through a multi-layer perceptron network to serve as prior information of point cloud diffusion, so that standard Gaussian noise is replaced, the denoising step number is reduced, and prior loss is used for supervising the initial distribution prediction of the local point cloud.
6. The method for generating cross-modal body posture data based on the image generation point cloud of claim 5, wherein the loss function used for training of the local point cloud diffusion portion in the cross-modal generation model comprises joint supervision of point cloud noise loss constructed by using standard diffusion L2 loss and prior loss constructed by using KL divergence upper bound estimation.
7. A cross-modal body posture data generating device based on an image generating point cloud, comprising a memory and one or more processors, wherein executable codes are stored in the memory, and the processor is characterized in that when executing the executable codes, the cross-modal body posture data generating method based on the image generating point cloud is realized according to any one of claims 1-6.
8. A computer-readable storage medium having a program stored thereon, which when executed by a processor, implements a cross-modality human body posture data generation method based on an image generation point cloud as claimed in any one of claims 1 to 6.

Description

Cross-modal human body posture data generation method based on image generation point cloud Technical Field The invention relates to a cross-modal data generation method, in particular to a cross-modal human body posture data generation method based on image generation point cloud. Background In recent years, the multi-modal human body posture estimation shows remarkable advantages in precision and robustness by fusing heterogeneous modal data such as images and point clouds. However, the cost of constructing a high-quality multi-mode data set is high, and especially, point cloud data becomes a key bottleneck for limiting the performance of a model due to the fact that acquisition equipment is expensive and acquisition is difficult. The point cloud is generated by utilizing the easily acquired image data, and becomes an effective path of the low-cost extended training data. The core difficulty faced by cross-modal data generation is that there is an intrinsic semantic gap and structural heterogeneity between modalities. The image is dense structured data, while the point cloud is a sparse, unordered, unstructured three-dimensional spatial sample, which have great differences in data structure, density distribution and physical meaning. Direct mapping is prone to geometric distortion or semantic misalignment. Especially in a human body posture estimation scene, the generated model is required to reproduce the spatial contour of the point cloud and accurately align key nodes of posture characterization. The diffusion model provides a new path for solving the problem by virtue of a progressive denoising mechanism and strong condition control capability, and semantic guidance of different modes is allowed to be fused at different stages by a condition injection mechanism, so that the generated content is precisely controlled. Disclosure of Invention The invention aims to generate corresponding point cloud data based on image data of human body gestures, and provides a cross-mode data generation method. The method adopts a double-stage generation method, firstly carries out global voxel diffusion generation, and then carries out local point cloud diffusion generation, so that point cloud data with gesture representation can be effectively generated, and the method is used for expanding the existing human gesture data. The invention aims at realizing the technical scheme that the cross-mode human body posture data generation method based on the point cloud generated by the image comprises the following steps of: S1, acquiring paired images containing human body gestures and a point cloud data set, wherein the images contain color images and depth images; s2, establishing a cross-modal generation model from an image to a point cloud, inputting the point cloud and the image by the model, performing global voxel diffusion on the point cloud by using a depth image as a condition to generate rough gesture voxel distribution, and then generating a fine point cloud in each voxel by using a color image as a condition through local point cloud diffusion; S3, training the cross-modal generation model in two stages by using the data set; and S4, generating a fine point cloud by using the trained cross-modal generation model, and merging the fine point cloud into a complete point cloud similar to the real point cloud. Further, the image data includes a color image acquired by a general RGB camera and a depth image acquired by a depth camera, and the point cloud data is acquired by a lidar sensor. Further, global voxel diffusion in the cross-modality generation model is implemented using a noise prediction model, Firstly, carrying out voxel division on a three-dimensional space of an input point cloud, carrying out Gaussian noise sampling according to the three-dimensional voxel space obtained by division to obtain noise voxels, inputting the noise voxels into a noise prediction network for noise prediction, gradually removing noise through a standard diffusion reverse process, and iteratively recovering the distribution of target attitude voxels, and carrying out condition information fusion on voxel data by using a depth image in a denoising process. Further, in the global voxel diffusion process, an attitude classification model is also used for predicting an attitude label corresponding to the long voxel distribution of the current denoising step in real time, and a loss calculation between a prediction result and a real label adopts a cross entropy loss function as semantic guidance loss. Further, the loss function used for training the global voxel diffusion section in the cross-modal generation model comprises joint supervision of voxel noise loss constructed by using an L2 norm and semantic guidance loss constructed by using cross entropy loss. Further, the local point cloud diffusion specifically includes: Firstly, carrying out standardization processing on rough gesture voxel distribution, carrying out point cloud f