CN-122023197-A - Multi-mode point cloud completion method and system

CN122023197ACN 122023197 ACN122023197 ACN 122023197ACN-122023197-A

Abstract

The invention relates to the technical field of three-dimensional computer vision and point cloud processing, and particularly discloses a multi-mode point cloud completion method and system. The method comprises the steps of rendering a single-mode incomplete point cloud into a multi-view depth map and generating corresponding text description, repairing the depth map by using a diffusion model under the condition of the text, extracting the point cloud, repairing images and text features in parallel by using a pre-training aligned encoder, combining cross attention and a projection mechanism by using a fusion network, fusing text semantic guidance and image geometric enhancement information to obtain global fusion features, generating and integrating hierarchical upsampling of cross-mode constraint by using seed point cloud based on the features, and outputting high-quality complete point cloud.

Inventors

ZHOU FENG
LIU SHIBO
LI JIN
LIU JIE

Assignees

北方工业大学

Dates

Publication Date: 20260512
Application Date: 20260202

Claims (10)

1. A multi-modal point cloud completion method, comprising: Rendering the input single-mode incomplete point cloud from at least two orthogonal view angles through a virtual camera, generating a corresponding two-dimensional incomplete depth map, and generating a corresponding text description based on semantic categories of the incomplete point cloud; Using the generated text description as a control condition, repairing the incomplete depth map by using a repair model based on a diffusion model, and outputting a repair depth map containing the geometric outline of the complete object; the method comprises the steps of extracting point cloud global features of incomplete point clouds, image features of a repair depth map and text semantic features of text description in parallel by using a point cloud encoder, an image encoder and a text encoder which are pre-trained and aligned across modal feature spaces; Fusing the point cloud global features, the image features and the text semantic features through a fusion network to obtain global fusion features, wherein the fusion process comprises the steps of conducting semantic guidance on the point cloud global features through a cross attention mechanism by utilizing the text semantic features, and conducting geometric enhancement on the point cloud global features through introducing camera view parameters to project the point cloud global features into a three-dimensional space by utilizing the image features; Inputting the global fusion characteristics to a seed point cloud generation module based on a multi-layer perceptron to generate sparse seed point clouds representing the basic topological structure of the object; And inputting the sparse seed point cloud into a hierarchical upsampling module for hierarchical upsampling, gradually increasing the point cloud density and refining the geometric details, wherein at least one upsampling module is integrated with a cross-mode reconstruction transformer and is used for referencing the image characteristics of the restoration depth map to restrict the geometric structure of the generated points in the upsampling process, and finally outputting the completed three-dimensional point cloud.
2. The multi-modal point cloud completion method of claim 1, wherein the at least two orthogonal view angles include a front view angle, a side view angle, and a top view angle.
3. The multi-modal point cloud completion method of claim 1, wherein the diffusion model-based repair model is a fine-tuned ControlNet model.
4. The multi-modal point cloud completion method of claim 1, wherein the pre-trained and cross-modal feature space aligned point cloud encoder, image encoder and text encoder belong to the ULIP series of models.
5. The multi-modal point cloud completion method of claim 1, wherein when the fusion network performs feature fusion, first, cross-attention computation is performed by using a point cloud global feature as a query vector and text semantic features as a key vector and a value vector, so as to obtain a semantically enhanced point cloud feature.
6. The multi-modal point cloud completion method of claim 5, wherein when the fusion network performs feature fusion, then a corresponding relationship between two-dimensional image pixels and three-dimensional point cloud coordinates is established by using camera view angle parameters when an incomplete depth map is generated, and image features are projected into a three-dimensional space through a cross attention mechanism and fused with the semantically enhanced point cloud features to obtain image-guided point cloud features.
7. The multi-modal point cloud completion method of claim 6, wherein when the fusion network performs feature fusion, finally, the point cloud global features, the semantically enhanced point cloud features and the image-guided point cloud features are spliced, and the global dependency relationship is modeled through the self-attention module, so as to output global fusion features.
8. The multi-modal point cloud completion method of claim 1, wherein the hierarchical upsampling module performs 1-fold upsampling, 4-fold upsampling, and 8-fold upsampling in sequence.
9. A multi-modal point cloud completion system, comprising: the multi-mode data preprocessing module is used for rendering the input single-mode incomplete point cloud into a multi-view incomplete depth map and generating corresponding text description; the depth map restoration module is used for restoring the incomplete depth map by using a restoration model based on a diffusion model and outputting a restoration depth map by taking text description as a control condition; the cross-modal feature extraction and fusion module is used for extracting point cloud, images and text features and fusing the point cloud, images and text features to obtain global fusion features, wherein when the cross-modal feature extraction and fusion module performs fusion, semantic guidance is performed on the point cloud features by using the text features, and geometric reinforcement is performed on the point cloud features by using the image features; The two-stage point cloud reconstruction module comprises a seed point cloud generation module and a hierarchical upsampling module, wherein the seed point cloud generation module is used for generating sparse seed point clouds according to global fusion characteristics, the hierarchical upsampling module is used for performing hierarchical upsampling on the sparse seed point clouds, and in the upsampling process, the image characteristics of a restoration depth map are consulted through a cross-mode reconstruction converter, and the full point clouds are finally output.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the multimodal point cloud replenishment method as claimed in any one of claims 1 to 8.

Description

Multi-mode point cloud completion method and system Technical Field The invention relates to the technical field of three-dimensional computer vision and point cloud processing, and particularly discloses a multi-mode point cloud completion method and system. Background In a three-dimensional visual task, point cloud data acquired by a sensor such as a laser radar often have large-area defects due to shielding, distance or material characteristics, so that incomplete point cloud is formed, and subsequent application such as identification, reconstruction and the like are directly affected. The existing point cloud completion method mainly relies on geometric information of a single mode (point cloud per se) to infer, and when the deficiency is serious, a reasonable and accurate complete geometric structure is difficult to generate due to lack of enough context and semantic priori, and fuzzy, distortion or topological error results are easy to generate. Although research attempts are made to introduce multi-view images as supplements, how to effectively fuse features from different modalities (especially two-dimensional visual information containing complete geometric cues) and make them perform accurate semantic and geometric constraints on the point cloud completion process in three-dimensional space is still a technical problem to be solved. Disclosure of Invention The invention aims to solve the problems in the prior art and provides a multi-mode point cloud completion method and system. The method aims to overcome the defect of insufficient geometric information of the single-mode point cloud complement under the condition of large-area deletion, and achieves more reasonable and accurate three-dimensional geometric complement of incomplete point cloud by introducing and cooperatively utilizing the fine geometric prior of the text semantic guidance and the repaired depth image. The invention provides a multi-mode point cloud completion method, which comprises the following steps: Rendering the input single-mode incomplete point cloud from at least two orthogonal view angles through a virtual camera, generating a corresponding two-dimensional incomplete depth map, and generating a corresponding text description based on semantic categories of the incomplete point cloud; Using the generated text description as a control condition, repairing the incomplete depth map by using a repair model based on a diffusion model, and outputting a repair depth map containing the geometric outline of the complete object; the method comprises the steps of extracting point cloud global features of incomplete point clouds, image features of a repair depth map and text semantic features of text description in parallel by using a point cloud encoder, an image encoder and a text encoder which are pre-trained and aligned across modal feature spaces; Fusing the point cloud global features, the image features and the text semantic features through a fusion network to obtain global fusion features, wherein the fusion process comprises the steps of conducting semantic guidance on the point cloud global features through a cross attention mechanism by utilizing the text semantic features, and conducting geometric enhancement on the point cloud global features through introducing camera view parameters to project the point cloud global features into a three-dimensional space by utilizing the image features; Inputting the global fusion characteristics to a seed point cloud generation module based on a multi-layer perceptron to generate sparse seed point clouds representing the basic topological structure of the object; And inputting the sparse seed point cloud into a hierarchical upsampling module for hierarchical upsampling, gradually increasing the point cloud density and refining the geometric details, wherein at least one upsampling module is integrated with a cross-mode reconstruction transformer and is used for referencing the image characteristics of the restoration depth map to restrict the geometric structure of the generated points in the upsampling process, and finally outputting the completed three-dimensional point cloud. The invention provides a multi-mode point cloud completion system, which comprises: the multi-mode data preprocessing module is used for rendering the input single-mode incomplete point cloud into a multi-view incomplete depth map and generating corresponding text description; the depth map restoration module is used for restoring the incomplete depth map by using a restoration model based on a diffusion model and outputting a restoration depth map by taking text description as a control condition; the cross-modal feature extraction and fusion module is used for extracting point cloud, images and text features and fusing the point cloud, images and text features to obtain global fusion features, wherein when the cross-modal feature extraction and fusion module performs fusion, semantic guidance is performe