CN-122023920-A - Point cloud self-supervision learning method based on multi-feature perception and auxiliary reconstruction

CN122023920ACN 122023920 ACN122023920 ACN 122023920ACN-122023920-A

Abstract

The invention relates to the technical field of three-dimensional sensing, in particular to a point cloud self-supervision learning method based on multi-feature perception and auxiliary reconstruction. The method comprises the following steps of S1, point cloud grouping of a random mask strategy, S2, local marking multi-feature fusion embedding, S3, self-encoder backbone network pre-training, S4, fine tuning stage GaussPoint enhancement, S5, auxiliary reconstruction branch in the fine tuning stage, and S6, multi-task coverage classification segmentation evaluation. The invention builds a Point-MAR network model based on a mask self-coding paradigm by combining a multi-feature perception fusion embedder, an auxiliary reconstruction branch and a GaussPoint data enhancement method, remarkably improves feature expression, geometric perception and generalization capability, and shows good competitiveness in each downstream task compared with a plurality of main stream models.

Inventors

FENG YUPING
ZHU YONGPING
HUO MINGLIANG
SHAO YOUJIA
QIN HAOHUA
ZHANG XIANJUN
WANG MINGJIA
GUO LANTIAN
Xun Ruobing

Assignees

青岛科技大学

Dates

Publication Date: 20260512
Application Date: 20260203

Claims (10)

1. A point cloud self-supervision learning method based on multi-feature perception and auxiliary reconstruction is characterized by comprising the following steps: s1, point cloud grouping of a random mask strategy, namely dividing the point cloud into local point blocks by using FPS and KNN, introducing a high-proportion random mask, and enhancing self-supervision learning ability; S2, local marking multi-feature fusion embedding, namely defining a multi-feature perception fusion embedder, extracting local semantics, explicit geometry and affine features in parallel, and enhancing feature expression through channel attention self-adaptive fusion; S3, pre-training a self-encoder backbone network, namely processing visible, coded visible and learnable masks tokens by adopting asymmetric encoder-decoder settings based on a standard transducer block so as to develop a pre-training mask reconstruction task; S4, enhancing a fine tuning stage GaussPoint, namely providing a GaussPoint data enhancing method, generating smooth geometric deformation based on a periodic Gaussian function, and improving the robustness and generalization capability of the model; S5, auxiliary reconstruction branches in a fine tuning stage, wherein a geometric perception generator is introduced in the fine tuning stage, the relative position coding and the random initialization are used for weighting, and point clouds are reconstructed to enhance geometric feature learning and alleviate forgetting problems; S6, multi-task coverage classification segmentation assessment, wherein the object classification task covers real object data set/clean object data set classification, the few-sample learning data set comprises four settings, reporting average accuracy and standard deviation, and object part segmentation adopts class/instance average intersection ratio as an assessment index.
2. The method for self-supervised learning point clouds based on multi-feature perception and auxiliary reconstruction according to claim 1, wherein the step S1 of the point cloud grouping of the random mask strategy comprises the following steps: S11, inputting point cloud through FPS and KNN algorithm Divided into a series of local blocks of points ; For an inclusion Input point cloud of individual points Input point cloud from FPS algorithm Middle sampling number The point is used as a center point set C, and based on the center point set C, a KNN algorithm is adopted to input point cloud To search k nearest neighbors of each center point to construct The local point blocks P: , , wherein: The point inside each local point block is expressed as relative coordinates through a centralization operation; s12, introducing a high-proportion random mask strategy at the local point block level; The local point blocks are encoded into tokens sequences by an embedder : , Set mask ratio as , Is divided into visible tokens And mask tokens : , Will be seen tokens The encoder is input to extract high-level context features and mask tokens The shared learner masktokens is substituted as an input to the decoder.
3. The self-supervised learning method of point cloud based on multi-feature perception and auxiliary reconstruction according to claim 2, wherein the local marker multi-feature fusion embedding of step S2 comprises the following steps: s21, generating tokens representation of a local Point block by a Point-MAE algorithm based on a lightweight PointNet encoder, wherein the encoder extracts local semantic features through sharing MLP and pooling aggregation; s22, defining the multi-feature perception fusion embedder on the basis of PointNet encoders through the following features: S221, local semantic features are given as a point block: , Extracting local semantic features by lightweight PointNet encoder Then obtaining global semantic features through maximum pooling aggregation : , Wherein: Is a shared MLP convolutional layer; is a learnable parameter; s222, explicit geometric characteristics through geometric mapping network Explicitly encoding the center point, the neighborhood point and the relative position relation thereof to generate an explicit geometric feature Injecting a local geometry prior for the model: wherein: Is the center point coordinate; the coordinates of the neighborhood points; is relative displacement; is a point-by-point multiplication; S223, affine transformation feature, namely introducing geometric affine module Tensor of local features Carrying out standardization and affine transformation processing, relieving the problem of inconsistent distribution among local point blocks, and enhancing the robustness of the characteristics: wherein: And Average accuracy and standard deviation; is a stabilizing factor; for the affine parameters introduced; S224, channel attention fusion, namely carrying out self-adaptive fusion on three types of features through a channel attention module: wherein: Is a fully connected network; is the learned fusion weight; Embedding each point block for multi-feature aware fusion embedder output And (5) collecting.
4. The point cloud self-supervised learning method based on multi-feature perception and auxiliary reconstruction as set forth in claim 3, wherein the self-encoder backbone network pre-training of step S3 includes the steps of: S31, setting a self-encoder backbone network by adopting an asymmetric encoder-decoder based on a standard transducer block, wherein the encoder only processes the visible tokens And adds visible center position embedding in each transducer block To provide position information, coded tokens is denoted as ; S32, transformer block of decoder to see tokens Sum mask tokens As input and add complete position embedding in each transducer block Providing location information for all tokens; the encoder-decoder structure is expressed as: , , wherein: is a mask ratio; the number of the local point blocks; is an embedding dimension; s33, output of decoder Achieving a reconstruction target by a prediction head: The prediction head adopts a lightweight fully-connected FC layer to project the features into vectors with the same dimension as the total number of the coordinates of the local point blocks, and then the predicted mask local point blocks are generated through remodelling : , 。
5. The method of point cloud self-supervised learning based on multi-feature perception and assisted reconstruction as set forth in claim 4, wherein in the self-encoder backbone network pre-training of step S3, the model is pre-trained based on ShapeNet data sets, shapeNet covers 55 object classes and contains 51,300 3D models, data is sampled 1024 points through the furthest point sampling FPS and divided into 64 point blocks, each point block contains 32 points, the points are embedded into local point block features and are randomly masked with mask ratio r=0.6 and then input into a transducer for modeling, the encoder Encoder is composed of 12-layer transducer blocks, the Decoder is composed of 4-layer transducer blocks, each transducer block has feature dimension 384 and contains 6 self-attention heads, the optimizer uses AdamW, the initial learning rate is set to 0.001, the weight attenuation coefficient is 0.05, and the cosine annealing CosLR strategy is adopted for learning rate scheduling, and the data enhancement in the pre-training stage uses only random translation.
6. The method for self-supervised learning of point cloud based on multi-feature perception and aided reconstruction of claim 4, wherein the fine tuning stage GaussPoint of step S4 is enhanced, comprising the steps of: S41, gaussPoint apply a smooth and periodic geometric deformation on the input point cloud using a periodic gaussian function as residual mapping, and maintain topology consistency: defining a periodic function as: wherein: The width of the Gaussian kernel is used for controlling the smoothness of deformation; is a function period; to cut off the range, set To balance computational efficiency with geometric continuity; s42, global deformation GaussPoint-SGL, namely normalizing the point cloud into a unit sphere space, and performing global deformation on the whole point cloud by using a single periodic Gaussian function, wherein the input point cloud is given The enhanced point cloud is expressed as: wherein: is the deformation amplitude; Is a frequency factor; phase shift is generated randomly; s43, local deformation GaussPoint-MUL, namely realizing local geometric deformation of point cloud by a multi-anchor point weighting mechanism, simulating the tiny geometric variation of the object surface, and selecting M anchor points by using furthest point sampling or random sampling Calculate for each point its average offset to all anchor points And based on the offset Construction of local displacement fields : S44, self-adaptive smoothing mechanism GaussPoint self-adaptive smoothing mechanism, wherein standard deviation of point cloud is input Adjusting the amplitude of deformation And Gaussian kernel width Ensuring that the deformation process is smooth and proportional to the geometric dimensions: 。
7. the point cloud self-supervision learning method based on multi-feature perception and auxiliary reconstruction as claimed in claim 6, wherein in the fine tuning stage GaussPoint enhancement of step S4, a multi-feature perception fusion embedder and an encoder of a pre-training stage are used for a fine tuning model, a classification/segmentation head is added after the encoder, auxiliary reconstruction branches are introduced, and the fine tuning model performs feature extraction and discrimination for classification or segmentation tasks.
8. The self-supervised learning method of point cloud based on multi-feature perception and auxiliary reconstruction according to claim 6, wherein the auxiliary reconstruction branch in the fine tuning stage of step S5 comprises the following steps: S51, introducing a geometric sense generator as an auxiliary reconstruction branch in the fine tuning stage, setting the geometric sense generator based on Transformer Decoder architecture, adopting a relative position code (RPE) as an input, and modeling relative direction prompts among groups to capture geometric topological relations among point cloud groups: for a group center set: wherein: is the first Center point coordinates of the individual groupings; The number of groups is the group number; the relative position coding RPE captures the relative geometric relationship of the point cloud by calculating the normalized direction vector between adjacent grouping centers; First, the Individual packets Relative to the first Individual packets Is defined as: , wherein: Using absolute center coordinates for the first packet; To avoid small values of zero; S52, the generator uses the output characteristics of the encoder And the relative position encoded RPE as input, the output of the final generator is mapped to point cloud coordinates by a simple prediction head: , 。
9. the method for self-supervised learning of point cloud based on multi-feature perception and aided reconstruction of claim 8, wherein the step S6 of multi-tasking coverage classification segmentation assessment comprises the steps of: Optimization objective of fine tuning model is downstream task loss And reconstruction loss Is a weighted combination of (1), namely: wherein: And Respectively predicting and real point sets; to balance the superparameter, the contribution of each loss is balanced.
10. The method for self-supervised learning point cloud based on multi-feature perception and aided reconstruction of claim 9, wherein in the multi-task coverage classification segmentation assessment of step S6, the downstream task of assessing model performance includes: object classification, namely, classifying a real-world data set and a clean object data set, wherein the real-world data set comprises an OBJ_BG, an OBJ_ONLY and a PB_T50_RS; The method comprises the steps of learning few samples, wherein a data set is set by adopting n-way and m-shot, wherein n is the number of categories randomly extracted from a category set, and m is the number of samples randomly selected from each category; the experiment covers four settings, namely {5-way,10-shot }, {5-way,20-shot }, {10-way,10-shot } and {10-way,20-shot }, 10 experiments are independently run under each setting, and average accuracy and standard deviation are reported; Object part segmentation using a lightweight segmentation head consistent with Point-MAE, contains an average cross-over ratio mIoU across all instances instance mIoU and all categories class mIoU.

Description

Point cloud self-supervision learning method based on multi-feature perception and auxiliary reconstruction Technical Field The invention relates to the technical field of three-dimensional sensing, in particular to a point cloud self-supervision learning method based on multi-feature perception and auxiliary reconstruction. Background Along with the rapid development of three-dimensional sensing technology, the acquisition of point cloud data is more and more convenient, and the point cloud is taken as an important data form for representing a three-dimensional geometric structure and is widely applied to a plurality of fields such as automatic driving, robot navigation, smart cities, medical treatment and the like. Unlike 2D images, the point cloud data has the characteristics of disorder, sparsity, irregularity, and the like, so that the deep neural network processing and the effective extraction of the point cloud features face a great challenge. Part of the work projects the point cloud into a multi-view 2D image, and the mature 2D convolution network is utilized to extract the point cloud characteristics, or the point cloud data is discretized into a three-dimensional voxel grid with the specification, and the 3D convolution neural network CNN is applied to the volume representation for shape classification. However, in the process of projection into a 2D image or voxelization, the defects of possible loss of geometric information, high calculation cost and the like exist, the PointNet directly processes the original point cloud data, global features are extracted from all points by using maximum pooling, the global features and the local features of each point are spliced in a segmentation task, point-by-point prediction is performed through a multi-layer perceptron MLP, and a great deal of research on a point cloud feature extraction method based on the points is started. Although the rapid development of the three-dimensional sensing technology enables the acquisition cost of the point cloud data to be remarkably reduced and the efficiency to be greatly improved, compared with a massive and mature two-dimensional image dataset, the high-quality and large-scale point cloud dataset is still relatively scarce. Meanwhile, the disorder and the huge data volume of the point cloud lead to high labeling difficulty, the manual labeling cost is high, the popularization of the supervised learning method on the point cloud task is greatly restricted, researchers are promoted to turn to semi-supervised and self-supervised learning of the point cloud, and potential representation and semantic information of the point cloud are fully mined while the dependence on labeling data is reduced. In recent years, a Point cloud self-supervised learning SSL method based on a transition mask self-coding MAE paradigm gradually becomes a research hot spot, such as Point-MAE, and utilizes an asymmetric transition encoder-decoder architecture to effectively capture long-distance dependence in the Point cloud and learn a robust global context representation. The method adopts a pre-training paradigm of dividing the point cloud into irregular point blocks and carrying out mask reconstruction on a high-proportion area, potential characteristic representations are learned from a large number of unlabeled point cloud data, the performance of a downstream task is remarkably improved, and the method fully proves the potential and effectiveness of a standard transducer in three-dimensional point cloud modeling. Despite the remarkable success of Point-MAE, the following three disadvantages remain. Firstly, the geometric prior is not fully utilized in the feature extraction stage, only the local semantic feature extraction is carried out based on PointNet, the patch tokens is directly subjected to global modeling by a standard transducer architecture, long-distance dependence is captured through a self-attention mechanism, but effective modeling of a point cloud local neighborhood display geometric structure is lacking, and in addition, due to the inherent sparsity and irregularity of point cloud data, the feature distribution difference among different local point blocks is large, so that a single feature channel cannot fully capture the rich geometric details and structural relations in the point cloud, and the deep representation capability of the model is limited. And in the fine tuning stage, explicit constraint is lacking between the pre-training task and the downstream task, the problem of insufficient relevance between the downstream task and the pre-training interface task is ignored, and cross entropy loss of the downstream task is usually only optimized, so that potential task relevance in the pre-training stage is not effectively utilized, a model is easily subjected to fitting on limited labeling data, and general characterization learned in pre-training is forgotten. In particular, point clouds in different data sets and ta