CN-116310310-B - Point cloud semantic segmentation method and device

CN116310310BCN 116310310 BCN116310310 BCN 116310310BCN-116310310-B

Abstract

The invention discloses a point cloud semantic segmentation method and device. The method comprises the steps of collecting point cloud data in a scene as a dataset, taking 3D sparse convolution as a coder and decoder, designing an inter-frame local attention module based on Point Transformer, constructing a semantic segmentation network structure integrating time sequence information, training a semantic segmentation network according to the obtained dataset, obtaining a model for semantic segmentation, and carrying out semantic segmentation on the point cloud data to be segmented by utilizing the semantic segmentation model. The invention can process large-scale outdoor scene tasks, increases an inter-frame local attention module, can solve the problems of shielding objects and low segmentation precision of few categories, and avoids the interference of useless information caused by the introduction of global features by a general Transformer.

Inventors

LIU ERYUN
XU JINGWEI

Assignees

浙江大学

Dates

Publication Date: 20260505
Application Date: 20221227

Claims (5)

1. The point cloud semantic segmentation method is characterized by comprising the following steps of: step 1), collecting point cloud data in a scene as a data set through data collection equipment; Step 2) taking 3D sparse convolution as a coder-decoder, designing an inter-frame local attention module based on Point Transformer, and constructing a semantic segmentation network structure fusing time sequence information; The point cloud semantic segmentation network for fusing the time sequence information comprises a data preprocessing module, a feature encoding module, an inter-frame local attention module, a feature encoding and decoding module and a post-processing module which are connected in sequence; step 3) training the semantic segmentation network constructed in the step 2) according to the data set obtained in the step 1) to obtain a model for semantic segmentation; step 4) carrying out semantic segmentation on the point cloud data to be segmented by utilizing a semantic segmentation model The feature encoding module and the feature decoding module in the step 2) are designed based on Encoder-Decoder network structures in the UNet network, an inter-frame local attention module is arranged between the feature encoding module and the feature decoding module, the output of the data preprocessing module is input into the feature encoding module to obtain a feature vector F t of a current frame and a feature vector F t−1 of a previous frame, the feature vector F t of the current frame and the feature vector F t−1 of the previous frame are input into the inter-frame local attention module to output a feature vector F t '', and then the feature vector F is input into the post-processing module through the feature decoding module; The inter-frame local attention module in the step 2) consists of two linear layers and an inter-frame local attention layer, wherein the inter-frame local attention layer comprises three linear learning layers, two neighborhood searching layers, a position coding layer and a learnable weight layer; The inter-frame local attention module finds the k neighborhood of the previous frame corresponding to each non-empty voxel of the current frame by utilizing the index positions of the current frame and the previous frame, and then updates the characteristics of the current frame through a weight matrix, and specifically comprises the following steps: 1) After the feature vector F t of the current frame and the feature vector F t−1 of the previous frame are input into the first linear layer, the feature vector F t 'of the current frame and the feature vector F t−1 ' of the previous frame are respectively obtained; 2) The linear learning layer is used for obtaining a Q vector by the linear transformation of the characteristic vector F t 'of the current frame through the first linear learning layer, and obtaining a K vector and a V vector by the linear transformation of the characteristic vector F t−1 ' of the previous frame through the second linear learning layer and the third linear learning layer respectively; Wherein Q epsilon R n2×C ,K∈R n1×C ,V∈R n1×C , n1 and n2 respectively represent the number of non-empty voxels of the previous frame and the current frame, and C represents the number of characteristic channels; 3) Neighborhood search layer: The K vector is subjected to neighborhood search of a first neighborhood search layer to obtain K neighborhood of a previous frame corresponding to each effective voxel of the current frame, and a feature vector K' and a relative position pr are output; The V vector is subjected to neighborhood searching operation of a second neighborhood searching layer to obtain k neighborhood of the previous frame corresponding to each effective voxel of the current frame, and a feature vector V' is output; Wherein, K ′ ∈R n2×k×C ;V ′ ∈ R n2×k×C , pr is the relative position of K neighborhoods and the center point, pr is E R n2×k×3 ; 4) The position coding layer is used for inputting the relative position pr output by the first neighborhood searching layer into the position coding layer, and the position coding layer unifies the characteristic dimension of pr to obtain p r ′,p r ′ ∈ R n2×k×C ; 5) The learning weight layer consists of an MLP layer and a softmax function, wherein the MLP layer comprises two linear layers and a ReLU nonlinear layer; the input of the learnable weight layer is vectors Q, K 'and p r ', the relevance between the neighborhood and the center point is learned through the MLP layer, and then the weight matrix is obtained through a softmax function; 6) The feature updating layer is used for carrying out vector multiplication on the output p r 'of the position coding layer and the feature vector V' and a weight matrix output by the leachable weight layer to obtain updated effective voxel feature F t ''; 7) The effective voxel F t ' is input into the second linear layer, and the output of the second linear layer is the output of the inter-frame local attention module.
2. The method for semantic segmentation of point cloud according to claim 1 is characterized in that step 1) comprises the steps of firstly determining semantic segmentation categories according to task requirements, then acquiring different areas in the same scene through a laser radar and a camera to obtain a plurality of sequences, wherein each sequence comprises point cloud data acquired by the laser radar and images acquired by the camera, and finally performing category labeling of the point cloud data according to the corresponding relation between the point cloud and image pixel points established by the internal parameters and the external parameters of the camera to obtain true value labels corresponding to each point, so that acquisition of a data set is completed.
3. The method according to claim 1, wherein in the step 2), the data preprocessing module includes a preprocessing operation, an MLP layer and a cylindrical voxel layer, the preprocessing operation is to convert cartesian coordinates of an input point cloud into cylindrical coordinates, and then calculate a voxel index to which each point belongs and an initial feature of each point through manually set hyper-parameters; And (3) inputting the point clouds of the current frame and the point clouds of the previous frame in the sequence into a data preprocessing module for voxelization, and outputting the characteristics and indexes of each non-empty voxel of the current frame and the previous frame.
4. The point cloud semantic segmentation method according to claim 1, wherein the post-processing module in the step 2) comprises a DDCM module, a disjunctive pixelation module and a point-by-point refinement module which are sequentially connected; the output of the feature decoding module is input into the DDCM module, and then is subjected to voxelization by the voxelization module, the output of the voxelization module and the output of the MLP layer in the data preprocessing module are added and then input into the point-by-point thinning module to learn the features point by point, so that a fine-granularity segmentation result is obtained.
5. The point cloud semantic segmentation device is characterized by comprising a 3D sensing device and computing equipment; The 3D sensing device comprises a laser radar sensor and a camera, wherein the laser radar sensor is used for acquiring point cloud data of a scene; the computing device comprising a memory and a processor, the memory storing a computer program running on the processor, the processor implementing the steps of the point cloud semantic segmentation method of any one of claims 1 to 4 when the computer program is executed.

Description

Point cloud semantic segmentation method and device Technical Field The invention relates to a laser radar semantic segmentation method and device in an automatic driving scene, in particular to a point cloud semantic segmentation method and device based on Point Transformer fusion time sequence information. Background Because of the sparsity of the point cloud, the single-frame-based point cloud semantic segmentation method is not ideal for the segmentation effect of a few categories, but the application of time sequence information in the current point cloud semantic segmentation is not much, the time is taken as one dimension, then 4D occupied grids are processed by sparse 4D convolution, but the memory occupation and the calculation consumption are huge, or the problems of parallel operation, network convergence and the like exist by using methods such as RNN, LSTM and the like. In recent years, a transducer algorithm originally applied to the field of natural language processing starts to be applied to three-dimensional scene understanding, however, due to the fact that the transducer itself is large in calculation amount and the characteristic that a self-attention mechanism focuses on global information, excessive useless information is introduced when a point cloud semantic segmentation problem is applied, and a segmentation result is not ideal. The invention combines the deep learning skeleton network with Point Transformer, improves the attention mechanism, and makes the separation result more accurate by utilizing the semantic information of the previous frame. Disclosure of Invention In order to solve the problems in the background art, the invention provides a point cloud semantic segmentation method and a point cloud semantic segmentation device, which are suitable for scenes moving at a low speed, can effectively overcome the data sparseness problem of single-frame point clouds, and achieve higher precision than single-frame point cloud segmentation. The technical scheme adopted by the invention is as follows: 1. point cloud semantic segmentation method The method comprises the following steps: step 1), collecting point cloud data in a scene as a data set through data collection equipment; Step 2) taking 3D sparse convolution as a coder-decoder, designing an inter-frame local attention module based on Point Transformer, and constructing a semantic segmentation network structure fusing time sequence information; The point cloud semantic segmentation network for fusing the time sequence information comprises a data preprocessing module, a feature encoding module, an inter-frame local attention module, a feature encoding and decoding module and a post-processing module which are connected in sequence; step 3) training the semantic segmentation network constructed in the step 2) according to the data set obtained in the step 1) to obtain a model for semantic segmentation; And 4) carrying out semantic segmentation on the point cloud data to be segmented by utilizing a semantic segmentation model. The method comprises the steps of 1) firstly determining semantic segmentation categories (taking an automatic driving scene as an example, mainly focusing on the categories including vehicles, people, travelable roads, sidewalks, other obstacles and the like) according to task requirements, then acquiring different areas in the same scene through a laser radar and a camera to obtain a plurality of sequences, wherein each sequence comprises point cloud data acquired by the laser radar and images acquired by the camera, and finally constructing a corresponding relation between the point cloud data and image pixel points according to the internal parameters and the external parameters of the camera, and marking the category of the point cloud data to obtain true value labels corresponding to each point, thereby completing acquisition of a data set. The frequencies between the laser radar and the camera are required to be kept consistent, so that the two devices acquire scenes at the same moment. In the step 2), the data preprocessing module comprises preprocessing operation, an MLP layer and a cylindrical voxel layer (CYLINDRIAL VOXELIZATION), wherein the preprocessing operation is to convert Cartesian coordinates of an input point cloud into cylindrical coordinates, and then calculate voxel indexes of each point and initial characteristics of each point through manually set super parameters; And (3) inputting the point clouds of the current frame and the point clouds of the previous frame in the sequence into a data preprocessing module for voxelization, and outputting the characteristics and indexes of each non-empty voxel of the current frame and the previous frame. The feature encoding module and the feature decoding module in the step 2) are designed based on Encoder-Decoder network structures in the UNet network, and an inter-frame local attention module is arranged between the feature encoding module