CN-122023790-A - Semantic segmentation method and device of three-dimensional point cloud and electronic equipment

CN122023790ACN 122023790 ACN122023790 ACN 122023790ACN-122023790-A

Abstract

The invention provides a semantic segmentation method, a semantic segmentation device and electronic equipment for three-dimensional point clouds, which can be used for extracting local geometric features and local semantic features through a local embedding module in each layer of coding network, further determining a local semantic aggregation result, extracting global semantic context vectors through a global embedding module, further determining a fusion result output by the layer of coding network, decoding the fusion result of the last layer of coding network, further determining the semantic result of each point, and improving semantic understanding capability while maintaining the geometric structure of the point clouds.

Inventors

LIU XIAOHE
GUO LIYONG

Assignees

北京邮电大学
中钛互联（北京）科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251231

Claims (10)

1. A semantic segmentation method of a three-dimensional point cloud, the method comprising: Acquiring a three-dimensional point cloud set aiming at a target scene, wherein the three-dimensional point cloud set comprises space coordinates and optional attribute information of each point; Inputting the three-dimensional point cloud set into a plurality of layers of coding networks, and obtaining a first point cloud set corresponding to each layer of coding network through layer-by-layer downsampling, wherein each layer of coding network comprises a local embedding module and a global embedding module; for each layer of coding network, extracting local geometric features and local semantic features based on a previous layer of point cloud set input to the layer of coding network through a local embedding module in the layer of coding network so as to output a local semantic aggregation result; Extracting a global semantic context vector based on the previous layer point cloud set through a global embedding module in the layer of coding network so as to determine a fusion result output by the global embedding module in the layer of coding network based on the global semantic context vector and the local semantic aggregation result; Taking a fusion result output by a global embedding module in a last layer of coding network as a final fusion result, and inputting the final fusion result into a multi-layer decoding network so as to perform layer-by-layer up-sampling on the final fusion result based on the characteristics in each layer of coding network through the multi-layer decoding network to obtain a final decoding result; And outputting a semantic result of each point in the three-dimensional point cloud set based on the final decoding result.
2. The method of claim 1, wherein the step of inputting the three-dimensional point cloud set into a multi-layer coding network and obtaining the first point cloud set corresponding to each layer of coding network by downsampling layer by layer comprises: And inputting the three-dimensional point cloud set into a multi-layer coding network, and adopting a furthest point sampling algorithm for each layer of coding network to downsample the point cloud set of the upper layer input into the layer of coding network to obtain a first point cloud set corresponding to the layer of coding network.
3. The method according to claim 1, wherein for each layer of the coding network, the step of extracting, by a local embedding module in the layer of the coding network, local geometric features, local semantic features based on the previous layer of point cloud sets to output a local semantic aggregation result comprises: Aiming at each layer of coding network, determining a plurality of neighborhood points corresponding to each point in the previous layer of point cloud set by adopting a K neighbor algorithm through a local embedding module in the layer of coding network; respectively calculating the relative coordinate difference between each neighborhood point and the point; mapping the point and each relative coordinate difference through a multi-layer perceptron to obtain the local geometric feature of each neighborhood point; extracting local semantic features of each neighborhood point through the multi-layer perceptron; and determining a local semantic aggregation result corresponding to the point by adopting a maximum pooling operation based on the local semantic characteristics of each neighborhood point corresponding to the point.
4. A method according to claim 3, characterized in that the method further comprises: determining a first weighting coefficient corresponding to each neighborhood point based on the local semantic feature of each neighborhood point corresponding to the point; determining a weighted semantic feature corresponding to each neighborhood point based on each first weighting coefficient and the local semantic feature of the point; performing splicing processing on the local geometric features, the local semantic aggregation results and the weighted semantic features of each neighborhood point to obtain a first splicing result; and carrying out compression mapping processing on the first splicing result through the multi-layer perceptron to obtain the fused local embedded feature.
5. A method according to claim 3, wherein the step of extracting, by a global embedding module in the layer of the coding network, a global semantic context vector based on the previous layer of point-cloud sets to determine a fusion result output by the global embedding module in the layer of the coding network based on the global semantic context vector and the local semantic aggregation result comprises: Extracting features of the previous layer of point cloud set through a global embedding module in the layer of coding network to obtain a feature set; sequentially carrying out maximum pooling operation and nonlinear activation operation on the feature set to obtain a global semantic context vector; Performing splicing processing on the features corresponding to the point, each relative coordinate difference and the global semantic context vector to obtain a second splicing result corresponding to each neighborhood point respectively; determining a second weighting coefficient corresponding to each neighborhood point based on each second splicing result; Determining a global geometric feature based on each of the second weighting coefficients and each of the second stitching results; And outputting a fusion result corresponding to the point based on the global geometric feature, the local semantic aggregation result and the global semantic context vector.
6. The method of claim 1, wherein the step of upsampling the final fusion result layer by layer through the multi-layer decoding network based on features in each layer of the encoding network, to obtain a final decoding result comprises: Performing layer-by-layer up-sampling on the final fusion result through the multi-layer decoding network to obtain a second point cloud set corresponding to each layer of decoding network; taking the first layer decoding network as a current layer, adopting a K neighbor algorithm for each point in the current layer, and determining a plurality of adjacent points corresponding to the point from a next layer point cloud set corresponding to a next layer decoding network; Calculating a weighted interpolation characteristic corresponding to the point based on a plurality of the neighboring points; Splicing the weighted interpolation features with features in the coding network of the corresponding layer to obtain a feature fusion result; Repeatedly executing the steps of determining a plurality of adjacent points corresponding to each point from a next layer point cloud set corresponding to the next layer decoding network by adopting a K nearest neighbor algorithm aiming at each point in the current layer by taking the next layer decoding network as a new current layer until a feature fusion result corresponding to the last layer decoding network is obtained; and determining the feature fusion result corresponding to the last layer of decoding network as a final decoding result.
7. The method of claim 1, wherein outputting the semantic result for each point in the three-dimensional point cloud set based on the final decoding result comprises: Outputting a plurality of original scores of each point belonging to a plurality of preset categories in the three-dimensional point cloud set through a full connection layer according to the final decoding result; Inputting a plurality of original scores corresponding to each point in the three-dimensional point cloud set into a preset activation function so as to output the probability that the point belongs to each category through the preset activation function; And determining the category with the highest probability as a semantic result corresponding to the point.
8. A semantic segmentation apparatus for a three-dimensional point cloud, the apparatus comprising: The acquisition module is used for acquiring a three-dimensional point cloud set aiming at a target scene, wherein the three-dimensional point cloud set comprises space coordinates and optional attribute information of each point; the down-sampling module is used for inputting the three-dimensional point cloud set into a plurality of layers of coding networks, and obtaining a first point cloud set corresponding to each layer of coding network through down-sampling layer by layer; The extraction module is used for extracting local geometric features and local semantic features of each layer of coding network based on a previous layer of point cloud set input to the layer of coding network through a local embedding module in the layer of coding network so as to output a local semantic aggregation result; The determining module is used for extracting a global semantic context vector based on the previous layer point cloud set through the global embedding module in the layer of coding network so as to determine a fusion result output by the global embedding module in the layer of coding network based on the global semantic context vector and the local semantic aggregation result; The up-sampling module is used for taking the fusion result output by the global embedding module in the last layer of coding network as a final fusion result, inputting the final fusion result into a multi-layer decoding network, and carrying out layer-by-layer up-sampling on the final fusion result through the multi-layer decoding network based on the characteristics in each layer of coding network to obtain a final decoding result; And the output module is used for outputting the semantic result of each point in the three-dimensional point cloud set based on the final decoding result.
9. An electronic device comprising a processor and a memory, the memory storing machine executable instructions executable by the processor, the processor executing the machine executable instructions to implement the method of semantic segmentation of a three-dimensional point cloud according to any one of claims 1-7.
10. A machine-readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the semantic segmentation method of a three-dimensional point cloud according to any one of claims 1-7.

Description

Semantic segmentation method and device of three-dimensional point cloud and electronic equipment Technical Field The invention relates to the technical field of computer vision and artificial intelligence, in particular to a semantic segmentation method and device for three-dimensional point cloud and electronic equipment. Background With the rapid development of artificial intelligence and computer vision technologies, three-dimensional point clouds gradually become important data for scene perception, target detection and environmental modeling. The point cloud semantic segmentation technology aims at carrying out semantic classification on each point in a three-dimensional scene so as to distinguish different semantic categories of ground, buildings, roads, vehicles, pedestrians and the like. The technology has wide application prospect in the fields of automatic driving, robot navigation, virtual reality, intelligent manufacturing, geographical mapping and the like. In the related technology, pointNet series methods can be adopted for carrying out point cloud semantic segmentation, the series methods directly process unordered point clouds through a multi-layer perceptron (Multilayer Perceptron, abbreviated as MLP), and feature aggregation is realized through a global pooling layer, so that the method has higher calculation efficiency. However, the method only uses global information, and the lack of the description of the local neighborhood geometric structure leads to insufficient local feature extraction, so that geometric and semantic relations among the neighborhoods are difficult to accurately model, and the semantic segmentation precision in complex scenes is reduced. Disclosure of Invention The invention aims to provide a semantic segmentation method and device for three-dimensional point cloud and electronic equipment, so as to improve semantic segmentation precision in a complex scene. The semantic segmentation method of the three-dimensional point cloud comprises the steps of obtaining a three-dimensional point cloud set aiming at a target scene, wherein the three-dimensional point cloud set comprises space coordinates and optional attribute information of each point, inputting the three-dimensional point cloud set into a multi-layer coding network, obtaining a first point cloud set corresponding to each layer of coding network through layer-by-layer downsampling, wherein each layer of coding network comprises a local embedding module and a global embedding module, aiming at each layer of coding network, extracting local geometric features and local semantic features based on a previous layer of point cloud set input into the layer of coding network through the local embedding module in the layer of coding network to output a local semantic aggregation result, extracting global semantic context vectors based on the previous layer of point cloud set through the global embedding module in the layer of coding network to determine a fusion result output by the global embedding module in the layer of coding network based on the global semantic context vectors and the local semantic aggregation result, taking the fusion result output by the global embedding module in the last layer of coding network as a final fusion result, inputting the final fusion result into a decoding network, carrying out multi-layer decoding network based on the final fusion result, and obtaining a three-dimensional fusion result based on each layer of the final point cloud set. Further, the step of inputting the three-dimensional point cloud set into the multi-layer coding network and obtaining the first point cloud set corresponding to each layer of coding network through layer-by-layer downsampling comprises the steps of inputting the three-dimensional point cloud set into the multi-layer coding network, and downsampling the last layer of point cloud set input into the layer of coding network by adopting the furthest point sampling algorithm aiming at each layer of coding network to obtain the first point cloud set corresponding to the layer of coding network. Further, for each layer of coding network, the step of extracting local geometric features and local semantic features based on the previous layer of point cloud set by using the local embedding module in the layer of coding network to output a local semantic aggregation result comprises the following steps: aiming at each layer of coding network, determining a plurality of neighborhood points corresponding to each point in the point cloud set of the previous layer by adopting a K neighbor algorithm through a local embedding module in the layer of coding network; the method comprises the steps of respectively calculating the relative coordinate difference between each neighborhood point and the point, mapping the point and each relative coordinate difference through a multi-layer perceptron to obtain the local geometric feature of each neighborhood point, extracti