US-12616559-B2 - Object detection and instance segmentation of 3D point clouds based on deep learning

US12616559B2US 12616559 B2US12616559 B2US 12616559B2US-12616559-B2

Abstract

A method of object detection in a point cloud includes: determining first features associated with points of a point cloud representing one or more objects in at least a 3D space and defining geometrical information for each point of the point cloud, a first type of network being configured to receive points of the point cloud as input; determining second point cloud features based on the first features, the second features defining local geometrical information about the point cloud at positions of nodes of a uniform 3D grid; generating an object, an object proposal defining a 3D bounding box, the 3D bounding box that may define an object; and determining, by a third type of deep neural network, a score for a 3D anchor indicating a probability that the 3D anchor, the determining being based on second features that are located in the 3D anchor.

Inventors

Farhad GHAZVINIAN ZANJANI
TEO CHERICI
Frank Theodorus Catharina Claessen

Assignees

PROMATON HOLDING B.V.

Dates

Publication Date: 20260505
Application Date: 20200715
Priority Date: 20190715

Claims (18)

1 . A method of object detection in a point cloud generated by a 3D optical scanner, the method comprising: determining by a first deep neural network, representing a feature extraction network, first features associated with points of the point cloud, the point cloud including points representing one or more objects representing one or more teeth in at least a 3D space of the point cloud, the first features defining geometrical information for each point of the point cloud, the first deep neural network being configured to receive points of the point cloud as input; determining, by a second deep neural network representing an object proposal network, second features based on the first features and a uniform or non-uniform 3D grid of nodes in the space of the IOS point cloud; wherein the non-uniform 3D grid includes a dense distribution of nodes close to the surface of an object and a sparse distribution of nodes at distances further away from the surface of an object, the second features defining local geometrical information about the point cloud at positions of the nodes of the 3D grid in the 3D space of the point cloud; generating one or more object proposals based on the second features, an object proposal defining a 3D bounding box positioned around a node of the 3D grid, the 3D bounding box containing points of the point cloud that may define a tooth, the 3D bounding box defining a 3D anchor; determining, by a third deep neural network representing an object classification network, a score for the 3D anchor, the score indicating a probability that the 3D anchor includes points defining a tooth or part of a tooth, the determining being based on second features that are located in the 3D anchor.
2 . The method according to claim 1 wherein the first features include first feature vectors, each first feature vector being associated with a point of the point cloud; and/or, the second features include second feature vectors, each second feature vector being associated with a node of the 3D grid.
3 . The method according to claim 1 wherein the first deep neural network defines a feature extraction network configured to receive points of the point cloud and to generate the first features.
4 . The method according to claim 3 wherein the first deep neural network includes a plurality of convolutional layers including multilayer perceptrons (MLPs), the feature extraction network being configured to receive points of a point cloud at its input and to generate a feature vector for each point of the point cloud at its output.
5 . The method according to claim 3 , wherein the feature extraction network includes one or more χ-Conv layers, each χ-Conv layer being configured to weigh and permute points and corresponding features provided to the input of the χ-Conv layer and subsequently subjecting the permuted points and features to a convolution kernel comprising χ-Conv layers.
6 . The method according to claim 1 wherein the second type of deep neural network represents an object proposal network, the object proposal network including a plurality of convolutional layers, each of the plurality of convolutional layers including a multilayer perceptron (MLP) including one or more convolutional kernels.
7 . The method according to claim 6 , wherein the object proposal network is configured as a Monte Carlo Convolutional Network (MCCNet), comprising a plurality of Monte Carlo (MC) spatial convolutional layers.
8 . The method according to claim 6 , wherein the object proposal network is configured as a Monte Carlo Convolutional Network (MCCNet), comprising a plurality of Monte Carlo (MC) spatial convolutional layers, wherein each MC spatial convolutional layer comprising a convolutional kernel configured for determining a convolution at a location of a node x located in the 3D space of the point cloud.
9 . The method according to claim 6 , wherein the object proposal network is configured as a Monte Carlo Convolutional Network (MCCNet), comprising a plurality of Monte Carlo (MC) spatial convolutional layers, wherein each MC spatial convolutional layer comprising a convolutional kernel configured for determining a convolution at a location of a node x located in the 3D space of the point cloud, and wherein determining the convolution includes: determining neighbouring points y within the receptive field r, the receptive field defining the field of view (FOV) of the convolutional kernel; determining for each neighbouring point y a probability density function p (x,y); determining the convolution at a node based on a Monte Carlo estimation using the neighbouring points y and the probability density value p (x,y) for each neighbouring point.
10 . The method according to claim 1 wherein the third deep neural network represents an object classification network, the third deep neural network including a plurality of fully connected (FC) multilayer perceptron (MLP) layers, the second type of deep neural network being configured to receive features associated with a 3D anchor and to use the features to determine a score associated with the 3D anchor, the score indicating a probability that the 3D anchor includes points defining a tooth or part of a tooth.
11 . Computer program product comprising software code portions configured for, when run in the memory of a computer, executing the method steps according to claim 1 .
12 . The method according to claim 1 wherein the first point cloud features include first feature vectors, each first feature vector being associated with a point of the point cloud, and wherein the first type of deep neural network defines a feature extraction network configured to receive points of the point cloud and to generate the first feature vectors associated with the points of the point cloud.
13 . The method according to claim 1 wherein the second type of deep neural network represents an object proposal network, the object proposal network including a plurality of convolutional layers, each of the plurality of convolutional layers including a multilayer perceptron (MLP) including one or more convolutional kernels, and wherein at least one of the plurality of convolutional layers is configured to receive the first features and nodes of the uniform 3D grid and to determine the second features based on the first features.
14 . A method of instance segmentation of a point cloud generated by a 3D optical scanner, the method comprising: determining, by a first deep neural network representing a feature extraction network, first features associated with points of a point cloud, the point cloud including points representing one or more objects representing one of more teeth in a 3D space of the point cloud, the first features defining geometrical information for each point of the point cloud, the first deep neural network being configured to receive points of the point cloud as input; determining, by a second deep neural network representing an object proposal network, second features based on the first features and a uniform or non-uniform 3D grid of nodes in the 3D space of the IOS point cloud, the second features defining local geometrical information about the point cloud at the position of the nodes of the 3D grid, in the 3D space of the IOS point cloud, wherein the non-uniform 3D grid includes a dense distribution of nodes close to the surface of an object and a sparse distribution of nodes at distances further away from the surface of an object; generating object proposals based on the second features, an object proposal defining a 3D volume containing points that may define a tooth, the 3D volume of an object proposal defining a 3D anchor positioned around a node of the 3D grid; determining a classified 3D anchor, by a third deep neural network representing an object classification network, the determining being based on a second feature set, the second feature set being a subset of the second features that are located in the 3D anchor; determining an object volume, by a fourth deep neural network representing an object location predictor network, a centre position of the object volume coinciding with a centre location of an object instance and the dimensions of the object volume matching the outer dimensions of the object instance, the determining being based the second feature set; and, determining classified points, by a fifth deep neural network representing a mask predictor network, based on a set of points and a set of first point cloud features that are located in the object volume, the classified points including first classified points belonging to a tooth instance and second classified points not belonging to a tooth.
15 . The method according to claim 14 , wherein the first deep neural network defines a feature extraction network, the feature extraction network including a plurality of convolutional layers including multilayer perceptrons (MLPs), the feature extraction network being configured to receive points of a point cloud at its input and to generate a feature vector for each point of the point cloud at its output; and/or, wherein the second deep neural network represents an object proposal network, the object proposal network being configured as a Monte Carlo Convolutional Network (MCCNet) comprising a plurality of Monte Carlo (MC) spatial convolutional layers each layer including a multilayer perceptron (MLP) including one or more convolutional kernels; and/or, wherein the third deep neural network represents an object classification network, the third deep neural network including a plurality of fully connected (FC) multilayer perceptron (MLP) layers; and/or, wherein the fourth deep neural network represents an object location predictor network, the fourth deep neural network including a plurality of fully connected (FC) multilayer perceptron (MLP) layers; and/or, wherein the fifth deep neural network represents a mask predictor network, the fifth type of deep neural network including one or more χ-Conv layers, each χ-Conv layer being configured to weigh and permute points and corresponding features provided to the input of the χ-Conv layer and subsequently subjecting the permuted points and features to a convolution kernel.
16 . The method according to claim 15 the object proposal network includes a plurality of convolutional layers, each layer including a multilayer perceptron (MLP) including one or more convolutional kernels, and wherein at least one of the plurality of convolutional layers is configured to receive the first point cloud features and nodes of the 3D grid and to transform the first point cloud features to the second point cloud features.
17 . A computer system adapted to tooth detection in a point cloud generated by an 3D optical scanner, the system comprising: a computer readable storage medium having computer readable program code embodied therewith, the program code including a pre-processing algorithm and at least a trained first 3D deep neural network, the computer readable program code; and a processor coupled to the computer readable storage medium, wherein responsive to executing the first computer readable program code, the processor is configured to perform executable operations comprising: determining, by a first deep neural network representing a feature extraction network, first features associated with points of a point cloud, the point cloud including points representing one or more teeth in a 3D space of the point cloud, the first features defining geometrical information for each point of the point cloud, the first deep neural network being configured to receive points of the point cloud as input; determining, by a second deep neural network representing a object proposal network, second features based on the first features, the second point cloud features defining local geometrical information about the point cloud at the position of nodes of a uniform or non-uniform 3D grid in the 3D space of the point cloud, wherein the non-uniform 3D grid includes a dense distribution of nodes close to the surface of an object and a sparse distribution of nodes at distances further away from the surface of an object; generating one or more object proposals based on the second features, an object proposal defining a 3D bounding box positioned around a node of the 3D grid, the 3D bounding box containing points of the point cloud that may define a tooth, the 3D bounding box defining a 3D anchor; determining, by a third deep neural network representing an object classification network, a score for the 3D anchor, the score indicating a probability that the 3D anchor includes points defining a tooth or part of a tooth, the determining being based on second features that are located in the 3D anchor.
18 . A computer system adapted for instance segmentation of a point cloud generated by a 3D optical scanner, the system comprising: a computer readable storage medium having computer readable program code embodied therewith, the program code including a pre-processing algorithm and at least a trained first 3D deep neural network, the computer readable program code; and a processor coupled to the computer readable storage medium, wherein responsive to executing the first computer readable program code, the processor is configured to perform executable operations comprising: determining, by a first deep neural network representing a feature extraction network, first features associated with points of a point cloud, the point cloud including points representing one or more teeth in a 3D space of the point cloud, the first features defining geometrical information for each point of the point cloud, the first deep neural network being configured to receive points of the IOS point cloud as input; determining, by a second deep neural network representing an object proposal network, second features based on the first features, the second features defining local geometrical information about the point cloud at the position of nodes of a uniform or non-uniform 3D grid spanning the 3D space of the point cloud, wherein the non-uniform 3D grid includes a dense distribution of nodes close to the surface of an object and a sparse distribution of nodes at distances further away from the surface of an object; generating object proposals based on the second features, an object proposal defining a 3D volume containing points that may define a tooth, the 3D volume of an object proposal defining a 3D anchor positioned around a node of the 3D grid in the 3D space of the point cloud; determining a classified 3D anchor, by a third deep neural network representing an object classification network, the determining being based on a second feature set, the second feature set being a subset of the second features that are located in the 3D anchor; determining an object volume, by a fourth deep neural network representing an object location predictor network, a centre position of the object volume coinciding with a centre location of the object instance and the dimensions of the object volume matching the outer dimensions of the object instance, the determining being based the second feature set; and, determining classified points, by a fifth deep neural network representing a mask predictor network, based on a set of points and a set of first point cloud features that are located in the object volume, the classified points including first classified points belonging to a tooth and second classified points not belonging to a tooth.

Description

CROSS-REFERENCE TO RELATED APPLICATION(S) This Application is a Section 371 National Stage Application of International Application No. PCT/EP2020/070046, filed Jul. 15, 2020, and published as WO 2021/009258 A1 on Jan. 21, 2021, and further claims priority to European Patent Application No. 19186357.0, filed Jul. 15, 2019. FIELD OF THE INVENTION The invention relates to object detection and instance segmentation of 3D point clouds based on deep learning, and in particular, though not exclusively, to methods and systems for object detection in 3D point clouds using deep learning, to methods and system for instance segmentation of 3D point clouds using deep learning, a deep neural network system for object detection in 3D point clouds, a deep neural network system for instance segmentation of 3D point clouds and a computer program product for executing such methods. BACKGROUND OF THE INVENTION In image processing instance segmentation refers to the process of object detection wherein specific objects in an image are detected (typically by determining bounding boxes comprising each of the detected objects) and creating a pixel mask for each identified object. Instance segmentation can be thought as object detection where the output is a pixel mask instead of just a bounding box. Thus, unlike semantic segmentation, which aims to categorize each pixel in an image, instance segmentation aims to label pixels in determined bounding boxes. Recently, fast and reliable instance segmentation for 2D camera images based on a so-called Mask R-CNN deep learning scheme is seeing increasing application in solving real-world problems. However, in many applications such as autonomous driving, robotics and certain medical applications, the sensor information that needs to be analyzed represents a 3D scene, not a 2D scene. These 3D applications rely on information generated by optical scanners, e.g. laser scanners such as LiDAR used in surveying applications and intra-oral scanners used in dentistry, which typically generate non-uniform 3D volumetric data in the form of a point cloud. These data are not structured in the form of a homogenous grid of data such as pixels or—case of non-optical 3D scanners, e.g. CT scanners—voxels. Data acquisition schemes based on optical scanners typically generate 3D volumetric data in the form of a point cloud data set or—in short—a point cloud. Data points of a point cloud may represent the surface of objects. Typically, point clouds include a large number of points which are non-uniformly distributed in the 3D space. The 3D space may include areas of densely distributed data points, areas of sparsely distributed data points and areas that do not have data points at all, e.g. the void space ‘inside’ objects. The term point cloud may refer any type 3D data set wherein each point may be represented as a vector in a 3D space. The points may be associated with further attributes, e.g. color or the like. Special types of point clouds include 3D surface definitions such as triangle meshes or polygon meshes. Although 3D analysis based on a point cloud is a rapidly growing field of technology, schemes for 3D object detection and instance segmentation are still in their infancy when compared to their 2D counterparts. Currently only few sources are known that address 3D instance segmentation, Qi, et al describe in their article “Frustum pointnets for 3D object detection from RGB-D data” IEEE Conference on Computer Vision and Pattern Recognition. pp. 918-927 (2018) a hybrid framework involving two stages wherein in a first stage 2D bounding boxes of objects are detected in 2D images and in a second stage a 3D point cloud is processed in a 3D search space, partially bound by the 2D bounding boxes. Similarly, Hou, et al. describe in their article “3D-SIS: 3D semantic instance segmentation of RGB-D scans”. arXiv preprint arXiv:1812.07003 (2018) a model wherein first 2D images are processed by a 2D convolutional network. Thereafter, the learned features are back-projected on a voxelized point cloud data, where the extracted 2D features and the geometric information are combined to obtain object proposals and per-voxel mask prediction. The dependency of the above-described models on 2D image(s) and voxelization limits the performance of such approaches. In another approach, Yi, et al described in their article “Generative shape proposal network for 3d instance segmentation in point cloud”, arXiv preprint arXiv:1812.03320 (2018) an analysis-by-synthesis strategy wherein instead of directly determining object bounding boxes in a point cloud, a conditional variational auto-encoder (CVAE) is used. However, GSPN training requires a rather complex separate two-stage training of the CVAE part and the region-based networks (which perform the classification, regression and mask generation on the proposals). In yet another approach, object proposals are determined based on a clustering scheme. Wang, et al described in