US-12626498-B2 - Network architecture for a three-dimensional object detection in a point cloud, method and vehicle

US12626498B2US 12626498 B2US12626498 B2US 12626498B2US-12626498-B2

Abstract

A network architecture for a three-dimensional object detection in a point cloud to a method and to vehicle. The network architecture for a three-dimensional object detection in a point cloud may include an encoder, a backbone, and a head, wherein the backbone and/or the head may include a three-dimensional convolution component for processing three-dimensional data of the point cloud within a three-dimensional voxel grid.

Inventors

Christopher Plachetka
Tim Fingscheidt
Benjamin Sertolli

Assignees

VOLKSWAGEN AKTIENGESELLSCHAFT

Dates

Publication Date: 20260512
Application Date: 20230914
Priority Date: 20221006

Claims (20)

1 . A point-cloud processing system for three-dimensional object detection in a point cloud comprising: an encoder; a backbone; and a head, wherein the encoder, the backbone, and the head are implemented by one or more processors, and wherein the backbone and/or the head are configured with a three-dimensional convolution component for processing three-dimensional data of the point cloud stored in memory within a three-dimensional voxel grid to detect vertically stacked objects in a vehicle environment and output a plurality of three-dimensional object detections comprising three-dimensional positions of the vertically stacked objects in the point cloud, wherein the plurality of three-dimensional object detections comprise respective three-dimensional positions of at least a first object and a second object that are vertically stacked relative to one another.
2 . The point-cloud processing system according to the claim 1 , wherein an input to the encoder comprises data from a LiDAR point cloud and/or a RADAR point cloud.
3 . The point-cloud processing system according to the claim 1 , wherein the head comprises a detection and regression head providing at least two three-dimensional convolutions in parallel.
4 . The point-cloud processing system according to the claim 3 , wherein the detection and regression head is configured to operate in a single-shot fashion.
5 . The point-cloud processing system according to the claim 3 , wherein the detection and regression head is configured to detect three-dimensional objects using a three-dimensional anchor grid.
6 . The point-cloud processing system according to claim 5 , wherein the detection and regression head is configured to match ground truth poles to the anchor grid using an intersection-over-smaller-area criterion that includes a “don't care” state.
7 . The point-cloud processing system according to claim 5 , wherein the three-dimensional anchor grid comprises anchors for poles with a predefined voxel size, wherein the anchors for the poles are represented as cylinders placed in the three-dimensional anchor grid with a predefined diameter.
8 . The point-cloud processing system according to claim 5 , wherein the three-dimensional anchor grid comprises anchors for lights with a predefined voxel size, wherein the anchors are modeled as a squared-sized bounding box of a predefined height and a predefined width.
9 . The point-cloud processing system according to claim 8 , wherein the detection and regression head is configured to match ground truth lights to the anchor grid by selecting candidates in an x-y-plane using an intersection-over-smaller-area criterion, and filtering candidates according an overlap criterion in the z-dimension.
10 . The point-cloud processing system according to claim 5 , wherein the three-dimensional anchor grid comprises anchors for signs with a predefined voxel size, and wherein the anchors for the signs are represented as rectangles placed in the three-dimensional anchor grid with a predefined height and width.
11 . The point-cloud processing system according to claim 1 , wherein the point cloud comprises a high-density point cloud, and wherein the point-cloud processing system comprises a neural network trained using high-density point clouds.
12 . A method for detecting three-dimensional objects, comprising: receiving a point cloud input in an encoder; performing, in one or more of a backbone and/or head, a three-dimensional convolution by processing three-dimensional data of the point cloud within a three-dimensional voxel grid, wherein the receiving and the performing are executed by one or more processors, and the three-dimensional voxel grid is stored in memory; and outputting three-dimensional object detections comprising three-dimensional positions of the vertically stacked objects in the point cloud, wherein the three-dimensional object detections comprise respective three-dimensional positions of at least a first object and a second object that are vertically stacked relative to one another.
13 . The method according to claim 12 , wherein the detecting of the three-dimensional objects comprises detecting of the three-dimensional objects using a three-dimensional anchor grid.
14 . The method according to claim 13 , further comprising matching ground truth poles to the anchor grid using an intersection-over-smaller-area criterion that includes a “don't care” state.
15 . The method according to claim 13 , wherein the three-dimensional anchor grid comprises anchors for poles with a predefined voxel size, wherein the anchors for the poles are represented as cylinders placed in the three-dimensional anchor grid with a predefined diameter.
16 . The method according to claim 13 , wherein the three-dimensional anchor grid comprises anchors for lights with a predefined voxel size, wherein the anchors are modeled as a squared-sized bounding box of a predefined height and a predefined width.
17 . The method according to claim 13 , further comprising matching ground truth lights to the anchor grid by selecting candidates in an x-y-plane using an intersection-over-smaller-area criterion, and filtering candidates according an overlap criterion in the z-dimension.
18 . The method according to claim 13 , wherein the three-dimensional anchor grid comprises anchors for signs with a predefined voxel size, and wherein the anchors for the signs are represented as rectangles placed in the three-dimensional anchor grid with a predefined height and width.
19 . The method according to claim 12 , wherein the point cloud comprises a high-density point cloud, and wherein the method is performed using a neural network trained using high-density point clouds.
20 . A vehicle for a three-dimensional object detection, comprising: one or more LiDAR-sensor and/or one or more RADAR sensors for providing one or more outputs; an encoder configured to receive the one or more outputs to generate a point cloud; a backbone; and a head, wherein the backbone and/or the head are configured with a three-dimensional convolution component for processing three-dimensional data of the point cloud within a three-dimensional voxel grid to detect vertically stacked objects in a vehicle environment and output a plurality of three-dimensional object detections comprising respective three-dimensional positions of the vertically stacked objects in the point cloud, and provide the three-dimensional object detections to a vehicle control system comprising at least one of a motion planning system, a localization system, and a mapping system, wherein the plurality of three-dimensional object detections comprise respective three-dimensional positions of at least a first object and a second object that are vertically stacked relative to one another.

Description

RELATED APPLICATIONS The present application claims priority to European Patent Application No. 22200113.3 to Plachetka et al., filed Oct. 6, 2022, titled “Network Architecture For A Three-Dimensional Object Detection In A Point Cloud, Method And Vehicle,” the contents of which is incorporated by reference in its entirety herein. TECHNICAL FIELD The present disclosure relates to a network architecture for detecting vertically stacked objects in a vehicle environment, and to related methods and a vehicle incorporating the network architecture. BACKGROUND Deep neural network architectures for object detection in point clouds (such as LiDAR or RADAR point clouds) published to date lack the capability to detect vertically stacked objects, as the current technologies focus on the movable parts of the environment for automated vehicles, where single objects are placed on the ground plane only. High-definition (HD) maps are a vital component for an automated vehicle and are suitable for numerous tasks of autonomous and semi-autonomous driving. In the field of automated driving, however, to generate HD maps, to detect map deviations onboard while driving, or to generate an environment model on-the-fly without map, the detection of vertically stacked objects such as traffic signs or lights (stacked with signs) is necessary. Also, such vertically stacked objects in a three-dimensional space may occur also in other technology fields, e.g., in physics. BRIEF SUMMARY Aspects of the present disclosure are directed to improving 3D object detection. In particular, there may be a need to detect vertically stacked objects in point clouds, for instance for, but not limited to, a vehicle environment. Accordingly, a network architecture, a method and a vehicle are disclosed according to the subject-matter of the independent claims. Further exemplary embodiments and improvements are provided by the subject-matter of the dependent claims. In some examples, a network architecture is disclosed for a three-dimensional object detection in a point cloud, and may include an encoder, a backbone, and a head, wherein the backbone and/or the head comprises a three-dimensional convolution component for processing three-dimensional data of the point cloud within a three-dimensional voxel grid. As used herein, “objects” in this context may be understood as real-world objects. The term “ground truth objects” refers to objects originating from the dataset used to train and test the network, assumed to be a ground truth without annotation errors. The term “ground truth distribution (GT)” may be understood as a distribution of a parameter (e.g., pole diameter) obtained from the ground truth dataset. Further the term “bounding shape parameters” may be understood as e.g., height, width, and orientation of a bounding rectangle. The network architecture may preferably be a deep neural network architecture. In some examples, a network architecture is disclosed for detecting vertically stacked objects in a vehicle environment, comprising an encoder, a backbone and a head, wherein the backbone and/or the head comprises a convolution component that is adapted to perform a convolution within a three-dimensional voxel grid. Some aspects of the present disclosure are directed to the detection of traffic signs as vertically stacked objects. However, the network architecture is not limited to these specific objects. All objects having a vertical dimension (being a three-dimensional object) may be applicable as detected objects in point clouds. In some examples, a method is disclosed for detecting three-dimensional objects in a point cloud, comprising providing an encoder, providing a backbone, providing a head, and performing in the backbone and/or in the head a three-dimensional convolution by processing three-dimensional data of the point cloud within a three-dimensional voxel grid. In some examples, a vehicle is disclosed, comprising the network architecture according to the present disclosure or its exemplary embodiments, and at least one LiDAR-sensor and/or one or more RADAR sensors, wherein the sensor is adapted for providing data for the point cloud. In some examples, the network may be implemented in a vehicle in the automotive domain. However, regarding the generation of HD maps (or other technology fields), the network may be implemented in a general processing unit (GPU)—such a GPU cluster generating traffic signs (or other vertically stacked objects) for large amounts of data—rather than single laser scans originating from onboard data, for instance. In some examples, the point cloud may be provided by one or more sensors, such as LiDAR-sensor or a plurality of LiDAR-sensor or a RADAR sensor or a plurality of RADAR sensors. Any other source of point clouds e.g., originating from depth estimation would be also applicable. The vehicle may be a self-driving vehicle. However, a vehicle application is only one example for a general processing unit. Als