US-12620170-B2 - Sparse voxel transformer for camera-based 3D semantic scene completion

US12620170B2US 12620170 B2US12620170 B2US 12620170B2US-12620170-B2

Abstract

An artificial intelligence framework is described that incorporates a number of neural networks and a number of transformers for converting a two-dimensional image into three-dimensional semantic information. Neural networks convert one or more images into a set of image feature maps, depth information associated with the one or more images, and query proposals based on the depth information. A first transformer implements a cross-attention mechanism to process the set of image feature maps in accordance with the query proposals. The output of the first transformer is combined with a mask token to generate initial voxel features of the scene. A second transformer implements a self-attention mechanism to convert the initial voxel features into refined voxel features, which are up-sampled and processed by a lightweight neural network to generate the three-dimensional semantic information, which may be used by, e.g., an autonomous vehicle for various advanced driver assistance system (ADAS) functions.

Inventors

Yiming Li
Zhiding Yu
Christopher B. Choy
Chaowei Xiao
JOSE MANUEL ALVAREZ LOPEZ
Sanja Fidler
Animashree Anandkumar

Assignees

NVIDIA CORPORATION

Dates

Publication Date: 20260505
Application Date: 20231120

Claims (18)

1 . A computer-implemented method, comprising: receiving one or more input images, wherein each image of the one or more input images is a two-dimensional (2D) image of a scene; and processing, via a plurality of models implemented by one or more processors, the one or more input images to generate three-dimensional (3D) semantic information for the scene, the processing comprising: extracting a set of image feature maps from the one or more input images by at least one feature extraction network, generating a depth map by processing the one or more input images by a depth estimation network, generating 3D point cloud data based on the depth map, generating a first binary voxel grid occupancy map at a first resolution based on the 3D point cloud, converting the first binary voxel grid occupancy map at the first resolution to a second binary voxel grid occupancy map at a second resolution by a depth correction network, and generating the three-dimensional semantic information based on the set of image feature maps by using at least one transformer.
2 . The method of claim 1 , wherein the processing, via the plurality of models implemented by the one or more processors, the one or more input images to generate the three-dimensional semantic information for the scene further comprises: generating, via a query proposal network, a set of query proposals by processing at least one of the depth map or the second binary voxel grid occupancy map.
3 . The method of claim 1 , wherein the at least one feature extraction network is a convolutional neural network (CNN).
4 . The method of claim 2 , wherein the generating the three-dimensional semantic information based on the set of image feature maps by using the at least one transformer comprises: processing, via a first transformer, the set of image feature maps using a deformable cross-attention (DCA) mechanism in accordance with the set of query proposals to generate an updated set of query proposals.
5 . The method of claim 4 , wherein the generating the three-dimensional semantic information based on the set of image feature maps by using the at least one transformer comprises: generating initial voxel features by combining the updated set of query proposals with a mask token; and processing, via a second transformer, the initial voxel features using a deformable self-attention (DSA) mechanism to generate refined voxel features.
6 . The method of claim 5 , wherein the generating the three-dimensional semantic information based on the set of image feature maps by using the at least one transformer comprises: up-sampling the refined voxel features; and processing the up-sampled refined voxel features via a neural network comprising one or more fully connected layers to generate the three-dimensional semantic information.
7 . The method of claim 1 , wherein the plurality of models are trained in accordance with a loss criteria as defined by: ℒ = - Σ k = 1 K ⁢ Σ c = c 0 c m c y ˆ k , c ⁢ log ⁢ ( e y k , c Σ c ⁢ e y k , c ) , where k is a voxel index, K is a total number of the voxels, c indexes a plurality of semantic classes, y k,c is a predicted logits for the k-th voxel belonging to class c, ŷ k,c is a k-th element of Ŷ t ; and c is a weight for each class according to an inverse of a class frequency.
8 . The method of claim 1 , further comprising: capturing, via an image sensor, the one or more input images.
9 . The method of claim 7 , wherein the image sensor is integrated in an autonomous vehicle, the method further comprising: performing at least one advanced driver assistance systems (ADAS) function based on the three-dimensional semantic information, wherein the at least one ADAS function includes one or more of the following: emergency braking; pedestrian detection; collision avoidance; route planning; lane departure warning; or object avoidance.
10 . The method of claim 1 , wherein the at least one feature extraction network includes a convolutional neural network (CNN) configured to process the one or more images to generate the set of image feature maps, and wherein the at least one transformer includes a first transformer configured to implement a deformable cross-attention mechanism and a second transformer configured to implement a deformable self-attention mechanism.
11 . A system, comprising: a memory storing one or more input images, wherein each image of the one or more input images is a two-dimensional (2D) image of a scene; and one or more processors, connected to the memory, to: process, via a plurality of models, the one or more input images to generate three-dimensional (3D) semantic information for the scene, by: extracting a set of image feature maps from the one or more input images by at least one feature extraction network, generating a depth map by processing the one or more input images by a depth estimation network, generating 3D point cloud data based on the depth map, generating a first binary voxel grid occupancy map at a first resolution based on the 3D point cloud, converting the first binary voxel grid occupancy map at the first resolution to a second binary voxel grid occupancy map at a second resolution by a depth correction network, and generating the three-dimensional semantic information based on the set of image feature maps by at least one transformer.
12 . The system of claim 11 , wherein the processing, via the plurality of models, the one or more input images to generate the 3D semantic information comprises: generating, via a query proposal network, a set of query proposals by processing at least one of the depth map or the second binary voxel grid occupancy map.
13 . The system of claim 12 , wherein the processing, via the plurality of models, the one or more input images to generate the 3D semantic information comprises: processing, via a first transformer, the set of image feature maps using a deformable cross-attention (DCA) mechanism in accordance with the set of query proposals to generate an updated set of query proposals; generating initial voxel features by combining the updated set of query proposals with a mask token; processing, via a second transformer, the initial voxel features using a deformable self-attention (DSA) mechanism to generate refined voxel features; up-sampling the refined voxel features; and processing the up-sampled refined voxel features via a neural network comprising one or more fully connected layers to generate the three-dimensional semantic information.
14 . The system of claim 11 , wherein the plurality of models are trained in accordance with a loss criteria as defined by: ℒ = - Σ k = 1 K ⁢ Σ c = c 0 c m c y ˆ k , c ⁢ log ⁢ ( e y k , c Σ c ⁢ e y k , c ) , where k is a voxel index, K is a total number of the voxels, c indexes a plurality of semantic classes, y k,c is a predicted logits for the k-th voxel belonging to class c, ŷ k,c is a k-th element of Ŷt; and c is a weight for each class according to an inverse of a class frequency.
15 . The system of claim 11 , further comprising: an image sensor, wherein the one or more input images are captured by the image sensor.
16 . The system of claim 15 , wherein the system comprises an autonomous vehicle, and wherein the autonomous vehicle performs at least one advanced driver assistance systems (ADAS) function based on the three-dimensional semantic information, wherein the at least one ADAS functions includes one or more of the following: emergency braking; pedestrian detection; collision avoidance; route planning; lane departure warning; or object avoidance.
17 . A non-transitory computer-readable media storing computer instructions that, responsive to being executed by one or more processors, cause a device to perform the steps of: receiving one or more input images, wherein each image of the one or more input images is a two-dimensional (2D) image of a scene; and processing, via a plurality of models implemented by one or more processors, the one or more input images to generate three-dimensional (3D) semantic information for the scene, the processing comprising: extracting a set of image feature maps from the one or more input images by at least one feature extraction network, generating a depth map by processing the one or more input images by a depth estimation network, generating 3D point cloud data based on the depth map, generating a first binary voxel grid occupancy map at a first resolution based on the 3D point cloud, converting the first binary voxel grid occupancy map at the first resolution to a second binary voxel grid occupancy map at a second resolution by a depth correction network, and generating the three-dimensional semantic information based on the set of image feature maps by at least one transformer.
18 . The non-transitory computer-readable media of claim 17 , wherein the processing, via the plurality of models implemented by the one or more processors, the one or more input images to generate the 3D semantic information comprises: generating, via a query proposal network, a set of query proposals by processing at least one of the depth map or the second binary voxel grid occupancy map; processing, via a first transformer, the set of image features using a deformable cross-attention (DCA) mechanism in accordance with the set of query proposals to generate an updated set of query proposals; generating initial voxel features by combining the updated set of query proposals with a mask token; processing, via a second transformer, the initial voxel features using a deformable self-attention (DSA) mechanism to generate refined voxel features; up-sampling the refined voxel features; and processing the up-sampled refined voxel features via a neural network comprising one or more fully connected layers to generate the three-dimensional semantic information.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims priority to U.S. Provisional Patent Application No. 63/426,497, filed on Nov. 18, 2022, which is herein incorporated by reference in its entirety. BACKGROUND Humans can easily imagine the complete 3D geometry of occluded objects and scenes. This appealing feature in human perception is important for the strong capability of recognition and understanding. However, computer systems and, specifically, artificial intelligence do not have this innate ability. This ability may be particularly beneficial in autonomous vehicle (AV) perception systems to benefit downstream tasks such as planning and map reconstruction. Obtaining accurate and complete 3D information in the real world can be difficult. The task is challenging due to a lack of sensing resolution and the incomplete observation due to limited field of view and occlusions. To tackle such tasks, semantic scene completion (SSC) has been proposed to infer the complete scene geometry and semantic features based on limited observations from sensors such as a camera and/or depth sensor. The SSC task has to perform two things well: (1) scene reconstruction of visible areas due to limited sensor resolution; and (2) scene hallucination of occluded or non-visible areas. Most existing SSC solutions consider LiDAR as an additional modality to enable accurate 3D geometric measurement. However, LiDAR sensors are expensive and less portable while cameras are cheaper and provide richer visual cues of the driving scenes. This motivated the study of camera-based SSC solutions, as first proposed in the pioneering work of MonoScene as described in Cao et al., “Monoscene: Monocular 3D Semantic Scene Completion,” Computer Vision and Pattern Recognition, p. 3991-4001 (2022), which is incorporated herein in its entirety. MonoScene lifts 2D image inputs to 3D using dense feature projection. However, such a projection inevitably assigns 2D features of visible regions to the empty or occluded voxels. For example, an empty voxel occluded by a car will still get the car's visual feature. As a result, the generated 3D features contain many ambiguities for subsequent geometric completion and semantic segmentation, resulting in unsatisfactory performance. Moreover, MonoScene requires a large number of parameters (˜1.8 Gb) due to the 3D convolutional neural networks (CNNs) for processing 3D features. There is a need for addressing these issues and/or other issues associated with the prior art. SUMMARY Embodiments of the present disclosure relate to a sparse voxel transformer for camera-based 3D semantic scene completion. Systems and methods are disclosed that infer the complete scene geometry and semantics from limited observations using a 2D camera. In accordance with a first aspect of the present disclosure, a method is disclosed for performing semantic scene completion. The method includes: receiving one or more input images; and processing, via a plurality of models implemented by one or more processors, the one or more input images to generate three-dimensional (3D) semantic information for the scene. Each image of the one or more input images is a two-dimensional (2D) image of a scene. The plurality of models includes at least one neural network to extract a set of image feature maps from the one or more input images and at least one transformer to convert the set of image feature maps into the three-dimensional semantic information. In at least one embodiment of the first aspect, the processing, via the plurality of models, the one or more input images to generate three-dimensional semantic information for the scene comprises: generating a depth map, using a depth estimation network, by processing the one or more input images. In at least one embodiment of the first aspect, the processing, via the plurality of models, the one or more input images to generate three-dimensional semantic information for the scene further comprises: generating 3D point cloud data based on the depth map; generating a first binary voxel grid occupancy map at a first resolution based on the 3D point cloud data; and converting, via a depth correction network, the first binary voxel grid occupancy map at the first resolution to a second binary voxel grid occupancy map at a second resolution that is lower than the first resolution. In at least one embodiment of the first aspect, the processing, via the plurality of models, the one or more input images to generate three-dimensional semantic information for the scene further comprises: generating, via a query proposal network, a set of query proposals by processing at least one of the depth map or the second binary voxel grid occupancy map. In at least one embodiment of the first aspect, the processing, via the plurality of models, the one or more input images to generate three-dimensional semantic information for the scene further comprises: processing, via a convolutional neural network (CNN),