US-20260127824-A1 - SYSTEMS AND METHODS FOR GENERATING THREE-DIMENSIONAL (3D) OCCUPANCY DATA

US20260127824A1US 20260127824 A1US20260127824 A1US 20260127824A1US-20260127824-A1

Abstract

Systems and techniques are described herein for generating three-dimensional (3D) occupancy data. For instance, a method for generating three-dimensional (3D) occupancy data is provided. The method may include processing an image of a scene using an image encoder to generate image features; processing the image features to generate bird's-eye-view (BEV) features; generating a first 3D occupancy prediction based on the BEV features; generating a second 3D occupancy prediction based on the image features; and combining the first 3D occupancy prediction and the second 3D occupancy prediction to generate a third 3D occupancy prediction.

Inventors

Yunxiao SHI
Hong Cai
Shizhong Steve HAN
Yinhao ZHU
Jisoo JEONG
Fatih Murat PORIKLI
Amin Ansari

Assignees

QUALCOMM INCORPORATED

Dates

Publication Date: 20260507
Application Date: 20250124

Claims (20)

1 . An apparatus for generating three-dimensional (3D) occupancy data, the apparatus comprising: at least one memory; and at least one processor coupled to the at least one memory and configured to: process an image of a scene using an image encoder to generate image features; process the image features to generate bird's-eye-view (BEV) features; generate a first 3D occupancy prediction based on the BEV features; generate a second 3D occupancy prediction based on the image features; and combine the first 3D occupancy prediction and the second 3D occupancy prediction to generate a third 3D occupancy prediction.
2 . The apparatus of claim 1 , wherein, to generate the first 3D occupancy prediction, the at least one processor is configured to: process the BEV features to generate a 2D occupancy prediction; and convert the 2D occupancy prediction into the first 3D occupancy prediction.
3 . The apparatus of claim 1 , wherein, to generate the second 3D occupancy prediction, the at least one processor is configured to: refine queries based on the image features to generate a 3D prediction; and convert the 3D prediction into the second 3D occupancy prediction.
4 . The apparatus of claim 3 , wherein the queries are refined using a cross-attention machine-learning model.
5 . The apparatus of claim 4 , wherein the queries are further refined using a self-attention machine-learning model.
6 . The apparatus of claim 1 , wherein: the first 3D occupancy prediction is generated by a first branch of a machine-learning model; the second 3D occupancy prediction is generated by a second branch of a machine-learning model; and the first branch of the machine-learning model and the second branch of the machine-learning model are trained together in an end-to-end training process.
7 . The apparatus of claim 6 , wherein: the first branch of the machine-learning model is trained using training data; the first branch of the machine-learning model is trained using a subset of the training data; and the first branch of the machine-learning model and the second branch of the machine-learning model are trained together using the training data.
8 . The apparatus of claim 6 , wherein the at least one processor is configured to cross attend 2D features of the first branch with 3D features of the second branch to generate combined features, wherein the first 3D occupancy prediction is further based on the combined features.
9 . A method for generating three-dimensional (3D) occupancy data, the method comprising: processing an image of a scene using an image encoder to generate image features; processing the image features to generate bird's-eye-view (BEV) features; generating a first 3D occupancy prediction based on the BEV features; generating a second 3D occupancy prediction based on the image features; and combining the first 3D occupancy prediction and the second 3D occupancy prediction to generate a third 3D occupancy prediction.
10 . The method of claim 9 , wherein generating the first 3D occupancy prediction comprises: processing the BEV features to generate a 2D occupancy prediction; and converting the 2D occupancy prediction into the first 3D occupancy prediction.
11 . The method of claim 9 , wherein generating the second 3D occupancy prediction comprises: refining queries based on the image features to generate a 3D prediction; and converting the 3D prediction into the second 3D occupancy prediction.
12 . The method of claim 11 , wherein the queries are refined using a cross-attention machine-learning model.
13 . The method of claim 12 , wherein the queries are further refined using a self-attention machine-learning model.
14 . The method of claim 9 , wherein: the first 3D occupancy prediction is generated by a first branch of a machine-learning model; the second 3D occupancy prediction is generated by a second branch of a machine-learning model; and the first branch of the machine-learning model and the second branch of the machine-learning model are trained together in an end-to-end training process.
15 . The method of claim 14 , wherein: the first branch of the machine-learning model is trained using training data; the first branch of the machine-learning model is trained using a subset of the training data; and the first branch of the machine-learning model and the second branch of the machine-learning model are trained together using the training data.
16 . The method of claim 14 , further comprising cross attending 2D features of the first branch with 3D features of the second branch to generate combined features, wherein the first 3D occupancy prediction is further based on the combined features.
17 . A non-transitory computer-readable storage medium having stored thereon instructions that, when executed by at least one processor, cause the at least one processor to: process an image of a scene using an image encoder to generate image features; process the image features to generate bird's-eye-view (BEV) features; generate a first 3D occupancy prediction based on the BEV features; generate a second 3D occupancy prediction based on the image features; and combine the first 3D occupancy prediction and the second 3D occupancy prediction to generate a third 3D occupancy prediction.
18 . The non-transitory computer-readable storage medium of claim 17 , wherein, to generate the first 3D occupancy prediction, the instructions, when executed by at least one processor, cause the at least one processor to: process the BEV features to generate a 2D occupancy prediction; and convert the 2D occupancy prediction into the first 3D occupancy prediction.
19 . The non-transitory computer-readable storage medium of claim 17 , wherein, to generate the second 3D occupancy prediction, the instructions, when executed by at least one processor, cause the at least one processor to: refine queries based on the image features to generate a 3D prediction; and convert the 3D prediction into the second 3D occupancy prediction.
20 . The non-transitory computer-readable storage medium of claim 19 , wherein the queries are refined using a cross-attention machine-learning model.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of U.S. Provisional Application No. 63/717,863, filed Nov. 7, 2024, which is incorporated herein by reference in its entirety. TECHNICAL FIELD The present disclosure generally relates to three-dimensional (3D) occupancy data. For example, aspects of the present disclosure include systems and techniques for generating 3D occupancy data. BACKGROUND Many devices include one or more cameras. For example, a vehicle may include cameras facing one or more directions away from the vehicle. A camera can capture images using an image sensor of the camera, which can include an array of photodetectors. Some devices can analyze image data captured by an image sensor to detect an object within the image data. Object detections based on perception data (such as images from a camera) may inform a driving systems (e.g., autonomous, semi-autonomous, or assisted driving systems, such as an advanced driver assistance system (ADAS)) what area is drivable and what objects (e.g., road users, other vehicles, bikes, pedestrian, etc.) are present and/or are moving in the environment around the vehicle. The driving system then makes decisions about how to move (e.g., slower, faster, stop, changing lanes, turning, a path to take, etc.) based on object detections, such as drivable areas and/or detected objects. SUMMARY The following presents a simplified summary relating to one or more aspects disclosed herein. Thus, the following summary should not be considered an extensive overview relating to all contemplated aspects, nor should the following summary be considered to identify key or critical elements relating to all contemplated aspects or to delineate the scope associated with any particular aspect. Accordingly, the following summary presents certain concepts relating to one or more aspects relating to the mechanisms disclosed herein in a simplified form to precede the detailed description presented below. Systems and techniques are described for generating three-dimensional (3D) occupancy data. According to at least one example, a method is provided for generating three-dimensional (3D) occupancy data. The method includes: processing an image of a scene using an image encoder to generate image features; processing the image features to generate bird's-eye-view (BEV) features; generating a first 3D occupancy prediction based on the BEV features; generating a second 3D occupancy prediction based on the image features; and combining the first 3D occupancy prediction and the second 3D occupancy prediction to generate a third 3D occupancy prediction. In another example, an apparatus for generating three-dimensional (3D) occupancy data is provided that includes at least one memory and at least one processor (e.g., configured in circuitry) coupled to the at least one memory. The at least one processor configured to: process an image of a scene using an image encoder to generate image features; process the image features to generate bird's-eye-view (BEV) features; generate a first 3D occupancy prediction based on the BEV features; generate a second 3D occupancy prediction based on the image features; and combine the first 3D occupancy prediction and the second 3D occupancy prediction to generate a third 3D occupancy prediction. In another example, a non-transitory computer-readable medium is provided that has stored thereon instructions that, when executed by one or more processors, cause the one or more processors to: process an image of a scene using an image encoder to generate image features; process the image features to generate bird's-eye-view (BEV) features; generate a first 3D occupancy prediction based on the BEV features; generate a second 3D occupancy prediction based on the image features; and combine the first 3D occupancy prediction and the second 3D occupancy prediction to generate a third 3D occupancy prediction. In another example, an apparatus for generating three-dimensional (3D) occupancy data is provided. The apparatus includes: means for processing an image of a scene using an image encoder to generate image features; processing the image features to generate bird's-eye-view (BEV) features; means for generating a first 3D occupancy prediction based on the BEV features; means for generating a second 3D occupancy prediction based on the image features; and means for combining the first 3D occupancy prediction and the second 3D occupancy prediction to generate a third 3D occupancy prediction. In some aspects, one or more of the apparatuses described herein is, can be part of, or can include an extended reality device (e.g., a virtual reality (VR) device, an augmented reality (AR) device, or a mixed reality (MR) device), a vehicle (or a computing device, system, or component of a vehicle), a mobile device (e.g., a mobile telephone or so-called “smart phone”, a tablet computer, or other type of mobile device), a smart or connected device (e.g.,