US-12620164-B2 - Using neural radiance fields for label efficient image processing

US12620164B2US 12620164 B2US12620164 B2US 12620164B2US-12620164-B2

Abstract

A device may one or more memories storing the frontal view image. A device may one or more processors coupled to the one or more memories and configured to: obtain, based on the frontal view image and via an implicit field engine in a geometry pathway, a depth map, obtain, based on the frontal view image, a masked image, generate, based on the masked image and via a semantic pathway, a reconstructed image, train, based on the depth map and the reconstructed image, a model; and finetune the model using a portion of task-specific labels to obtain a finetuned model that performs semantic view mapping on input images.

Inventors

Nikhil GOSALA
Kürsat PETEK
Kiran BANGALORE RAVI
Senthil Kumar Yogamani
Abhinav VALADA

Assignees

QUALCOMM TECHNOLOGIES, INC.

Dates

Publication Date: 20260505
Application Date: 20240227

Claims (18)

1 . An apparatus to generate a semantic map from one or more images, the apparatus comprising: one or more memories configured to store the one or more images; and one or more processors coupled to the one or more memories and configured to: generate, based on the one or more images using an implicit field engine in a geometry pathway of a machine learning model, a depth map; generate, based on the one or more images, a masked image; generate, based on the masked image using a semantic pathway of the machine learning model, a reconstructed image; train, based on the depth map and the reconstructed image, the machine learning model; finetune the machine learning model using a portion of task-specific labels to obtain a finetuned model that performs semantic view mapping on input images; and reconstruct, using the masked image, a non-masked image using a temporal masked auto-encoder that uses temporal consistency by randomly masking patches of the one or more images while minimizing photometric and red-green-blue losses.
2 . The apparatus of claim 1 , wherein the one or more processors are configured to train the machine learning model using unsupervised representation learning.
3 . The apparatus of claim 1 , wherein the one or more processors are further configured to: generate a volumetric density of a field; and generate the depth map further based on ray casting using the volumetric density of the field.
4 . The apparatus of claim 3 , wherein, to generate the depth map, the one or more processors are configured to: project image features for sampled points on the one or more images on a two-dimensional image plane; compute a value for each projection location of a plurality of projection locations on the two-dimensional image plane using bilinear interpolation; and process, using a multi-layer perceptron (MLP), an image feature with positional encodings based on the value.
5 . The apparatus of claim 1 , wherein the one or more processors are further configured to: generate, via a transformer, transformed-based features at a plurality of different scales; fuse the transformed-based features at the plurality of different scales to generate input data; and provide the input data to the implicit field engine to generate the depth map.
6 . The apparatus of claim 1 , wherein the one or more processors are further configured to: reconstruct at least one non-masked image for at least one corresponding future time stamp.
7 . The apparatus of claim 6 , wherein the one or more processors are further configured to: use a current masked image to reconstruct a current time-stamped image and a future time-stamped image.
8 . The apparatus of claim 1 , wherein the task-specific labels are associated with at least one of depth estimation, semantic segmentation, instance retrieval, semantic scene segmentation tasks, 3D scene generation based on input two dimensional images, or autonomous driving.
9 . The apparatus of claim 1 , wherein the portion of the task-specific labels comprises less than 5% of the task-specific labels.
10 . The apparatus of claim 1 , wherein the semantic map is associated with generating a bird's eye view of a scene associated with the one or more images.
11 . The apparatus of claim 1 , wherein the one or more images include one or more frontal view images.
12 . The apparatus of claim 11 , wherein a frontal view image of the one or more frontal view images comprises one of a monocular frontal view image, a temporal image, a spatial image, a 3D image, a 360 degree images, a radar data, or LiDAR data.
13 . The apparatus of claim 1 , further comprising one or more cameras configured to capture the one or more images.
14 . A method of generating a semantic map from one or more images, the method comprising: generating, based on the one or more images using an implicit field engine in a geometry pathway of a machine learning model, a depth map; generating, based on the one or more images, a masked image; generating, based on the masked image using a semantic pathway of the machine learning model, a reconstructed image; training, based on the depth map and the reconstructed image, the machine learning model; finetuning the machine learning model using a portion of task-specific labels to obtain a finetuned model that performs semantic view mapping on input images; and reconstructing, using the masked image, a non-masked image using a temporal masked auto-encoder that uses temporal consistency by randomly masking patches of the one or more images while minimizing photometric and red-green-blue losses.
15 . The method of claim 14 , further comprising training the machine learning model using unsupervised representation learning.
16 . The method of claim 14 , further comprising: generating a volumetric density of a field; and generating the depth map further based on ray casting using the volumetric density of the field.
17 . The method of claim 14 , further comprising: generating, via a transformer, transformed-based features at a plurality of different scales; fusing the transformed-based features at the plurality of different scales to generate input data; and providing the input data to the implicit field engine to generate the depth map.
18 . A non-transitory computer-readable medium having stored thereon instructions which, when executed by one or more processors, cause the one or more processors to be configured to: generate, based on one or more images using an implicit field engine in a geometry pathway of a machine learning model, a depth map; generate, based on the one or more images, a masked image; generate, based on the masked image using a semantic pathway of the machine learning model, a reconstructed image; train, based on the depth map and the reconstructed image, the machine learning model; finetune the machine learning model using a portion of task-specific labels to obtain a finetuned model that performs semantic view mapping on input images; and reconstruct, using the masked image, a non-masked image using a temporal masked auto-encoder that uses temporal consistency by randomly masking patches of the one or more images while minimizing photometric and red-green-blue losses.

Description

PRIORITY CLAIM The present claims priority to U.S. Provisional Application No. 63/613,686 filed on Dec. 21, 2023, the contents of which are incorporated herein by reference. TECHNICAL FIELD The present disclosure generally relates to processing images using machine learning systems. For example, aspects of the present disclosure relate to systems and techniques for using neural radiance fields (NERFs) for label efficient image processing (e.g., Bird's Eye View (BEV) semantic segmentation, or other image processing). BACKGROUND Various image processing techniques are valuable for many types of systems, such as driving systems (e.g., autonomous or semi-autonomous driving systems), extended reality (XR) systems, robotics systems, among others. In one example, images can be processed to generate semantic Bird's Eye View (BEV) maps from a BEV perspective. Semantic BEV maps can be useful for driving systems (e.g., autonomous driving systems), as they offer rich, occlusion-aware information for height-agnostic applications including object tracking, collision avoidance, and motion control. In other examples, images can be processed to generate semantic segmentation masks, to perform object detection and/or tracking, among others. In some cases, machine learning systems (e.g., neural networks) can be used to process images and generate a corresponding output (e.g., a semantic BEV map). The machine learning systems can be trained using supervised training, semi-supervised training, unsupervised training, or other type of training technique. Supervised training involves the use of labeled data, such as annotated images. Machine learning-based image processing techniques (e.g., instantaneous BEV map estimation) that do not rely on large amounts of labeled/annotated data can be crucial for the rapid deployment of certain technologies, such as driving systems (e.g., for autonomous or semi-autonomous driving vehicles), XR systems, robotics systems, etc. However, many machine learning-based image processing systems (e.g., existing BEV mapping approaches) follow a fully supervised learning paradigm and thus rely on large amounts of annotated data (e.g., annotated data in BEV). The large amount of annotated data can be arduous to obtain and can hinder the scalability of systems to novel environments. SUMMARY Systems and techniques are described herein for using neural radiance fields (NERFs) for label efficient image processing (e.g., Bird's Eye View (BEV) semantic segmentation, or other image processing). In some cases, the systems and techniques can implement an unsupervised representation learning approach to generate an output (e.g., semantic BEV maps, semantic segmentation maps from other perspectives, object detection outputs, etc.) from images, such as from a monocular frontal view (FV) image, in a label-efficient manner. Such systems and techniques can reduce the amount of manual labeling needed for training data (e.g., BEV ground training data) for training models used for various tasks (e.g., autonomous or semi-autonomous driving tasks, XR tasks, robotics tasks such as navigation or scene understanding, or for other tasks). In one illustrative example, conceptually, a machine learning model (e.g., a neural network) is configured on a device (e.g., a vehicle, an XR device, a robotics system, etc.). The machine learning system can receive a two-dimensional image from a camera configured on the vehicle. The model predicts or estimates a high definition (HD) map or a three-dimensional space (from a bird's eye view) around the device. A large dataset of human labeled images is required for achieving such a task. The systems and techniques described herein provides an approach that reduces the need for manual labeling for training a model to generate such output. The systems and techniques can pretrain a machine learning model (e.g., a neural network model) with images (e.g., frontal view images) such that the machine learning model learns a geometry of a scene and the semantics of the scene. Unannotated images are used to pretrain the network (in an unsupervised manner), which makes it possible to learn required features such that a very small number of labeled BEV images can be used to train the network as part of a finetuning phase. Using such systems and techniques, the machine learning model can provide results that match current state of the art models while using a small amount of annotated data (e.g., 1%, 5% or other relatively small percentage of available labeled data). In some aspects, the techniques described herein relate to an apparatus to generate a semantic map from one or more images, the apparatus including: one or more memories storing the one or more images; and one or more processors coupled to the one or more memories and configured to: generate, based on the one or more images using an implicit field engine in a geometry pathway of a machine learning model, a depth map; generate, based on the one