US-12625926-B2 - Temporal-based perception

US12625926B2US 12625926 B2US12625926 B2US 12625926B2US-12625926-B2

Abstract

In various examples, temporal-based perception for autonomous or semi-autonomous systems and applications is described. Systems and methods are disclosed that use a machine learning model (MLM) to intrinsically fuse feature maps associated with different sensors and different instances in time. To generate a feature map, image data generated using image sensors (e.g., cameras) located around a vehicle are processed using a MLM that is trained to generate the feature map. The MLM may then fuse the feature maps in order to generate a final feature map associated with a current instance in time. The feature maps associated with the previous instances in time may be preprocessed using one or more layers of the MLM, where the one or more layers are associated with performing temporal transformation before the fusion is performed. The MLM may then use the final feature map to generate one or more outputs.

Inventors

Jiwoong Choi
JOSE MANUEL ALVAREZ LOPEZ
Shiyi Lan
Yashar Asgarieh
Zhiding Yu

Assignees

NVIDIA CORPORATION

Dates

Publication Date: 20260512
Application Date: 20230316

Claims (20)

1 . A method comprising: generating, using one or more machine learning models and based at least on one or more first fused feature maps associated with one or more first times, one or more second fused feature maps by at least temporally transforming the one or more first fused feature maps; generating, using the one or more machine learning models and based at least on image data representative of a plurality of images obtained using a plurality of sensors of a machine, a third fused feature map associated with a second time that is after the one or more first times; generating, using the one or more machine learning models and based at least on the third fused feature map and the one or more second fused feature maps, a temporally fused feature map; generating, based at least on the temporally fused feature map, one or more outputs; and performing one or more operations by the machine based at least on the one or more outputs.
2 . The method of claim 1 , wherein: the generating the one or more second fused feature maps uses one or more layers of the one or more machine learning models that are associated with a temporal transformation; and the method further comprises storing the one or more second fused feature maps in a memory.
3 . The method of claim 1 , wherein the generating the third fused feature map is performed at least partially in parallel with the generating the one or more second fused feature maps.
4 . The method of claim 1 , wherein the generating the third fused feature map comprises: generating, using the one or more machine learning models and based at least on the image data, one or more feature maps; and generating, using the one or more machine learning models, the third fused feature map based at least on aggregating the one or more feature maps.
5 . The method of claim 1 , further comprising: storing the one or more second fused feature maps in a memory; and based at least on the generating the first-third fused feature map: removing at least a second fused feature map of the one or more second fused feature maps from the memory; and storing the third fused feature map in the memory.
6 . The method of claim 1 , further comprising: generating, using the one or more machine learning models and based at least on second image data representative of a second plurality of images, a fourth fused feature map associated with a third time that is after the second time; generating, using the one or more machine learning models and based at least on the fourth fused feature map, the third fused feature map, and at least one second fused feature map of the one or more second fused feature maps, a second temporally fused feature map; and generating, based at least on the second temporally fused feature map, one or more second outputs.
7 . The method of claim 1 , wherein: the plurality of images are associated with a two-dimensional coordinate system; and the one or more second fused feature maps, the third fused feature map, and the temporally fused feature map are associated with a three-dimensional coordinate system.
8 . The method of claim 1 , wherein the generating the one or more outputs comprises generating, based at least on the temporally fused feature map, at least one of: a first output indicating a free-space within an environment; a second output indicating one or more locations of one or more objects depicted by the plurality of images; a third output corresponding to a semantic segmentation mask; a fourth output corresponding to an instance segmentation mask; a fifth output representing a map indicating the one or more locations of the one or more objects; or a sixth output representing parking information.
9 . A system comprising: one or more processors to: determine, using one or more machine learning models and based at least on one or more first fused feature maps associated with one or more first times, one or more second fused feature maps by at least temporally transforming the one or more first fused feature maps; determine, using the one or more machine learning models and based at least on image data representative of one or more images obtained using one or more sensors of a machine, a third fused feature map associated with a second time that is after the one or more first times; determine, using the one or more machine learning models and based at least on the third fused feature map and the one or more second fused feature maps, a temporally fused feature map; determine, based at least on the temporally fused feature map, one or more outputs; and perform one or more operations by the machine based at least on the one or more outputs.
10 . The system of claim 9 , wherein: the one or more second fused feature maps are determined using one or more layers of the one or more machine learning models that are associated with a temporal transformation; and the one or more processors are further to store the one or more second fused feature maps in a memory.
11 . The system of claim 9 , wherein the third fused feature map is determined at least partially in parallel with the one or more second fused feature maps being determined.
12 . The system of claim 9 wherein the third fused feature map is determined, at least, by: generating, using the one or more machine learning models and based at least on the image data, one or more feature maps; and generating, using the one or more machine learning models, the third fused feature map based at least on aggregating the one or more feature maps.
13 . The system of claim 9 , wherein the one or more processors are further to: store the one or more second fused feature maps in a memory; and based at least on the third fused feature map being determined: remove at least a second fused feature map of the one or more second fused feature maps from the memory; and store the third fused feature map in the memory.
14 . The system of claim 9 , wherein the one or more processors are further to: determine, using the one or more machine learning models and based at least on second image data representative of one or more second images, a fourth fused feature map associated with a third time that is after the second time; determine, using the one or more machine learning models and based at least on the fourth fused feature map, the third fused feature map, and at least one second fused feature map of the one or more second fused feature maps, a second temporally fused feature map; and determine, based at least on the second temporally fused feature map, one or more second outputs.
15 . The system of claim 9 , wherein the one or more images are associated with a two-dimensional coordinate system; and the one or more second fused feature maps, the third fused feature map, and the temporally fused feature map are associated with a three-dimensional coordinate system.
16 . The system of claim 9 , wherein the system is comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system for performing conversational AI operations; a system for generating synthetic data; a system implementing one or more large language models (LLMs); a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.
17 . One or more processors comprising processing circuitry to: determine, using one or more machine learning models and based at least on image data representative of one or more images obtained using one or more sensors of a machine, a first fused feature map associated with a first time, the one or more images associated with a two-dimensional coordinate system; determine, using the one or more machine learning models and based at least on the first fused feature map and one or more second fused feature maps associated with one or more second times prior to the first time, a temporally fused feature map associated with a three-dimensional coordinate system; determine, based at least on the temporally fused feature map, one or more outputs; and perform one or more operations by the machine based at least on the one or more outputs.
18 . One or more processors of claim 17 , wherein the one or more processors are comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system for performing conversational AI operations; a system for generating synthetic data; a system implementing one or more large language models (LLMs); a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.
19 . The one or more processors of claim 17 , wherein the one or more processors are further to determine, using one or more layers of the one or more machine learning models and based at least on one or more third fused feature maps, the one or more second fused feature maps by temporally transforming the one or more third fused feature maps.
20 . The one or more processors of claim 19 , wherein the first fused feature map is determined at least partially in parallel with the one or more second fused feature maps being determined.

Description

BACKGROUND Autonomous and semi-autonomous vehicles and machines often employ machine learning models—such as deep neural networks (DNNs)—to perceive and reason about the surrounding environment. For instance, an autonomous or semi-autonomous vehicle or machine may generate sensor data using sensors located around the vehicle or machine, and may then process the sensor data using a machine learning model(s) that is trained to output information in two-dimensional (2D) space (e.g., image space) and/or three-dimensional (3D) space associated with an environment for which the autonomous or semi-autonomous vehicle or machine is navigating. For instance, the output may indicate the locations of detected objects, a drivable free-space region in which the vehicle or machine may navigate, a parking region in which the vehicle or machine may park and/or the like. The autonomous or semi-autonomous vehicle or machine may then use the output to perform various operations—such as planning, control, navigation, or actuation operations. In order to improve the accuracy of these machine learning models, the machine learning models may use temporal information. For instance, and for 3D perception systems, systems employing these machine learning models may rely on temporal fusion of outputs from previous timestamps and a current timestamp to generate a temporal understanding of the environment. For example, depth-based approaches perform aggregation by naively aligning and concatenating the features from multiple timestamps. Additionally, query-based approaches perform aggregation by sampling previous features using queries from the current frames. However, processing the previous frames and fusing them with the current frame may require a large amount of computing resources and may introduce latency into the system—thereby making these processes less suitable for real-time or near real-time applications. As such, these conventional approaches may rely on a limited number of frames and may use naïve fusion methods. SUMMARY Embodiments of the present disclosure relate to temporal-based perception for autonomous or semi-autonomous systems and applications. Systems and methods are disclosed that—internal to a machine learning model or DNN—fuse feature maps associated with different instances in time. To generate a feature map, image data (or more generally, sensor data) generated using image sensors (e.g., cameras) and/or other sensor modalities (e.g., RADAR, LiDAR, etc.) located around a vehicle or machine are processed using a machine learning model(s)—such as a DNN—that is trained to generate the feature map. The machine learning model(s) may then fuse the feature maps—using one or more layers of the machine learning model(s)—in order to generate a final, temporally informed, feature map associated with a current instance in time. In some examples, to improve the fusion processes, the feature maps associated with the previous instances in time are preprocessed—e.g., in parallel with the processing of the current feature map to reduce latency—using one or more layers of the machine learning model(s), where the one or more layers are associated with performing temporal transformation before the fusion is performed (e.g., before the feature maps are stored in memory). The machine learning model(s) may then use the final feature map to generate one or more outputs—such as object detection outputs, drivable free-space outputs, semantic or instance segmentation outputs, etc.—associated with the environment surrounding the vehicle or machine. In contrast to conventional systems, such as those described above, the current systems, in some embodiments, perform one or more processes to improve the temporal fusion of the machine learning model(s). For example, the current systems may process the feature maps associated with the previous instances in time using the one or more layers that are associated with temporal transformation before the feature maps are fused with the feature map associated with the current instance in time. This may improve the fusion processes by better transforming the previous feature maps to a coordinate frame of the current feature map before the feature maps are fused together. Additionally, in contrast to the conventional systems, the current systems, in some embodiments, perform various processes in parallel in order to reduce the overall latency. For example, the current systems may, in parallel, process one or more previous feature maps using the one or more layers associated with temporal transformation while also processing the current feature map using aggregation and/or fusion in order to generate the final feature map. In this way, the final feature map that is processed allows for a temporal understanding of the surrounding environment that is suitable for real-time or near real-time applications. BRIEF DESCRIPTION OF THE DRAWINGS The present systems and methods for temporal-based perception in au