US-20260127869-A1 - TIME-CONTINUOUS RECURRENT NEURAL NETWORKS FOR COMPUTER VISION

US20260127869A1US 20260127869 A1US20260127869 A1US 20260127869A1US-20260127869-A1

Abstract

An apparatus configured to perform a perception task may generate sensor features from data from one or more sensors. process the sensor features with a time-continuous recurrent neural network (RNN) to produce time-continuous features, and perform the perception task using the time-continuous features. The time-continuous features may be defined by a first feature vector value corresponding to a first observation time of the one or more sensors, a prediction of a steady state feature vector value, and estimated feature vector values between the first feature vector value and the steady state feature vector value, the estimated feature vector values being defined by a function.

Inventors

Per Albert SIDEN
Per Cronvall
Gustav Nils Ture Persson
Meysam Sadeghigooghari
Jacob Roll

Assignees

QUALCOMM INCORPORATED

Dates

Publication Date: 20260507
Application Date: 20241104

Claims (20)

1 . An apparatus configured to perform a perception task, the apparatus comprising: a memory; and processing circuitry connected to the memory, the processing circuitry configured to: generate sensor features from data from one or more sensors; process the sensor features with a time-continuous recurrent neural network (RNN) to produce time-continuous features; and perform the perception task using the time-continuous features.
2 . The apparatus of claim 1 , wherein the time-continuous features are defined by a first feature vector value corresponding to a first observation time of the one or more sensors, a prediction of a steady state feature vector value, and estimated feature vector values between the first feature vector value and the steady state feature vector value, the estimated feature vector values being defined by a function.
3 . The apparatus of claim 2 , wherein the function is an exponential decay function or is defined by an ordinary differential equation.
4 . The apparatus of claim 2 , where to perform the perception task using the time-continuous features, the processing circuitry is configured to: perform the perception task using estimated feature vector values from a time after the first observation time.
5 . The apparatus of claim 1 , wherein the sensor features are birds-eye-view (BEV) sensor features, and wherein to generate the sensor features from the data from the one or more sensors, the processing circuitry is configured to: generate respective sensor features from the one or more sensors; and generate, using the respective sensor features, a BEV representation having the BEV sensor features.
6 . The apparatus of claim 5 , wherein to process the sensor features with the time-continuous RNN to produce time-continuous features, the processing circuitry is configured to: receive current BEV sensor features at a current time; receive previous BEV sensor features from a previous time; warp the previous BEV sensor features to a pose of the current BEV sensor features to create warped BEV sensor features; combine the warped BEV sensor features and the current BEV sensor features to form combined BEV sensor features; and process the combined BEV sensor features with the time-continuous RNN to form the time-continuous features.
7 . The apparatus of claim 6 , wherein to perform the perception task using the time-continuous features, the processing circuitry is configured to: process the time-continuous features and the current BEV features using a transformer decoder.
8 . The apparatus of claim 1 , wherein the perception task includes one or more of semantic segmentation, semantic occupancy prediction, lane tracking, or 3D object detection.
9 . The apparatus of claim 1 , wherein the processing circuitry is further configured to: train the time-continuous RNN using training feature vectors from non-consecutive observation times.
10 . The apparatus of claim 1 , wherein the one or more sensors include one or more camera sensors, one or more sonar sensors, one or more radar sensors, or one or more LiDAR sensors, and wherein to generate the sensor features from the data from the one or more sensors, the processing circuitry is configured to: receive the data from the one or more sensors at asynchronous observation times; and generate the sensor features from the data from the one or more sensors at each of the asynchronous observation times.
11 . The apparatus of claim 1 , wherein the processing circuitry is part of an advanced driver assistance system (ADAS), and wherein the ADAS is configured to control a vehicle at least in part based on an output of the perception task.
12 . A method for performing a perception task, the method comprising: generating sensor features from data from one or more sensors; processing the sensor features with a time-continuous recurrent neural network (RNN) to produce time-continuous features; and performing the perception task using the time-continuous features.
13 . The method of claim 12 , wherein the time-continuous features are defined by a first feature vector value corresponding to a first observation time of the one or more sensors, a prediction of a steady state feature vector value, and estimated feature vector values between the first feature vector value and the steady state feature vector value, the estimated feature vector values being defined by a function.
14 . The method of claim 13 , wherein the function is an exponential decay function or is defined by an ordinary differential equation.
15 . The method of claim 13 , where to performing the perception task using the time-continuous features comprises: performing the perception task using estimated feature vector values from a time after the first observation time.
16 . The method of claim 12 , wherein the sensor features are birds-eye-view (BEV) sensor features, and wherein generating the sensor features from the data from the one or more sensors comprises: generating respective sensor features from the one or more sensors; and generate, using the respective sensor features, a BEV representation having the BEV sensor features.
17 . The method of claim 16 , wherein processing the sensor features with the time-continuous RNN to produce time-continuous features comprises: receiving current BEV sensor features at a current time; receiving previous BEV sensor features from a previous time; warping the previous BEV sensor features to a pose of the current BEV sensor features to create warped BEV sensor features; combining the warped BEV sensor features and the current BEV sensor features to form combined BEV sensor features; and processing the combined BEV sensor features with the time-continuous RNN to form the time-continuous features.
18 . The method of claim 17 , wherein performing the perception task using the time-continuous features comprises: processing the time-continuous features and the current BEV features using a transformer decoder.
19 . The method of claim 12 , further comprising: training the time-continuous RNN using training feature vectors from non-consecutive observation times.
20 . The method of claim 12 , wherein the one or more sensors include one or more camera sensors, one or more sonar sensors, one or more radar sensors, or one or more LiDAR sensors, and wherein generating the sensor features from the data from the one or more sensors comprises: receiving the data from the one or more sensors at asynchronous observation times; and generating the sensor features from the data from the one or more sensors at each of the asynchronous observation times.

Description

TECHNICAL FIELD This disclosure relates to computer vision techniques. BACKGROUND Computer vision applications, including applications in automotives, make use of the detection and analysis of three-dimensional (3D) objects. 3D object detection may include the identification and localization of objects in 3D space using sensors like cameras, LiDAR, and radar. Algorithms process this data to recognize and position objects accurately, enhancing real-time situational awareness. Example computer vision tasks for automotive application include semantic occupancy prediction, semantic segmentation, lane tracking, and 3D object detection. Semantic occupancy prediction involves predicting the presence and category of objects in a 3D space, typically represented as a grid or voxel space, helping to understand the structure and content of the environment. Semantic segmentation is the process of classifying each pixel in an image into predefined categories, enabling more precise identification and localization of different objects and regions within the image. Lane tracking involves identifying and following lane markings in images or video frames, which is important for autonomous driving systems to navigate and stay within traffic lanes accurately. 3D object detection aims to identify and localize objects within a 3D space, providing detailed information about the position, dimensions, and categories of objects in the environment. SUMMARY In general, this disclosure describes techniques for performing perception tasks that may be used in computer vision and automotive use cases. In particular, this disclosure describes techniques for using time-continuous recurrent neural networks (RNNs) when performing a perception task. Time-continuous RNNs differ from traditional RNNs in that time-continuous RNNs are not limited to observations at fixed-interval timepoints. Rather, time-continuous RNNs may model feature vector dynamics over time. For example, a time-continuous RNN may use an exponential decay function or another function to estimate feature vector values between a start value (e.g., a feature vector value associated with an observation) and a predicted long-term steady state value at a future time. By explicitly accounting for the time between inputs, time-continuous RNNs can update their internal states smoothly across uneven intervals. This allows a time-continuous RNN to more accurately reflect the temporal dependencies in data that might not be regularly spaced, as is often the case with asynchronous sensor inputs. As such, a time-continuous RNN may more accurately represent feature vector values for systems with multiple asynchronous sensor inputs, such as in computer visions systems for automotive that may use multiple camera sensors, as well as other sensors such as LiDAR, radar, sonar, and others. Furthermore, a time-continuous RNN may allow for better training, as the ability to generate gradients from disparate time instances is readily available, thus allowing a time-continuous RNN to be trained using long-term temporal dependencies in the training dataset. Accordingly, the use of time-continuous RNNs as described herein may result in more accurate outputs of various perception tasks, such as semantic segmentation, semantic occupancy prediction, lane tracking, or 3D object detection. In one example, this disclosure describes an apparatus configured to perform a perception task, the apparatus comprising a memory, and processing circuitry connected to the memory, the processing circuitry configured to generate sensor features from data from one or more sensors, process the sensor features with a time-continuous RNN to produce time-continuous features, and perform the perception task using the time-continuous features. In another example, this disclosure describes a method for performing a perception task, the method comprising generating sensor features from data from one or more sensors, processing the sensor features with a time-continuous RNN to produce time-continuous features, and performing the perception task using the time-continuous features. In another example, this disclosure describes a non-transitory computer-readable storage medium storing instructions that, when executed, cause one or more processors of a device configured to perform a perception task to generate sensor features from data from one or more sensors, process the sensor features with a time-continuous RNN to produce time-continuous features, and perform the perception task using the time-continuous features. In another example, this disclosure describes a device configured to perform a perception task, the device comprising means for generating sensor features from data from one or more sensors, means for processing the sensor features with a time-continuous RNN to produce time-continuous features, and means for performing the perception task using the time-continuous features. The details of one or more examples are set forth