WO-2026096227-A1 - RESOLVING OBJECT DETECTION OUTPUTS FROM MULTIPLE OBJECT DETECTION PIPELINES

WO2026096227A1WO 2026096227 A1WO2026096227 A1WO 2026096227A1WO-2026096227-A1

Abstract

A method includes generating, by a first object detection model of a perception system of an autonomous vehicle and based on a first portion of sensor data describing a first portion of an environment of the autonomous vehicle, a first object detection output that comprises a first predicted class for an object; generating, by a second object detection model of the perception system and based on a second portion of sensor data describing a second portion of the environment, a second object detection output that comprises a predicted distribution over candidate classes for the object; adapting at least a portion of the first object detection output into an adapted representation of the first object detection output; providing the adapted representation in a shared prediction output space with the predicted distribution from the second object detection output; and generating a resolved object detection output based on the adapted representation and the predicted distribution.

Inventors

GARDNER, Rachel Lyn
HUANG, Yanda

Assignees

AURORA OPERATIONS, INC.

Dates

Publication Date: 20260507
Application Date: 20251017
Priority Date: 20241101

Claims (15)

1. A computer-implemented method, comprising: generating, by a first object detection model of a perception system of an autonomous vehicle and based on a first portion of sensor data describing a first portion of an environment of the autonomous vehicle, a first object detection output that comprises a first predicted class for an object; generating, by a second object detection model of the perception system and based on a second portion of sensor data describing a second portion of the environment, a second object detection output that comprises a predicted distribution over candidate classes for the object; adapting, by a prediction resolution model, at least a portion of the first object detection output into an adapted representation of the first object detection output; providing the adapted representation in a shared prediction output space with the predicted distribution from the second object detection output; and generating, by the prediction resolution model, a resolved object detection output based on the adapted representation and the predicted distribution, wherein the resolved object detection output indicates a second predicted class for the object.
2. The computer-implemented method of claim 1, comprising: querying a data structure comprising one or more precomputed logit estimation tensors using the first predicted class; retrieving a precomputed logit tensor associated with the first predicted class and the first object detection model, wherein the precomputed logit tensor indicates an estimated distribution over candidate classes.
3. The computer-implemented method of claim 1, comprising: transforming, by a calibration model configured to adapt outputs from the first object detection model into the shared prediction output space, the portion of the first object detection output into the adapted representation of the first object detection output.
4. The computer-implemented method of claim 1, wherein: the first object detection output comprises first update data for an object track stored by the perception system to track an object in the environment, wherein the first update data indicates a first update to the object track; and the second object detection output comprises second update data for the object track, wherein the second update data indicates a second update to the object track.
5. The computer-implemented method of claim 1, wherein the first portion of sensor data comprises: a modality of data not present in the second portion of sensor data; and/or data describing a different field of view of the environment as compared to the second portion of sensor data.
6. The computer-implemented method of claim 1, wherein the prediction resolution model executes periodically to resolve conflicts between object detection outputs from the first object detection model and the second object detection model.
7. The computer-implemented method of claim 1, wherein the first object detection model generates the first object detection output at a first time, and wherein the second object detection model generates the second object detection output at a second time, and optionally wherein one or more output layers of the prediction resolution model discount a contribution of the first object detection model based on the first time.
8. The computer-implemented method of claim 1, wherein one or more output layers of the prediction resolution model are optimized using a global optimizer over a single batch of ground truth examples.
9. The computer-implemented method of claim 1, wherein one or more output layers of the prediction resolution model are optimized using a non-uniformly downsampled batch of a dataset of ground truth examples, wherein a ratio of a number of a respective category of examples in the batch to a number of the respective category in the dataset is inversely correlated with an error rate associated with the respective category.
10. The computer-implemented method of claim 3, comprising: transforming, by a calibration model configured to adapt outputs from the second object detection model, the predicted distribution into a second adapted representation; wherein the resolved object detection output is based on a linear combination of the adapted representation and the second adapted representation.
11. A computer-implemented method, comprising: generating, at a first time, by a first object detection model of a perception system of an autonomous vehicle and based on a first portion of sensor data describing a first portion of an environment of the autonomous vehicle, a first object detection output; generating, at a second time, by a second object detection model of the perception system and based on a second portion of sensor data describing a second portion of the environment, a second object detection output; and generating, by a prediction resolution model and based on the first object detection output and the second object detection output in a shared prediction output space, a resolved object detection output, wherein the prediction resolution model executes periodically to resolve conflicts between object detection outputs from the first object detection model and the second object detection model.
12. The computer-implemented method of claim 11, wherein: the first object detection output comprises first update data for an object track stored by the perception system to track an object in the environment, wherein the first update data indicates a first update to the object track; and the second object detection output comprises second update data for the object track, wherein the second update data indicates a second update to the object track.
13. The computer-implemented method of claim 11, comprising: processing, by the prediction resolution model, a predicted class from the first object detection output; processing, by the prediction resolution model, a predicted distribution over candidate classes from the second object detection output; adapting, by the prediction resolution model, at least a portion of the first object detection output into an adapted representation of the first object detection output; providing the adapted representation in a shared prediction output space with the predicted distribution from the second object detection output; and generating, by the prediction resolution model, the resolved object detection output based on the adapted representation and the predicted distribution.
14. A computer-implemented method, comprising: generating, by a first object detection model of a perception system of an autonomous vehicle and based on a first portion of sensor data describing a first portion of an environment of the autonomous vehicle, a first object detection output that comprises first update data for an object track stored by the perception system to track an object in the environment, wherein the first update data indicates a first update to the object track; generating, by a second object detection model of the perception system and based on a second portion of sensor data describing a second portion of the environment, a second object detection output that comprises second update data for the object track, wherein the second update data indicates a second update to the object track, and optionally wherein the first portion of sensor data comprises data describing a different field of view of the environment as compared to the second portion of sensor data; and generating, by a prediction resolution model and based on the first update data and the second update data in a shared prediction output space, a resolved object detection output that comprises a resolved update for the object track.
15. The computer-implemented method of claim 14, wherein the prediction resolution model executes periodically to resolve conflicts between object detection outputs from the first object detection model and the second object detection model.

Description

PCT/US25/51543 17 October 2025 (17.10.2025) RESOLVING OBJECT DETECTION OUTPUTS FROM MULTIPLE OBJECT DETECTION PIPELINES PRIORITY [0001] This application claims priority to United States Patent Application no. 18/934,680, filed November 1, 2024. United States Patent Application no. 18/934,680 is hereby incorporated by reference herein in its entirety. BACKGROUND [0002] An autonomous platform can process data to perceive an environment through which the autonomous platform travels. For example, an autonomous vehicle can perceive its environment using a variety of sensors and identify objects around the autonomous vehicle. The autonomous vehicle can identify an appropriate path through the perceived surrounding environment and navigate along the path with minimal or no human input. SUMMARY [0003] Example implementations of the present disclosure provide for object detection system architectures and training techniques that improve the ability of an autonomous vehicle to navigate in dynamic real-world environments. In an example aspect, a perception system architecture may include multiple object detection pipelines. For instance, an example perception system architecture may use distinct object detection models for different perception tasks (e.g., long-range perception; single -modality perception; multimodal perception; etc.). Each object detection pipeline may generate object detection outputs based on the same or different portions of sensor data, and each object detection may operate with the same or different timing (e.g., sweep or other cycle time). Each object detection model may be specifically adapted for a particular task within its respective pipeline. An example perception system architecture may include a prediction resolution model to ingest individual detection outputs from the different pipelines to generate an overall detection output. In this manner, for instance, the prediction resolution model may leverage the strengths of each respective model executing over its respective sensor inputs at its respective operating frequency to obtain a unified understanding of the environment. [0004] In an example, an object tracking system of the perception system may store object tracks. An object track may record a category of an object and movement of the object PCT/US25/51543 17 October 2025 (17.10.2025) within an environment over time. An example object track stores a series of keypoints indicating the current and one or more past locations of the object within the environment. The object tracking system may maintain a current representation of an object track by ingesting updates from multiple different object detection pipelines. For instance, one object detection pipeline may focus on obtaining long-range detections. Another object detection pipeline may focus on matching sensor data to existing tracks and generating updates to those tracks. Another object detection pipeline may focus on image data, while another may focus on multimodal or LIDAR-only data. Each pipeline may execute different models at different frequencies, which may be set based on an availability of new sensor data (e.g., LIDAR sweeps) or a latency demand of various downstream subroutines. Each pipeline may publish updates to an object track for a tracked object, proposals for new object tracks. [0005] A prediction resolution model may execute asynchronously with the respective component models to update the object tracks based on the full scope of available information. In this manner, for instance, a prediction resolution model may effectively facilitate “voting” among different object detection expert models without disrupting or blocking execution of the different pipelines on their respective tasks. For instance, some object detection pipelines may support a perception task that demands extremely low latency; others may use larger detection models that operate at a slower frequency. An example implementation of the present disclosure may obtain the benefits of multi -expert voting without impacting the performance of the system on its various subtasks by accumulating results from the respective pipelines and then generating a resolved detection output that may be published for use by the respective system(s) for future cycles or by downstream system(s) for, for instance, planning motions of the vehicle, understanding or mapping a current state of the environment, or other tasks. The prediction resolution model may be tuned (e.g., using one or more learnable parameters or hyperparameters) to adjust a performance (e.g., improve a performance) of a downstream system. [0006] A prediction resolution model may facilitate improved long-range detections by allowing individual detection pipelines to publish results earlier, with the expectation that as more data becomes available over time from other or the same pipelines the detection result may be updated and resolved. For instance, at long distances, a traffic cone may appear