US-12617413-B1 - Contrastive training of object trajectory encoders and text encoders

US12617413B1US 12617413 B1US12617413 B1US 12617413B1US-12617413-B1

Abstract

Techniques are described herein for training contrastive models including object trajectory encoders and text encoders for evaluating, classifying, and/or predicting the movements and behaviors of dynamic objects in driving environments. A training system may receive sets of ground truth trajectory data describing movements of objects within driving environments, and associated text descriptions related to the trajectory data. The training system may jointly train the trajectory encoder and the text encoder, using contrastive loss, based on the related sets of trajectory data and text data. Once trained, the trajectory encoder and/or the text encoder may operate as pre-trained models for subsequently training and executing additional models with different output heads and/or various other downstream encoding tasks. In some examples, contrastive pre-trained trajectory encoders trained as described herein may be used for training and executing motion forecasting models within autonomous vehicles.

Inventors

Ethan Miller Pronovost
Sean Konz

Assignees

Zoox, Inc.

Dates

Publication Date: 20260505
Application Date: 20231031

Claims (5)

1 . A system comprising: one or more processors; and one or more non-transitory computer-readable media storing computer-executable instructions that, when executed, cause the one or more processors to perform operations comprising: receiving driving scene data associated with a driving environment; receiving an object trajectory of an object in the driving environment; receiving a text description associated with the object trajectory of the object; determining, using a trajectory encoder, and based at least in part on the driving scene data and the object trajectory, a first trajectory encoding; determining, using a text encoder, and based at least in part on the text description associated with the object trajectory, a first text encoding; and jointly training the trajectory encoder and the text encoder, wherein the jointly training comprises: determining, based at least in part on a similarity between the first trajectory encoding and the first text encoding, a first loss associated with the trajectory encoder, and a second loss associated with the text encoder; modifying the trajectory encoder, based at least in part on the first loss; and modifying the text encoder, based at least in part on the second loss.
2 . The system of claim 1 , wherein the text description indicates at least one of: a relationship between the object and a second object in the driving environment; or a relationship between the object and a map data feature in the driving environment.
3 . The system of claim 1 , wherein the text encoder comprises: a first set of transformer blocks associated with a large language model; and a second set of transformer blocks associated with descriptions of object movements in the driving environment, wherein jointly training the trajectory encoder and the text encoder comprises modifying the second set of transformer blocks.
4 . The system of claim 1 , wherein jointly training the trajectory encoder and the text encoder comprises: training, during a first training stage, a trained trajectory encoder; and wherein the operations further comprise: training, during a second training stage after the first training stage, an object motion forecasting model including the trained trajectory encoder.
5 . The system of claim 4 , the operations further comprising: transmitting the object motion forecasting model to a vehicle, wherein the vehicle is configured to be controlled based at least in part on the object motion forecasting model.

Description

BACKGROUND Autonomous and semi-autonomous vehicles may utilize systems and components to traverse through driving environments including various dynamic objects, such as other moving or stationary vehicles (autonomous or otherwise), pedestrians, bicycles, and animals, as well as static objects such as curbs, sidewalks, road debris, and other potential road obstructions. When traversing through such an environment, the vehicle may determine a trajectory based on sensor data from the perception systems of the vehicle, as well as map data of the environment. For example, a planning component within an autonomous or semi-autonomous vehicle may determine a trajectory and a corresponding set of actions for the vehicle to take to navigate in an operating environment. Trajectory selection techniques may be determined based in part on avoiding the other objects present in the environment, which may include predicting and/or anticipating the movements or behaviors of the other objects. For example, a planning system may determine an action to yield to a walking pedestrian, change lanes to avoid another vehicle in the road, etc. The perception systems of the vehicle may utilize sensor data to perceive the environment, which enables the prediction and planning systems to determine and evaluate potential actions for the vehicle to perform based on the current driving environment. However, in certain circumstances, the complexity of such environments may preclude accurate prediction of the future states and trajectories of other objects in the environment and/or efficient determinations of optimized trajectories for the vehicle, especially as applied in ever more complicated scenarios. BRIEF DESCRIPTION OF THE DRAWINGS The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical components or features. FIG. 1 illustrates an example architecture of a contrastive training system for training a trajectory-text model including a trajectory encoder and a text encoder, in accordance with one or more examples of the disclosure. FIGS. 2A and 2B depicts examples of object trajectory data and associated text description data, in accordance with one or more examples of the disclosure. FIG. 3 is a diagram illustrating an example contrastive training technique for jointly training a trajectory encoder and a text encoder, in accordance with one or more examples of the disclosure. FIGS. 4A-4C depict additional examples of a contrastive training technique for jointly training a trajectory encoder and a text encoder, based on many-to-one object trajectory data and associated text description data, in accordance with one or more examples of the disclosure. FIG. 5 illustrates an autonomous vehicle including motion forecasting models using a pre-trained trajectory encoder, in which the autonomous vehicle uses the motion forecasting models to predict object trajectories and determine a determined trajectory to control the autonomous vehicle along a route in a driving environment. FIG. 6 depicts a block diagram of an example system for implementing various techniques described herein. FIG. 7 is a flow diagram illustrating an example process for contrastive training of a trajectory encoder and text encoder, based on object trajectory data and associated text description data, and using the trained encoders for various downstream tasks, in accordance with one or more examples of the disclosure. DETAILED DESCRIPTION This application describes techniques for training contrastive trajectory-text models based on multimodal object trajectory data and associated text descriptions, during which a trajectory encoder and a text encoder may be jointly trained using contrastive loss. The trajectory encoders and text encoders trained as described herein then may be used for evaluating, classifying, and/or predicting the movements and behaviors of dynamic objects (or agents) such as vehicles, bicycles, and pedestrians in driving environments. In various examples, a contrastive training system may receive sets of ground truth trajectory data describing an object's movements within a driving environment, and associated text descriptions corresponding to the trajectory data. The contrastive training system may jointly train a model including a trajectory encoder and a text encoder, during which contrastive losses are determined by comparing the encodings from related and unrelated sets of trajectory data and text descriptions. After training a trajectory encoder and/or a text encoder using the contrastive training techniques described herein, one or both of the encoders may operate as pre-trained models for use in subsequent training stages for different model output heads and/or various other downstream encoding tasks. For e