US-12623688-B2 - Systems and methods related to training a machine learning model used in controlling autonomous vehicle(s)

US12623688B2US 12623688 B2US12623688 B2US 12623688B2US-12623688-B2

Abstract

Systems and methods related to controlling an autonomous vehicle (“AV”) are described herein. Implementations can obtain a plurality of instances that each include input and output. The input can include actor(s) from a given time instance of a past episode of locomotion of a vehicle, and stream(s) in an environment of the vehicle during the past episode. The actor(s) may be associated with an object in the environment of the vehicle at the given time instance, and the stream(s) may each represent candidate navigation paths in the environment of the vehicle. The output may include ground truth label(s) (or reference label(s)). Implementations can train a machine learning (“ML”) model based on the plurality of instances, and subsequently use the ML model in controlling the AV. In training the ML model, the actor(s) and stream(s) can be processed in parallel.

Inventors

James Andrew Bagnell
Arun Venkatraman
Sanjiban Choudhury
Venkatraman Narayanan

Assignees

AURORA OPERATIONS, INC.

Dates

Publication Date: 20260512
Application Date: 20211217

Claims (20)

1 . A method for training a machine learning (“ML”) model for use by an autonomous vehicle (“AV”), the method comprising: generating a plurality of training instances, each of the plurality of training instances comprising: training instance input, the training instance input comprising: one or more associated actors at a given time instance of an associated past episode of locomotion of a vehicle, wherein each of the one or more associated actors correspond to an object in an environment of the vehicle during the associated past episode of locomotion, and wherein the vehicle is the AV or an additional vehicle that is in addition to the AV; and a plurality of associated streams in an environment of the vehicle during the associated past episode of locomotion, wherein each stream, of the plurality of associated streams, corresponds to a candidate navigation path for the vehicle or one of the associated actors; and training instance output, the training instance output comprising: one or more reference labels that are associated with the past episode of locomotion, wherein the one or more reference labels include a respective probability that the object will follow the candidate navigation path of each of the plurality of associated streams; training the ML model using the plurality of training instances, wherein the trained ML model is subsequently utilized in controlling the AV.
2 . The method of claim 1 , wherein the respective probability that the object will follow the candidate navigation path of each of the plurality of associated streams is represented as a ground truth probability distribution for each of the one or more associated actors.
3 . The method of claim 2 , further comprising, for a particular training instance of the plurality of training instances: generating the respective ground truth probability distribution for each of the one or more associated actors, wherein generating the respective ground truth probability distribution for each of the one or more associated actors comprises: extracting, for a plurality of time instances of the past episode that are subsequent to the given time instance, a plurality of associated features associated with each of the one or more associated actors; determining, based on the plurality of associated features associated with each of the one or more associated actors, and for each of the plurality of time instances, a lateral distance between each of the one or more associated actors and each of the plurality of associated streams; and generating, based on the lateral distance between each of the one or more associated actors and each of the plurality of associated streams, and for each of the plurality of time instances, the respective ground truth probability distribution for each of the one or more associated actors.
4 . The method of claim 1 , wherein each of the one or more reference labels further includes a ground truth constraint, for the vehicle, and associated with the given time instance or a subsequent time instance that is subsequent to the given time instance, and wherein the ground truth constraint includes information related to where the vehicle cannot be located, at the given time instance, and in the environment of the past episode of locomotion.
5 . The method of claim 1 , wherein each of the one or more reference labels further includes a ground truth action, for the vehicle, and associated with the given time instance or a subsequent time instance that is subsequent to the given time instance, and wherein the ground truth action includes information related to an action performed by the vehicle, at the given time instance, and in the environment of the past episode of locomotion.
6 . The method of claim 1 , wherein each of the one or more associated actors from the given time instance of the past episode include a plurality of associated features, wherein the plurality of associated features for each of the associated actors comprise at least one of: velocity information for the object, the velocity information including at least one of: a current velocity of the object, or historical velocities of the object, distance information for the object, the distance information including a distance between the object and each of the plurality of streams, or pose information associated with the object, the pose information including at least one of: location information, or orientation information for the object in the past episode.
7 . The method of claim 1 , wherein each stream, of the plurality of associated streams, corresponds to a sequence of poses that represent the candidate navigation path, in the environment of the vehicle, for the vehicle or one of the associated actors.
8 . The method of claim 7 , wherein the plurality of associated streams include at least one of: a target stream corresponding to the candidate navigation path the vehicle will follow, a joining stream that merges into the target stream, a crossing stream that is transverse to the target stream, an adjacent stream that is parallel to the target stream, or an additional stream that is one-hop from the joining stream, the crossing stream, or the adjacent stream.
9 . The method of claim 1 , wherein the object includes at least one of an additional vehicle that is located in the environment of the vehicle, a bicyclist, or a pedestrian.
10 . The method of claim 9 , wherein the object is dynamic in the environment of the vehicle along a particular stream of the plurality of streams.
11 . The method of claim 1 , further comprising, for one or more of the plurality of training instances: receiving user input that defines one or more of the reference labels.
12 . The method of claim 1 , wherein training the ML model based on the plurality of training instances comprises, for each of the plurality of training instances: processing, using the ML model, the one or more actors at the given time instance of the associated past episode of locomotion and the plurality of associated streams in the environment of the vehicle during the associated past episode of locomotion to generate predicted output; determining, based on the predicted output, a respective predicted probability that the object will follow the candidate navigation path of each of the plurality of associated streams at a future time instance of the associated past episode of locomotion comparing the respective probability that the object will follow the candidate navigation path of each of the plurality of associated streams to the respective predicted probability that the object will follow the candidate navigation path of each of the plurality of associated streams at the future time instance to generate an error; and updating the ML model based on the error.
13 . The method of claim 12 , wherein the ML model is a transformer ML model that includes at least a plurality of layers, and wherein the plurality of layers include at least a plurality of encoding layers, a plurality of decoding layers, and a plurality of attention layers.
14 . The method of claim 1 , wherein subsequently utilizing the trained ML model in controlling the AV comprises: processing, using the trained ML model, sensor data generated by one or more sensors of the AV to generate predicted output; and causing the AV to be controlled based on the predicted output.
15 . The method of claim 14 , wherein causing the AV to be controlled based on the predicted output comprises: processing, using one or more additional layers of the ML model, one or more rules, or one or more additional ML models, the output to rank AV control strategies; and causing the AV to be controlled based on one or more of the ranked AV control strategies.
16 . A system training a machine learning (“ML”) model for use by an autonomous vehicle (“AV”), the system comprising: at least one processor; and memory storing instructions that, when executed, cause the at least one processor to: generate a plurality of training instances, each of the plurality of training instances comprising: training instance input, the training instance input comprising: one or more associated actors at a given time instance of an associated past episode of locomotion of a vehicle, wherein each of the one or more associated actors correspond to an object in an environment of the vehicle during the associated past episode of locomotion, and wherein the vehicle is the AV or an additional vehicle that is in addition to the AV; and a plurality of associated streams in an environment of the vehicle during the associated past episode of locomotion, wherein each stream, of the plurality of associated streams, corresponds to a candidate navigation path for the vehicle or one of the associated actors; and training instance output, the training instance output comprising: one or more reference labels that are associated with the past episode of locomotion, wherein the one or more reference labels include a respective probability that the object will follow the candidate navigation path of each of the plurality of associated streams; train the ML model using the plurality of training instances, wherein the trained ML model is subsequently utilized in controlling the AV.
17 . The system of claim 16 , wherein the respective probability that the object will follow the candidate navigation path of each of the plurality of associated streams is represented as a ground truth probability distribution for each of the one or more associated actors.
18 . The system of claim 16 , wherein each of the one or more reference labels further includes a ground truth constraint, for the vehicle, and associated with the given time instance or a subsequent time instance that is subsequent to the given time instance, and wherein the ground truth constraint includes information related to where the vehicle cannot be located, at the given time instance, and in the environment of the past episode of locomotion.
19 . The system of claim 16 , wherein each of the one or more reference labels further includes a ground truth action, for the vehicle, and associated with the given time instance or a subsequent time instance that is subsequent to the given time instance, and wherein the ground truth action includes information related to an action performed by the vehicle, at the given time instance, and in the environment of the past episode of locomotion.
20 . The system of claim 16 , wherein the instructions to train the ML model based on the plurality of training instances further cause the at least one processor to, for each of the plurality of training instances: process, using the ML model, the one or more actors at the given time instance of the associated past episode of locomotion and the plurality of associated streams in the environment of the vehicle during the associated past episode of locomotion to generate predicted output; determine, based on the predicted output, a respective predicted probability that the object will follow the candidate navigation path of each of the plurality of associated streams at a future time instance of the associated past episode of locomotion compare the respective probability that the object will follow the candidate navigation path of each of the plurality of associated streams to the respective predicted probability that the object will follow the candidate navigation path of each of the plurality of associated streams at the future time instance to generate an error; and update the ML model based on the error.

Description

BACKGROUND As computing and vehicular technologies continue to evolve, autonomy-related features have become more powerful and widely available, and capable of controlling vehicles in a wider variety of circumstances. For automobiles, for example, the automotive industry has generally adopted SAE International standard J3016, which designates 6 levels of autonomy. A vehicle with no autonomy is designated as Level 0, and with Level 1 autonomy, a vehicle controls steering or speed (but not both), leaving the operator to perform most vehicle functions. With Level 2 autonomy, a vehicle is capable of controlling steering, speed and braking in limited circumstances (e.g., while traveling along a highway), but the operator is still required to remain alert and be ready to take over operation at any instant, as well as to handle any maneuvers such as changing lanes or turning. Starting with Level 3 autonomy, a vehicle can manage most operating variables, including monitoring the surrounding environment, but an operator is still required to remain alert and take over whenever a scenario the vehicle is unable to handle is encountered. Level 4 autonomy provides an ability to operate without operator input, but only in specific conditions such as only certain types of roads (e.g., highways) or only certain geographical areas (e.g., specific cities for which adequate mapping data exists). Finally, Level 5 autonomy represents a level of autonomy where a vehicle is capable of operating free of operator control under any circumstances where a human operator could also operate. The fundamental challenges of any autonomy-related technology relates to collecting and interpreting information about a vehicle's surrounding environment, along with making and implementing decisions to appropriately control the vehicle given the current environment within which the vehicle is operating. Therefore, continuing efforts are being made to improve each of these aspects, and by doing so, autonomous vehicles increasingly are able to reliably handle a wider variety of situations and accommodate both expected and unexpected conditions within an environment. SUMMARY As used herein, the term actor or track refers to an object in an environment of a vehicle during an episode (e.g., past or current) of locomotion of a vehicle (e.g., an AV, non-AV retrofitted with sensors, or a simulated vehicle). For example, the actor may correspond to an additional vehicle navigating in the environment of the vehicle, an additional vehicle parked in the environment of the vehicle, a pedestrian, a bicyclist, or other static or dynamic objects encountered in the environment of the vehicle. In some implementations, actors may be restricted to dynamic objects. Further, the actor may be associated with a plurality of features. The plurality of features can include, for example, velocity information (e.g., historical, current, or predicted future) associated with corresponding actor, distance information between the corresponding actor and each of a plurality of streams in the environment of the vehicle, pose information (e.g., location information and orientation information), or any combination thereof. In some implementations, the plurality of features may be specific to the corresponding actors. For example, the distance information may include a lateral distance or a longitudinal distance between a given actor and a closest object, and the velocity information may include the velocity of the given actor and the object along a given stream. In some additional or alternative implementations, the plurality of features may be relative to the AV. For example, the distance information may include a lateral distance or longitudinal distance between each of the plurality of actors and the AV, and the velocity information may include relative velocities of each of the actors with respect to the AV. As described herein, these features, which can include those generated by determining geometric relationships between actors, can be features that are processed using the ML model. In some implementations, multiple actors are generally present in the environment of the vehicle, and the actors can be captured in sensor data instances of sensor data generated by one or more sensors of the vehicle. As used herein, the term stream refers to a sequence of poses representing a candidate navigation path, in the environment of the vehicle, for the vehicle or the actors. The streams can be one of a plurality of disparate types of streams. The types of streams can include, for example, a target stream corresponding to the candidate navigation path the vehicle is following or will follow within a threshold amount of time, a joining stream corresponding to any candidate navigation path that merges into the target stream, a crossing stream corresponding to any candidate navigation path that is transverse to the target stream, an adjacent stream corresponding to any candidate navigation pa