US-12626106-B2 - Machine learning models for behavior understanding

US12626106B2US 12626106 B2US12626106 B2US 12626106B2US-12626106-B2

Abstract

A method for performing one or more tasks, wherein each of the one or more tasks includes predicting behavior of one or more agents in an environment, the method comprising: obtaining a three-dimensional (3D) input tensor representing behaviors of the one or more agents in the environment across a plurality of time steps; generating an encoded representation of the 3D input tensor by processing the 3D input tensor using an encoder neural network, wherein 3D input tensor comprises a plurality of observed cells and a plurality of masked cells; and processing the encoded representation of the 3D input tensor using a decoder neural network to generate a 4D output tensor.

Inventors

Jonathon Shlens
Ashish Venugopal
Vijay Vasudevan
Jiquan Ngiam
Benjamin James Caine
Zhengdong Zhang
Zhifeng Chen
Hao-Tien Chiang
David Joseph Weiss
Jeffrey Ling

Assignees

GOOGLE LLC

Dates

Publication Date: 20260512
Application Date: 20220531

Claims (20)

1 . A method for performing one or more tasks, wherein each of the one or more tasks includes predicting behavior of one or more agents in an environment, the method comprising: obtaining a three-dimensional (3D) input tensor representing behaviors of the one or more agents in the environment across a plurality of time steps, the 3D input tensor having (i) an agent dimension that represents the one or more agents, (ii) a feature dimension that represents, for each of the one or more agents, features corresponding to behavior of the agent, and (iii) a time dimension that represents the plurality of time steps; generating an encoded representation of the 3D input tensor by processing the 3D input tensor using an encoder neural network, wherein the 3D input tensor comprises a plurality of observed cells and a plurality of masked cells, wherein each observed cell includes a specified value for a respective feature that corresponds to an observed behavior of a respective agent at a respective time step, and each masked cell includes a placeholder value for a respective feature that corresponds to a to-be-predicted behavior of a respective agent at a respective time step in the future; and processing the encoded representation of the 3D input tensor using a decoder neural network to generate a 4D output tensor, the 4D output tensor having (i) an agent dimension that represents the one or more agents, (ii) a feature dimension that represents, for each of the one or more agents, features corresponding to behavior of the agent, and (iii) a time dimension that represents the plurality of time steps, and (iv) a future dimension of size F that represents F possible future behaviors for each of the one or more agents at each time step.
2 . The method of claim 1 , further comprising generating an output for the one or more tasks from the F possible future behaviors in the 4D output tensor corresponding to each of the masked cells.
3 . The method of claim 2 , further comprising providing the output of the one or more tasks to a control system for a particular agent of the one or more agents for use in controlling the particular agent.
4 . The method of claim 1 , wherein the one or more tasks comprise at least one of a behavior prediction task, a conditional behavior prediction task, or a goal-directed planning task.
5 . The method of claim 1 , wherein when the one or more tasks include a behavior prediction task, all of the cells in the 3D input tensor up to a specified time step are observed cells and all of the cells in the 3D input tensor after the specified time step are masked cells.
6 . The method of claim 1 , wherein when the one or more tasks include a conditional behavior prediction task, all of the cells in the 3D input tensor up to a specified time step are observed cells, and all of the cells in the 3D input tensor after the specified time step except the cells corresponding to one or more specified agents are masked cells.
7 . The method of claim 1 , wherein when the one or more tasks include a goal directed planning task, all of the cells in the 3D input tensor up to a specified time step are observed cells, and all of the cells in the 3D input tensor after the specified time step except the cells corresponding to a specified agent at a final time step are masked cells.
8 . The method of claim 1 , wherein obtaining the 3D tensor comprises: receiving as input a temporal representation that includes a characterization of the environment at each of a plurality of time steps, and processing, using an input neural network, the temporal representation to generate the 3D input tensor.
9 . The method of claim 1 , further comprising processing the 4D output tensor to generate, for each masked cell in the 3D input tensor, a probability distribution over the F possible future behaviors of the respective agent at the respective time.
10 . The method of claim 1 , further comprising using the probability distribution over the F possible future behaviors to control at least one of the one or more agents.
11 . The method of claim 1 , wherein the encoder neural network comprises a plurality of encoder attention layers, each of the plurality of encoder attention layers configured to receive a respective encoder attention layer input and to apply an attention mechanism over the respective encoder attention layer input to generate a respective encoder attention layer output.
12 . The method of claim 11 , wherein the encoder attention layer input of a first encoder attention layer in the plurality of the encoder attention layers is the 3D input tensor, and wherein each of the plurality of encoder attention layers following the first encoder attention layer is configured to receive the encoder attention layer output of the preceding encoder attention layer as the respective encoder attention layer input.
13 . The method of claim 11 , wherein the plurality of encoder attention layers includes an encoder self-attention layer that is configured to apply a self-attention mechanism over the respective encoder attention layer input across the time dimension.
14 . The method of claim 11 , wherein the plurality of encoder attention layers includes an encoder self-attention layer that is configured to apply a self-attention mechanism over the respective encoder attention layer input across the agent dimension.
15 . The method of claim 11 , wherein the plurality of encoder attention layers includes a cross-attention layer that is configured to apply a cross-attention mechanism over the respective encoder attention layer input across the agent dimension using a road graph.
16 . The method of claim 15 , wherein the road graph represents dynamic and static elements of the environment.
17 . The method of claim 16 , wherein the static elements include a lane structure and a road layout of the environment.
18 . The method of claim 16 , wherein the dynamic elements include elements in the environment that change over time.
19 . The method of claim 11 , wherein the decoder neural network comprises a plurality of decoder attention layers, each of the plurality of decoder attention layers configured to receive a respective decoder attention layer input and to apply an attention mechanism over the respective decoder attention layer input to generate a respective decoder attention layer output.
20 . The method of claim 19 , wherein the decoder attention layer input of a first decoder attention layer in the plurality of decoder attention layers is the encoded representation, and wherein each of the plurality of decoder attention layers following the first decoder attention layer is configured to receive the decoder attention layer output of the preceding decoder attention layer as the respective decoder attention layer input.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This application is a non-provisional of and claims priority to U.S. Provisional Patent Application No. 63/194,577, filed on May 28, 2021, the entire contents of which are hereby incorporated by reference. BACKGROUND This specification relates to a system that performs one or more machine learning tasks that require predicting the behavior of one or more agents in an environment using a neural network. Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters. SUMMARY This specification describes a neural network system implemented as computer programs on one or more computers in one or more locations that includes a neural network configured to perform one or more tasks, in which each of the one or more tasks includes predicting or planning behavior of one or more agents (or predicting behavior of a plurality of agents jointly) in an environment. The one or more tasks include at least one of a behavior prediction task, a conditional behavior prediction task, or a goal-directed planning task. Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages. Navigating dynamic environments necessitates predicting the interactions of multiple agents and the cascading effects of each potential future behavior of each agent. Such a problem is acute in many cases, for example, in the case of autonomous vehicles where agents (e.g. other vehicles and pedestrians) and their associated behaviors may be diverse, and decisions of an autonomous vehicle itself may influence the environment significantly. Many existing approaches decompose the problem into independently predicting future behaviors for each agent, and subsequently planning against these fixed predictions. However, these approaches suffer from an inability to accurately model interactions, and thus systematically fail to predict the behavior of each agent. In the case of navigation, some prior work has focused on dividing the problem of navigation into behavior prediction (or motion forecasting), i.e. predicting the potential future behaviors/trajectories of agents such as vehicles and pedestrians, and goal-directed planning, i.e. selecting a route that efficiently arrives at the destination taking into account the actions of other agents. Such a division, however, suffers from several challenges. First, propagating uncertainty across these tasks is challenging, especially if the systems are not trained with a single, unified objective. Second, interactions between agents largely dictate uncertainty, and formulating the problem into independent subtasks ignores this dimension. To address the drawbacks of conventional systems, the techniques described in this specification allow a neural network to predict future behaviors of one or more agents or to jointly predict future behaviors of multiple agents in an environment in a unified manner. The neural network can capture the large, cascading interactions between agents and can be trained simultaneously on a plurality of tasks (e.g., behavior prediction, conditional behavior prediction and goal-oriented planning) by synergistically leveraging information from distinct data sources. The trained neural network is a unified model that can perform individual tasks by changing how one queries the model. By unifying the modeling, individual tasks often performed through heuristics may instead be learned from a large corpus of data. The described techniques eschew an agent-centric approach, and instead develop a global representation for all agents in an environment (including the autonomous vehicle itself). The techniques employs a simple variant of self-attention in which the attention mechanism is factorized across separate dimensions (instead of across multiple dimensions at the same time, which is computationally intensive and not scalable). The resulting neural network architecture only requires alternating attention between dimensions representing time and agents across the scene, thus being more computationally efficient and scalable compared to prior behavior prediction architectures. The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows an example neural network system that inclu