US-12626162-B2 - System and method for utilizing a recursive reasoning graph in multi-agent reinforcement learning

US12626162B2US 12626162 B2US12626162 B2US 12626162B2US-12626162-B2

Abstract

A system and method for utilizing a recursive reasoning graph in multi-agent reinforcement learning that includes receiving data associated with an ego agent and a target agent that are traveling within a multi-agent environment and utilizing a multi-agent central actor-critic framework to analyze the data associated with the ego agent and the target agent. The system and method also include performing level-k recursive reasoning based on the multi-agent actor-critic framework to calculate higher level recursion actions of the ego agent and the target agent. The system and method further include controlling at least one of: the ego agent and the target agent to operate within the multi-agent environment based on at least one of: an agent action policy that is associated with the ego agent and an agent action policy that is associated with the target agent.

Inventors

David F. ISELE
Xiaobai MA
Jayesh K. GUPTA
Mykel J. Kochenderfer

Assignees

HONDA MOTOR CO., LTD.
THE BOARD OF TRUSTEES OF THE LELAND STANFORD JUNIOR UNIVERSITY

Dates

Publication Date: 20260512
Application Date: 20210211

Claims (20)

1 . A computer-implemented method for utilizing a recursive reasoning graph in multi-agent reinforcement learning, comprising: receiving data associated with an ego agent and a target agent that are traveling within a multi-agent environment; utilizing a multi-agent central actor-critic framework to analyze the data associated with the ego agent and the target agent; wherein: the multi-agent central actor-critic framework includes a central actor component for each of the ego agent and the target agent; performing level-k recursive reasoning based on the multi-agent central actor-critic framework to calculate higher level recursion actions of the ego agent and the target agent, wherein: performing the level-k recursive reasoning includes, representing each of the ego agent and the target agent as a respective central actor node in the recursive reasoning graph, receiving, by each central actor component, data reflecting prior level actions of other central actor nodes using a message passing between central actor nodes, and generating a higher-level response action as part of the level-k recursive reasoning including updating level-k actions for each central actor node by incorporating prior level actions of other central actor nodes, wherein an output of the level-k recursive reasoning is used to learn an agent action policy that is associated with the ego agent and an agent action policy that is associated with the target agent; and autonomously controlling at least one operational system of: the ego agent and the target agent to operate within the multi-agent environment based on at least one of: the agent action policy that is associated with the ego agent and the agent action policy that is associated with the target agent, by executing the agent action policy learned from the level-k recursive reasoning on an electronic control unit (ECU) of the ego agent or the target agent to actuate at least one of vehicle steering, acceleration, or braking.
2 . The computer-implemented method of claim 1 , wherein receiving data associated with the multi-agent environment includes receiving image data and LiDAR data from at least one of the: ego agent and the target agent, and wherein the image data and LiDAR data are processed using artificial intelligence capabilities to conduct multimodal fusion into the fused environmental data.
3 . The computer-implemented method of claim 2 , wherein the image data and the LiDAR data are aggregated to determine a simulated multi-agent environment in which a simulation of a virtual environment is processed to execute at least one iteration of a stochastic game.
4 . The computer-implemented method of claim 3 , wherein utilizing the multi-agent central actor-critic framework includes augmenting the ego agent and the target agent based on the central actor component to model each agent's conditional response to one another within the at least one iteration of the stochastic game.
5 . The computer-implemented method of claim 4 , wherein utilizing the multi-agent central actor-critic framework includes determining a reward for each agent, wherein a transition probability is conditioned on a current state of each agent as well as actions of the ego agent and the target agent, wherein a Nash Equilibrium is utilized where the ego agent and the target agent act in response to each other's current strategy.
6 . The computer-implemented method of claim 1 , wherein the level-k recursive reasoning further includes using a message passing process in the recursive reasoning graph, wherein a node set of the recursive reasoning graph contains the central actor node for each agent and an edge set contains edges between the ego agent and the target agent, wherein each central actor node takes an input message based on a level-k policy and outputs the higher level response based on the prior level actions of an opposing agent.
7 . The computer-implemented method of claim 1 , wherein the level-k recursive reasoning further includes implementing a level-zero policy as a base policy, wherein the ego agent and the target agent treat each other as obstacles in reaching respective goals based on the level-zero policy.
8 . The computer-implemented method of claim 7 , wherein the level-k recursive reasoning further includes implementing a level-k policy of each agent that takes into account actions of an opposing agent that are based on past actions of a respective agent.
9 . The computer-implemented method of claim 8 , wherein complete message passing through the recursive reasoning graph gives one-level up in recursion, wherein the agent action policy is output for the ego agent and the agent action policy is output for the target agent based on respective level-k policies output from the recursive reasoning graph.
10 . A system for utilizing a recursive reasoning graph in multi-agent reinforcement learning, comprising: a memory storing instructions when executed by a processor cause the processor to: receive data associated with an ego agent and a target agent that are traveling within a multi-agent environment; utilize a multi-agent central actor-critic framework to analyze the data associated with the ego agent and the target agent; wherein: the multi-agent central actor-critic framework includes a central actor component for each of the ego agent and the target agent; perform level-k recursive reasoning based on the multi-agent central actor-critic framework to calculate higher level recursion actions of the ego agent and the target agent, wherein: performing the level-k recursive reasoning includes, representing each of the ego agent and the target agent as a respective central actor node in the recursive reasoning graph, receiving, by each central actor component, data reflecting prior level actions of other central actor nodes using a message passing between central actor nodes, and generating a higher-level response action as part of the level-k recursive reasoning including updating level-k actions for each central actor node by incorporating prior level actions of other central actor nodes, wherein an output of the level-k recursive reasoning is used to learn an agent action policy that is associated with the ego agent and an agent action policy that is associated with the target agent; and autonomously control at least one operational system of: the ego agent and the target agent to operate within the multi-agent environment based on at least one of: the agent action policy that is associated with the ego agent and the agent action policy that is associated with the target agent, by executing the agent action policy learned from the level-k recursive reasoning on an electronic control unit (ECU) of the ego agent or the target agent to actuate at least one of vehicle steering, acceleration, or braking.
11 . The system of claim 10 , wherein receiving data associated with the multi-agent environment includes receiving image data and LiDAR data from at least one of the: ego agent and the target agent, and wherein the image data and LiDAR data are processed using artificial intelligence capabilities to conduct multimodal fusion into the fused environmental data.
12 . The system of claim 11 , wherein the image data and the LiDAR data are aggregated to determine a simulated multi-agent environment in which a simulation of a virtual environment is processed to execute at least one iteration of a stochastic game.
13 . The system of claim 12 , wherein utilizing the multi-agent central actor-critic framework includes augmenting the ego agent and the target agent based on the central actor component to model each agent's conditional response to one another within the at least one iteration of the stochastic game.
14 . The system of claim 13 , wherein utilizing the multi-agent central actor-critic framework includes determining a reward for each agent, wherein a transition probability is conditioned on a current state of each agent as well as actions of the ego agent and the target agent, wherein a Nash Equilibrium is utilized where the ego agent and the target agent act in response to each other's current strategy.
15 . The system of claim 10 , wherein the level-k recursive reasoning further includes using a message passing process in the recursive reasoning graph, wherein a node set of the recursive reasoning graph contains the central actor node for each agent and an edge set contains edges between the ego agent and the target agent, wherein each central actor node takes an input message based on a level-k policy and outputs the higher level response based on the prior level actions of an opposing agent.
16 . The system of claim 10 , wherein the level-k recursive reasoning further includes implementing a level-zero policy as a base policy, wherein the ego agent and the target agent treat each other as obstacles in reaching respective goals based on the level-zero policy.
17 . The system of claim 16 , wherein the level-k recursive reasoning further includes implementing a level-k policy of each agent that takes into account actions of an opposing agent that are based on past actions of a respective agent.
18 . The system of claim 17 , wherein complete message passing through the recursive reasoning graph gives one-level up in recursion, wherein the agent action policy is output for the ego agent and the agent action policy is output for the target agent based on respective level-k policies output from the recursive reasoning graph.
19 . A non-transitory computer readable storage medium storing instructions that when executed by a computer, which includes a processor for performing a method, the method comprising: receiving data associated with an ego agent and a target agent that are traveling within a multi-agent environment; utilizing a multi-agent central actor-critic framework to analyze the data associated with the ego agent and the target agent; wherein: the multi-agent central actor-critic framework includes a central actor component for each of the ego agent and the target agent; performing level-k recursive reasoning based on the multi-agent central actor-critic framework to calculate higher level recursion actions of the ego agent and the target agent, wherein: performing the level-k recursive reasoning includes, representing each of the ego agent and the target agent as a respective central actor node in the recursive reasoning graph, receiving, by each central actor component, data reflecting prior level actions of other central actor nodes using a message passing between central actor nodes, and generating a higher-level response action as part of the level-k recursive reasoning including updating level-k actions for each central actor node by incorporating prior level actions of other central actor nodes, wherein an output of the level-k recursive reasoning is used to learn an agent action policy that is associated with the ego agent and an agent action policy that is associated with the target agent; and autonomously controlling at least one operational system of: the ego agent and the target agent to operate within the multi-agent environment based on at least one of: the agent action policy that is associated with the ego agent and the agent action policy that is associated with the target agent, by executing the agent action policy learned from the level-k recursive reasoning on an electronic control unit (ECU) of the ego agent or the target agent to actuate at least one of vehicle steering, acceleration, or braking.
20 . The non-transitory computer readable storage medium of claim 19 , wherein the level-k recursive reasoning further includes using a message passing process in the recursive reasoning graph, wherein a node set of the recursive reasoning graph contains the central actor node for each agent and an edge set contains edges between the ego agent and the target agent, wherein each central actor node takes an input message based on a level-k policy and outputs the higher level response based on the prior level actions of an opposing agent.

Description

CROSS-REFERENCE TO RELATED APPLICATION This application claims priority to U.S. Provisional Application Ser. No. 63/139,690 filed on Jan. 20, 2021, which is expressly incorporated herein by reference. BACKGROUND Many real-world scenarios involve interactions between multiple agents with limited information exchange. Multi-robot navigation and autonomous driving applications such as highway merging, four-way stops, and lane changing are examples of situations where interaction is required between multiple mobile agents. For example, two mobile agents may be attempting to make maneuvers that may cross each other's paths. Modeling interactions of various agents may be difficult as continuous leaning is necessary. In scenarios where complex interactions may occur between numerous agents adequate machine based understanding of multiple agent reasoning to properly model such interactions has not been successfully achieved. BRIEF DESCRIPTION According to one aspect, a computer-implemented method for utilizing a recursive reasoning graph in multi-agent reinforcement learning that includes receiving data associated with an ego agent and a target agent that are traveling within a multi-agent environment. The computer-implemented method also includes utilizing a multi-agent central actor-critic framework to analyze the data associated with the ego agent and the target agent. The computer-implemented method additionally includes performing level-k recursive reasoning based on the multi-agent actor-critic framework to calculate higher level recursion actions of the ego agent and the target agent. An output of the level-k recursive reasoning is used to learn an agent action policy that is associated with the ego agent and an agent action policy that is associated with the target agent. The computer-implemented method further includes controlling at least one of: the ego agent and the target agent to operate within the multi-agent environment based on at least one of: the agent action policy that is associated with the ego agent and the agent action policy that is associated with the target agent. According to another aspect, a system for utilizing a recursive reasoning graph in multi-agent reinforcement learning that includes a memory storing instructions when executed by a processor cause the processor to receive data associated with an ego agent and a target agent that are traveling within a multi-agent environment. The instructions also cause the processor to utilize a multi-agent central actor-critic framework to analyze the data associated with the ego agent and the target agent. The instructions additionally cause the processor to perform level-k recursive reasoning based on the multi-agent actor-critic framework to calculate higher level recursion actions of the ego agent and the target agent. An output of the level-k recursive reasoning is used to learn an agent action policy that is associated with the ego agent and an agent action policy that is associated with the target agent. The instructions further cause the processor to control at least one of: the ego agent and the target agent to operate within the multi-agent environment based on at least one of: the agent action policy that is associated with the ego agent and the agent action policy that is associated with the target agent. According to yet another aspect, a non-transitory computer readable storage medium storing instructions that when executed by a computer, which includes a processor perform a method that includes receiving data associated with an ego agent and a target agent that are traveling within a multi-agent environment. The method also includes utilizing a multi-agent central actor-critic framework to analyze the data associated with the ego agent and the target agent. The method additionally includes performing level-k recursive reasoning based on the multi-agent actor-critic framework to calculate higher level recursion actions of the ego agent and the target agent. An output of the level-k recursive reasoning is used to learn an agent action policy that is associated with the ego agent and an agent action policy that is associated with the target agent. The method further includes controlling at least one of: the ego agent and the target agent to operate within the multi-agent environment based on at least one of: the agent action policy that is associated with the ego agent and the agent action policy that is associated with the target agent. BRIEF DESCRIPTION OF THE DRAWINGS The novel features believed to be characteristic of the disclosure are set forth in the appended claims. In the descriptions that follow, like parts are marked throughout the specification and drawings with the same numerals, respectively. The drawing figures are not necessarily drawn to scale and certain figures can be shown in exaggerated or generalized form in the interest of clarity and conciseness. The disclosure itself, however, as well as a preferred mode of