CN-114761966-B - System and method for robust optimization of trajectory-centric model-based reinforcement learning

CN114761966BCN 114761966 BCN114761966 BCN 114761966BCN-114761966-B

Abstract

A controller for optimizing a local control strategy for a trajectory-centric reinforcement learning system is provided. The controller includes performing the steps of learning a random prediction model of the system using a set of data collected during a trial-and-error experiment performed using an initial random control strategy, estimating an associated average prediction and uncertainty, determining a local set of deviations of the system from a nominal system state using the learned random system model when the current time step uses the control input, determining a system state having a worst case deviation, determining a gradient of a robustness constraint, providing and solving a robust strategy optimization problem using a nonlinear programming to obtain a system trajectory and simultaneously stabilizing the local strategy, updating control data according to the solved optimization problem, and outputting the updated control data via the interface.

Inventors

D.Jia
P. Corac
A Iggo Hu Nahan
M Ben Nuosiman
D. Romeres

Assignees

三菱电机株式会社

Dates

Publication Date: 20260508
Application Date: 20201204
Priority Date: 20191212

Claims (15)

1. A controller for optimizing a local control strategy for a trajectory-centric reinforcement learning system, the controller comprising: An interface configured to receive data comprising a tuple of system state, control data, and state transitions measured by the sensor; A memory storing a program executable by the processor, the program executable by the processor including a random predictive learning model for generating nominal states and control trajectories in a desired time range as a function of time steps in response to task commands of the system received via the interface, the control trajectories being open loop trajectories, a control strategy including a machine learning method algorithm and an initial random control strategy, the local control strategy for adjusting deviations along the nominal trajectories; At least one of the processors is configured to perform, the at least one processor is configured to: learning a random predictive model of the system using a set of data collected during trial-and-error experiments performed using the initial random control strategy; estimating an average prediction and uncertainty associated with the stochastic prediction model; Formulating a controller synthesis problem centered on the trajectory to calculate a nominal trajectory and feed-forward control and a steady time-invariant feedback control simultaneously; Determining a local set of deviations of the system from a nominal system state using the learned random system model when the control input is used at the current time step; Determining a system state in a local set of deviations of the system having a worst-case deviation from the nominal system state; determining a gradient of a robustness constraint by calculating a first derivative of the robustness constraint at the system state with worst-case deviation; Determining an optimal system state trajectory, feed-forward control input and local time-invariant feedback strategy for adjusting the system state to the nominal trajectory by minimizing the cost of a state control trajectory while optimizing satisfaction of the state and input constraints as the local control strategy by solving a robust strategy using a nonlinear programming; updating the control data according to the solved optimization problem, and And outputting the updated control data through the interface.
2. The controller of claim 1, wherein the system is a discrete-time dynamic system.
3. The controller of claim 1, wherein the at least one processor is configured to perform the trajectory-centric reinforcement learning of the system by synthesizing a trajectory-centric control strategy from a time-dependent feedforward control and a locally time-invariant feedback control that stabilizes the time-dependent feedforward control.
4. The controller of claim 2, wherein the synthesis of the trajectory-centric control strategy for the discrete-time dynamic system is formulated as a nonlinear optimization program with nonlinear constraints.
5. The controller of claim 4, wherein the nonlinear constraint is a system dynamic and stable constraint for a local time-invariant feedback strategy.
6. The controller of claim 1, wherein the local time-invariant feedback strategy is determined as a solution to a mathematical expression of a robust trajectory optimization problem that minimizes trajectory costs and additionally satisfies the robustness constraint at each step along the trajectory that pushes the current state of the system along the trajectory in a worst-case bias state at a current time step into a fault tolerance range around the trajectory at a next time step.
7. The controller of claim 1, wherein the set of local uncertainties along the nominal trajectory is obtained by a stochastic function approximator for learning a forward dynamics model of the system.
8. The controller of claim 1, wherein the worst-case bias condition of the system at each state along the nominal trajectory in the defined set of system states is obtained by solving an optimization problem.
9. The controller of claim 1, wherein a formulated nonlinear program with robustness constraints is solved using a gradient of the robustness constraint at the worst-case bias state to obtain the feedforward control and the time-invariant feedback control, or Wherein at least one of the sensors performs wireless communication via the interface, or Wherein at least one of the sensors is a 3D camera providing a moving picture comprising a depth image, or Wherein the trajectory-centric controller composition problem is a non-linear program, or Wherein the local strategy is a time-invariant feedback strategy or a local stability controller, or Wherein the control track is an open loop track.
10. The controller of claim 9, wherein the sensor is disposed in the system and a predetermined peripheral location.
11. The controller of claim 10, wherein at least one of the predetermined peripheral locations is determined by a perspective such that the 3D camera captures a range of movement of the system.
12. A computer-implemented method for optimizing local control strategies for a trajectory-centric reinforcement learning system, the method comprising the steps of: Learning a random predictive model of the system using a set of data collected during trial-and-error experiments performed using an initial random control strategy; estimating an average prediction and uncertainty associated with the stochastic prediction model; formulating a controller synthesis problem with a track as a center to calculate a nominal track, feed-forward control and stable time-invariant feedback control simultaneously; Determining a local set of deviations of the system from a nominal system state using the learned random system model when the control input is used at the current time step; Determining a system state in a local set of deviations of the system having a worst-case deviation from the nominal system state; Determining a gradient of a robustness constraint by calculating a first derivative of the robustness constraint at the system state with worst-case deviation; Determining an optimal system state trajectory, feed-forward control input and local time-invariant feedback strategy for adjusting the system state to the nominal trajectory by minimizing the cost of a state control trajectory while providing and solving a robust strategy optimization problem satisfying state and input constraints as the local control strategy by using a nonlinear programming; Updating control data based on solved optimization problem, and And outputting the updated control data through the interface.
13. The method of claim 12, wherein the system is a discrete time dynamic system.
14. The method of claim 12, wherein to perform the trajectory-centric reinforcement learning of the system, a trajectory-centric control strategy is synthesized from a time-dependent feedforward control and a local time-invariant feedback control that stabilizes the time-dependent feedforward control.
15. The method of claim 13, wherein the synthesis of the trajectory-centric control strategy for the discrete-time dynamic system is formulated as a nonlinear optimization program with nonlinear constraints.

Description

System and method for robust optimization of trajectory-centric model-based reinforcement learning Technical Field The present invention relates generally to systems and methods for simultaneously optimizing local strategies for trajectory-centric reinforcement learning and controlling trajectories. Background Reinforcement Learning (RL) is a learning framework that deals with sequential decision problems, where "agents" or decision makers learn strategies to optimize long-term rewards by interacting with (unknown) environments. At each step, the RL agent obtains assessment feedback (called rewards or costs) about its performance, causing it to improve (maximize or minimize) the performance of the subsequent action. In general, global learning and optimization of any nonlinear system can be extremely challenging in both computation and algorithm. However, many tasks that many systems need to perform are track-centric, and thus, local learning and optimization can be very data efficient. Trajectory-centric control can be challenging for nonlinear systems due to the time-varying nature of the controller. Deviations from planned trajectories during operation are common for practical systems due to incorrect models or noise in observations or actuations. Machine learning methods allow for learning and then predicting uncertainty in the evolution of a controlled trajectory. From a control perspective, it is desirable to design a local state-dependent strategy that can use a learned uncertainty model to stabilize the controlled trajectory. Most techniques cannot use the uncertainty knowledge present in the system model to stabilize the desired control trajectory. It is also desirable to design the trajectory and corresponding stabilization strategy simultaneously. This will naturally compromise between the optimality of the control track and its stability. Intuitively, under such a setup, the policy optimization algorithm will avoid areas of the state space that may be more difficult to control, so the uncertainty in the model can be exploited to design a robust optimal trajectory-centric controller. Most current techniques perform these two steps separately (trajectory design and controller synthesis) and therefore cannot take advantage of this knowledge of model uncertainty. In view of the facts and challenges described above, there is a need for better policy optimization methods that can use uncertain statistical models of physical systems and utilize the structure of these models to achieve robust performance of the system over a wide range of tasks. Disclosure of Invention Recent research has led to significant success of these algorithms in various fields such as computer games. In track-centric RL, the goal is to optimize a strategy that can successfully perform tasks starting from the initial state of the system and guiding the system to the desired final state. Track-centric approaches have the advantage that they can learn faster because they learn local predictive models and use them to optimize strategies in local neighborhoods of the system. Reinforcement learning algorithms can be broadly divided into two categories—model-based methods and model-free methods. Model-based reinforcement learning (MBRL) techniques are generally considered data-efficient because they learn task-independent predictive models for systems. The learned model is then used to synthesize a strategy for the system using a random control method. However, these methods are often difficult to train, thus resulting in a low performance strategy. There are several reasons for obtaining low performance with such algorithms, one of the key challenges being that the predictive model of the system estimated during the learning process is random in nature due to the noise present in the data collected during the learning process. As a result, the wrong model can drive the optimization algorithm to a part of the state space where the system is unstable, and then the learning process may diverge. Another challenging aspect of MBRL is that the estimated model may have varying degrees of uncertainty in different regions of the state space, so that subsequent policy optimization steps should take advantage of this structure of the learned statistical model to achieve optimal performance. Most policy optimization techniques either ignore or fail to incorporate this information during policy optimization. MBRL has the advantage that the predictive models estimated during learning are task independent, so they can be used for multiple tasks and thus are more efficient in learning across multiple tasks. Thus MBRL allows reuse of the learned model to compute strategies for different tasks. As a result MBRL has the potential to learn effective strategies for many physical systems where collecting large amounts of data to optimize the strategy can be very expensive. According to some embodiments of the present invention, policy optimization is perf