EP-4239423-B1 - DEVICE AND METHOD FOR DETERMINING SAFE ACTIONS TO BE EXECUTED BY A TECHNICAL SYSTEM

EP4239423B1EP 4239423 B1EP4239423 B1EP 4239423B1EP-4239423-B1

Inventors

Straehle, Christoph-Nikolas
GEIGER, PHILIPP

Dates

Publication Date: 20260506
Application Date: 20220303

Claims (15)

Computer-implemented method (100) for training a machine learning system (60), wherein the machine learning system is configured to determine a control signal (A) characterizing an action to be executed by a technical system (40, 100, 200), wherein the method for training comprises the steps of: • Obtaining a safe action ( a ) to be executed by the technical system (100, 200), wherein obtaining the safe action ( a ) comprises the steps of: ∘ Obtaining (101) a state signal (s), wherein the state signal ( s ) characterizes a state of the environment of the technical system; ∘ Determining (102), by a parametrized policy module (61) of the machine learning system (60), a distribution of potentially unsafe actions that could be executed by the technical system (100, 200), wherein the policy module (61) determines the distribution based on the obtained state signal (s); ∘ Sampling (103) a potentially unsafe action ( â ) from the distribution; ∘ Obtaining (104), by a safety module (62) of the machine learning system (60), the safe action ( a ), wherein the safe action ( a ) is obtained based on the sampled potentially unsafe action ( â ) and a set of safe actions ( A ) with respect to a current environment of the technical system (40, 100, 200), wherein obtaining (104) the safe action ( a ) comprises mapping the potentially unsafe action ( â ) to a safe action if the potentially unsafe action ( â ) is not in the set of safe actions ( A ); • Determining (105) a loss value based on the state signal (s) and the safe action ( a ), wherein the loss value characterizes a reward obtained based on the safe action ( a ); • Training (106) the machine learning system (60) by updating parameters (Φ) of the policy module (61) according to a gradient of the loss value with respect to the parameters (Φ).
Method (100) according to claim 1, wherein obtaining (104) the safe action ( a ) by the safety module (62) comprises mapping the potentially unsafe action ( â ) to an action from the set of safe actions ( A ) if the potentially unsafe action ( â ) is not in the set of safe actions ( A ), wherein the mapping is performed by means of a piecewise diffeomorphism.
Method (100) according to claim 2, wherein mapping the potentially unsafe action ( â ) to an action from the set of safe actions ( A ) comprises • Determining a countable partition (M) of the space of actions; • Determining, for each set (u, k) of the countable partition (M), whether the set (u, k) is safe set (k) or an unsafe set (u), wherein a set is determined as safe set (k) if the set only comprises actions from the set of safe actions ( A ) and if there exists a trajectory of actions for future states that comprises only safe actions and wherein a set is determined as unsafe set (u) otherwise; • If the potentially unsafe action ( â ) is in an unsafe set (u): ∘ Determining a safe set (k) from the partition (M) based on the distribution of the potentially unsafe actions; ∘ mapping the potentially unsafe action ( â ) to an action from the safe set (k); ∘ Providing the action as safe action ( a ); • Otherwise, providing the potentially unsafe action ( â ) as safe action ( a ).
Method (100) according to claim 3, wherein determining the safe set (k) comprises determining, for each safe set in the partition, a probability density of a representative action of the respective safe set of the partition with respect to the distribution of potentially unsafe actions, wherein the safe set comprising the representative action with highest probability density value is provided as determined safe set (k).
Method (100) according to claim 3, wherein determining the safe set (k) comprises determining, for each safe set in the partition, a probability density of a representative action of the respective safe set of the partition with respect to the distribution of potentially unsafe actions, wherein a safe set is sampled based on the determined probability densities and the sampled safe set is provided as determined safe set (k).
Method (100) according to claim 3, wherein the safe set (k) is determined by choosing the set from the partition that is deemed safe and has a minimal distance to the potentially unsafe action ( â ).
Method (100) according to any one of the claims 3 to 6, wherein mapping the potentially unsafe action ( â ) to an action from the safe set (k) and providing the action as safe action ( a ) comprises determining a relative position of the potentially unsafe action ( â ) in the unsafe set (u) and providing the action at the relative position in the safe set (k) as safe action ( a ).
Method (100) according to any one of the claims 3 to 6 wherein mapping the potentially unsafe action ( â ) to an action from the safe set (k) and providing the action as safe action ( a ) comprises determining an action from the safe set (k) that has a minimal distance to the potentially unsafe action ( â ) and providing the action as safe action ( a ).
Method (100) according to any one of the claims 1 to 7, wherein the loss value is determined by a discriminator and training the machine learning system (60) comprises training the policy module (61) and the discriminator according to generative adversarial imitation learning.
Computer-implemented method for determining a control signal (A) for controlling an actuator (10) of a technical system (100, 200), wherein the method comprises the steps of: • Training a machine learning system (60) using the method according to any one of the claims 1 to 10; • Determining the control signal (A) by means of the trained machined learning system (60) and based on a state signal (s) of an environment of the inical system.
Machine learning system (60) trained according to claim 1.
Computer-implemented method for training the machine learning system (60) according to claim 11, wherein the policy module is trained according to a reinforcement learning paradigm or an imitation learning paradigm, wherein during inference of the machine learning system (60) potentially unsafe actions ( â ) provided by the policy module (61) are mapped to safe actions ( a ) according to the step (104) of obtaining, by the safety module (62) of the machine learning system (60), the safe action ( a ) according to any one of the claims 1 to 9.
Training system (140), which is configured to carry out the training method according to any one of the claims 1 to 9 or 12.
Computer program that is configured to cause a computer to carry out the method according to any one of the claims 1 to 10 or 12 with all of its steps if the computer program is carried out by a processor (45, 145).
Machine-readable storage medium (46, 146) on which the computer program according to claim 14 is stored.

Description

Technical field The invention relates to a computer-implemented method for training a control system, a training system, a control system, a computer program, and a machine-readable storage medium. Prior art Bhattacharyya et al. 2020 "Modeling Human Driving Behavior through Generative Adversarial Imitation Learning", https://arxiv.org/abs/2006.06412v1 discloses the use of generative adversarial imitation learning for learning-based driver modeling. EP 3 838 505 A1 discloses a computer-implemented method (700) of configuring a system which interacts with a physical environment. An action of the system in a state of the physical environment results in an updated state of the physical environment according to a transition probability. A safe set of state-action pairs known to be safely performable and an unsafe set of state-action pairs to be avoided are indicated. During an environment interaction, a safe set of state-action pairs is updated by estimating a transition probability for a state-action pair based on an empirical transition probability of a similar other state-action pair, and including the state-action pair in the safe set of state-action pairs only if the state-action pair is not labelled as unsafe and the safe set of state-action pairs can be reached with sufficient probability from the state-action pair based on the estimated transition probability. US 2020/0104680 A1 discloses methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training an action selection policy neural network, wherein the action selection policy neural network is configured to process an observation characterizing a state of an environment to generate an action selection policy output, wherein the action selection policy output is used to select an action to be performed by an agent interacting with an environment. In one aspect, a method comprises: obtaining an observation characterizing a state of the environment subsequent to the agent performing a selected action; generating a latent representation of the observation; processing the latent representation of the observation using a discriminator neural network to generate an imitation score; determining a reward from the imitation score; and adjusting the current values ofthe action selection policy neural network parameters based on the reward using a reinforcement learning training technique. Technical background Modern technical devices often-times interact with their environment by executing certain actions. For example, a robot arm may move from one point to another wherein the movement may constitute an action. An at least partially automated vehicle may execute a longitudinal and/or lateral acceleration, e.g., by steering and/or acceleration of the wheels. A manufacturing robot may further execute actions specific to a tool mounted to the robot, e.g., gripping, cutting, welding, or soldering. An action to be executed by a technical system is typically determined by a control system. In most modern system, actions may be formulated abstractly by the control system, wherein further components of the technical system may then translate the abstract action into actuator commands such that the action is executed. For example, the control system of the manufacturing robot from above may determine to execute the action "gripping" and submit a control signal characterizing the action "gripping" to another component, wherein the other component translates the abstract action into, e.g., electric currents for a pump of a hydraulic of the robot or a motor, e.g., a servo motor, for controlling a gripper of the robot. In general, an action executed by a robot may be considered safe with respect to some desired safety goal and/or desired behavior. An autonomous vehicle may, for example, be considered as performing safe actions if an action determined by the control system does not lead to a collision of the vehicle with other road participants and/or environmental entities. Determining safe actions for a technical system is typically a non-trivial problem. This is in parts due to the fact that safe actions (e.g., stop in the emergency lane) may not always contribute to a desired behavior of the technical system (e.g., travel to a desired destination) and may in fact even be detrimental to achieving the desired behavior. Hence, it is desirable to obtain a control system for a technical system that is able to determine actions to be executed by the technical system, wherein the actions are safe with respect to one or multiple safety goals and wherein the actions further contribute to achieving a desired behavior. Advantageously, the method with features of the independent claim 1 allows for determining a machine learning system that is configured to provide a control signal characterizing safe actions to be performed by a technical system. In addition to being safe, the actions determined by the machine learning system advantageo