EP-4235317-B1 - METHOD FOR CONTROLLING A MACHINE BY MEANS OF A LEARNING BASED CONTROL AGENT AND CONTROL DEVICE

EP4235317B1EP 4235317 B1EP4235317 B1EP 4235317B1EP-4235317-B1

Inventors

Swazinna, Phillip
UDLUFT, STEFFEN
RUNKLER, THOMAS

Dates

Publication Date: 20260513
Application Date: 20220228

Claims (13)

Computer-implemented method for controlling a machine (M) by means of a learning-based control agent (POL), wherein a) a performance evaluator (PEV) is provided and uses a control signal (A, A1) to determine a performance (RET) for controlling the machine (M) by means of the control signal (A, A1), wherein the performance (RET) relates to a power, an efficiency, a resource consumption, a yield, a pollutant emission, a product quality and/or other operating parameters of the machine (M), b) an action evaluator (VAE) is provided and uses the control signal (A, A1) to determine a deviation (D) from a predefined control sequence (SP), c) a multiplicity of weight values (W) for weighting the performance (RET) with respect to the deviation (D) are generated, d) a multiplicity of state signals (S, S1) and the weight values (W) are fed into the control agent (POL), wherein - a respectively resulting output signal (A, A1) from the control agent (POL) is fed as a control signal into the performance evaluator (PEV) and into the action evaluator (VAE), - a performance (RET) respectively determined by the performance evaluator (PEV) is weighted with respect to a deviation (D) respectively determined by the action evaluator (VAE) by a target function (TF) according to the respective weight value (W), and - the control agent (POL) is trained to use a state signal (S, S1) and a weight value (W) to output a control signal (A, A1) optimizing the target function (TF), and e) in order to control the machine (M) - an operating weight value (WO) and an operating state signal (SO) from the machine (M) are fed into the trained control agent (POL), and - a resulting output signal (AO) from the trained control agent (POL) is supplied to the machine (M), and wherein the weight value (W) is gradually changed when controlling the machine (M) in such a manner that the performance (RET) is increasingly given a higher weighting with respect to the deviation (D).
Method according to Claim 1, characterized in that a respective performance (RET) is determined by the performance evaluator (PEV) and/or a respective deviation (D) is determined by the action evaluator (VAE) on the basis of a respective state signal (S, S1).
Method according to one of the preceding claims, characterized in that a performance value (R) is respectively read in for a multiplicity of state signals (S) and control signals (A) and quantifies a performance resulting from application of a respective control signal (A) to a state of the machine (M) specified by a respective state signal (S), in that the performance evaluator (PEV) is trained to reproduce an associated performance value (R) on the basis of a state signal (S) and a control signal (A).
Method according to one of the preceding claims, characterized in that the performance evaluator (PEV) is trained to determine a performance (RET) accumulated over a future period of time by means of a Q-learning method and/or another Q-function-based reinforcement learning method.
Method according to one of the preceding claims, characterized in that a multiplicity of state signals (S) and control signals (A) are read in, in that the action evaluator (VAE) is trained to use a state signal (S) and a control signal (A) to reproduce the control signal (A) following information reduction, wherein a reproduction error (DR) is determined, and in that the deviation (D) is determined on the basis of the reproduction error (DR).
Method according to one of the preceding claims, characterized in that the deviation (D) is determined by the action evaluator (VAE) by means of a variational autoencoder, by means of an autoencoder, by means of generative adversarial networks and/or by a comparison, in particular a state-signal-dependent comparison, with predefined control signals.
Method according to one of the preceding claims, characterized in that the weight values (W) are generated in a randomized manner.
Method according to one of the preceding claims, characterized in that a gradient-based optimization method, a stochastic optimization method, particle swarm optimization and/or a genetic optimization method is/are used to train the control agent (POL), the performance evaluator (PEV) and/or the action evaluator (VAE).
Method according to one of the preceding claims, characterized in that the control agent (POL), the performance evaluator (PEV) and/or the action evaluator (VAE) comprise(s) an artificial neural network, a recurrent neural network, a convolutional neural network, a multilayer perceptron, a Bayesian neural network, an autoencoder, a variational autoencoder, a deep learning architecture, a support vector machine, a data-driven trainable regression model, a k-nearest neighbour classifier, a physical model and/or a decision tree.
Method according to one of the preceding claims, characterized in that the machine (M) is a robot, a motor, a manufacturing plant, a factory, an energy supply device, a gas turbine, a wind turbine, a steam turbine, a milling machine or another device or another installation.
Controller (CTL) for controlling a machine (M), configured to carry out a method according to one of the preceding claims.
Computer program product configured to carry out a method according to one of Claims 1 to 10.
Computer-readable storage medium having a computer program product according to Claim 12.

Description

Data-driven machine learning methods, particularly reinforcement learning, are increasingly used to control complex technical systems such as robots, motors, production facilities, power supply systems, gas turbines, wind turbines, steam turbines, milling machines, and other machinery. In these methods, control agents, especially artificial neural networks, are trained using large amounts of training data to generate a control signal optimized for a given state of the technical system according to a predefined objective function. Such a control agent trained to control a technical system is often referred to as a policy or simply as an agent. The objective function allows for the evaluation of performance to be optimized, such as power output, efficiency, resource consumption, yield, emissions, product quality, and/or other operating parameters of the technical system. Such an objective function is also frequently referred to as a reward function, cost function, or loss function. Using such a performance-optimizing control agent can significantly increase the performance of a controlled technical system in many cases. However, the productive use of such a control agent also carries risks because the generated control signals are often untested and therefore potentially unreliable. This is especially true under operating conditions that are not adequately covered by the training data. Furthermore, a user is often less familiar with such an optimized control sequence and may therefore be unable to fully utilize their expertise. less useful for monitoring the correct behavior of the technical system. EP 3 940 596 A1 discloses a computer-implemented method for training a learning-based control agent for a technical system, wherein the control agent is configured by the training to optimize the control of the technical system. US 2015/370227 A1 discloses a method for controlling a target system by a weighted combination of several control strategies. The object of the present invention is to provide a method and a control device for controlling a machine by means of a learning-based control agent, which allow for more efficient and/or more reliable control. This problem is solved by a method with the features of claim 1, by a control device with the features of claim 11, by a computer program product with the features of claim 12, and by a computer-readable storage medium with the features of claim 13. To control a machine using a learning-based control agent, a performance evaluator and an action evaluator are provided. The performance evaluator determines the performance for controlling the machine using a control signal, and the action evaluator determines any deviation from a predefined control sequence based on the same control signal. Furthermore, a multitude of weight values are generated to weight the performance against the deviation. These weight values, along with a multitude of status signals, are fed into the control agent. Each resulting output signal from the control agent is fed as a control signal into the performance evaluator and the action evaluator. a performance determined by the performance evaluator is weighted against a deviation determined by the action evaluator according to the respective weight value by an objective function, and The control agent is trained to use a state signal and output a control signal that optimizes the objective function based on a weight value. To control the machine, an operating weight value and an operating state signal of the machine are then fed into the trained control agent. A signal is fed into the system, and a resulting output signal is fed to the trained control agent of the machine. When controlling the machine, the weight value is successively changed so that performance is given increasingly greater weight than deviation. To carry out the method according to the invention, a corresponding control device, a computer program product and a computer-readable, preferably non-volatile storage medium are provided. The method and control device according to the invention can be carried out or implemented, for example, by means of one or more computers, processors, application-specific integrated circuits (ASICs), digital signal processors (DSPs) and/or so-called "Field Programmable Gate Arrays" (FPGAs). A particular advantage of the invention lies in the fact that, by simply entering or changing an operating weight value, different control strategies can be set during the operation of machine M. These strategies are optimized for different weightings of performance versus deviations from a predefined control sequence. In this way, the trained control agent can be temporarily configured during operation to prioritize proximity to the predefined control sequence over machine performance. This can be particularly advantageous during safety-critical operating phases of the machine. Conversely, in less critical operating phases, performance can be weighte