CN-122021966-A - Reservoir dispatching safety reinforcement learning method based on Lagrangian method improvement
Abstract
The invention discloses an improved reservoir dispatching safety reinforcement learning method based on a Lagrangian method, and relates to the technical field of reservoir dispatching. Compared with a correction method and a punishment method, the Lagrange method has the advantages that constraint weight is adaptively adjusted, target optimization performance is remarkably improved while constraint satisfaction is guaranteed, experimental results show that the Lagrange method is superior to other methods in constraint satisfaction and power generation capacity, constraint violation degree is reduced by more than 80%, training stability is effectively improved through cooperation of prepositive defense and posterior optimization by means of an action mask mechanism, a complete rule mask can accurately define a safety boundary, violation rate is reduced from 21% -79% to zero, and strategy convergence is accelerated.
Inventors
- XIE YU
- XU WEI
- Xie Zaichao
- ZHOU MI
- LI ZHENGXUAN
- PU SIYU
Assignees
- 重庆交通大学
Dates
- Publication Date
- 20260512
- Application Date
- 20251226
Claims (5)
- 1. The improved reservoir dispatching safety reinforcement learning method based on the Lagrangian method is characterized by at least comprising the following steps: S1, regarding a reservoir dispatching problem as a sequential decision problem with safety constraint, adopting a constraint Markov decision process, namely CMDP, to formally model the problem, and representing a reservoir system as seven-tuple; S2, introducing a state-driven action mask mechanism in a strategy sampling stage, dynamically filtering actions which obviously do not meet physical constraints according to the current reservoir state before each step of decision, so as to construct a front defense line outside strategy learning, namely, constructing the state-driven action mask mechanism, calculating a feasible flow interval based on a water quantity balance equation and the physical constraints, converting the feasible flow interval into an action index interval through action-flow mapping, and filtering illegal actions in the strategy sampling stage; S3, performing nonlinear action mapping and resolution design, and adopting nonlinear action mapping functions under a unified framework By adjusting Is to distribute different action densities in different flow intervals; S4, adopting a constraint strengthening method based on Lagrangian dual at a policy optimization level to ensure that the policy meets CMDP constraints on a long-term scale, and converting the problem of constraint optimization of CMDP into a saddle point search problem on a policy space and a dual space.
- 2. The method for improving reservoir dispatching safety reinforcement learning based on Lagrangian method according to claim 1, wherein S1 comprises the following steps: The reservoir system is represented as seven tuples: Wherein: In order to be a state space, Indicating time of day System state of (2); In order to be a space for the motion, Indicating the time of day of the agent A selected scheduling decision; for state transition cores, described in the execution of actions Rear system slave Evolution to Probability of (2); the method is an instant rewarding function and is used for describing comprehensive benefits such as power generation, flood control, ecology, shipping and the like; The method is used for quantifying the violation degree of safety constraints such as water level, flow and the like as a constraint cost function; is an initial state distribution; Is a discount factor; to characterize an acceptable security risk level, an upper bound on allowed long-term average constraint costs; In a specific implementation, the state The method at least comprises the steps of current water level, warehouse-in flow, prediction of coming water in a plurality of days in the future, dynamic water level upper and lower limits corresponding to time periods and date codes, and is characterized by comprising the following steps: Wherein, the Is the current water level; The current warehouse-in flow is the current warehouse-in flow; inflow prediction vectors for several days in the future; And (3) with Respectively time-varying minimum and maximum operating water levels; coding normalized dates for characterizing seasonality; Reservoir capacity-water level evolution follows the principle of conservation of mass and is recorded For the moment of time Is used for the storage capacity of the container, Is the delivery flow; For the schedule time step, the water balance equation is: Monotone function between reservoir capacity and water level obtained by topography measurement Association, namely: Wherein, the Ensuring that the environmental dynamics strictly obeys the water conservation and the topography conditions for the inverse function of the reservoir capacity-water level curve, and providing a physical basis for the subsequent constraint design; Under CMDP framework, policies The optimization objective of (a) is to maximize the discount jackpot if long term constraints are met: Wherein, the Is the round length; Constraint costs are desired; Representing a desired calculation; representing a discount factor; is a reward function; The safe operation requirement of reservoir dispatching is condensed into an expected constraint, and a unified mathematical starting point is provided for a subsequent constraint processing mechanism and a Lagrange method.
- 3. The method for improving reservoir dispatching safety reinforcement learning based on Lagrangian method according to claim 2, wherein S2 comprises the following steps: s2.1, determining a probability form of the mask; Record the discrete action space as Wherein Is the action dimension; Given state A mask function is defined: Wherein, the Representing action index In state The lower part is feasible; Indicating that the action is masked against the current water level, flow or reservoir capacity constraints; if the log probability of the output of the strategy network when not being masked is recorded as The masked log probability is written as: Wherein, appoints to So that the probability of a masked action after softmax is strictly zero, specifically the masked strategy is: This equates to the policy definition domain being driven Shrink to actionable subset Thus, the mask mechanism injects physical constraints into the action sampling process in a deterministic manner on the premise of not changing the strategy network structure; s2.2, mapping an action space and a flow space; in order to express the water level and the reservoir capacity constraint on the engineering on the action mask, a mapping relation between the action index and the physical ex-reservoir flow is required to be established; Recording device And (3) with The minimum and maximum feasible ex-warehouse flow are respectively: Indexing discrete actions Mapping to physical traffic ; Accordingly, a normalized action variable is defined For representing "successive positions" in the motion space, the relationship between it and the physical flow is written as: Wherein, the As a monotonically increasing function, different Corresponding to different linear or nonlinear mapping, at a given point Thereafter, defining a reflection of the traffic to the normalized action: And further will Quantization is a discrete action index: Wherein, the Representation rounding; Thus, any given flow restriction interval Can be mapped into action index intervals Further converting into a continuous section of 'movable window' in the mask vector; s2.3, calculating a mask interval based on water balance; The key to the masking mechanism is how to rely on the current state Calculating allowed normalized action intervals Thereby obtaining the corresponding action index section ; Directly utilizes the water balance equation and the water level limit to outflow Giving upper and lower bound constraints; On the one hand, to prevent the water level from exceeding the current day's height limit The writing-out end storage capacity is not more than the limit storage capacity by using The conditions of (2): Wherein, the The minimum allowable outflow can be solved by the method for the storage capacity corresponding to the high limit water level: When (when) In the time-course of which the first and second contact surfaces, Positive, indicating that reservoir capacity fallback is required to be achieved through larger outflow, when In the time-course of which the first and second contact surfaces, Negative, allowing the outflow to be slightly less than the inflow to achieve water storage; On the other hand, in the other hand, low limit storage capacity for preventing storage capacity from falling and breaking on the same day The requirements are satisfied: thereby obtaining the maximum allowable outflow: The feasible interval of the stream can be written as: Wherein: The interval end points in the process of tightening can be further tightened by combining engineering constraints such as minimum ecological flow, shipping flow, equipment capacity upper limit and the like; Finally, will And (3) with Mapping to normalized action intervals Action index section And at Only the actions within that interval are reserved; Therefore, the mask mechanism gives an action feasible region consistent with the physical constraint based on water balance at each time step, and realizes one-step mapping of the physical constraint, the flow interval and the action interval.
- 4. The method for improving safety reinforcement learning of reservoir dispatching based on Lagrangian method of claim 4, wherein S3 comprises the following steps: s3.1, mapping function family and inverse function; Given discrete motion space size Record action index The corresponding normalized positions are: On this basis, a unified form of mapping is defined: Wherein the method comprises the steps of In order to be a strictly monotonically increasing function, At least including self-linear mapping, convex mapping, concave mapping, and logarithmic/exponential mapping; For any given So long as it At the position of Continuously and strictly monotonically increasing in the upper direction, there is an inverse function Thereby passing the physical flow Mapping back to normalized action space; this reverse mapping is particularly important in action masks where the upper and lower bounds of the given traffic need to be passed Is converted into an action zone, in Is realized in the middle; S3.2, resolution analysis; introducing an action density concept for quantitatively analyzing the resolution difference of different mapping functions in each flow interval; Recording device For a certain flow interval of interest, the interval length is ; After the interval is corresponding to the normalized space, the number of actions available in the interval is about: Wherein, the Respectively is Corresponding normalized positions; further defining an average action density: Then The larger the flow rate interval, the higher the operation resolution.
- 5. The method for improving reservoir dispatching safety reinforcement learning based on Lagrangian method according to claim 5, wherein S4 comprises the following steps: at a given dual variable Defining a modified value target: And optimizing by using any deep reinforcement learning algorithm based on strategy gradient, wherein the strategy update only needs to replace instant rewards with the ones in the scheme The rest of the implementation remains unchanged; on the other hand, after policy updating for several rounds, the average constraint cost is estimated from the empirical trajectory under the current policy And performing dual variable update according to the formula: Wherein, the Is the dual learning rate; Representing a projection on a non-negative number axis; if the constraint cost of the current strategy is significantly higher than the threshold value Then Increasing, thereby improving constraint penalty strength in subsequent policy updating; Conversely, if the constraint cost is far below the threshold, then Gradually reducing, and releasing more optimization space to give a main task reward; In engineering realization, in order to improve numerical stability, the constraint cost is normalized and the default part is extracted; Recording device At the moment for the environment The original constraint cost of the return is that, As a result of the normalization factor, For the default threshold, the normalized cost and the default metric are written as: employing default metrics in training Substitute for original cost The incorporated reward reconstruction and the dual update of the formula can reduce the optimization difficulty caused by the scale difference on the premise of keeping the CMDP target structure unchanged.
Description
Reservoir dispatching safety reinforcement learning method based on Lagrangian method improvement Technical Field The invention relates to the technical field of reservoir dispatching, in particular to a reservoir dispatching safety reinforcement learning method based on Lagrange method improvement. Background Reservoir scheduling is essentially a safety critical decision problem. Different from the general optimization task, reservoir operation is subjected to strict physical and safety constraints that flood limit water level must be maintained for reserving flood control reservoir capacity in a flood season (6-9 months), and downstream flood disaster can be caused by violation; The minimum running water level is required to be maintained in the dead water period (1-5 months) to ensure power generation and water supply, the unit is stopped and ecologically degenerated due to the fact that the water level is too low, the delivery flow is required to be controlled within the range of equipment bearing capacity and downstream river safety discharge, and the safety of a dam structure or downstream lives and properties are endangered due to the fact that the delivery flow is out of stock. These hard constraints form a strict "security envelope" and any violation may lead to catastrophic results. In the global scope, over 58,000 large reservoirs bear 15% of the global generated energy and irrigation water supply, the scheduling decision directly influences the safety and economic development of hundreds of millions of population, and the requirements on the safety are as severe as those of the safety key fields such as automatic driving, medical diagnosis and the like. Traditional scheduling methods rely mainly on empirical procedures and mathematical optimization. The experience rules ensure the safety through a preset water level-period-outflow check-up table, but are difficult to cope with a nonstationary water supply mode under the climate change background, and can not dynamically balance among multiple targets (flood control, power generation, ecology and shipping), and the mathematical optimization method (such as dynamic planning and linear planning) can solve the optimal solution, but is limited by the 'dimensional curse' and model assumption, and is high in calculation cost and difficult to respond in real time under the combination of multiple time-space scale coupling and complex constraint. In recent years, deep Reinforcement Learning (DRL) has demonstrated potential in single-and multi-library scheduling with use as a pretext-to-end learning and adaptive decision-making capabilities. However, standard reinforcement learning is targeted at maximizing jackpots and may frequently explore and even severely violate security constraints during exploration, leading to training failures or the creation of undeployable dangerous strategies. How to strictly guarantee physical constraints such as water level, flow and the like in the learning process and realize safety reinforcement learning (Safe Reinforcement Learning) becomes a core bottleneck for restricting reservoir dispatching and intelligent landing, and the urgency and technical difficulty of the safety reinforcement learning are similar to 'collision avoidance guarantee' in automatic driving and 'stability guarantee' in robot control. Safety reinforcement learning aims at training strategies capable of maximizing a jackpot on the premise of meeting safety constraints, and has shown important application values in safety critical fields such as automatic driving, robot control, power systems and the like. The key challenge is how to avoid dangerous states in the strategy exploration process, and simultaneously ensure learning efficiency and final performance. The Constraint Markov Decision Process (CMDP) provides a theoretical framework for this, formalizes the security constraint as an upper bound constraint for the desired cost, and solves by methods such as Lagrangian dual, trust domain optimization, and the like. However, the general SafeRL method often assumes that the constraint functions are known and can be made tiny, and difficult to apply directly in practical engineering systems. Particularly in a reservoir dispatching scene, the safety constraint has the characteristics of multi-source isomerism (water level, flow, equipment capacity, downstream bearing), state dependence (different constraints in flood season and dead water period), hardness threshold (disaster can be caused by violation) and the like, and the existing method faces three prominent problems. First, constraint handling mechanisms lack system contrast and theoretical guarantees. The current reservoir dispatching document mostly adopts a punishment item with fixed weight, constraint violation is added into a rewarding function by a fixed coefficient, but the method has three defects that (1) the weight coefficient needs to be subjected to a great deal of trial and error adj