CN-122026538-A - Power distribution network scheduling method and system based on multi-stage safety reinforcement learning

CN122026538ACN 122026538 ACN122026538 ACN 122026538ACN-122026538-A

Abstract

The invention discloses a power distribution network scheduling method and system based on multi-stage safety reinforcement learning, which belong to the technical field of operation control of power systems and comprise the steps of constructing a cost rewarding function with minimized power distribution network cost as a target, constructing a cost intelligent agent with candidate output adjustment amounts of a cost scheduling main body as a cost action space, acquiring actual output adjustment amounts of the cost intelligent agent according to output boundary constraint of the cost scheduling main body, constructing an adaptive rewarding function with safety scheduling performance of a maximized safety scheduling main body as a target, constructing a power grid safety punishment function with maximized power distribution network safety as a target, constructing a safety intelligent agent with candidate output adjustment amounts of the safety scheduling main body as a safety action space, acquiring actual output adjustment amounts of the safety scheduling main body by using the safety intelligent agent, and scheduling the power distribution network based on the actual output adjustment amounts of all main bodies. The technical problem that the safety of the power distribution network and the scheduling performance of new energy are difficult to consider in the prior art is solved.

Inventors

MA XIANG
YANG XU
YANG LINING
LI JIANG
HUANG YINQIANG
LIU HAOTIAN
ZHENG RAN
WU XUEFENG
GU WEI
Lv Leiyan
WU LIFENG
TONG CUNZHI
FANG XUAN
LIU DONG

Assignees

国网浙江省电力有限公司金华供电公司
清华四川能源互联网研究院

Dates

Publication Date: 20260512
Application Date: 20260416

Claims (10)

1. The power distribution network scheduling method based on the multi-stage safety reinforcement learning is characterized by comprising the following steps of: constructing a cost rewarding function with the aim of minimizing the cost of the power distribution network, taking the candidate output adjustment quantity of the cost scheduling main body as a cost action space, and constructing a cost intelligent body based on the cost rewarding function and the cost action space; acquiring the actual output adjustment quantity of the cost scheduling main body by using a cost intelligent agent according to the output boundary constraint of the cost scheduling main body; Constructing an adaptive rewarding function with the safety scheduling performance of the maximized safety scheduling main body as a target, constructing a power grid safety punishment function with the maximized power distribution network safety as a target, and constructing a safety intelligent body based on the adaptive rewarding function, the power grid safety punishment function and the safety action space with the candidate output adjustment quantity of the safety scheduling main body as a safety action space; acquiring an actual output force adjustment quantity of a safety dispatching main body by using a safety intelligent agent; and scheduling the power distribution network based on the actual output adjustment amount of the cost scheduling main body and the actual output adjustment amount of the safety scheduling main body.
2. A multi-stage security reinforcement learning-based power distribution network scheduling method according to claim 1, wherein the constructing a cost agent based on a cost reward function and a cost action space comprises: Determining the number of neurons of a first input layer according to the number of state variables of the cost scheduling main body and the number of state variables of the power distribution network nodes, determining the number of neurons of a first output layer according to the number of action variables in a cost action space, thus constructing a first strategy neural network, and summing the number of neurons of the first input layer and the number of neurons of the first output layer to obtain the number of neurons of a second input layer, thus constructing a first rewarding neural network; Constructing a first strategy loss function based on a guiding item for guiding the first strategy neural network to select optimal actions from a cost action space and a first predicted reward item for representing the selected actions according to the first reward neural network, training the first strategy neural network by using the first strategy loss function, obtaining a first actual reward item for representing the selected actions according to the cost reward function, obtaining a first reward loss function based on the first predicted reward item and the first actual reward item, and training the first reward neural network based on the first reward loss function; And acquiring the cost agent based on the trained first strategy neural network and the trained first reward neural network.
3. The power distribution network scheduling method based on multi-stage safety reinforcement learning according to claim 1, wherein the obtaining the actual output adjustment of the cost scheduling body by using the cost agent according to the output boundary constraint of the cost scheduling body comprises: Constructing a first punishment neural network based on the fully-connected neural network, and constructing a main body safety-performance punishment function with the maximization of the safety of the cost scheduling main body and the maximization of the cost scheduling performance of the cost scheduling main body as targets; Acquiring a first actual penalty item representing the selected action quality according to the main body safety-performance penalty function, and acquiring a first prediction penalty item representing the selected action quality based on the first penalty neural network; Acquiring a first punishment loss function based on a first actual punishment item and a first prediction punishment item, training a first punishment neural network based on the first punishment loss function, and improving a cost agent based on the trained first punishment neural network so as to acquire the actual output adjustment quantity of a cost scheduling main body; And determining the number of neurons of the input layer in the first punishment neural network according to the number of constraint variables in the output boundary constraint, the number of state variables of the cost scheduling subject and the number of action variables in the cost action space.
4. A power distribution network scheduling method based on multi-stage security reinforcement learning according to claim 3, wherein said constructing a first punishment neural network based on a fully connected neural network, targeting maximizing cost scheduling subject security and maximizing cost scheduling performance of a cost scheduling subject, constructs a subject security-performance punishment function, comprising: Constructing a safety punishment neural network and a performance punishment neural network based on the fully connected neural network, constructing a main body safety punishment function with the maximum cost dispatching main body safety as a target, and constructing a main body performance punishment function with the maximum cost dispatching performance of the cost dispatching main body as a target; The safety punishment neural network and the performance punishment neural network form a first punishment neural network.
5. A power distribution network scheduling method based on multi-stage safety reinforcement learning according to claim 3, wherein the training-completed first punishment neural network improves the cost agent so as to obtain the actual output adjustment of the cost scheduling agent, and the method comprises the following steps: the method comprises the steps that a strategy loss function in a cost agent is improved based on a first punishment neural network after training is completed, and the strategy neural network in the cost agent is trained based on the improved strategy loss function to obtain a final cost agent; And acquiring an initial output adjustment quantity of the cost dispatching main body through the final cost intelligent agent, if the initial output adjustment quantity exceeds the output boundary constraint, carrying out safe projection on the initial output adjustment quantity to acquire an actual output adjustment quantity of the cost dispatching main body, and if the initial output adjustment quantity does not exceed the output boundary constraint, taking the initial output adjustment quantity as the actual output adjustment quantity of the cost dispatching main body.
6. A power distribution network scheduling method based on multi-stage security reinforcement learning according to claim 1, wherein said constructing a security agent based on an adaptive rewards function, a grid security penalty function, and a security action space comprises: determining the number of neurons of a third input layer according to the number of state variables of the cost dispatching main body, the number of state variables of the power distribution network nodes and the number of state variables of the safety dispatching main body, determining the number of neurons of a second output layer according to the number of action variables in a safety action space, thus constructing a second strategy neural network, and summing the number of neurons of the third input layer and the number of neurons of the second output layer to obtain the number of neurons of a fourth input layer, thus constructing a second rewarding neural network; Constructing a second punishment neural network based on the fully-connected neural network, constructing a second strategy loss function based on a guiding item for guiding the second strategy neural network to select optimal actions from a safe action space, a second prediction rewarding item for representing the selected actions according to the second rewarding neural network, and a second prediction punishment item for representing the selected actions according to the second strategy neural network, wherein the second prediction punishment item is obtained by the second punishment neural network, and training the second strategy neural network by using the second strategy loss function; Training a second rewarding neural network and a second punishment neural network according to the adaptive rewarding function and the grid security punishment function; and acquiring the security agent based on the trained second strategy neural network, the trained second rewarding neural network and the trained second punishment neural network.
7. The power distribution network scheduling method based on multi-stage security reinforcement learning according to claim 6, wherein training the second rewarding neural network and the second punishing neural network according to the adaptive rewarding function and the grid security punishing function comprises: Acquiring a second actual rewarding item representing the selected action quality based on the adaptive rewarding function, acquiring a second rewarding loss function based on the second predicted rewarding item and the second actual rewarding item, and training a second rewarding neural network based on the second rewarding loss function; And acquiring a second actual penalty term representing the selected action quality based on the power grid safety penalty function, acquiring a second penalty loss function based on the second prediction penalty term and the second actual penalty term, and training the second penalty neural network based on the second penalty loss function.
8. The method for scheduling a power distribution network based on multi-stage security reinforcement learning of claim 6, wherein prior to training the second strategic neural network with the second strategic loss function, further comprising: And obtaining a stability penalty term by using the enhanced Lagrangian method and the second prediction penalty term, and improving the second strategy loss function through the stability penalty term.
9. A method of scheduling a power distribution network based on multi-stage security reinforcement learning according to claim 8, wherein said improving the second policy loss function by a stability penalty term comprises: And acquiring the self-adaptive loss function of the self-adaptive multiplier in the second strategy loss function according to the second prediction penalty term and the stability penalty term, updating the self-adaptive multiplier based on the self-adaptive loss function, and fusing the updated self-adaptive multiplier, stability penalty term and the second strategy loss function.
10. A power distribution network scheduling system based on multi-stage safety reinforcement learning, which is applicable to the power distribution network scheduling method based on multi-stage safety reinforcement learning as claimed in any one of claims 1 to 9, and is characterized by comprising: The cost agent construction module is used for constructing a cost rewarding function with the aim of minimizing the cost of the power distribution network, taking the candidate output adjustment quantity of the cost scheduling main body as a cost action space, and constructing a cost agent based on the cost rewarding function and the cost action space; the first output adjustment quantity acquisition module is used for acquiring the actual output adjustment quantity of the cost scheduling main body by using the cost intelligent agent according to the output boundary constraint of the cost scheduling main body; the safety intelligent agent construction module is used for constructing an adaptive reward function with the aim of maximizing the safety scheduling performance of the safety scheduling main body, constructing a power grid safety penalty function with the aim of maximizing the safety of the power distribution network, and constructing a safety intelligent agent based on the adaptive reward function, the power grid safety penalty function and the safety action space with the candidate output adjustment quantity of the safety scheduling main body as the safety action space; The second output adjustment quantity acquisition module is used for acquiring the actual output adjustment quantity of the safety dispatching main body by using the safety intelligent agent; And the scheduling module is used for scheduling the power distribution network based on the actual output adjustment quantity of the cost scheduling main body and the actual output adjustment quantity of the safety scheduling main body.

Description

Power distribution network scheduling method and system based on multi-stage safety reinforcement learning Technical Field The invention relates to the technical field of operation control of power systems, in particular to a power distribution network scheduling method and system based on multi-stage safety reinforcement learning. Background The distributed new energy represented by the distributed photovoltaic is connected into the power distribution network on a large scale by the novel loads such as the electric automobile, the energy storage system and the like, and the operation characteristic of the power distribution network is more complex. The strong uncertainty and volatility of the source load and the load are obviously increased, so that the traditional passive scheduling mode is difficult to adapt. The mode has response lag when facing real-time dynamic change, and is extremely easy to cause safety problems such as node voltage out-of-limit, line overload and the like. Meanwhile, the conservative operation strategy adopted for avoiding risks can limit the scheduling performance of new energy, increase network loss and electricity purchasing cost, and reduce the overall economy of the system. In order to guarantee scheduling performance of new energy, the prior art gradually grasps a control strategy under complex uncertainty of a power distribution network through heuristics of reinforcement learning agents in the environment. However, the traditional reinforcement learning method mainly focuses on maximizing long-term jackpot in design, and the core exploration mechanism is essentially an action that allows an agent to try to possibly bring higher returns but unknown risks. In a power distribution network scheduling scene, unsafe actions such as voltage out-of-limit, equipment overload and the like are extremely easy to occur by the mechanism. Once the actions are executed in a real system, the service life of equipment is damaged, the power quality is reduced, even a power grid fault can be possibly caused, and the safety of a power distribution network is difficult to ensure. Therefore, the traditional reinforcement learning is directly applied to power distribution network dispatching, an inherent safety guarantee mechanism is lacked, and unacceptable safety risks exist. In order to ensure the safety of the power distribution network, a part of researches propose a safety reinforcement learning method, wherein the main stream thinking is to add punishment items of safety constraint violation into a reward function, and realize safety by reinforcing learning of self-learning evasion punishment of an intelligent agent. However, the safety constraint in the power distribution network has diversity, a single punishment mechanism is adopted to process all types of constraint, the capability of carrying out differentiated processing on different constraint characteristics is lacked, the punishment strength is too large, the development of the scheduling performance of new energy is limited to be maximized, the punishment strength is too small, the safety of the power distribution network is difficult to ensure, and in addition, the punishment and the violation are risked for pursuing rewarding, and the punishment is excessively conserved for avoiding punishment. Therefore, how to consider the safety of the power distribution network and the scheduling performance of new energy is a technical problem that is difficult to overcome in the prior art. Disclosure of Invention Aiming at the technical problem that the prior art is difficult to consider the safety of the power distribution network and the scheduling performance of new energy, the invention provides a power distribution network scheduling method and system based on multi-stage safety reinforcement learning, the actual output regulating quantity of the power distribution network is determined by a cost intelligent agent under the output boundary constraint of a cost scheduling main body through decoupling of a staged decision and a reward and punishment mechanism, meanwhile, the safety intelligent body is formed by utilizing the independently constructed adaptive reward function and the power grid safety punishment function, and the safety intelligent body outputs the actual output adjustment quantity of the safety scheduling main body, so that the new energy scheduling performance is fully improved on the premise of guaranteeing the safety of the power distribution network, and the technical problem that the safety of the power distribution network and the new energy scheduling performance are difficult to consider in the prior art is solved. In order to solve the technical problems, the invention provides a power distribution network scheduling method based on multi-stage safety reinforcement learning, which comprises the following steps: constructing a cost rewarding function with the aim of minimizing the cost of the power distribution