CN-122026415-A - Offshore wind farm deep reinforcement learning reactive power optimization method considering equipment fatigue loss and application thereof

CN122026415ACN 122026415 ACN122026415 ACN 122026415ACN-122026415-A

Abstract

The invention discloses a deep reinforcement learning reactive power optimization method of an offshore wind farm, which is characterized by comprising the steps of firstly establishing a reactive power-voltage coordination control model of the offshore wind farm, comprising loss characteristics, secondly designing a composite rewarding function comprising voltage deviation, net loss and equipment fatigue loss, quantifying fatigue by punishing action amplitude and smoothness, introducing a state space expansion mechanism to meet Markov, thirdly constructing an agent based on a near-end strategy optimization algorithm, introducing a shearing agent objective function and generalized advantage estimation to ensure strategy updating stability, and finally performing offline training based on history and simulation data. The invention effectively reduces voltage deviation and network loss, obviously reduces the action frequency of equipment and prolongs the service life of the equipment.

Inventors

ZHOU YUCHI
SHI JIANPING
ZHENG ZIJUN

Assignees

南京师范大学

Dates

Publication Date: 20260512
Application Date: 20260126

Claims (10)

1. The deep reinforcement learning reactive power optimization method for the offshore wind farm taking equipment fatigue loss into consideration is characterized by comprising the following steps of: s1, establishing a reactive-voltage coordination control mathematical model of an offshore wind farm in consideration of equipment operation constraint and loss characteristics, wherein the model covers a wind turbine generator, a current collecting system cable, a main transformer and a reactive compensation device; S2, constructing a reinforcement learning environment based on a Markov decision process, and defining an extended state space S, a continuous action space A and a compound reward function R containing equipment fatigue loss penalty items; s3, constructing a deep reinforcement learning network architecture based on a near-end strategy optimization algorithm, wherein the deep reinforcement learning network architecture comprises a strategy network for generating reactive power control instruction probability distribution and a value network for evaluating state value; S4, performing offline training on the PPO model by utilizing historical operation data of the offshore wind farm and simulation scene data constructed based on data enhancement, optimizing network parameters by maximizing a shearing proxy objective function, realizing mapping from the operation state of the offshore wind farm to an optimal reactive power control instruction, and introducing entropy regularization items in the training process to enhance exploration capacity; and S5, deploying a PPO strategy model for training convergence in an offshore wind farm energy management system, and generating a reactive voltage coordination control strategy which takes into account voltage quality, system network loss and equipment health on line according to measurement data acquired in real time.
2. The method for deep reinforcement learning reactive power optimization of an offshore wind farm in consideration of equipment fatigue loss according to claim 1, wherein the objective function of the mathematical model for reactive power-voltage coordination control of the offshore wind farm established in step S1 comprises node voltage deviation, system active loss and fatigue loss terms; The objective function F is expressed as: ; Where N is the total number of system nodes, For the voltage deviation of the i-th node, In order to achieve the effect of the system, In order to achieve a cost of fatigue loss of the equipment, Dynamic weight coefficients for each optimization objective.
3. An offshore wind farm deep reinforcement learning reactive power optimization method taking into account equipment fatigue loss as claimed in claim 2, wherein the equipment fatigue loss cost is The control cost model based on the simplification of the thermal stress circulation principle of the power electronic device has the following calculation formula: ; Wherein K is the total number of the adjustable reactive power equipment, And Reactive output instructions of kth equipment at the time t and the time t-1 are respectively adopted, Rated capacity for the device; the motion amplitude punishment coefficient is used for inhibiting large-amplitude power jump; The smoothness punishment coefficient is used for inhibiting continuous oscillation; penalty coefficients for high load operation; As an indication function; is a high load threshold coefficient.
4. The method for deep reinforcement learning reactive power optimization of offshore wind farm in consideration of fatigue loss of equipment according to claim 1, wherein in step S2, the state space S adopts an extended definition including historical actions in order to satisfy markov property after introducing differential rewards: ; In the formula, The active power output of each wind turbine generator set at the time t, For the reactive power output of each unit at present, For the voltage amplitude of each node of the whole network, As a result of the current loss of network, And issuing reactive power instruction vectors to all adjustable devices for the last control period.
5. The method for deep reinforcement learning reactive power optimization of offshore wind farm in consideration of fatigue loss of equipment according to claim 1, wherein the policy update of the near-end policy optimization algorithm in step S3 adopts a shearing proxy objective function To limit the policy update step size, the expression is: ; In the formula, As a function of the parameters of the policy network, Is the probability ratio of the new strategy to the old strategy, To use the generalized dominance estimation to calculate the dominance function, In order to cut the super-parameters, To cut the function, ensure that the probability ratio is limited to Within the interval.
6. The method for deep reinforcement learning reactive power optimization of an offshore wind farm in consideration of equipment fatigue loss according to claim 5, wherein the dominance function The calculation of (1) adopts generalized dominance estimation, and the formula is as follows: ; ; In the formula, In the event of a TD error, As a discount factor, the number of times the discount is calculated, For the GAE smoothing parameter, And (5) outputting state value estimation for the value network.
7. The method for deep reinforcement learning reactive power optimization of offshore wind farm in consideration of fatigue loss of equipment according to claim 1, wherein the strategy network output in the step S3 is a multidimensional gaussian distribution parameter under continuous action space, and the specific structure comprises: The input layer receives the state vector ; The hidden layer comprises a plurality of layers of fully-connected neural networks and an activation function; the output layer is divided into two branches, namely a mean value branch And standard differential branch ; Reactive control action By from the Gaussian distribution And (3) obtaining the intermediate samples, mapping the intermediate samples to the intervals of [ -1, 1] through an activation function, and scaling the intermediate samples to the actual reactive capacity range of each device.
8. The method for deep reinforcement learning reactive power optimization of offshore wind farm in consideration of fatigue loss of equipment according to claim 1, wherein the offline training process in step S4 comprises equipment health protection mechanism and reinforcement learning strategy, wherein fatigue loss weight is set in initial training stage Leading the intelligent agent to learn voltage control preferentially, gradually increasing along with the increase of training rounds Guiding the intelligent agent to search a strategy with minimum action cost in a voltage safety domain, and when the action generated by the intelligent agent causes the adjustment times of equipment to exceed a preset threshold value in a short time Or adjust the amplitude to exceed When applying an additional strong penalty term in the bonus function And forcibly terminating the current training round to accelerate learning of the safety action boundary by the agent.
9. A method for deep reinforcement learning reactive power optimization of an offshore wind farm in view of fatigue loss of equipment according to claim 3, wherein the smoothness penalty factor is further combined with a sliding window based slew rate limit if the equipment is in continuous If the variance exceeds a threshold value over a time step, an exponential negative reward is given to prevent high frequency oscillations.
10. An offshore wind farm reactive voltage control system applying the method as claimed in any one of claims 1-9, comprising a data acquisition module, a deep reinforcement learning control module and an execution module, wherein the data acquisition module acquires the voltage, current, active power and SVG/converter running state of each node of the offshore wind farm in real time based on a SCADA system, the deep reinforcement learning control module is internally provided with a trained PPO strategy model, receives state data and calculates an optimal reactive power distribution instruction through forward reasoning, and the execution module receives the reactive power distribution instruction and adjusts reactive power output of the SVG and the wind turbine converter.

Description

Offshore wind farm deep reinforcement learning reactive power optimization method considering equipment fatigue loss and application thereof Technical Field The invention belongs to the technical field of wind power generation and automatic control of a power system, and particularly relates to a deep reinforcement learning reactive power optimization method of an offshore wind power field in consideration of equipment fatigue loss. Background With the rapid development of offshore wind power, the charging power characteristics of submarine cables make reactive voltage control of offshore wind farms a great challenge. Reactive power optimization by adjusting Static Var Generators (SVGs) and fan converters is currently the various mainstream means in order to maintain voltage stability and reduce grid losses. The existing control method is mainly divided into a traditional optimization method based on a physical model and a reinforcement learning method based on data driving. The traditional method such as Optimal Power Flow (OPF) is highly dependent on accurate system parameters, the calculation complexity increases exponentially along with the increase of nodes, the requirement of second-level real-time control of the offshore wind farm is difficult to meet, and model mismatch is easily caused by parameter drift. In recent years, a deep reinforcement learning method represented by a depth deterministic gradient strategy (DDPG) has been widely studied because of its model-free and quick response characteristics. However, the prior art has serious drawbacks in practical engineering applications, namely, first, neglecting equipment health and fatigue loss. The existing algorithm usually aims at voltage and network loss only, and in order to pursue mathematical tiny optimization, a high-frequency and large-amplitude adjusting instruction is often generated. Such severe power fluctuations can lead to frequent thermal cycling stresses on the power electronics (e.g., IGBTs), accelerated equipment aging, and even early failure, greatly increasing the full life cycle operation and maintenance costs. Second, the algorithm is not stable enough. The existing DDPG algorithm is extremely sensitive to super parameters, is easy to sink into local optimum or generate strategy oscillation based on a deterministic strategy, so that training is difficult to converge, and the training is difficult to be directly deployed in a high-reliability power system. In view of the above, there is a need for an improved reactive power optimization control method for offshore wind farms, which can not only ensure voltage safety, but also effectively inhibit equipment motion fatigue and prolong equipment life. 1. Through retrieval, the Chinese patent with publication number CN113541192A discloses a deep reinforcement learning-based reactive power-voltage coordination control method for an offshore wind farm, and belongs to the technical field of wind power generation. The method comprises the steps of S1, building a reactive-voltage coordination control model of the offshore wind farm, S2, building a Markov decision process model based on a reactive-voltage control strategy of the wind farm, defining state actions and rewarding functions of a system, S3, training the reactive-voltage coordination control model based on a depth deterministic gradient strategy and combining random output data of a unit to realize mapping from the state of the wind turbine to reactive instructions, and S4, deploying the reactive-voltage coordination control strategy of the offshore wind farm on line. The comparison document is different from the application in that 1, the difference between the core algorithm architecture and the training stability is 1 The CN 113541192A employs a depth deterministic gradient strategy (DDPG) algorithm. DDPG is used as an algorithm based on a deterministic strategy, extremely depends on fine adjustment of super parameters, is easily influenced by Q value overestimation in the training process, causes unstable strategy updating, is easily trapped into local optimum or oscillates, and is difficult to directly meet the requirement of an offshore wind power plant on high reliability of a control strategy. The patent adopts a deep reinforcement learning architecture based on a near-end policy optimization (PPO) algorithm. The patent uses the specific shearing proxy objective function (Clipped Surrogate Objective) of the PPO algorithm, and strictly limits the probability Ratio (Ratio) of new and old policy updating by introducing a shearing mechanism, so that the step length of the policy updating is forced to be kept in a reasonable intervalAnd (3) inner part. The mechanism ensures monotonic non-reduction of strategy optimization in mathematical principle, effectively solves the technical problems that the DDPG algorithm in the CN 113541192A is difficult to converge and the strategy is easy to diverge in training under the complex