CN-121978888-A - Electric vehicle charging station safety scheduling method, equipment and medium based on PID-Lagrange deep reinforcement learning

CN121978888ACN 121978888 ACN121978888 ACN 121978888ACN-121978888-A

Abstract

The invention discloses an electric vehicle charging station safety dispatching method, equipment and medium based on PID-Lagrange deep reinforcement learning, and relates to the technical field of intelligent power grid and artificial intelligence intersection. The method comprises the steps of firstly constructing a constraint Markov decision process model of the charging station, and defining a reward function containing power grid tracking errors or economic profits and a cost function based on distribution transformer physical capacity limitation and user satisfaction. In the deep reinforcement learning training process, a PID control mechanism is introduced to dynamically update Lagrange multipliers, and penalty weights are adjusted by utilizing proportional, integral and derivative terms of safety constraint violation amounts. The method solves the problems of severe multiplier oscillation and low convergence speed in the traditional Lagrange relaxation method when the hard constraint is processed. Experiments show that the invention can strictly ensure that the transformer is not overloaded while maximizing the operation benefit of the charging station, effectively considers the charging requirement of users, and has the advantages of stable convergence, high safety, strong adaptability and the like.

Inventors

CHEN RUI
CHEN JINXIANG
GENG PENG
RUI XIONGLI
CHEN XINGYOU
ZHANG TIANQI

Assignees

南京工程学院

Dates

Publication Date: 20260505
Application Date: 20260119

Claims (10)

1. The electric vehicle charging station safety scheduling method based on PID-Lagrange deep reinforcement learning is characterized by comprising the following steps of: Step S1, constructing a constrained Markov decision process model of an electric vehicle charging station, wherein the constrained Markov decision process model is formed by quintuple The composition is that, Representing a state space of the object and, Representing the space of action and, Representing the probability of a state transition and, Representing a bonus function, Representing a safety cost function; Step S2, initializing a deep reinforcement learning agent, wherein the deep reinforcement learning agent comprises a strategy network for generating strategies, a value network for evaluating value and a PID Lagrangian multiplier updater for constraint control; Step 3, deep reinforcement learning agent and environment interaction, which specifically comprises the steps of acquiring the running state of a charging station in real time, inputting the running state of the charging station into the strategy network, outputting power scheduling actions of each charging pile, calculating the rewarding value and the transformer overload cost at the current moment, feeding back the running state of the charging station at the next moment, and storing the running state of the current charging station, the power scheduling actions, the rewarding value, the transformer overload cost and the running state of the charging station at the next moment into an experience playback pool of a constraint Markov decision process model; s4, after each complete interaction round is finished, updating the Lagrangian multiplier by using the PID Lagrangian multiplier updater; S5, constructing a composite objective function containing an original rewarding function and a weighted safety cost penalty term based on the updated Lagrange multiplier, and alternately updating parameters of the strategy network and the value network by using the objective function; and S6, judging whether the strategy network and the value network converge or reach the maximum iteration step number, if so, outputting the optimal safe scheduling strategy by the strategy network, and otherwise, returning to the step S3.
2. The method for safely scheduling electric vehicle charging stations based on PID-lagrangian deep reinforcement learning according to claim 1, wherein in step S1, the safety cost function is specifically a desired cumulative cost function, and the method for constructing the desired cumulative cost function is as follows: reading rated capacity parameter of distribution transformer of charging station in real time Base load at present moment ; Counting all in a charging station Each charging pile is at the current moment Actual output power, calculating charging station in Total power demand at time of day : Wherein, the Represent the first The charging piles are arranged at The positive value of the charge and discharge power at the moment represents charging, and the negative value represents discharging to the power grid; constructing an overload cost function of the transformer, and quantifying the violation degree of the current dispatching action on the physical capacity limit of the transformer: Wherein, the For the linear rectification function ReLU, ensuring that a cost value of a positive value is generated only when the total load exceeds the capacity of the transformer, and the cost value is zero when the transformer is not overloaded; is a normalized coefficient; Is in state of Action Overload cost of the transformer; Constructing a desired cumulative cost function for constrained reinforcement learning training based on the transformer overload cost function: Wherein, the Representing the maximum allowable accumulated violation costs for a preset safety threshold; Representing policies The expected cumulative cost; Representing the mathematical expectation operator, The state-action trajectory is represented by a state-action trajectory, Representing the time domain length of the reinforcement learning training, As a discount factor, the number of times the discount is calculated, Representation of The state of the system at the moment in time, Representation of Scheduling actions at time.
3. The method for safely scheduling electric vehicle charging stations based on PID-lagrangian deep reinforcement learning according to claim 1, wherein in step S1, the state space is a high-dimensional feature vector containing multi-source information describing environmental dynamics, and specifically comprises: Load characteristics of the transformer, real-time load factor of the current transformer And a margin index reflecting overload risk; time period characteristic of the number of minutes to be currently counted And hours number Mapping to trigonometric function coding to obtain time period characteristic vector : The price characteristic of the power grid comprises the electricity purchasing price at the current moment Price of electricity And the past extracted with sliding window Historical price sequence and future for hours A daily predicted price sequence for an hour; aggregation or individual characteristics of the electric automobile, including connection state, current battery charge state for each charging port Target state of charge set by a user Charged amount Predicted remaining dock time Rated capacity of battery Maximum allowable charge and discharge power; environmental interaction characteristics, that is, the last moment of execution action For introducing a smoothing term in the bonus function.
4. The method for safely scheduling the electric vehicle charging stations based on PID-Lagrangian deep reinforcement learning according to claim 1, wherein in step S1, when the charging stations participate in the scene of auxiliary service of grid frequency modulation and peak clipping and valley filling, the reward function is a grid power tracking reward function based on quadratic penalty, and the construction process of the grid power tracking reward function is as follows: Will set the power point Sum of maximum charge-discharge power potential available to all on-line vehicles currently charged with the station Comparing, taking the smaller value of the two as a practically feasible tracking target, and calculating the total power of the tracking target and the charging station after the actual execution of the actions The deviation between the two types is adopted to construct negative rewards in a quadratic penalty form, and the specific formula is as follows: In the formula, A reward is tracked for grid power, The maximum charge-discharge power potential sum which can be provided for all current online vehicles of the charging station; the total power after the action is actually executed; Is a weight coefficient; to take a minimum function.
5. The method for safely scheduling electric vehicle charging stations based on PID-lagrangian deep reinforcement learning according to claim 1, wherein in a commercial operation scenario, in step S1, the reward function is based on a mixed multi-objective profit maximization reward mechanism, and the total profit reward function is constructed as follows: In the formula, Indicating the total profit margin is awarded, The net income of the electric charge is represented, Representing a user satisfaction penalty; And The price of electricity selling and electricity purchasing is respectively; taking 1 when the condition in brackets is satisfied, otherwise taking 0; represent the first Scheduling actions executed by the charging piles; represent the first The charging piles are arranged at The charge-discharge power at the moment of time, The total number of the charging piles is; For a set of vehicles that are moving away from the field at the current time, Representing a collection The first of (3) A vehicle electric vehicle; Indicating a target electrical quantity of the vehicle, The actual departure electric quantity of the vehicle; Is a shape parameter.
6. The method for safely scheduling electric vehicle charging stations based on PID-lagrangian deep reinforcement learning according to claim 1, wherein step S3 is specifically: The running state of the charging station comprises power grid parameters and vehicle states, and the collected multidimensional state vector Carrying out normalization processing on each component in the model; inputting normalized state vectors into a policy network Output the mean value of Gaussian distribution And standard deviation Introducing standard normal noise Computing original actions using reparameterization techniques : Wherein, the Is used to obtain the range of values The function is limited to Between them; random noise vectors which are compliant with a standard normal distribution; Is a unit matrix; Representing a normal distribution; Will act as original Mapping into actual power command vector of each charging pile : Wherein, the Maximum allowable power of charging pile, and power instruction obtained by mapping Namely, the scheduling action When (when) When the electric car is charged, the charging pile is controlled to charge the electric car with the power When the electric automobile and the power grid allow the V2G to discharge, the charging pile is controlled to discharge to the power grid according to the absolute value of the power; executing power instructions Then, the environment feeds back the state of the next moment according to the physical response of the power grid and the dynamic state of the vehicle battery Current prize value And overload cost of transformer And five-tuple is formed And storing into an experience playback pool.
7. The method for safely scheduling electric vehicle charging stations based on PID-lagrangian deep reinforcement learning according to claim 1, wherein step S4 is specifically: after each complete interaction round has ended, calculating a cumulative security cost from the beginning to the end of the interaction round As a desired cumulative cost Is determined by a single sample estimate of (a); Calculating security constraint violation errors Defined as cumulative cost And preset safety threshold Is the difference of (a): Based on PID control theory, calculating gradient update step length of Lagrangian multiplier The update step length is composed of three parts, namely a proportional term , Representing the proportional coefficient and integral term , Represents an integral coefficient for eliminating steady-state errors of the system, The value of the integral term at the last moment, For sampling period, differential term For capturing the variation trend of the error, Representing differential coefficient, the sum of the three parts is the gradient update step length ; Combining learning rates Updating Lagrange multiplier at next moment And introducing non-negative projection operations: In the formula, Represents the lagrangian multiplier at the current time, Representing a linear rectification function.
8. The method for safely scheduling electric vehicle charging stations based on PID-lagrangian deep reinforcement learning according to claim 1, wherein step S5 is specifically: introducing strategy entropy term in reinforcement learning objective function And updated Lagrangian multiplier The weighted safety cost penalty term is used for constructing a composite objective function comprising the original rewarding function and the weighted safety cost penalty term by maximizing the cumulative rewarding and the strategy randomness and simultaneously minimizing the safety cost, and the composite objective function comprises the following steps: In the formula, Representing policies Is used to determine the complex objective function of (1), Representing policies The following expectations; in order to reward the function part, For the discount factor, t is the time of day, The prize value at time t is indicated, Is a temperature coefficient of the silicon carbide material, As an item of policy entropy, For the updated lagrangian multiplier, Is the accumulated security cost of the current interaction round; The specific process of updating the value network parameters is as follows: The value network comprises two rewards Q networks with identical structures And And two identical-structure safety cost Q networks And ; When calculating the target value, taking the minimum value of the output of the two rewards Q networks and the minimum value of the output of the two safety cost Q networks: In the formula, Is the bonus target value, Indicating the current time Is a bonus of (1), As a discount factor, the number of times the discount is calculated, Indicating time of day Is provided with a round-trip end flag, Representing the output of the jth bonus Q network at the next instant, The state at the next moment in time is indicated, The action at the next moment in time is indicated, Is a temperature coefficient of the silicon carbide material, Representing policy network in state Lower output action Probability density values of (2); is the cost target value of the current value, Indicating the current time Is used for the overload cost of the transformer, Representing the output of the jth safety cost Q network at the next moment; Calculating the output of the j-th reward Q network at the current time And a target value of rewards Mean square error of (a); Calculating the output of the j-th safety cost Q network at the current moment And a cost target value Mean square error of (a); updating a reward Q network by minimizing mean square error by gradient descent And Security cost Q network And Parameters of (2); The specific process of updating the strategy network parameters is as follows: based on a composite objective function Deducing a loss function of the strategy network by using a strategy gradient method, and updating strategy network parameters by minimizing the loss function of the strategy network; the specific process of deducing the loss function of the strategy network by utilizing the strategy gradient method is as follows: for composite objective function by using cost function approximation method Jackpot in (3) With rewards Q network Estimating a cumulative security cost With a secure cost Q network Estimating policy entropy terms By logarithmic probability Representation, combining updated Lagrange multipliers Will maximize the composite objective function Conversion to minimize the loss function : In the formula, Is the loss function of the policy network, Is a mathematical expectation operator that is used to determine, Is the output of the jth security cost Q network at the current time, Is the output of the j-th bonus Q network at the current time, Is a temperature coefficient of the silicon carbide material, Representing policy network in state Lower output action Probability density values of (a).
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, implements the PID-lagrangian deep reinforcement learning based electric vehicle charging station safety dispatch method of any one of claims 1-8.
10. A computer-readable storage medium storing a computer program, wherein the computer program causes a computer to execute the electric vehicle charging station safety scheduling method based on PID-lagrangian deep reinforcement learning as claimed in any one of claims 1 to 8.

Description

Electric vehicle charging station safety scheduling method, equipment and medium based on PID-Lagrange deep reinforcement learning Technical Field The invention relates to the technical field of intelligent power grids, energy Internet and artificial intelligence intersection, in particular to an electric vehicle charging station safety scheduling method, equipment and medium based on PID-Lagrange deep reinforcement learning, which are particularly applied to orderly charging and discharging (V2G) management of electric vehicles in a region with limited transformer capacity. Background With the rapid increase in the amount of electric vehicles (ELECTRIC VEHICLE, EV) held, large-scale charging load access presents a significant challenge to the distribution network. Particularly, in places with limited power distribution capacity such as old communities or commercial buildings, unordered charging electrodes of electric automobiles are easy to cause overload of a distribution transformer, and safety accidents such as voltage out-of-limit and equipment damage are caused. In the prior art, a method based on rules or Model Predictive Control (MPC) is mainly adopted for charging scheduling. The MPC method is highly dependent on accurate prediction of future information, and the calculation complexity of the MPC method is exponentially increased along with the increase of the number of charging piles, so that the timeliness requirement of real-time control is difficult to meet. In recent years, deep Reinforcement Learning (DRL) has been widely used because of its strong decision instantaneity. However, it is difficult for the conventional unconstrained DRL to guarantee hard physical security constraints, and while constraint reinforcement learning (Constrained RL) which introduces the Lagrangian relaxation method is the mainstream solution, the existing method generally only updates the Lagrangian multiplier (equivalent to only integral control) by using gradient information of constraint violation. The mechanism has serious hysteresis, so that the multiplier greatly oscillates in the training process, the strategy repeatedly swings between extremely conservative and frequent violations, and the optimal strategy for considering safety and efficiency is difficult to quickly converge. Therefore, there is a need for an electric vehicle charging station safety scheduling method capable of effectively suppressing lagrangian multiplier oscillation, improving algorithm convergence speed and stability, and strictly guaranteeing physical safety constraint of power grid equipment. Disclosure of Invention Aiming at the defects in the prior art, the invention provides an electric vehicle charging station safety scheduling method, equipment and medium based on PID-Lagrange deep reinforcement learning, so as to solve the problems that economy and hard physical safety constraint are difficult to be considered and the traditional constraint reinforcement learning training is unstable in the prior art. In order to achieve the above purpose, the present invention adopts the following technical scheme: An electric vehicle charging station safety dispatching method based on PID-Lagrange deep reinforcement learning comprises the following steps: Step S1, constructing a constrained Markov decision process model of an electric vehicle charging station, wherein the constrained Markov decision process model is formed by quintuple The composition is that,Representing a state space of the object and,Representing the space of action and,Representing the probability of a state transition and,Representing a bonus function,Representing a safety cost function; Step S2, initializing a deep reinforcement learning agent, wherein the deep reinforcement learning agent comprises a strategy network for generating strategies, a value network for evaluating value and a PID Lagrangian multiplier updater for constraint control; Step 3, deep reinforcement learning agent and environment interaction, which specifically comprises the steps of acquiring the running state of a charging station in real time, inputting the running state of the charging station into the strategy network, outputting power scheduling actions of each charging pile, calculating the rewarding value and the transformer overload cost at the current moment, feeding back the running state of the charging station at the next moment, and storing the running state of the current charging station, the power scheduling actions, the rewarding value, the transformer overload cost and the running state of the charging station at the next moment into an experience playback pool of a constraint Markov decision process model; s4, after each complete interaction round is finished, updating the Lagrangian multiplier by using the PID Lagrangian multiplier updater; S5, constructing a composite objective function containing an original rewarding function and a weighted safety cost penalty term based on the