CN-121684539-B - Demand response user side multi-main-body collaborative decision-making method based on switchable preference double-graph reinforcement learning

CN121684539BCN 121684539 BCN121684539 BCN 121684539BCN-121684539-B

Abstract

The invention discloses a demand response user side multi-main-body collaborative decision-making method based on switchable preference double-graph reinforcement learning, which is characterized in that switchable preference parameters are introduced into a reward and strategy layer simultaneously to realize continuous adjustability among multiple targets such as economy, comfort, energy storage equipment health and the like, a double-graph structure of a homogeneous graph and a heterogeneous graph is constructed, cross-type key interaction objects are screened based on a scale matching key interaction object selection mechanism, local attention vectors and global attention vectors are adopted to respectively aggregate competition characteristics and cooperation characteristics to form a mixed competition cooperation characteristic and input a preference conditional strategy network, and a near-end strategy optimization algorithm is finally adopted to complete centralized training and realize distributed execution to obtain a layered demand response collaborative decision-making scheme capable of switching strategy behaviors through the preference parameters in an operation stage. The method and the device can solve the problem that decisions meeting the real-time preference of the main body of the multi-target user are difficult to effectively formulate in the process of executing the demand response task.

Inventors

HUA HAOCHEN
MA LUYAO
MEI FEI
WANG BO

Assignees

河海大学

Dates

Publication Date: 20260505
Application Date: 20260210

Claims (10)

1. A demand response user side multi-subject collaborative decision-making method based on switchable preference double-graph reinforcement learning is characterized by comprising the following steps: Establishing a hierarchical demand response system, wherein the system is a user side main body set formed by an aggregate provider set, a shared energy storage operator set and a user set; the method comprises the steps of expressing the behaviors of each user side main body into a part of observable Markov decision process based on a hierarchical demand response system; establishing a comfort level depiction model of a user set to quantify the comfort level loss of a demand response behavior model, establishing a shared energy storage model for a shared energy storage operator set, establishing a time-sharing compensation pricing model of an aggregate quotient set based on the demand response behavior model of the user set, depicting a demand response peak regulation target gap of a hierarchical demand response system based on the demand response behavior model of the user set; Constructing switchable preference parameters, and respectively defining a target of the established comfort level depiction model, a target of the shared energy storage model, a target of the time-sharing compensation pricing model and a preference-conditioned multi-target reward of the demand response peak regulation target by using the switchable preference parameters; Selecting heterogeneous key neighbors and homogeneous cooperative neighbors of each main body from a main body set of a user side according to the selected key interaction objects to form a heterogeneous key neighbor set and a homogeneous cooperative neighbor set, and constructing a double-graph structure of each main body of the user side according to the heterogeneous key neighbor set and the homogeneous cooperative neighbor set; Constructing a main body preference embedding vector based on the mixed competition cooperation relation characteristic, inputting the preference embedding vector into a preference conditioning policy network to obtain action distribution parameters, and inputting the preference embedding vector into a value network to obtain preference conditioning state values; The method comprises the steps of respectively training a preference conditional policy network and a value network by utilizing action distribution parameters, preference conditional state values and preference conditional multi-objective rewards and combining a near-end policy optimization reinforcement learning algorithm until convergence, obtaining final policy parameters of the preference conditional policy network and final policy parameters of the value network corresponding to each main body in a main body set of a user side, and utilizing the training converged final policy parameters to perform preference switching and collaborative decision output in an online operation stage.
2. The switchable preference dual graph reinforcement learning-based demand response user side multi-subject collaborative decision-making method according to claim 1, wherein the constructing switchable preference parameters, using the switchable preference parameters to define a set-up comfort profile model goal, a shared energy storage model goal, a time-sharing compensation pricing model goal, and a demand response peaking goal, respectively, of a preference-conditioned multi-objective rewards comprises: Introducing preference parameters And allows preference parameters to switch during the run phase; setting preference parameters for an aggregator, a shared energy storage operator and a user respectively, and defining preference-conditioned multi-objective rewards by using the preference parameters, wherein the preference parameters are as follows: will aggregate the business Is defined as: ; Wherein, the Rewards for the aggregator; preference parameters for an aggregator; awarding weights for multiple targets; Total amount of response amount in the aggregated region; Is an aggregator At the moment of time Is a compensation expenditure of (2); in response to the notch; Average discomfort for jurisdictions; Will share energy storage operators Is defined as: ; Wherein, the Rewards for sharing energy storage operators; preference parameters for the shared energy storage operator; awarding weights for multiple targets; is an energy storage profit index; Is in a state of charge; Is a reference state of charge; Is energy storage power; is the maximum power of energy storage; Taking an absolute value; a power stress penalty term; Will user Is defined as: ; Wherein, the Rewards for users; preference parameters for the user; awarding weights for multiple targets; to compensate for the benefit; the electricity charge expense is used; Is not moderate.
3. The method for collaborative decision-making of multiple subjects at a demand response user side based on switchable preference double-graph reinforcement learning according to claim 1, wherein the calculating of interaction intensity weights and the scale matching of key interaction object selection based on real-time physical state quantities of each subject output by a demand response behavior model, a shared energy storage model and a time-sharing compensation pricing model comprises: introducing a scale matching key interaction object selection mechanism, and defining interaction intensity weights for different types of subjects: ; Wherein, the Representing a user At the moment of time Is a weight of interaction strength of the (c); representing a user Is a payload of (2); representing shared energy storage operators Is a weight of interaction strength of the (c); representing shared energy storage operators At the moment of time Charging and discharging power of (2); representing shared energy storage operators Is a maximum power of charge and discharge; Representing an aggregator At the moment of time Is a weight of interaction strength of the (c); Representing an aggregator Summarizing the response in the polymerized range; to the target subject And cross-type candidate subject Defining a scale matching score: ; Wherein, the Representing a target subject And cross-type candidate subject A scale matching score between; Is the target main body Is a weight of interaction strength of the (c); candidate subject for cross-type Is a weight of interaction strength of the (c); Is a scale parameter; is an exponential function; Based on target subject And cross-type candidate subject Inter-scale matching score From heterogeneous candidate sets Screening heterogeneous key neighbor set To construct a heterogeneous graph edge set, key interaction objects in the heterogeneous graph edge set Defined as the representative object with the greatest score: ; Wherein, the Is the target main body Key heterogeneous interaction entities of (a); representing the independent variable corresponding to the maximum value; defining heterogeneous critical neighbor sets as per-target principals And cross-type candidate subject Inter-scale matching score Selecting the highest score The cross-type objects: ; Wherein, the Before the representation is selected Element sets corresponding to the maximum values; is the preset neighbor number, and the heterogeneous graph edge set can be written as ; For homogeneous candidate set Defining a target subject And cross-type candidate subject State quantity similarity during inter-execution demand response: ; Wherein, the Is the similarity of state quantity; Is the target main body State quantity during execution of the demand response; candidate subject for cross-type State quantity during execution of the demand response; Is the euclidean norm; Is a similarity scale parameter; According to the similarity of state quantity From homogeneous candidate sets Is selected to be the most similar The neighbors form a homogenous collaborative neighbor set And obtain a homogeneous graph edge set 。
4. The method for collaborative decision-making of multiple subjects on a demand response user side based on switchable preference double-graph reinforcement learning according to claim 1, wherein the expressing the hybrid competition collaboration relationship between adjacent user side subjects based on the hybrid graph attention mechanism in the double-graph structure of all user side subjects to obtain the hybrid competition collaboration relationship feature includes: Mapping the body state quantity to a unified hidden space: ; Wherein, the Is the target main body The state quantity during execution of the demand response is mapped to a vector representation within a hidden space identifiable by the neural network; And (3) with Are all learnable neural network parameters; Is the target main body State quantity during execution of the demand response; subject the object Key heterogeneous interaction principals of (1) Mapping state quantities of (a) to a unified hidden space: ; For heterogeneous critical neighbor sets Constructing an attention score: ; Wherein, the Is the target main body Paired span type candidate subject Is scored across types of candidate subjects As a target subject Is a heterogeneous neighbor host; And (3) with Are all learnable parameters of the neural network; As a hyperbolic tangent function; vector stitching is performed; Is a transposition; Is the target main body Key heterogeneous interaction principals of (1) To heterogeneous critical neighbor sets Normalization is carried out to obtain attention weight: ; Wherein, the Is the attention weight; heterogeneous critical neighbors are aggregated based on local attention vectors to obtain competing features: ; Wherein, the Is the target main body Is a heterogeneous competitive feature of (1); Is the target main body Key heterogeneous interaction principals of (1) Is represented by a hidden space of (a); Homogeneous mass The set of individual nearest neighbors forms a homogeneous collaborative neighbor set Pair of homogeneous collaborative neighbors based on global attention vector And (3) performing polymerization to obtain cooperative characteristics: ; Wherein, the Is the target main body Is a homogeneous collaborative feature of (1); is a collection base; subject the object Is represented by a hidden space of (1) Target subject Heterogeneous competitive characteristics of (C) With the target subject Is a homogeneous collaborative feature of (1) And (3) performing hierarchical splicing and fusion to obtain a hybrid competition cooperation feature: ; Wherein, the Is a hybrid contention collaboration feature; Is a converged network; Is a splicing operator.
5. The method for multi-subject collaborative decision-making on the user side based on switchable preference double-graph reinforcement learning according to claim 1, wherein the constructing a subject preference embedding vector based on a hybrid contention collaboration relationship feature, inputting the preference embedding vector into a preference conditioning policy network to obtain motion distribution parameters, and inputting the preference embedding vector into a value network to obtain a preference conditioning state value comprises: to the target subject Preference parameters of (a) Constructing an embedding map: ; Wherein, the Embedding a vector for the body preference; Is a learning embedded function; Is the target main body Is used to determine the preference parameters of the (c), Taking out 、 Or (b) ; Embedding preferences into vectors Inputting a preference conditioning policy network, and outputting action distribution parameters by the preference conditioning policy network: ; Wherein, the Is as the parameter of Is a policy network of (2); Is the motion distribution mean value; the standard deviation is the action distribution; Is the logarithm of standard deviation; Is the target main body Is a vector of observation of (a); is a hybrid contention collaboration feature; Embedding a vector for the body preference; and adopting a Gaussian strategy and constraining the action range by hyperbolic tangent: ; Wherein, the The main body acts; As a hyperbolic tangent function; is Hadamard product; Is noise; gaussian distribution for zero mean unit covariance; Is a unit matrix; And (3) with All are policy network outputs; the preference embedding vector is input into the value network, and the value network outputs a preference conditional state value: ; Wherein, the Is a parameter Is a value network of (1); conditioning state values for preferences; Is the target main body State quantity during execution of the demand response; is a hybrid contention collaboration feature; the vector is embedded for the body preference.
6. The method for collaborative decision-making of demand response user-side multi-subject based on switchable preference dual graph reinforcement learning of claim 1 wherein the establishing a model of demand response behavior for a set of users for each user-side subject portion observable markov decision process comprises: For the user Defining a user At the position of The time-of-day payload power is: ; Wherein, the For users At the position of Time of day payload power; For users At the position of Fixed load power at time; For users At the position of Air conditioning power at moment; For users At the position of The charging power of the electric automobile at moment; For users At the position of Load power can be reduced at any time; For users At the position of Transferable load power at time; For users At the position of Distributed generation output power at moment; Set user At the position of Baseline payload at time no response participation is Defining a cut-down type demand response quantity to obtain an expression of a demand response behavior model of a user set: ; Wherein, the For users At the position of Demand response at time; For users At the position of Baseline payload at time of no response participation; For users At the position of Time of day payload power; To take the maximum value operator.
7. The switchable preference-based dual graph reinforcement learning-based demand response user side multi-subject collaborative decision-making method of claim 1, wherein the establishing a user-aggregate comfort profiling model to quantify a loss of comfort of a demand response behavioral model comprises: The indoor temperature of the user is updated by adopting a first-order thermal inertia model: ; Wherein, the For users At the position of Indoor temperature at time; For users At the position of Indoor temperature at time; Is that Outdoor temperature at moment; For users Is a heat exchange coefficient of (a); For users Equivalent influence coefficient of air conditioning power on room temperature change; For users At the position of Air conditioning power at moment; Defining a user At the position of The discomfort of the moment is the non-negative quantity of the temperature deviation comfort zone, and the expression of the comfort level depiction model of the user set is as follows: ; Wherein, the For users At the position of Discomfort in time; And (3) with The lower boundary and the upper boundary of the comfortable temperature interval are respectively; For users At the position of Indoor temperature at time; and quantifying the comfort loss of the demand response behavior model by utilizing the comfort depiction model of the user set.
8. The switchable preference-based dual graph reinforcement learning-based demand response user-side multi-subject collaborative decision-making method of claim 1, wherein the constructing a shared energy storage model for a set of shared energy storage operators comprises: for shared energy storage operators The stored energy charge state is updated as follows: ; Wherein, the For sharing energy-storage operators At the position of A state of charge at time; For sharing energy-storage operators At the position of A state of charge at time; To be used in Cut off in interval An inner part; For sharing energy-storage operators Is not limited, and the charging efficiency of the battery is improved; For sharing energy-storage operators Is provided; For sharing energy-storage operators Is a rated capacity of (2); Is the time step; is a charging power component; is a discharge power component; And (3) with Respectively shared energy storage operators At the position of A lower bound and an upper bound of the state of charge at the moment; is energy-storing power, positive value is charging, negative value is discharging The following constraints are satisfied: ; Wherein, the Is an absolute value; is the maximum allowable charge and discharge power.
9. The switchable preference double-graph reinforcement learning-based demand response user side multi-subject collaborative decision-making method according to claim 1, wherein the constructing a time-sharing compensation pricing model of an aggregate set based on a demand response behavior model of a user set comprises: polymerization quotient At the moment of time The compensation electricity price of (2) is as follows: ; Wherein, the To compensate electricity price; 、、 Compensating electricity prices for peaks, flat and valley respectively; Is an indication function; is of a period type; Compensation electricity price The following boundaries are satisfied: ; Wherein, the And (3) with The lower limit and the upper limit of the compensation electricity price are respectively; To compensate electricity price; the expression of the time-compensated pricing model for the aggregate set of merchants is as follows: ; Wherein, the Is an aggregator At the moment of time Is a compensation expenditure of (2); A sum operator; a set of users; attributing to the user; To compensate electricity price; For users At the position of Demand response at time.
10. The method for collaborative decision-making of demand response user side based on switchable preference dual graph reinforcement learning according to claim 1, wherein the characterizing the demand response peaking target gap of the hierarchical demand response system based on the demand response behavior model of the user set comprises: Set the aggregator at The target response amount at the moment is The response gap of the demand response peak shaving target of the hierarchical demand response system is defined as: ; Wherein, the In response to the notch; Is an aggregator At the position of Target response amount at the moment; An operation for summing the responses within the aggregator jurisdiction; For users At the position of Demand response at time; To take the maximum value operator.

Description

Demand response user side multi-main-body collaborative decision-making method based on switchable preference double-graph reinforcement learning Technical Field The invention relates to the field of demand response, in particular to a demand response user side multi-body collaborative decision-making method based on switchable preference double-graph reinforcement learning. Background In the operation scene of the power distribution system with high proportion of distributed resource access, the demand response participated by the flexible resource at the user side is gradually evolved from the traditional single-main-body load control to a multi-main-body cooperative regulation process. The system generally comprises three core main bodies of an aggregator, a shared energy storage operator and users with various flexible resources. The system comprises a user side flexible resource operation, a shared energy storage operator, an aggregation operator and an excitation signal, wherein the user side flexible resource operation is influenced by comfort constraint, a physical boundary of equipment and energy utilization service requirements, the shared energy storage operator assists energy time shifting and supply and demand mutual aid among users through charge and discharge power control and is limited by charge states and battery health constraint, and the aggregation operator organizes user participation response by taking the excitation signal as a main means, so that the response scale and the standard reaching probability are improved, and the supply and demand adjustment cost and the user electricity consumption comfort loss are controlled. The three interact at the same time scale, so that the decision presents significant cross-subject coupling, target conflict and uncertainty. Aiming at the layering collaborative decision-making problem, the existing researches are mainly divided into three types. The first class of centralized or hierarchical planning methods based on optimization usually takes user response, power control and guide signals as decision variables to aim at realizing balance of system regulation and control targets and energy efficiency through constraint optimization, the second class of methods based on game theory focuses on strategy interaction and interaction balance between descriptive subjects (such as aggregators and users), and the third class is a data driving method based on MARL, which usually uses graph representation to learn and describe a subject interaction structure and adopts multi-target rewards with fixed weights for training. A MARL-based data driving method models an aggregator, a shared energy storage operator and a user as an intelligent agent and interacts in discrete time sequences, wherein the aggregator outputs excitation signals according to observation, the user schedules various flexible resources according to signals, the shared energy storage outputs charge and discharge power and supply and demand mutual aid related actions, and the environment updates a state according to physical constraint and feeds back interaction data containing indexes such as response quantity, supply and demand adjustment cost, equipment state and the like. However, the prior art has the defects that (1) decisions meeting real-time preference of a multi-target user side main body are difficult to effectively formulate in the process of executing a demand response task, (2) the decision migration among the user side main bodies participating in the demand response under similar cooperation and the behavior restriction effect under heterogeneous competition are difficult to characterize, (3) the marginal restriction relation among the user side multi-type main bodies and the resource contention behavior in the demand response task are difficult to accurately describe, and (4) the conventional method is unstable in training and insufficient in generalization capability when the scale of the multi-main body is enlarged or the state is changed rapidly, and is difficult to adapt to the demand response task with multiple time periods and multiple scenes. Disclosure of Invention The invention aims to provide a multi-main-body collaborative decision-making method of a demand response user side based on switchable preference double-graph reinforcement learning, which can solve the technical problem that in the prior art, a multi-target user side main body is difficult to effectively make decisions conforming to the real-time preferences of the user side main body in the process of executing a demand response task. The technical scheme is that the demand response user side multi-main-body collaborative decision-making method based on switchable preference double-graph reinforcement learning comprises the following steps: Establishing a hierarchical demand response system, wherein the system is a user side main body set formed by an aggregate provider set, a shared energy storage oper