CN-121985377-A - Energy collection D2D communication system long-term energy efficiency optimization method and system based on reinforcement learning and Lyapunov optimization

CN121985377ACN 121985377 ACN121985377 ACN 121985377ACN-121985377-A

Abstract

The invention discloses a method and a system for optimizing long-term energy efficiency of an energy collection D2D communication system based on reinforcement learning and Lyapunov optimization, and relates to the technical field of power control and energy efficiency optimization. And constructing a Lyapunov function to deduce a Lyapunov drift term, and embedding the Lyapunov drift term into a reinforcement learning reward function. The reinforcement learning part adopts a self-adaptive double-depth Q network structure based on a Dropout mechanism, takes a state vector as input, and realizes self-adaptive balance of exploration and utilization through randomization of Q value distribution. The intelligent agent dynamically adjusts the transmission power of the EH-D2D system according to the optimal action output by the training network, and realizes the stability of parameter updating through the target network, and long-term energy efficiency optimization and queue stability in the energy collecting environment.

Inventors

LUO YING
ZHU XUBIN
ZENG MIN
XU GUANJUN
LI XINGFENG
REN ZHENWEN
ZHANG LI

Assignees

西南科技大学

Dates

Publication Date: 20260505
Application Date: 20260210

Claims (10)

1. The energy collection D2D communication system long-term energy efficiency optimization method based on reinforcement learning and Lyapunov optimization is characterized by comprising the following steps of: constructing a wireless communication system model comprising a plurality of equipment nodes, and defining a system state; inputting the current system state into a training network to output power control actions; after the power control action is executed, the system obtains environmental feedback rewards and transfers to the next state; Constructing a current interaction sample according to the current system state, the power control action, the environmental feedback rewards and the next state, storing the current interaction sample into an experience memory bank, randomly sampling sample data from the experience memory bank, and updating parameters of a training network; calculating Lyapunov drift items of a queue, introducing the drift items into an optimization target, and updating the introduced optimization values into environment feedback rewards of a training network; And iterating the training process until the network converges, obtaining an optimal power control strategy, and performing joint optimization of long-term energy efficiency optimization and system stability constraint.
2. The method for optimizing long-term energy efficiency of an energy harvesting D2D communication system based on reinforcement learning and Lyapunov optimization of claim 1, wherein each of said plant nodes harvests energy from three energy sources, solar energy, wind energy and radio frequency energy.
3. The method for long term energy efficiency optimization of a reinforced learning and Lyapunov optimization-based energy harvesting D2D communication system of claim 1, wherein the system states include node energy states, data queue states, and channel condition information.
4. The method for optimizing long-term energy efficiency of an energy collection D2D communication system based on reinforcement learning and Lyapunov optimization according to claim 1, wherein the training network is an n-layer neural network, dropout units are arranged between hidden layers, a system state is used as an input, action values of actions in an action set are output, and an action with the largest action value is selected as a power control action.
5. The method for long term energy efficiency optimization of an energy harvesting D2D communication system based on reinforcement learning and Lyapunov optimization of claim 4, wherein the action value is used to characterize the long term cumulative return obtained by selecting a specific power action in a given state, defined as: ; Wherein, the The state of the system is defined as the state of the system, In order to be of action value, Is a function of the neural network parameters, For the rate of the discount it is, Is to Is a approximation of (a).
6. The method for long-term energy efficiency optimization of an energy harvesting D2D communication system based on reinforcement learning and Lyapunov optimization according to claim 4, wherein the Dropout unit performs adaptive perturbation of the Q-value distribution by randomly masking neurons, The calculation is as follows: ; Wherein, the In order to perform the action(s), Is randomly sampled Values.
7. The method for optimizing long-term energy efficiency of an energy harvesting D2D communication system based on reinforcement learning and Lyapunov optimization according to claim 1, wherein the randomly sampling sample data from the empirical memory library, and performing parameter updating on the training network comprises: ; Wherein, the In order to be able to predict the value, To base true observed rewards, define Is that 。
8. The method for optimizing long-term energy efficiency of an energy harvesting D2D communication system based on reinforcement learning and Lyapunov optimization of claim 7, further comprising optimizing with a loss function that minimizes a difference between a predicted Q value and a target Q value, and performing parameter synchronization via a target network: ; Parameter updating: ; Wherein, the Is the rate of learning to be performed, 。
9. The method for optimizing long-term energy efficiency of an energy harvesting D2D communication system based on reinforcement learning and Lyapunov optimization according to claim 1, wherein the calculating the queue Lyapunov drift term, introducing the drift term into an optimization target, and updating the introduced optimization value into an environmental feedback reward of a training network comprises: Wherein, the For a set of queues, As a function of Lyapunov, As a Lyapunov drift function, In the hope that, In the form of a virtual queue, In order to achieve a transmission rate of the data, Is energy consumption; introducing drift terms into optimization objectives In (1) will As a reward for training the network, a joint control of long-term energy efficiency and queue stability is performed.
10. A reinforcement learning and Lyapunov optimization-based energy harvesting D2D communication system long-term energy efficiency optimization system, comprising: The system model building module is used for building a wireless communication system model comprising a plurality of equipment nodes and defining a system state; the power control action module is used for inputting the current system state into the training network to output power control action; the system acquires environmental feedback rewards after executing the power control actions, and transfers to the next state; The parameter updating module is used for constructing a current interaction sample according to the current system state, the power control action, the environmental feedback rewards and the next state, storing the current interaction sample into the experience memory bank, randomly sampling sample data from the experience memory bank, and updating parameters of the training network; The Lyapunov optimizing module is used for calculating a Lyapunov drift item of the queue, introducing the drift item into an optimizing target, and updating the introduced optimizing value into an environment feedback reward of the training network; And the joint optimization module is used for iterating the training process until the network converges, obtaining an optimal power control strategy and performing joint optimization of long-term energy efficiency optimization and system stability constraint.

Description

Energy collection D2D communication system long-term energy efficiency optimization method and system based on reinforcement learning and Lyapunov optimization Technical Field The invention relates to the technical field of power control and energy efficiency optimization, in particular to a method and a system for optimizing long-term energy efficiency of an energy collection D2D communication system based on reinforcement learning and Lyapunov optimization. Background With the rapid development of internet of things (Internet of Things, ioT) and 5G/6G communication technologies, device-to-Device (D2D) has become an important technical means for improving spectrum utilization and reducing communication delay. Meanwhile, due to the introduction of the energy collection (ENERGY HARVESTING, EH) technology, the terminal equipment can acquire energy from environmental energy sources such as solar energy, wind energy, radio frequency energy and the like, and a new direction is provided for building a green low-carbon communication system. However, the EH-D2D system often has significant non-stationarity in actual deployment, and energy sources such as solar energy and wind energy are affected by factors such as weather, shielding, time period and the like, so that the user business load also changes continuously along with the change of application scenes. The dual non-stationarity leads to the obvious performance reduction of the traditional power control method based on steady state assumption when the environment changes, and is easy to cause energy exhaustion and business backlog, and lacks the robustness required by engineering operation. The existing power control method is generally based on a static energy model or a short-term optimal strategy, and cannot effectively cope with the non-stationary characteristics of energy and task arrival, so that the energy utilization rate is easily reduced or the system is unstable. Meanwhile, the traditional reinforcement learning algorithm is easy to trap into local optimum or slow in convergence when trained in a non-stable environment, and energy efficiency maximization and queue stability are difficult to be achieved. Therefore, how to provide a method and a system for optimizing long-term energy efficiency of an energy collection D2D communication system based on reinforcement learning and Lyapunov optimization, so as to realize long-term energy efficiency optimization and system stability assurance of the energy collection D2D communication system in a non-stationary environment, which are the problems to be solved by those skilled in the art. Disclosure of Invention In view of the above, the invention provides a method and a system for optimizing long-term energy efficiency of an energy collection D2D communication system based on reinforcement learning and Lyapunov optimization, so as to realize the long-term energy efficiency optimization and system stability assurance of the energy collection D2D communication system in a non-stationary environment. In order to achieve the above purpose, the present invention adopts the following technical scheme: an energy collection D2D communication system long-term energy efficiency optimization method based on reinforcement learning and Lyapunov optimization comprises the following steps: constructing a wireless communication system model comprising a plurality of equipment nodes, and defining a system state; inputting the current system state into a training network to output power control actions; after the power control action is executed, the system obtains environmental feedback rewards and transfers to the next state; Constructing a current interaction sample according to the current system state, the power control action, the environmental feedback rewards and the next state, storing the current interaction sample into an experience memory bank, randomly sampling sample data from the experience memory bank, and updating parameters of a training network; calculating Lyapunov drift items of a queue, introducing the drift items into an optimization target, and updating the introduced optimization values into environment feedback rewards of a training network; And iterating the training process until the network converges, obtaining an optimal power control strategy, and performing joint optimization of long-term energy efficiency optimization and system stability constraint. Optionally, each of the equipment nodes collects energy from three energy sources, namely solar energy, wind energy and radio frequency energy. Optionally, the system state includes a node energy state, a data queue state, and channel condition information. Optionally, the training network is an n-layer neural network, dropout units are arranged between hidden layers, a system state is taken as input, action values of all actions in the action set are output, and the action with the largest action value is selected as the power control action.