CN-116456493-B - D2D user resource allocation method and storage medium based on deep reinforcement learning algorithm

CN116456493BCN 116456493 BCN116456493 BCN 116456493BCN-116456493-B

Abstract

The invention discloses a D2D user resource allocation method and a storage medium based on a deep reinforcement learning algorithm, and relates to the technical field of wireless communication. The method comprises the steps of constructing a wireless network model, carrying out discretization processing on D2D transmitting power, constructing a user signal-to-noise ratio calculation model, taking the throughput of a communication system as an optimization target to the maximum, setting a prediction strategy network pi, a prediction value network Q, a target strategy network pi 'and a target value network Q', modeling a D2D communication environment as a Markov decision process, taking a D2D transmitter as an intelligent body, circularly loading parameters of the target strategy network, generating strategies to interact with the environment, determining a state space, an action space and a reward function, carrying out strategy optimization on each D2D user by adopting a MAAC algorithm, circularly updating the parameters of the target strategy network and the target value network by adopting a soft update mode until learning training is completed, and downloading the parameters of the trained target strategy network by the D2D user to carry out strategy improvement.

Inventors

LI JUN
LIU XINGXIN
LIU ZIYI
SHEN GUOLI
ZHANG QIANQIAN
LI CHEN

Assignees

无锡学院

Dates

Publication Date: 20260505
Application Date: 20230420

Claims (10)

1. The D2D user resource allocation method based on the deep reinforcement learning algorithm is characterized by comprising the following steps of: Constructing a wireless network model, performing discretization on D2D transmitting power to generate K power levels, wherein the wireless network model comprises a macro base station, L cellular users, N pairs of D2D user pairs and M orthogonal spectrum resource blocks in the coverage area of the macro base station network, and parameters configured by the wireless network model comprise user positions; Establishing a user signal-to-noise ratio calculation model, which is used for calculating signal-to-noise ratio information of a D2D user and a cellular user, setting QoS requirements of the D2D user for communication with the cellular user, and optimizing a wireless network model by taking the maximum throughput of a communication system consisting of the D2D user and the cellular user as an optimization target, wherein the user signal-to-noise ratio comprises the signal-to-noise ratio of a receiving end of the D2D user and the signal-to-noise ratio of the cellular user; macro base station sets prediction strategy network for each agent Predictive value network Q, target policy network And a target value network Q'; modeling a D2D communication environment as a Markov decision process, regarding a D2D transmitter as an intelligent agent, and circularly loading a target strategy network On the premise of meeting QoS requirements, each intelligent agent selects a communication mode to be adopted at the moment t and according to the current observed state Executing an action Obtain rewards And transition to the next state Combining experience groups Uploading to an experience pool for centralized training, wherein the communication mode comprises a special mode, a multiplexing mode and a waiting mode, the state comprises position information and signal-to-noise ratio information of a D2D user and a cellular user, and the actions comprise selecting a power value and a resource block for communication; Carrying out strategy optimization on each D2D user by adopting a MAAC algorithm, carrying out centralized training on small-batch random sampling from an experience pool, updating a predicted value network by adopting a TD algorithm, updating parameters of the predicted value network by adopting a gradient descent method, calculating accumulated rewards based on rewards obtained by executing actions by the agents, setting strategy gradients according to the accumulated rewards, and circularly updating parameters of the predicted strategy network by adopting a gradient ascent method based on the strategy gradients, wherein the learning objective of the MAAC algorithm is to learn a strategy for each agent so as to obtain the maximum accumulated benefit; Based on the parameters of the prediction strategy network and the prediction value network, circularly updating the parameters of the target strategy network and the target value network in a soft updating mode until learning training is completed; And D2D users download parameters of the target strategy network after training, perform strategy improvement and select a communication mode, a resource block and/or communication power according to the observed current environment.
2. The D2D user resource allocation method based on the deep reinforcement learning algorithm of claim 1, wherein the user signal-to-noise ratio calculation model includes an SINR of an mth D2D user receiving end and an SINR of an mth cellular user; The expression of SINR of the mth D2D user receiving end is: In the formula, Representing the transmit power of the D2D transmitter; representing the channel gain between the D2D transmitter and the D2D receiver; Representing cellular resource sharing coefficients for differentiating D2D communication modes, when an mth D2D user uses an idle channel for communication, i.e. a cellular user spectrum resource block is not multiplexed, when there is no interference of the cellular user, then When the spectrum resource block of the cellular user is multiplexed, =0, then =1; Representing the transmit power of the cellular user; representing the channel gain of the cellular user to D2D; indicating the D2D resource sharing coefficient, if other nth D2D users and mth D2D users multiplex the same resource block at the moment, =1, Otherwise =0; Representing the transmit power of other D2D users; indicating the channel gain used by other D2D to the D2D user; Representing gaussian white noise; SINR for the first cellular user, expressed as: In the formula, Representing the transmit power of the cellular user; Representing the channel gain of the macro base station to the cellular user; Representing the resource block multiplexing coefficients, if =1 Means that there is a D2D user multiplexing of cellular user resource blocks, otherwise =0; Representing the transmission power of the nth D2D; representing the channel gain of D2D user n to cellular user l; Representing gaussian white noise; the system throughput Tp expression is: In the formula, Representing the bandwidth between the cellular user and the macro base station, Representing the bandwidth between the D2D transmitter and the D2D receiver; Representing the throughput at the cellular user side; representing the throughput of the D2D user side; The method comprises the steps of setting QoS requirements of communication between a D2D user pair and a cellular user, optimizing a wireless network model by taking the maximum throughput of a communication system consisting of the D2D user and the cellular user as an optimization target, and describing the wireless network model as the following expression: p(3a) (3b) (3c) (3d) (3e) wherein, the formula (3 a) represents the maximum optimization target of the system throughput, the formulas (3 b) and (3 c) represent SINR requirements of the D2D receiver and the cellular user, and the formulas (3D) and (3 e) represent limiting conditions of the transmission power of the D2D transmitter and the cellular user; Representing D2D minimum signal-to-noise requirements; Representing a minimum signal-to-noise ratio requirement for a cellular user; representing D2D minimum transmission power; representing D2D maximum transmission power; representing the transmit power of the nth D2D pair; representing the transmit power of the cellular user; being constant, the transmit power representing all cellular users in the environment is a fixed value.
3. The D2D user resource allocation method based on the deep reinforcement learning algorithm of claim 2, wherein the D2D communication environment is modeled as a markov decision process, the D2D transmitter is regarded as an agent, and the target policy network is circularly loaded The generated strategy interacts with the environment after the parameters of the system, a state space, an action space and a rewarding function are determined, on the premise of meeting QoS requirements, each intelligent agent selects a communication mode to be adopted at the moment t, and the communication mode is selected according to the current observed state Executing an action Obtain rewards And transition to the next state Combining experience groups Uploading to an experience pool for centralized training, specifically: Modeling the D2D communication environment as a markov decision process, treating the D2D transmitter as an agent; Agent cyclic loading target strategy network The parameter post-generation strategy of (2) interacts with the environment, selects the communication mode to be adopted at the moment t, and observes the state according to the moment t Executing an action Obtain rewards And transition to the next state Wherein the actions performed by the agent are all under the constraint of the QoS requirement; defining the state space of the mth D2D user to the moment t as And (3) a process for preparing the same, wherein, Representing the self basic information of the D2D user at the time t, including the position information of the D2D user User signal to noise ratio information I.e. ; Representing cellular subscriber basic information including location information of cellular subscriber users User signal to noise ratio information I.e. ; Defining the action space of the mth D2D user to the moment t as Wherein, the method comprises the steps of, Representing D2D user selection of the first The resource blocks are M-dimensional in total; Representing selection of the first The power levels are communicated, and K choices are formed; Will be the first The rewards obtained by the individual users for performing the action at time t are defined as: Wherein, the Is a constant less than 0; = , represent the first Signal to noise ratio at time t of individual D2D users, Representing D2D user bandwidth; environment before conversion Actions performed Converted environment And rewards To experience group Is uploaded to the experience pool.
4. The D2D user resource allocation method based on the deep reinforcement learning algorithm according to claim 1, wherein each agent selects a communication mode to be adopted at time t, comprising: Judging whether an idle channel exists in the system or not, if so, adopting a special mode to communicate; Otherwise, judging whether the QoS requirements of the D2D user and the cellular user are met after multiplexing the resource blocks, if so, the D2D user enters a special mode to share the cellular user resources for communication, otherwise, enters a waiting mode and does not communicate until an idle channel exists in the system, and then initiates a communication request again.
5. The D2D user resource allocation method based on the deep reinforcement learning algorithm of claim 1, wherein the cumulative bonus expression is: In the formula, Representing discount factors, wherein the value is in the interval of [0,1 ]; Indicating a reward desire; Indicating an instant prize.
6. The D2D user resource allocation method based on the deep reinforcement learning algorithm according to claim 5, wherein the policy optimization is performed on each D2D user by using the MAAC algorithm, the centralized training is performed by randomly sampling small amounts of data from the experience pool, the predicted value network is updated by using the TD algorithm, the parameters of the predicted value network are updated by using the gradient descent method, the cumulative rewards are calculated based on rewards obtained by performing actions by the agent, the policy gradient is set according to the cumulative rewards, and the parameters of the predicted policy network are cyclically updated by using the gradient ascent method based on the policy gradient, comprising: in the environment of multiple agents, the prediction strategy network of all agents is used for Predictive value network Respectively defined as And Target policy network of all agents Target value network Respectively defined as And ; Judging whether the number of experience groups stored in the experience pool meets a preset threshold value, if so, executing centralized training, otherwise, not performing operation; Wherein the centralized training comprises: Randomly sampling small batches from an experience pool, and establishing a current round training data set; First, the Predictive policy network of individual agents in state For input, use Policy generation selection actions Policy a of (1) that the agent performs an action State transition to And get rewards Wherein, the method comprises the steps of, The policy expression is: Wherein A represents an action strategy of an agent; the value of (2) is continuously attenuated along with the learning process; approximating the action cost function with the predictive value network, updating the predictive value network with the TD algorithm, and learning the Q function, i.e., the action cost function, with the Belman equation First, the The predicted value network of each agent is used for the state of the agent And actions To input and output action cost function The target value network takes the converted state And the next moment of action To input and output the action cost function of the next moment ; And updating the predicted value network by minimizing a loss function according to the output of the predicted value network and the target value network by adopting a function approximation method, wherein the loss function expression is as follows: In the formula, Is a target value, is generated by a target value network, = , Representing discount factors, wherein the value is in the interval of [0,1], Smaller descriptions are less likely to be of interest in future benefits when An equal to 0 means that only immediate benefits are considered, as The trend toward 1 represents a trend toward future benefits; outputting the predicted value as a predicted value by a predicted value network; Definition of the definition Is that Updating parameters of predictive value network by gradient descent method So that The prediction error is reduced; According to the first The cumulative rewards of the individual agents define a policy gradient expressed as follows: In the formula, Representing the gradient of the Q function obtained in the predictive value network; A deterministic policy gradient representing a predictive policy network; D represents an experience pool; based on the strategy gradient, updating parameters of the prediction strategy network by adopting a gradient ascending method 。
7. The D2D user resource allocation method based on the deep reinforcement learning algorithm of claim 6, wherein the input of the predictive value network introduces a neighbor user mechanism, specifically: setting a distance constraint value ; Will be at a distance from the ith agent Less than the constraint value The j-th agent to put into the neighbor set ={ | In the process, the ith and the jth agents are neighbor users, wherein the distance between the different agents is the distance between D2D transmitters, which is calculated by Euclidean distance formula, and the position is Is the ith agent and location of (1) The expression of the distance between the j-th agent of (a) is: ; The input of the predicted value network of the ith agent includes the status and actions of the ith agent, and also includes the set The state and action of the medium agent, and the action value function of the ith agent is output 。
8. The D2D user resource allocation method based on the deep reinforcement learning algorithm according to claim 1, wherein parameters of the predictive strategy network and the predictive value network And A qualification trace mechanism is introduced in the updating process of the system, specifically: Wherein, the The term "TD-error" is used to denote, ; An action cost function representing the predicted value network output; Representing n-step time sequence differential error Reporting, the expression is: Wherein T represents the final time; is the attenuation rate parameter, the value of which is within the interval [0,1], when In the time-course of which the first and second contact surfaces, The return is I.e. single step return, when The reported updating algorithm is a single-step time sequence differential error algorithm, when In the time-course of which the first and second contact surfaces, The return is I.e. The updating algorithm of the return is the Monte Carlo algorithm; representing the qualification trace of the predictive policy network, The qualification trace representing the predicted value network is updated as follows: Wherein, the As a parameter of the decay rate, ; Is a discount coefficient; representing gradients of the prediction strategy network; Representing gradients in the predictive value network, qualification trace accumulating a gradient value at each step and using the value to predict the gradient Attenuation, the components of the weight vector that make either positive or negative contributions to the most recent state estimate are tracked.
9. The D2D user resource allocation method based on the deep reinforcement learning algorithm according to claim 1, wherein the parameter soft update process of the target policy network and the target value network is as follows: Wherein, the Parameters representing the target policy network; representing parameter updating coefficients, wherein the value is in the interval of [0,1 ]; Parameters representing a predictive policy network; parameters representing a target value network; parameters representing the target value network.
10. A computer storage medium having instructions stored therein, which when executed on a computer, cause the computer to perform the method of any of claims 1-9.

Description

D2D user resource allocation method and storage medium based on deep reinforcement learning algorithm Technical Field The invention relates to the technical field of wireless communication, in particular to a D2D user resource allocation method based on a deep reinforcement learning algorithm and a storage medium. Background In the age of rapid development of technology today, wireless communication technology is well known from people's daily life. The demands of people on mobile communication are rapidly increasing, the demands are becoming higher and higher, and the demands on definition and tone quality of videos are gradually improved from the previous mobile communication equipment which only needs to have a simple call function, to the subsequent basic internet searching which needs to be performed, to the current video brushing and music listening. However, the problem of lack of spectrum resources is particularly prominent in environments where the number of users is dense and the communication interference between each other is large, so we propose a number of methods to solve this problem. One of the technologies, device-to-device (D2D), is a technology that directly exchanges information between neighboring devices in a communication network. Compared with the traditional cellular communication, the D2D communication technology is used, the D2D communication does not need to take a base station as a relay, so that the communication can be carried out in a place far away from the base station or even without the base station, the transmission pressure of the base station is effectively reduced, the frequency spectrum resource of a cellular user can be shared by the D2D communication technology, the frequency spectrum utilization rate is greatly improved, the throughput of the system is improved, and the performance of the whole communication system is improved. In the D2D communication technology, it is important for D2D users (D2D User Equipment, DUE) to perform reasonable power allocation and resource block allocation, and DUE mainly multiplexes spectrum resources occupied by cellular users (Cellular User Equipment, CUE), so interference exists among the DUE, CUE and Base Station (BS). In order to effectively avoid these interferences, improve the quality of service (Quality of Service, qoS) of D2D users, many solutions have been proposed. For example, in recent years, the problem of channel allocation and power control is handled by a very hot machine learning technology, and most of these are considered as an ideal model, i.e. the information of all users is determined. However, considering that in a real environment, neither DUE nor CUE exists in a dynamic manner, such as location information, channel gain, etc., the amount of information is huge, and the scene change rapidly causes great computational complexity, and the conventional optimization method cannot be applied. Disclosure of Invention The invention provides a D2D user resource allocation method and a storage medium based on a deep reinforcement learning algorithm, which are used for overcoming the defect that the prior art cannot adapt to a dynamic environment. In order to solve the technical problems, the technical scheme of the invention is as follows: In a first aspect, a D2D user resource allocation method based on a deep reinforcement learning algorithm includes: Constructing a wireless network model, performing discretization on D2D transmitting power to generate K power levels, wherein the wireless network model comprises a macro base station, L cellular users, N pairs of D2D user pairs and M orthogonal spectrum resource blocks in the coverage area of the macro base station network, and parameters configured by the wireless network model comprise user positions; Establishing a user signal-to-noise ratio calculation model, which is used for calculating signal-to-noise ratio information of a D2D user and a cellular user, setting QoS requirements of the D2D user for communication with the cellular user, and optimizing a wireless network model by taking the maximum throughput of a communication system consisting of the D2D user and the cellular user as an optimization target, wherein the user signal-to-noise ratio comprises the signal-to-noise ratio of a receiving end of the D2D user and the signal-to-noise ratio of the cellular user; The macro base station sets a prediction strategy network pi, a prediction value network Q, a target strategy network pi 'and a target value network Q' for each intelligent agent; Modeling a D2D communication environment as a Markov decision process, regarding a D2D transmitter as an intelligent agent, circularly loading parameters of a target strategy network, generating a strategy to interact with the environment, and determining a state space, an action space and a reward function; on the premise of meeting QoS requirements, each intelligent agent selects a communication mode to be adopted at