Search

CN-116074966-B - IEEE 802.11be WiFi real-time resource allocation method based on depth deterministic strategy gradient

CN116074966BCN 116074966 BCN116074966 BCN 116074966BCN-116074966-B

Abstract

An IEEE 802.11be WiFi real-time resource allocation method based on depth certainty strategy gradient comprises the following steps of 1) establishing an IEEE 802.11be WiFi network model, 2) determining a mobility model, a path loss model and an interference model adopted by the network, 3) deriving a throughput rate expression of the network, 4) proposing an optimization problem taking the maximized minimum throughput rate as an objective function, wherein the optimization problem aims at optimizing allocation of power, channels and resource units in real time so as to improve the minimum throughput rate of the network, and 5) designing a real-time resource allocation algorithm based on the depth certainty strategy gradient to solve the optimization problem so as to realize IEEE 802.11be WiFi real-time resource allocation. The invention can effectively improve the minimum throughput rate of the network.

Inventors

  • QIU SHUWEI
  • CAO RONG
  • ZHOU ZEXUN
  • DONG XIAOQING
  • MIAO LIMING
  • WANG HUILIN

Assignees

  • 韩山师范学院
  • 广东汕头幼儿师范高等专科学校

Dates

Publication Date
20260505
Application Date
20221228

Claims (11)

  1. 1. An IEEE 802.11be WiFi real-time resource allocation method based on depth deterministic strategy gradient is characterized by comprising the following steps: 1) Establishing an IEEE 802.11be WiFi network model; 2) Determining a mobility model, a path loss model and an interference model adopted by a network; 3) Deriving a throughput rate expression of the network; 4) An optimization problem with the maximized minimum throughput rate as an objective function is proposed, wherein the optimization problem aims at optimizing the distribution of power, channels and resource units in real time, so that the minimum throughput rate of a network is improved; 5) And designing a real-time resource allocation algorithm based on depth deterministic strategy gradient to solve the optimization problem, so as to realize IEEE 802.11be WiFi real-time resource allocation.
  2. 2. The IEEE 802.11be WiFi real-time resource allocation method based on depth deterministic policy gradient according to claim 1, wherein in step 1), the IEEE 802.11be WiFi network is composed of a network controller, a plurality of APs and a plurality of nodes, and the network controller is responsible for allocating resources to the APs and coordinating the working states between the APs to reduce the interference between the adjacent basic service sets BSSs; An AP set is represented by A, a node set is represented by S, an AP j is provided with N antennas, a node i is provided with 1 antenna, j A,i S, an IEEE 802.11be WiFi network adopts a spatial multiplexing technology, an AP provided with N antennas is provided with N spatial streams, each node communicates with the AP through one of the spatial streams, the IEEE 802.11be WiFi network adopts four frequency bands of 2.4G, 5G-I, 5G-II and 6G, each frequency band comprises a plurality of channels, the width of the channels is represented by w, and the units are MHz and w W= {20, 40, 80, 160, 320} MHz, where W is a set of channel bandwidths, one channel is allocated to one AP, and set b= {2.4G, 5G-I, 5G-II, 6G } is introduced to represent 4 frequency bands adopted by the AP, so as to Represents the set of all channels of an IEEE 802.11be WiFi network, wherein C b represents the set of channels in band b, and AP j adopts the channels in band b Selecting from a given channel set C b ; each channel is divided into a plurality of resource units RU which are orthogonal to each other, each RU is composed of k data subcarriers, k K = {26, 52, 106, 242, 484, 996 2, 996 4, The AP allocates one or more RUs for the node i, the RUs allocated to the node i are represented by set RUs i , different nodes use different RU sets to concurrently transmit data, when RUs in a channel with a bandwidth of w MHz are divided into m w RU sets, the AP with N spatial streams supports up to N M w nodes send or receive data simultaneously; transmit power employed by AP j in band b Selected from a given power set P b , wherein TXOP, PIFS, SIFS, M-BA and OFDMA-BA represent transmission opportunities, PCF inter-frame space, short inter-frame space, multi-node block acknowledgements, and OFDMA block acknowledgements, respectively.
  3. 3. The IEEE 802.11be WiFi real-time resource allocation method based on depth deterministic policy gradient according to claim 1 or 2, wherein in the step 2), the motion pattern of the mobile node is described by using an RWP motion model, and the motion process of the node using the RWP model is as follows: 2.1.1 Setting the initial position of the node i as (x i , y i ) and the target position as% , ) The initial speed is v i , the initial position and the target position are randomly selected in the target area, and the initial speed is randomly selected in the interval [ v min , v max ]; 2.1.2 Node i moves in a straight line from position (x i , y i ) to position at speed v i , ); 2.1.3 Node arrival position [ ] , ) Then, staying for a period of time t p ,t p , and randomly selecting in a section [ t min , t max ]; 2.1.4 The user randomly selects a position in the target area , ) As a new target position, and randomly selects a speed in the interval v min , v max As a new speed, a start position (x i , y i ) = =. A. , ) Target position [ ] , )=( , ) Speed v i = Returning to step 2.1.2).
  4. 4. The IEEE 802.11be WiFi real-time resource allocation method based on depth deterministic policy gradient according to claim 1 or 2, wherein in step 2), in order to describe signal propagation characteristics between AP and node, the following path loss model is adopted: (1); Where d is the distance between the sender and the receiver in meters, L FS (d) is the path loss in free space in dB, d BP is the distance from the sender to the demarcation point in meters, the path between the sender and the receiver is the LoS path, i.e. no obstacle between the sender and the receiver, if the distance between the sender and the receiver is less than or equal to d BP , the path between the sender and the receiver is the NLoS path, i.e. an obstacle between the sender and the receiver, if the distance between the sender and the receiver is greater than d BP , the expression of L FS (d) is: (2); Where f is frequency in Hz, f is set as the center frequency of the frequency band, SF is shadow fading in dB, SF obeys a log-normal distribution with 0 as the mean, and the distribution function is: (3); Wherein, the Is the standard deviation of shadow fading; Given d BP , the path loss between the sender and receiver is determined by d and f, and thus L (d) is rewritten as L (d, f b ),f b denotes the center frequency of band b).
  5. 5. The IEEE 802.11be WiFi real-time resource allocation method based on depth deterministic policy gradient according to claim 1 or 2, wherein in the step 2), an interference model of a network is determined, and the procedure is as follows: 2.3.1 Interference between adjacent basic service sets) According to the path loss model, there is P R = P T – L(d, f b ), where P R is the received signal strength RSS of the receiver, P T is the transmit power of the sender, and according to equation (4), the distance between the sender and the receiver is: (4); By r j and j Representing the communication distance and interference distance of AP j or node j, respectively, if P R = D D=r j , if P R = I D=then j Here, the number of the parts of the device, here, D And I A threshold value for decoding a data frame and a threshold value for interfering signal strength respectively, D Through type SINR = D / (P I + P noise ) Obtained, P I represents the total interference signal strength from nearby signal sources, P noise represents the thermal noise power; node x is denoted by STA x, node y is denoted by STA y, the distance between AP i and STA x is denoted by d i,x , the distance between AP j and STA y is denoted by d j,y , so as to x And y The interference distance between the STA x and the STA y is respectively represented by l i,x , and the link between the AP i and the STA x is represented by l j,y ; The set of nodes associated with AP i in band b is denoted by S b (i), the set of nodes associated with AP j in band b is denoted by S b (j), and the interference ranges of AP i and AP j in band b are defined The method comprises the following steps: (5); If the distance between AP i and AP j is less than or equal to ,i J, channels adopted by the two APs overlap each other, so that the links l i,x and l j,y interfere each other, and under the coordination of the network controller, the two links need to use the channels in turn to avoid mutual interference; 2.3.2 Interference from nearby signal sources The SINR model is used to model the interference level from nearby signal sources, and the definition of SINR is as follows: SINR = P R / (P I + P noise ) (6); Wherein P R represents the RSS of the receiver, and at the set data rate, if a packet is to be successfully received, the SINR of the receiver needs to be greater than or equal to a given threshold SINR In the following SINR 、P I And P noise , let P R =given that D Through type SINR = D / (P I + P noise ) Namely, the threshold value for correctly decoding the data frame of the receiving party is obtained D 。
  6. 6. The IEEE 802.11be WiFi real-time resource allocation method based on depth deterministic policy gradient according to claim 5, wherein in step 3), the throughput expression of the network is derived as follows: 3.1 Downstream rate of a node) Assume channels in band b Sum power Is allocated to AP j, and the RSS of the node i is according to the path loss model Where d i,j is the distance between node i and AP j, and represents the channel by K c In the total number of data subcarriers, then the average RSS of node i in the k-tone RU is To Representing the SINR of node i in the k-tone RU, then: (7); Wherein, the The average value of the total interference power of the node i in the k-tone RU plus the thermal noise power is obtained by the formula (8), and the bit rate of the node i in the k-tone RU is obtained in the unit of bps/Hz; (8); Wherein, the And x (X=1, 2,., X) is the bit rate and corresponding SINR, x (X=1, 2,., X) is known, assuming RU set RU i is assigned to node i, then the downlink rate for node i is: (9); wherein W sc represents the bandwidth of the subcarrier, and the unit is Hz; 3.2 Uplink rate of a node) The uplink PPDU transmission is scheduled by a trigger frame, through which the AP designates the target RSS of the uplink transmission of the node i, and the target RSS of the uplink transmission of the node i in the RU i is represented by the RSS i , and then the uplink target RSS of the node i in the k-tone RU is as follows Then, the transmission power of node i in the k-tone RU is: (10); Where f k denotes the center frequency of the k-tone RU, so that the total transmission power of node i in RU i is To Representing the SINR of AP j in the k-tone RU, then: (11); According to And x The value of (x=1, 2., X) determines the uplink data rate of the node ; 3.3 Throughput rate of a node Representing the set of nodes associated with AP j in band b by S b (j), the number of nodes associated with AP j in band b is |s b (j) |, the number of spatial streams of AP j in band b by N b , assuming that the channel width of AP j is w and the channel is divided into m w RU sets, each RU set being shared by N b spatial streams, the number of nodes supported per data transmission being at most N b M w , therefore, AP j needs Y j,b data transmissions to complete one round of transmission, i.e. each node in set S b (j) completes one uplink transmission and one downlink reception in each round of transmission, and thus, gets: (12); The duration of the trigger frame, PIFS, SIFS, uplink PPDU, M-BA, downlink PPDU, and OFDMA-BA are represented by T TF ,t PIFS ,t SIFS ,t UL_PPDU ,t M_MA ,t DL_PPDU and T OFDMA_BA , the time of uplink transmission is represented by T UL , and the time of downlink transmission is represented by T DL , thus obtaining And Let T j,b denote the period of one round of transmission of AP j in band b, then: T j,b = (T UL + T DL )Y j,b (13); Therefore, the throughput rate of node i in band b of AP j is: (14); in formula (14), the molecule Representing the total number of bits transmitted by node i in a round of transmission Indicating the total time required for a round of transmission, introducing an indicator variable channel collision indicator CCI indicates the interference level of the AP, CCI j,b indicates the interference level of the AP j by the neighbor BSSs in the band b, CCI j,b is defined as the number of neighboring BSSs belonging to the same overlapping channel set as the channel of the AP j in the band b, thus CCI j,b +1 in formula (14) indicates that the AP j and its CCI j,b neighboring BSSs interfere with each other in the band b, and CCI j,b +1 BSSs must be transmitted in turn.
  7. 7. The IEEE 802.11be WiFi real-time resource allocation method based on depth deterministic policy gradient according to claim 6, wherein in said step 4), during the operation of the network, the state of the network changes with time, and the throughput of node i is related to power by optimizing the resource allocation of AP and node in each time slot to maximize minimum throughput, as seen from the derivation of equation (14) Channel(s) And the function of RU set RU i , thus, by optimization 、 And RU i achieve the goal of maximizing minimum throughput, the above problem is described by the following optimization problem: (15); Wherein, the i (T) represents the throughput rate of node i in time slot T, t=1, 2,3, T, T is the total number of time slots, T , And Is the power and channel used by AP j in band b, RU i is the set of RUs assigned to node i by AP j, constraint C1 to C3 ensures 、 And RU i takes values in a given range, in constraint C3, RU w,m represents that RU in a channel with bandwidth w is combined into m w RU sets, m {1, 2..M w },RU i is a subset of a certain set of RUs therein; the running time of the network is discretized into T time slots, Throughput rate of s|users at the beginning of the t-th slot i (T) constitutes the current STATE of the network, i=1, 2, |s|, i.e. at the start of time slot 1, the STATE of the network is STATE 1= { i (1) |I=1, 2, 3,.|s| } at the beginning of slot 2, the STATE of the network is STATE 2= { i (2) |I=1, 2,3, |s| } the STATE of the network is STATE x= { at the beginning of time slot x i (X) |i=1, 2, 3,.|s| } at the beginning of time slot t, the network controller needs to obtain a viable solution to the optimization problem (15) to maximize minimum throughput rate, at D t = { at the beginning of time slot t , RU i represents the resource allocation decision, j, obtained by the network controller at the beginning of time slot t A,b B,i And S, then, the network controller allocates power, channels and RUs for the AP and the nodes according to a decision D t , so that the minimum throughput rate is maximized, and the optimization problem (15) needs to be solved in real time at the starting moment of the time slot t.
  8. 8. The IEEE 802.11be WiFi real-time resource allocation method based on depth deterministic policy gradient according to claim 7, wherein the procedure of step 5) is as follows: 5.1 Defining a reinforcement learning model, and for the optimization problem (15), outputting a reasonable resource allocation strategy in real time according to the current WiFi network state by utilizing the trained converged neural network; in order to design DDPG-based algorithms to solve the optimization problem (15), the optimization problem (15) is transformed into a reinforcement learning model comprising a state space E, an action space H, and a reward function g # ) And discount coefficient Four quantities, defined as follows: 5.1.1 State space E) E { i (T) |i=1, 2, |s| } due to i (T) reflects the position of the node, the SINR of the AP and the node, and the number of nodes associated with each AP in band b, therefore, the AP needs to keep track of i (T) and taking it as the current network state; 5.1.2 Motion space H H {(H 1 , H 2 , …, H |A| ) | H j = (p j,b , c j,b ), p j,b P b , c j,b C b , b B, j=1, 2, |a| } the network controller will generate an action h (t) at the beginning of time slot t H, the action is that the network controller allocates power and channel to each AP, after the power and channel of the AP are determined, the AP will generate a corresponding RU set according to the bandwidth of the channel, and allocate RU set to each node associated with the AP, and since the power and channel output by the algorithm based on DDPG are values in the continuous real number interval, the constraint conditions C1 and C2 need to be satisfied after discretizing mapping the power and channel; 5.1.3 Bonus function g (t) According to the network state e (t) E is obtained Thus, the bonus function is defined as g (t) = The objective of the optimization problem (15) is to maximize This is equivalent to maximizing the value of the reward function g (t) in the reinforcement learning model; 5.1.4 Discount coefficient
  9. 9. To determine the importance of the current prize value and the resulting future prize value obtained after performing action h (t), the reinforcement learning model introduces a discount coefficient ,0 1, The value of this coefficient is adjustable, 5.2 The algorithm is designed with two deep neural networks, one is a value network and the other is a strategy network, because the two neural networks have own parameters, the parameters of the two neural networks are respectively updated in the iterative training process, so as to iteratively find a better resource allocation strategy, the value network and the strategy network are required to be duplicated in the training process to generate a target value network and a target strategy network, the structures of the target value network and the target strategy network are the same as the value network and the strategy network, and the difference is that the parameter updating of the target value network and the target strategy network is relatively slow, so that the training process is relatively stable; representing the total number of rounds of neural network training by M Representing the number of time slots per round to Representing the number of rounds an agent undergoes during initial training, The algorithm based on depth deterministic strategy gradient has the following steps: 5.2.1 Random initialization value network Q and policy network The weights are respectively Q And Generating a target network Q And ; 5.2.2 Randomly allocating power to AP j Sum channel Performing node-AP Association by adopting an Association algorithm, and allocating RU sets RU i for a node i by adopting a RUassignment algorithm; 5.2.3 Setting episode =1; 5.2.4 If episode > M), go to step 5.2.21); 5.2.5 Calculating the throughput rate of each node by adopting a formula (14) and obtaining an initial state e (0) of episode th round network; 5.2.6 Set t=1; 5.2.7 If t > Then episode = episode +1 is set, returning to 5.2.4); 5.2.8 If episode Then power is allocated randomly to AP j Sum channel Performing node-AP Association by adopting an Association algorithm, and allocating RU sets RU i for a node i by adopting a RUassignment algorithm; 5.2.9 If episode > Outputting the motion ; 5.2.10 Calling Discretize algorithm Mapping into a feasible solution h (t) H; 5.2.11 Executing h (t) to allocate power for AP j Sum channel Adopting a correlation algorithm to carry out node-AP Association, adopting RUassignment algorithm to allocate RU set RU i for node i; 5.2.12 Sampling the current user's location according to RWP movement model, calculating the throughput rate of each node using equation (14) i (T) obtaining a prize value g (t) = And obtaining a new state e (t+1) of the network; 5.2.13 Data (e (t), h (t), g (t), e (t+1)) are stored in the buffer; 5.2.14 Random extraction in buffer Data (e i , h i , g i , e i+1 ), i=1, 2, ; 5.2.15 A) setting ,i = 1, 2, ..., ; 5.2.16 Updating the value network Q by minimizing the loss, defined as: ; 5.2.17 Updating a policy network using a policy gradient : ; 5.2.18 Updating the target value network: ; 5.2.19 Updating the target policy network: ; 5.2.20 Set t=t+1, return 5.2.7); 5.2.21 Ending the exit.
  10. 10. The IEEE 802.11be WiFi real-time resource allocation method based on depth certainty policy gradient as set forth in claim 8, wherein the process of step 5) further includes the following 5.3) Association algorithm, the Association algorithm functions to implement Association between nodes and APs, and the steps are as follows: 5.3.1 Node i detects the signal strength from surrounding APs and sends an association request to the AP with the strongest signal; 5.3.2 AP j counts the total number of nodes with which the request is associated and calculates the number of nodes allocated to band b according to equations (16) and (17); (16); (17); Wherein, the Represents the number of nodes associated with the frequency band b of the AP j ) The function represents rounding off to the nearest neighbor, Representing a set of nodes requesting association with AP j; representing the number of fundamental channels in band b; 5.3.3 AP j in set Is uniformly selected from The individual nodes are associated with band b of AP j, The set of individual nodes is denoted by S b (j), i.e ; 5.4 RUassignment algorithm, AP j assigns RU set RU i to node i associated therewith using RUassignment algorithm, as follows: 5.4.1 Defining m w RU sets RU w,m , w W, m=1, 2,., m w ,m w is the maximum number of 106-tone RUs in a channel with a bandwidth of W MHz; 5.4.2 Determining the maximum number of nodes supported per parallel transmission, which is N due to the fact that there are N b spatial streams in band b of AP j, channel c j,b is divided into at most m w RU subsets b m w ; 5.4.3 Grouping nodes, dividing |s b (j) | by N b M w , obtain quotient as The remainder is rem, if rem is not equal to 0, |S b (j) | nodes are partitioned into Group +1, group x (x=1, 2., ) Contains N b m w nodes The +1 group contains rem nodes, and if rem is equal to 0, |S b (j) | nodes are partitioned into Groups, each group containing N b m w nodes; 5.4.4 RU allocation, if rem is not equal to 0, will Is assigned to an x-th group of nodes, x=1, 2, For the first Nodes in +1 group, uniformly selecting one space stream from N b space streams to be distributed to each node, and determining a required RU set according to the number of the nodes served by each space stream, if rem is equal to 0, then When the RU sets are allocated to the nodes in each group, the nodes far away from the AP are allocated with larger RU sets, namely RU sets with more total data subcarriers, and the nodes closer to the AP are allocated with smaller RU sets, namely RU sets with less total data subcarriers, so that the data rate of the nodes can be balanced.
  11. 11. The IEEE 802.11be WiFi real-time resource allocation method based on depth certainty policy gradient as set forth in claim 9, wherein the process of step 5) further includes the following 5.5) Discretize algorithm, discretize algorithm functioning to network policy The resulting actions are mapped into legal candidate solution spaces, thereby outputting a feasible solution, comprising the following steps: 5.5.1 Tiling the candidate power value and the candidate channel number in the frequency band b according to the directions of the x axis and the y axis in the plane coordinate system respectively to generate a continuous bounded candidate solution plane; 5.5.2 To a policy network The outputted motion Corresponding to the corresponding point in the candidate solution plane, = {(H 1 , H 2 , …, H |A| ) | H j = (p j,b , c j,b ), p j,b [1, |P b |], c j,b [1, |C b |], b B, j=1, 2, |a| } wherein interval [1, |p b | ] and interval [1, |c b | ] are both real intervals; 5.5.3 Calculating the coordinate value of an integer nearest to these points to obtain H (t) = { (H) 1 , H 2 , …, H |A| ) | H j = (p j,b , c j,b ), p j,b P b , c j,b C b , b B, j=1, 2, |a| } wherein p j,b and c j,b each belong to an integer in the specified set of integers; 5.5.4 Output h (t).

Description

IEEE 802.11be WiFi real-time resource allocation method based on depth deterministic strategy gradient Technical Field The invention belongs to the field of network resource allocation, and relates to an IEEE 802.11be WiFi real-time resource allocation method. Background WiFi based on the IEEE 802.11be standard may be applied to network access of indoor nodes. The IEEE 802.11be standard is the next generation WiFi standard, and is currently in the state of being established and continuously perfected. As can be seen from the technical draft of the IEEE 802.11be standard, compared with the existing WiFi standard, such as the IEEE 802.11ac or IEEE 802.11ax,IEEE 802.11be standard, the method has the outstanding characteristics of supporting enhanced MU-MIMO technology, supporting enhanced OFDMA technology, supporting multi-RU aggregation technology, supporting multi-link transmission and multi-AP coordination technology and the like. These new features can effectively improve the performance of the network. Therefore, deploying WiFi networks based on the IEEE 802.11be standard is very important to improve the user experience. In a WiFi network based on the IEEE 802.11be standard, dynamically allocating network resources to APs and nodes in real time according to the current network state, such as the location of the nodes and the strength of network interference, is an effective method for improving the network throughput. Generally, there are two common methods for resource allocation in WiFi networks. The first is a fixed resource allocation method. In this approach, the power and channel employed by the AP are relatively fixed. Although a network administrator can configure the power and the channel of the AP, the configuration process has various options for selection, the power and the channel after the configuration is completed are fixed, and cannot be dynamically adjusted due to the change of the network state, so that the network resources cannot be efficiently utilized, and the throughput rate of part of nodes in the network is low easily. The second is a dynamic resource allocation method based on heuristic algorithm. In recent years, in order to improve the utilization of network resources to further improve the throughput of the network, dynamic resource allocation techniques based on heuristic algorithms have been proposed. In such a technique, the AP has a parameter adjustment function, and can calculate an optimal resource allocation scheme based on the current network state. However, this type of technique has a major disadvantage, namely the high time complexity of the algorithm. When the AP calculates the optimal resource allocation scheme corresponding to a certain network state, the state of the network may have changed during the calculation of the AP. Therefore, the real-time performance of the resource allocation scheme obtained by the algorithm is relatively poor. Disclosure of Invention In order to overcome the defects of the prior art, the invention provides an IEEE 802.11be WiFi real-time resource allocation method based on depth deterministic strategy gradient, which can effectively improve the minimum throughput rate of a network. The technical scheme adopted for solving the technical problems is as follows: an IEEE 802.11be WiFi real-time resource allocation method based on depth deterministic strategy gradient comprises the following steps: 1) Establishing an IEEE 802.11be WiFi network model; 2) Determining a mobility model, a path loss model and an interference model adopted by a network; 3) Deriving a throughput rate expression of the network; 4) An optimization problem with the maximized minimum throughput rate as an objective function is proposed, wherein the optimization problem aims at optimizing the distribution of power, channels and resource units in real time, so that the minimum throughput rate of a network is improved; 5) And designing a real-time resource allocation algorithm based on depth deterministic strategy gradient to solve the optimization problem, so as to realize IEEE 802.11be WiFi real-time resource allocation. The invention has the main advantage of effectively improving the minimum throughput rate of the network. Drawings Fig. 1 is a frequency band and channel diagram of a network. Fig. 2 is a data transmission process diagram of an IEEE 802.11be WiFi network. Fig. 3 is a diagram of the interference range between link l i,x and link l j,y. Fig. 4 is a timeline diagram of network operation. Fig. 5 is a reinforcement learning model diagram. Fig. 6 is a schematic diagram of RU in a 40MHz channel. Fig. 7 is a candidate solution plan for power and channel. Detailed Description The invention is further described below with reference to the accompanying drawings. Referring to fig. 1 to 7, an IEEE 802.11be WiFi real-time resource allocation method based on a depth deterministic policy gradient includes the following steps: 1) An IEEE 802.11be WiFi network model