CN-122002539-A - Combined AP scheduling and power distribution method based on distributed collaborative deep reinforcement learning in honeycomb-free large-scale MIMO

CN122002539ACN 122002539 ACN122002539 ACN 122002539ACN-122002539-A

Abstract

The invention discloses a joint AP scheduling and power distribution method based on distributed collaborative deep reinforcement learning in honeycomb-free large-scale MIMO, which belongs to the field of wireless communication, and specifically comprises the steps of firstly, constructing a honeycomb-free large-scale MIMO downlink communication scene facing diversified user demands, wherein AP can dynamically perform sleep/active state scheduling, and performing time slot scheduling Calculating the down-link reachable rate of each user served by each AP and the frequency spectrum demand meeting rate at each AP, and further calculating the time slot To maximize the cumulative energy efficiency of the system To optimize the goal, simultaneously meeting the sum of power distribution is smaller than the total power And accumulates user spectral efficiency satisfaction Greater than the threshold Modeling the optimization problem as a Markov process, and finally solving the Markov process by adopting a collaborative A2C optimization algorithm to maximize the energy efficiency of the system on the premise of meeting the constraint.

Inventors

WANG CHAOWEI
XU JISONG
Deng Danhao
LI YEHAO
ZHANG ZHI

Assignees

北京邮电大学

Dates

Publication Date: 20260508
Application Date: 20260211

Claims (8)

1. A joint AP scheduling and power distribution method based on distributed collaborative deep reinforcement learning in honeycomb-free large-scale MIMO is characterized by comprising the following specific steps: step one, constructing a honeycomb-free large-scale MIMO downlink communication scene facing to diversified user demands; In the communication scene, include An AP distributed deployment, The AP can carry out dynamic dormancy/active state scheduling, and all active APs can provide services for users in all ranges at the same time; Step two, in time slot Calculate the first Downstream achievable rate and th for each user of an AP service Spectrum demand satisfaction rate at individual APs : Step three, counting time slots The downlink reachable rate of all users served by each AP is calculated, and the ratio of all downlink reachable rates to the total downlink power consumption is calculated, namely the instant messaging Energy efficiency at the site; Is the first The downstream achievable rate of the individual user, Representing time slots A set of all active APs; For time slots At the first place A set of users served by the APs; For time slots At the first place The sum of the user power allocations for the individual AP services; Step four, to maximize the accumulated energy efficiency of the system To optimize the goal, simultaneously meeting the sum of power distribution is smaller than the total power And accumulates user spectral efficiency satisfaction Greater than a threshold value Is a constraint on (2); Modeling the optimization problem as a Markov process; system state including AP local state Global state of system ; Wherein the AP local status Channel state information matrix including all users of current AP service Beamforming vector Satisfaction of cumulative spectral efficiency And accumulated energy efficiency The method comprises the following steps: Global state of system The CPU generates based on the local status summary reported by each AP, including the global accumulated spectrum efficiency satisfaction degree And global cumulative energy efficiency The method comprises the following steps: System rewards, rewards when power and satisfaction constraints are met using a cumulative rewards mechanism If not, awarding ; And step six, solving the Markov process by adopting a cooperative A2C optimization algorithm, and realizing the maximization of the energy efficiency of the system on the premise of meeting the constraint.
2. The method as claimed in claim 1, wherein in the first step, each AP is connected to the CPU through a backhaul link, and the APs exchange local state information through a neighborhood communication link.
3. The method according to claim 1, wherein in the second step, the first step Downstream reachable rate of individual users The calculation is as follows: For time slots At the first place AP and the first A channel coefficient matrix between individual users; For time slots At the first place AP and the first Precoding weight vectors of zero-forcing beam forming are used among the users; For time slots Time No Interference from other users experienced by individual users; Is the first Additive white gaussian noise for individual users; First, the Spectrum demand satisfaction rate at individual APs : Is the first The first AP service The downlink reachable rate of the individual users; Is the first The first AP service Individual users are in time slots At the downstream rate requirements.
4. The method of claim 1, wherein in the fourth step, a global spectral efficiency satisfaction of the system is calculated based on the spectral efficiency of each AP: 。
5. The method of claim 1, wherein in the fifth step, the sleeping/active state schedule of the AP and the power allocation are incorporated into the action space together to construct a markov process dynamic decision framework, and the system actions are expressed as: Scheduling for sleep/active states of all APs; Power allocation for active APs.
6. The method of claim 1, wherein in the sixth step, a dominance function is introduced in conjunction with an A2C optimization algorithm For characterising in-state Down selection action By means of the Monte Carlo method, the expression of the dominance function is: ; is a discount factor.
7. The method of claim 1, wherein in the sixth step, the collaborative A2C optimization algorithm comprises an AP local policy network and a local value network, policy network Outputting action strategy based on current system state and value network Evaluating the value of the current state; the parameter updating of the strategy network adopts a gradient rising method and adds an entropy regularization mechanism to encourage exploration, and the updating process and the loss function are as follows: as a result of the time-series differential error, In order to be an entropy of the water, For the entropy regularization factor, The learning rate of the strategy network; the value network parameter update adopts a gradient descent method, and the update process and the loss function are as follows: is the learning rate of the value network.
8. The method according to claim 6 or 7, wherein the sixth specific step is as follows: Step 701, initializing parameters of a policy network and a value network 、 Learning rate And Discount factor Satisfaction threshold Initializing a policy network Value network Initializing the total training wheel number And the total number of time slots per round ; Step 702, state acquisition for each time slot in each round of training Each AP collects its own local state through the neighborhood communication link When (1) For presetting global state acquisition period CPU timing through backhaul link to collect system global state Generating a global constraint index; Step 703, action decision, wherein each AP is based on the current local state of each AP Outputting an AP state scheduling instruction through a policy network And power allocation instructions ; Step 704, executing and feeding back, each AP adjusts the sleep/active state and the transmission power according to the instruction, and calculates the user rate of the time slot in real time Meeting rate of frequency spectrum demand And energy efficiency Generating rewards ; Step 705, local network update, according to state transition Calculating time sequence difference target in cooperative A2C algorithm And timing differential error Updating the policy network and the value network; Step 706, global network update, CPU is in the process of Updating global value network at any time and feeding back gradient to all APs (access points) The gradient is fused into the updating process of the local strategy network when the AP updates locally next time, so that the local decision is ensured to accord with the global optimization target; step 707, iterative optimization, namely repeating the steps 702-706 until the training round number is reached And finally, the network converges to an optimal strategy, and a joint AP scheduling and power distribution scheme is output.

Description

Combined AP scheduling and power distribution method based on distributed collaborative deep reinforcement learning in honeycomb-free large-scale MIMO Technical Field The invention belongs to the field of wireless communication, and particularly relates to a joint AP scheduling and power distribution method based on distributed collaborative deep reinforcement learning in honeycomb-free large-scale MIMO. Background In the face of the upcoming 6G network, the access of massive heterogeneous user terminals and the continuous increase of the requirements of high-speed, low-delay and high-reliability communication services lead to the rapid rise of the system energy consumption, and become the core bottleneck for restricting the development of the next generation of mobile communication. How to reduce the system power consumption while guaranteeing the quality of service (Quality of Service, qoS) has become a key research direction in the wireless communication field. The Cell-free large-scale multiple input multiple output (Cell-FREE MASSIVE MIMO) is used as a core technology of next generation mobile communication, and a distributed communication connection is constructed with User Equipment (UE) in a coverage area through random deployment of a large number of distributed lightweight Access Points (APs) and centralized cooperative scheduling of a central processing unit (Central Processing Unit, CPU), so that the technical bottlenecks of obvious Cell boundary interference and reduced edge User performance of the traditional cellular network are broken through, and the method has higher frequency spectrum efficiency, stronger interference suppression capability and more flexible resource allocation characteristics. Meanwhile, the close-range deployment of the AP and the user greatly reduces signal attenuation, provides a foundation for dynamic power adjustment and energy consumption optimization, and becomes a key technology for supporting green communication and efficient resource utilization. The traditional honeycomb-free large-scale MIMO system takes the improvement of frequency spectrum efficiency as a core optimization target, realizes interference suppression through the technologies of cooperative transmission of distributed APs, zero-forcing beam forming and the like, and improves the communication quality of edge users by relying on the centralized signal processing capability of a CPU. In addition, as a dynamic state switching mechanism is not designed for the actual contribution degree of the AP, more non-energy consumption is caused, and the green communication requirement is difficult to adapt. In the aspects of resource allocation and optimization, the optimization modes of the traditional honeycomb-free large-scale MIMO system can be divided into two types: One type is single-dimensional power allocation optimization, i.e. the spectrum efficiency is guaranteed by adjusting the transmission power of the AP to the users only, and for example, the power allocation is dynamically adjusted according to the channel state information to reduce multi-user interference. But this approach has few considerations of overall energy efficiency of the system and is difficult to apply to practical scenarios. Another type is static AP selection optimization, i.e. pre-screening a fixed number of AP subsets to serve the user, e.g. selecting the AP with the optimal channel conditions based on the channel gain. However, this scheme is difficult to adapt to the dynamic change of the downlink rate requirement of the user, and the fixed AP subset is prone to the situation that the QoS of the high-demand user is insufficient and the resources of the low-demand user are wasted in the face of heterogeneous downlink rate requirements of users in different periods. In addition, there is no static optimization means such as a traditional greedy algorithm, a heuristic algorithm and the like in the resource optimization of the cellular large-scale MIMO system, and the problem of long-term gain optimization in a dynamic demand environment cannot be solved only by aiming at the short-term performance target design in a specific scene. Further, the centralized CPU decision architecture adopted by the conventional scheme has significant drawbacks: Firstly, a CPU needs to collect global channel state Information (CHANNEL STATE Information) of all APs and user requirements through a backhaul link, so that the transmission cost of the backhaul link is huge, and the cost is linearly increased along with the increase of the number of APs (6G mass deployment scene); Secondly, the centralized calculation depends on CPU Shan Diansuan force, so that the decision time delay is high, and the user dynamic rate requirement is difficult to respond in real time; Third, the single point failure of the CPU can lead to paralysis of the whole system, and the robustness is insufficient. Disclosure of Invention Aiming at the characteristics and chal