CN-115665878-B - Air interface spectrum efficiency improving method for multiple intelligent agents based on reinforcement learning

CN115665878BCN 115665878 BCN115665878 BCN 115665878BCN-115665878-B

Abstract

The invention discloses a reinforcement learning-based air interface spectrum efficiency improving method for multiple agents, relates to the technical field of high-flux communication systems, and solves the problems of data dimension explosion and low sample efficiency in a wireless resource allocation scheme based on deep reinforcement learning by using an attention mechanism in a transducer structure. In the invention, by adopting the deep reinforcement learning technology and combining with the attention mechanism in the transducer structure, the correlation of the user position distribution in the multi-user cellular network and the distribution relation between each resource can be mined and analyzed, the same-frequency interference is avoided to a certain extent, the improvement of the system spectrum efficiency is realized, and the problems of data dimension explosion and low sample efficiency in the resource distribution scheme based on the deep reinforcement learning are solved.

Inventors

YU HANG
YI LONGTENG
FENG XUAN
DONG ZANYANG
QIN PENGFEI
QI KAIQIANG
ZHANG CHENG
ZHOU YEJUN

Assignees

中国空间技术研究院

Dates

Publication Date: 20260512
Application Date: 20221018

Claims (9)

1. A method for improving the air interface spectrum efficiency of multiple agents based on reinforcement learning is characterized in that the resource allocation problem in a multi-user cellular network is modeled as a double-sequence decision process and is solved by combining a deep reinforcement learning tool with a transducer, and the method comprises the following steps: Digging the correlation of the user position distribution and the distribution relation among the resources by using the attention mechanism in the transducer structure to obtain a multi-user resource distribution decision in a single transmission time interval; Further comprises: Strategy learning is carried out by utilizing dynamic interaction between an agent and a cellular network environment in deep reinforcement learning, so that a resource allocation scheme on a plurality of continuous transmission time intervals is obtained; The wireless resource allocation method based on the deep reinforcement learning of the transducer structure comprises the following steps: S1, constructing a resource allocation strategy model based on deep reinforcement learning of a transducer structure; S2, collecting the observation state in the multi-user cellular network by the agent; s3, mapping the multidimensional request information of the user into one-dimensional user labels; S4, inputting a user sequence formed by user tags into a transducer network, and outputting allocation decisions of each resource block; s5, executing a resource allocation decision, and acquiring feedback rewarding information from a multi-user cellular network; S6, the agent evaluates the value of the current environment state and the resource allocation action; S7, repeating the steps S2 to S6, collecting data and calculating advantages; s8, utilizing the collected data to train a resource allocation strategy network model offline; S9, fine-tuning the strategy network model trained in the step S8; And S10, outputting an optimal resource allocation scheme based on the state of the subsequent moment.
2. The method for improving the air interface spectrum efficiency of the multi-agent based on reinforcement learning according to claim 1 is characterized in that in the step S1, a resource allocation strategy model based on deep reinforcement learning of a transform structure is built at a central controller, the resource allocation model based on the transform structure is built in a single transmission time interval, and the resource allocation model based on the deep reinforcement learning is built for a plurality of continuous transmission time intervals.
3. The method for improving the air interface spectrum efficiency of multiple agents based on reinforcement learning according to claim 1, wherein in the step S2, the central controller is used as an agent to collect the observation states in the multi-user cellular network, and the observation states mainly include the states of each user, the states of each resource block, and the request information of each user, and these information are used together as the states of the multi-user cellular network and expressed as: 。
4. The method for improving the air interface spectrum efficiency of multiple agents based on reinforcement learning according to claim 1, wherein in the step S3, part of information is selected from the user request as a key factor affecting the resource allocation effect, and user tags are extracted therefrom to avoid the resource allocation problem from sinking into dimension curse, and the user tag set forms a user sequence and is input into the converter network.
5. The method for improving the air interface spectrum efficiency of multiple agents based on reinforcement learning according to claim 1, wherein the step S4 is characterized in that the resource allocation action is generated based on a transform network by inputting a user tag set into an encoder of a transform structure, inputting a start bit of resource allocation into a decoder of the transform structure, utilizing an attention mechanism to find out the correlation between a user request and the resource allocation, and outputting an allocation result of a first resource through sampling Then, the start bits are combined Commonly used as input of decoder to obtain second resource allocation result The reciprocating cycle is performed until the allocation situation of all the resource blocks is obtained and expressed as 。
6. The method for improving air interface spectrum efficiency of multiple agents based on reinforcement learning according to claim 1, wherein the step S5 is to transmit data by a user with a certain power on a given resource block according to the resource allocation scheme given in the step S4, and obtain rewarding information about system spectrum efficiency and user fairness as follows , wherein, Is the spectral efficiency of the system and, Is a theoretical limit of the spectral efficiency of the system On behalf of the fairness of the user, And The weight coefficients given to the two are respectively.
7. The method for improving air interface spectrum efficiency of multiple agents based on reinforcement learning according to claim 1, wherein in step S6, based on Critic network, the value of the observed state is evaluated as In step S7, a plurality of strips are collected Training data is stored in the data buffer, and the advantage function can be calculated 。
8. The method for improving the air interface spectrum efficiency of multiple agents based on reinforcement learning according to claim 1, wherein in the step S8, training data in a data buffer is utilized to update network parameters so that a resource allocation strategy gradually converges to an optimal value, and loss functions of an Actor network and a Critic network are respectively: the loss function of the Actor network is: Wherein, the Representative of the parameters of the Actor network, Is the ratio of the probabilities of the new strategy and the old strategy, The specific form of the function is ; The loss function of Critic networks is: Wherein the method comprises the steps of Representative are parameters of the Critic network.
9. The method for improving the air interface spectrum efficiency of the multi-agent based on reinforcement learning according to claim 1 is characterized in that in the step S9, the trained strategy model is interacted with the multi-user cellular network continuously, the resource allocation strategy model is subjected to on-line fine tuning at intervals by utilizing newly collected data so as to ensure that a real-time optimal resource allocation strategy is obtained, and in the step S10, the central controller collects state information at the subsequent moment and inputs the state information into the strategy model in the step S9, so that an optimal resource allocation scheme is obtained.

Description

Air interface spectrum efficiency improving method for multiple intelligent agents based on reinforcement learning Technical Field The invention relates to the technical field of high-flux communication systems, in particular to a method for improving air interface spectrum efficiency of multiple intelligent agents based on reinforcement learning. Background In order to avoid interference of adjacent beams, the traditional multi-beam satellite can allocate the frequency range of each beam by using the four-color theorem, ensure that the adjacent beams do not use the same frequency, and reduce the same-frequency interference. To achieve a gigabit high-throughput satellite system, maximizing the available data rate and spectrum utilization, a full frequency reuse scheme may be employed, but this scheme can introduce serious co-channel interference problems. Dynamic resource allocation is considered an efficient way of interference management. Currently, the following resource allocation methods mainly exist in multi-beam satellite communication: (1) Conventional base station level radio resource allocation methods. The central idea of the method is to divide a cell into a central area and an edge area and to allocate specific radio resources to the respective areas. For example, soft frequency reuse and partial frequency reuse methods better accommodate the distribution of traffic within and around the cell by adjusting the power threshold ratio of the sub-carriers to the main carrier. Although this scheme improves the throughput of cell edge users, it is necessary to readjust the power threshold ratio after the traffic distribution changes, and it is difficult to adapt to the dynamic wireless network environment. (2) The traditional user-level wireless resource allocation method comprises a polling algorithm, a maximum carrier-to-interference ratio algorithm and a proportional fairness algorithm. The polling algorithm is an algorithm pursuing fairness maximization, periodically distributes resources to users according to a certain sequence, is simple to realize, does not consider factors such as service characteristics, user priority and the like, the maximum carrier-to-interference ratio algorithm is an algorithm pursuing maximization of performance, distributes all resources to users with best signal quality in a scheduling period, has the highest resource utilization rate, does not consider fairness factors at all, and is a compromise between the two algorithms of polling and the maximum carrier-to-interference ratio. (3) A wireless resource allocation method based on deep reinforcement learning. The deep reinforcement learning integrates the perception capability of the deep learning and the decision capability of the reinforcement learning, and solves the problems of insufficient dynamics and intelligence in the traditional resource allocation method. The deep reinforcement learning technology models the wireless resource allocation problem as continuous dynamic interaction between an agent and a wireless network environment, and learns the dynamic knowledge of the wireless environment through feedback information given by the environment, so that an optimal resource allocation decision can be made. However, this method generally has problems such as explosion of data and large data demand, and therefore, it is difficult to achieve an ideal effect when the number of users is large and the service is complicated. Although the existing radio resource allocation method can avoid interference to a certain extent so as to improve the spectrum efficiency of the system, some disadvantages still exist: (1) The traditional resource allocation method has high computational complexity in the optimization process, long time spent by an iterative algorithm, insufficient dynamic property and intelligent property and incapability of adapting to a dynamic wireless network environment. (2) The resource allocation method based on deep reinforcement learning depends on a large amount of interaction data, and can cause problems such as curse dimension, data explosion and the like in a large-scale network. The invention aims to solve the problems of explosion of data dimension and low sample efficiency in a wireless resource allocation scheme based on deep reinforcement learning. Disclosure of Invention The invention aims to solve the problems and provides a method for improving the air interface spectrum efficiency of multiple intelligent agents based on reinforcement learning. In order to achieve the above purpose, the present invention adopts the following technical scheme: A method for improving the air interface spectrum efficiency of multiple agents based on reinforcement learning models the resource allocation problem in a multi-user cellular network as a double-sequence decision process and adopts a deep reinforcement learning tool to combine with a transducer to solve the problem, comprising the following steps: Di