CN-122028075-A - Unmanned aerial vehicle resource allocation method, system, equipment and medium based on multi-agent reinforcement learning

CN122028075ACN 122028075 ACN122028075 ACN 122028075ACN-122028075-A

Abstract

A multi-agent reinforcement learning-based unmanned aerial vehicle resource allocation method, system, equipment and medium comprise the steps of establishing a multi-unmanned aerial vehicle resource allocation optimization problem for a multi-mode unmanned aerial vehicle assisted sense-of-general integrated data service scene, designing unmanned aerial vehicle data service evaluation indexes with average peak information age, expressing the optimization problem as a Markov decision problem, designing a state space, an action space and corresponding instant rewarding functions, designing a condition variation self-encoder, representing a coupling relation between discrete actions and continuous actions, combining MATD multi-agent reinforcement learning algorithm to obtain a MATD algorithm represented by the coupling relation, training and updating network parameters of an unmanned aerial vehicle by using a MATD algorithm represented by the coupling relation according to a Markov decision model, and applying the network parameters to the data service scene for communication resource allocation.

Inventors

SHI JIA
SUN WENTAO
LI ZAN
BAI ZIXUAN
WEI QING

Assignees

西安电子科技大学

Dates

Publication Date: 20260512
Application Date: 20260209

Claims (7)

1. The unmanned aerial vehicle resource allocation method based on multi-agent reinforcement learning is characterized by comprising the following steps of: Step 1, establishing an optimization problem of multi-unmanned aerial vehicle resource allocation in a data service scene for a multi-mode unmanned aerial vehicle assisted general sense integrated data service scene, and simultaneously designing unmanned aerial vehicle data service evaluation indexes of average peak information age; Step 2, expressing the optimization problem of the multi-unmanned aerial vehicle resource allocation under the data service scene established in the step 1 as a Markov decision problem, and respectively designing a state space, an action space and a corresponding instant rewarding function; and 3, characterizing the coupling relation between discrete actions and continuous actions through a design condition variation self-encoder, combining MATD multi-agent reinforcement learning algorithm to obtain a MATD algorithm characterized by the coupling relation, training and updating network parameters of the unmanned aerial vehicle by using the MATD algorithm characterized by the coupling relation according to the Markov decision model obtained in the step 2, and applying the unmanned aerial vehicle obtained by training to a data service scene to perform communication resource allocation.
2. The unmanned aerial vehicle resource allocation method based on multi-agent reinforcement learning according to claim 1, wherein the specific method of step 1 comprises: The method comprises the steps of 1.1, making specific assumptions for a data service scene, deploying a plurality of unmanned aerial vehicles as air base stations in a designated area to provide data service support for ground users, selecting a radar sensing mode to position the ground users during the operation of the unmanned aerial vehicles, or communicating with a plurality of ground users based on an Orthogonal Frequency Division Multiplexing (OFDM) scheme, simultaneously, enabling the positions of part of the ground users to continuously change, and enabling different data requirements in different time, and further, enabling the unmanned aerial vehicles to provide data service by sending a plurality of data packets to the ground users, wherein the unmanned aerial vehicles are used as the air base stations, and selecting a self behavior mode, a radar sensing mode or a data transmission mode while deciding an action track, and simultaneously realizing power control under different modes; Step 1.2, establishing a communication link channel model between the unmanned aerial vehicle and a ground user, wherein the channel capacity of the communication link is as follows: Wherein, the For the bandwidth of the spectral sub-band, The probability LoS channel gain representing the communication link between drone m and ground user n, Representing the transmit power of the drone over the communication link, Representing the power spectral density of additive gaussian white noise, The same-frequency interference on the communication link between the unmanned plane m and the ground user n is represented, specifically: Wherein M and N represent the number of unmanned aerial vehicle and ground users respectively, Representing the transmit power of the communication link between the drone i and the ground user j, Using a binary value to indicate whether the unmanned aerial vehicle i is serving the ground user j, wherein 1 indicates serving and 0 indicates not serving; using binary values to represent whether the ground user j and the ground user n are multiplexed in a using frequency band, wherein 1 represents multiplexing and 0 represents non-multiplexing; Step 1.3, designing unmanned aerial vehicle data service evaluation indexes of average peak information age; At the time of The age of the update packet with a time stamp u is considered as When the updated time stamp corresponds to the current time t and the age is zero, the updated data packet is considered to be fresh, and the average peak information age of the ground user i in the time slot j is expressed as: Wherein, the Indicating the cumulative peak number of terrestrial user i in slot j, And finally, the average peak information age of each ground user in each time slot is expressed as: Wherein K represents the total number of slots; Step 1.4, combining the scene assumption of step 1.1, the communication link channel model between the unmanned aerial vehicle and the ground user established in step 1.2 and the unmanned aerial vehicle data service evaluation index of the average peak information age designed in step 1.3, to obtain the optimization problem of multi-unmanned aerial vehicle resource allocation in the data service scene, wherein the optimization problem is specifically described as follows: Wherein (a), (b) and (c) ensure maximum flight speed constraints of the unmanned aerial vehicle and the positional constraints thereof; the unmanned aerial vehicle behavior mode selection method comprises the steps of (d) expressing formulated description of unmanned aerial vehicle behavior mode selection, (e) ensuring that each ground user is only served by one unmanned aerial vehicle, (f) ensuring that the number of ground users served by each unmanned aerial vehicle cannot exceed the number of subcarriers, and (g) ensuring that the power consumption of each unmanned aerial vehicle cannot exceed the upper limit of the transmitting power.
3. The unmanned aerial vehicle resource allocation method based on multi-agent reinforcement learning according to claim 1, wherein the specific method of step 2 comprises: Step 2.1, designing a state space, wherein the state space comprises state changes of a ground user and an unmanned aerial vehicle, namely the relative distance between the unmanned aerial vehicle and the ground user and the data demand, and the relative distance between the unmanned aerial vehicle and other unmanned aerial vehicles is specifically as follows: Wherein, the And Representing the relative distance between the drone m and the ground user n in the x-axis and y-axis respectively, Representing the data demand of the surface user n, And Respectively representing the relative distance between the unmanned plane m and other unmanned planes i on the x axis and the y axis; Step 2.2, designing an action space, wherein the actions of the unmanned aerial vehicle consist of discrete actions and continuous actions, the discrete actions are selection action modes and comprise a radar sensing mode or a data transmission mode, the continuous actions are continuous parameters in all the action modes, radar power is fixed when the unmanned aerial vehicle selects the radar sensing mode, the continuous actions only change on the x axis and the y axis of the position coordinates of the unmanned aerial vehicle, and in addition, when the data transmission mode is selected, the continuous actions further comprise transmitting power distributed on each subcarrier when the unmanned aerial vehicle selects data transmission ; Step 2.3, designing a reward function as follows: Wherein, the Representing the average peak information age penalty factor for period t.
4. The unmanned aerial vehicle resource allocation method based on multi-agent reinforcement learning according to claim 1, wherein the specific method of step 3 comprises: step 3.1, designing a condition variation self-encoder to represent the coupling relation between discrete actions and continuous actions; constructing a decodable potential action space, representing a dependency in a hybrid action space, constructing an embedded table To represent Discrete actions, where each row Is one of Discrete actions of dimensions A continuous vector, wherein, For row indexing, then building a potential representation space of the continuous motion, modeling and embedding the dependency in an implicit manner using a conditional variational self-encoder consisting of an encoder and a decoder, for discrete motion Continuous motion And state Encoder To be used for As parameters, in state Embedding vectors For the condition, continuous action Encoding as potential actions For encoders Using gaussian distribution , wherein, And The mean and standard deviation of the encoder output, respectively, for any potential actions by the decoder Performing deterministic decoding, furthermore, by Parameterized decoder From potential actions under the same conditions Decoding continuous motion Is arbitrarily embedded in And any potential actions Are decoded into mixed actions by nearest neighbor lookup and decoder of embedded table And Thus, the encoding and decoding process is expressed as: Then, using the environmental dynamics, the square error loss based on the state change prediction is designed to further refine the hybrid motion representation, thus taking the data service amount of each unmanned aerial vehicle after performing the hybrid motion as the state change prediction, i.e , wherein, Data service quantity of unmanned plane m to ground user n is represented, and the following is made Parameters representing the transition network, the decoder network and the state change prediction network, respectively, continuous action of VAE reconstruction And prediction Using a pool of experiences Batch state of (c) And mixing action And By minimizing the following loss function Training embedded table And conditional VAE: Wherein, the Is a continuous action of VAE reconstruction Is a function of the square error of (c), Is the encoder distribution Kullback-Leibler (KL) divergence from standard gaussian distribution, furthermore, minimizing state change predictions Prediction of (2) Square error of (c): Thus, the total training loss of the conditional variance from the encoder is expressed as: ; In addition, two mechanisms are adopted, namely potential action constraint and training experience correction, which are respectively used for processing unreliable potential actions and outdated off-policy training experience; restricting the potential actions to a reasonable area, i.e. for each potential action in the experience pool By calculation of Center range acquisition boundary of (a) And Each dimension of the potential action is then rescaled to a bounded range ; Checking timeliness of potential actions in the experience pool by adopting a training experience correction mechanism, and correcting outdated potential actions by using the latest potential strategies; Step 3.2, combining MATD multi-agent reinforcement learning algorithm to obtain MATD algorithm of coupling relation characterization, and training and updating network parameters of the unmanned aerial vehicle by using MATD algorithm of coupling relation characterization according to the Markov decision model obtained in step 2; Training potential strategies by MATD algorithm, using learned condition variation self-encoder for representing coupling relation between discrete action and continuous action by using Actor network of unmanned aerial vehicle, namely parameters of Potential strategies of (a) To output potential motion vectors Wherein Simultaneously, training a condition-variable self-encoder for each agent, wherein each agent uses a condition-variable self-encoder to encode Decoding into mixed actions And ; Dual criticizing home network for each agent , Taking global information as input, acting as a function of approximate state From experience pools Randomly selects a batch of experience , As training samples, calculate Critic network The mean square error loss function of (2) is: Wherein, the , Representing the estimated target Q value, expressed as: Wherein, the , And For parameters of the target network, the Actor network of each agent is trained to maximize the state action value of the potential action output by the network, which value is updated by the policy gradient, so the loss function of the Actor network is: Wherein, the ; Step 3.3, applying the unmanned aerial vehicle obtained by training to a data service scene to allocate communication resources; The method comprises the steps of designing a hybrid action space, providing an Actor network and a condition variation self-encoder for each unmanned aerial vehicle, wherein the Actor network outputs a potential strategy, inputting the potential strategy into the trained condition variation self-encoder for decoding to obtain discrete actions and continuous actions, wherein the discrete actions are used for determining action modes, namely radar sensing modes or data transmission modes, the continuous actions output continuous action parameters of each action mode, after the current state of an environment is observed, the unmanned aerial vehicle selects the radar sensing modes or the data transmission modes according to the discrete actions, then the corresponding continuous action parameters are obtained from the continuous actions, when the unmanned aerial vehicle selects the radar sensing modes, the unmanned aerial vehicle starts the radar sensing ground user position, the position of the unmanned aerial vehicle is adjusted, and when the unmanned aerial vehicle selects the data transmission modes, the unmanned aerial vehicle transmits data to the ground user through transmission power obtained through the continuous actions on each channel, and the position of the unmanned aerial vehicle is adjusted.
5. A multi-agent reinforcement learning-based unmanned aerial vehicle resource allocation system based on the method of claim 1, comprising: The optimization problem establishing module is used for establishing an optimization problem of multi-unmanned aerial vehicle resource allocation in a data service scene for a multi-mode unmanned aerial vehicle assisted general sense integrated data service scene, and simultaneously designing unmanned aerial vehicle data service evaluation indexes of average peak information age; The Markov decision problem expression module is used for expressing the optimization problem of the multi-unmanned aerial vehicle resource allocation in the data service scene as a Markov decision problem, and respectively designing a state space, an action space and a corresponding instant rewarding function; The unmanned aerial vehicle communication resource allocation module is used for representing the coupling relation between discrete actions and continuous actions through the design condition variation self-encoder, obtaining a MATD algorithm represented by the coupling relation by combining MATD multi-agent reinforcement learning algorithm, training and updating network parameters of the unmanned aerial vehicle by using a MATD algorithm represented by the coupling relation according to a Markov decision model, and applying the unmanned aerial vehicle obtained by training to a data service scene for communication resource allocation.
6. Unmanned aerial vehicle resource allocation equipment based on multi-agent reinforcement study, characterized by comprising: A memory storing a computer program of the unmanned aerial vehicle resource allocation method based on multi-agent reinforcement learning according to any one of claims 1 to 4, as a computer-readable device; A processor for implementing the unmanned aerial vehicle resource allocation method based on multi-agent reinforcement learning according to any one of claims 1 to 4 when executing the computer program.
7. A computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program when executed by a processor is capable of implementing a multi-agent reinforcement learning-based unmanned aerial vehicle resource allocation method according to any one of claims 1 to 4.

Description

Unmanned aerial vehicle resource allocation method, system, equipment and medium based on multi-agent reinforcement learning Technical Field The invention belongs to the technical field of unmanned aerial vehicle wireless communication, and particularly relates to an unmanned aerial vehicle resource allocation method, system, equipment and medium based on multi-agent reinforcement learning. Background In recent years, unmanned aerial vehicles have the advantages of multifunction, high maneuverability, flexible deployment, low cost, high probability of establishing a line-of-sight link and the like, and research on the unmanned aerial vehicles as aerial base stations has attracted wide attention in countries around the world. On the one hand, the small unmanned aerial vehicle plays an important role in realizing airspace coverage in the 6G sky-ground integrated system, and can be used as an airspace-assisted communication platform, for example, in a high-density communication scene, the unmanned aerial vehicle can be used as a temporary base station or a relay station to support wireless communication and enhance the user capacity. On the other hand, in the field of emergency communication, although an infrastructure communication infrastructure can generally handle daily communication loads, when an emergency or an unusual temporary scenario occurs, communication can be supported by using a drone. For example, in network reconstruction after a major natural disaster, the drone may quickly adjust the location, providing quick post-disaster wireless service recovery. These characteristics of the drone provide an opportunity to establish an efficient, flexible, reliable drone-assisted communication network in the event of insufficient or malfunctioning fixed ground infrastructure. However, the problem of resource allocation in the current data service scenario still faces many challenges, on one hand, the burstiness and randomness of the data demands of the ground users provide significant challenges for fairness of data services and improving the service quality of the ground users, and on the other hand, the prior information assumption of the ground users is too idealized, and although some researches consider dynamic changes of the ground users, only unmanned aerial vehicle downlink transmission strategies under the condition that the information (position distribution and the like) of the ground users are known a priori are obtained, and how to perceive the states of the ground users is not considered. Meanwhile, as discrete and continuous optimization variables coexist in the resource allocation process, the optimization difficulty is further increased, and how to characterize and decouple the coupling relationship between the discrete and continuous optimization variables is a great difficulty. Unmanned aerial vehicle flight trajectory planning, mode selection and transmit power allocation are the basis for ensuring energy-efficient unmanned aerial vehicle assisted wireless networks. The trajectory of the unmanned aerial vehicle should be precisely designed to meet the dynamic data service requirements of the ground users in real time. Meanwhile, the transmitting power of the unmanned aerial vehicle is well distributed so as to adapt to the dynamic change of the channel state information between the unmanned aerial vehicle and the ground user in real time, and the same-frequency interference between the unmanned aerial vehicles is reduced as much as possible. In addition, the unmanned aerial vehicle needs to have the ability to select a timely action mode to sense the ground user information, thereby enhancing the ability to meet the ground user data requirements. In summary, it is of urgent importance to develop more practical methods to effectively advance unmanned aerial vehicle assisted wireless networks. Patent document with application number 202411903606.6 discloses an unmanned aerial vehicle resource allocation method based on multi-agent reinforcement learning, which realizes agent mixed action decision by respectively designing a discrete Actor network and a continuous Actor network, and introduces MATD3 deep reinforcement algorithm to perform network training so as to realize mode selection, path planning and power allocation of the multi-unmanned aerial vehicle. However, the above method makes discrete actions and continuous actions separately, and does not consider the coupling relationship between them, which often results in poor model performance after training convergence. Disclosure of Invention In order to overcome the defects of the prior art, the invention aims to provide an unmanned aerial vehicle resource allocation method, system, equipment and medium based on multi-agent reinforcement learning, which can enable an unmanned aerial vehicle base station to select an action mode and allocate communication resources under a dynamic random data service scene by designing discrete and