CN-122021356-A - Unmanned aerial vehicle offline reinforcement learning training method combining implicit Q learning

CN122021356ACN 122021356 ACN122021356 ACN 122021356ACN-122021356-A

Abstract

The invention discloses an unmanned aerial vehicle offline reinforcement learning training method combining implicit Q learning, which comprises the following steps of building an unmanned aerial vehicle simulation environment and reinforcement learning training environment, combining the unmanned aerial vehicle offline reinforcement learning simulation environment, building a rule policy agent through an unmanned aerial vehicle maneuver decision statistics method, interacting with the unmanned aerial vehicle offline reinforcement learning simulation environment through the rule policy agent, enabling a first unmanned aerial vehicle and a second unmanned aerial vehicle to maneuver by using the rule policy agent, collecting interaction training data, storing the interaction training data into an offline reinforcement learning experience playback pool, building a soft actor commentator algorithm module combining implicit Q learning, randomly extracting batch data from the experience playback pool, and inputting the batch data into an IQL-SAC unmanned aerial vehicle offline reinforcement learning algorithm module to train a neural network. According to the invention, through combining with the soft actor commentator algorithm offline reinforcement learning training of implicit Q learning, the offline reinforcement learning training effect and maneuver decision capability of the unmanned aerial vehicle are improved.

Inventors

JIAO PENGPENG
GU ZHENGHUI
YUAN YINLONG

Assignees

华南理工大学

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (10)

1. An unmanned aerial vehicle offline reinforcement learning training method combining implicit Q learning is characterized by comprising the following steps: S1, constructing an unmanned aerial vehicle offline reinforcement learning simulation environment based on an unmanned aerial vehicle dynamics and kinematics model, and constructing an offline reinforcement learning experience playback pool based on a data standard; s2, combining an unmanned aerial vehicle offline reinforcement learning simulation environment, and constructing a rule strategy intelligent body through an unmanned aerial vehicle maneuver decision statistical method, so that the intelligent body can output a discrete maneuver to be input into the environment by combining current state variable information; s3, using the rule policy agent to interact with the unmanned aerial vehicle offline reinforcement learning simulation environment, using the rule policy agent to make maneuvering decisions by the unmanned aerial vehicles of both interaction parties, collecting interaction training data at the same time, and storing the interaction training data into an offline reinforcement learning experience playback pool as unmanned aerial vehicle offline reinforcement learning experience data; S4, based on the unmanned aerial vehicle offline reinforcement learning experience data obtained in the step S3, constructing an IQL-SAC unmanned aerial vehicle offline reinforcement learning algorithm module which is a soft actor commentator algorithm module combined with implicit Q learning; S5, randomly extracting batch data from the experience playback pool, and inputting the batch data into a training strategy network, an action state value network and a neural network parameter of the state value network in the IQL-SAC unmanned aerial vehicle offline reinforcement learning algorithm module.
2. The unmanned aerial vehicle offline reinforcement learning training method combined with implicit Q learning according to claim 1, wherein the offline reinforcement learning environment constructing step of step S1 includes: S11, establishing a single unmanned aerial vehicle maneuver simulation model based on unmanned aerial vehicle dynamics and a kinematic model, expanding the single unmanned aerial vehicle maneuver simulation model into an interactive two-party unmanned aerial vehicle maneuver decision reinforcement learning simulation environment, wherein the input of the simulation environment is a maneuver input control variable of each unmanned aerial vehicle, and outputting the simulation environment as a simulation situation information set of each unmanned aerial vehicle; S12, based on input and output variables of an unmanned aerial vehicle offline reinforcement learning simulation environment, adapting to offline reinforcement learning experience playback training data standards, specifically, state variable adaptation, action variable adaptation and rewarding function setting; s13, constructing an offline reinforcement learning experience playback pool based on state variable, action variable and rewarding function variable data, and collecting data during offline reinforcement learning training.
3. The unmanned aerial vehicle offline reinforcement learning training method combining implicit Q learning according to claim 2 is characterized in that each single unmanned aerial vehicle simulation situation information set comprises three-dimensional coordinate position information, speed scalar information, pitch angle information and course angle information, and maneuver input variables of each unmanned aerial vehicle maneuver simulation model comprise tangential overload, normal overload and roll angle, and the input variables are used as maneuver variables for controlling the unmanned aerial vehicle.
4. The unmanned aerial vehicle offline reinforcement learning training method in combination with implicit Q learning according to claim 1, wherein the rule policy agent constructing step of step S2 comprises: s21, combining the three-dimensional model of the unmanned aerial vehicle with an offline reinforcement learning environment, and designing a bottom layer maneuvering macro action library; S22, constructing a rule policy agent by using an unmanned aerial vehicle maneuver decision statistical method, wherein the process is as follows: Selecting one of the bottom layer maneuvering macro action libraries to perform action simulation evolution, deducing the position information of the unmanned aerial vehicle at the next moment, calculating the angle value, the distance value, the height value and the speed value of the maneuvering macro action according to the position information of the unmanned aerial vehicle, and then calculating the value mean value and the standard deviation of the maneuvering macro action; S23, selecting the largest maneuver in the motion value mean as a rule policy agent, and finally outputting the specific maneuver, and if a plurality of identical largest motion value mean exist, selecting the maneuver with the smallest standard deviation in the motion of the largest motion value mean.
5. The unmanned aerial vehicle offline reinforcement learning training method in combination with implicit Q learning according to claim 1, wherein the collecting of the interactive training data of step S3 comprises: S31, interacting the initialized rule strategy agent with an unmanned aerial vehicle offline reinforcement learning simulation environment, wherein the unmanned aerial vehicle uses the rule strategy agent to make maneuvering decisions; s32, initializing the unmanned aerial vehicle in the unmanned aerial vehicle simulation environment by using the initialization data; s33, processing the current unmanned aerial vehicle situation information into an offline reinforcement learning standard state variable; S34, the initialized rule policy agent generates rule policy actions under the current state variables based on the offline reinforcement learning standard state variables to form a rule policy action set; S35, inputting the rule strategy action set into an unmanned aerial vehicle simulation environment for deduction to obtain a next-step state variable; S36, calculating a reward function value according to the next state variable; s37, forming the offline reinforcement learning training data generated in the steps S32 to S35 into four groups, and storing the four groups in an offline reinforcement learning experience playback pool.
6. The unmanned aerial vehicle offline reinforcement learning training method combined with implicit Q learning according to claim 5, wherein determining whether the offline reinforcement learning experience playback pool is collected is specifically: if the experience playback pool is collected and reaches the maximum capacity, an IQL-SAC unmanned aerial vehicle offline reinforcement learning algorithm module is built in the step S4; If the experience playback pool does not reach the maximum capacity and the current simulation environment reaches the maximum step length, returning to the step S32, reinitializing the unmanned aerial vehicle to perform a new simulation, and continuing to perform new data collection; if the experience playback pool does not reach the maximum capacity and the current simulation environment does not reach the maximum step length, the process returns to the step S33, and the simulation collection data of the round is continued.
7. The unmanned aerial vehicle offline reinforcement learning training method combined with implicit Q learning according to claim 1, wherein in step S4, the IQL-SAC unmanned aerial vehicle offline reinforcement learning algorithm module comprises three large neural network modules, namely a strategy network Actor module, an action state Value network Critic module and a state Value network Value module, wherein the Actor module is used for outputting a mean Value and a variance parameter of Gaussian distribution according to an input state variable, mapping actions to a continuous action space through an arc tangent function, the Critic module is used for outputting corresponding action Value estimation according to an input state action pair, and the Value module is used for predicting the state Value independently and approaching an optimal implicit Value function through an expected fractional regression mode.
8. The unmanned aerial vehicle offline reinforcement learning training method in combination with implicit Q learning of claim 7, wherein the IQL-SAC unmanned aerial vehicle offline reinforcement learning algorithm comprises: S51, randomly extracting batch data from an offline reinforcement learning experience playback pool, wherein the batch data comprises state variables, action variables, rewarding functions and next-step state variables; s52, inputting the state action data of the batch into two Critic networks of the Critic module to obtain corresponding action values, inputting the state variable data of the batch into the Value network of the Value module to obtain the state values, and updating the Value network based on an expected quantile regression mode; S53, inputting next-step state variable data of the batch into a target Value network of the Value module to obtain next-step state Value, inputting state action data of the batch into two Critic networks of the Critic module to obtain corresponding action Value, and updating the Critic networks based on a mean square error loss function; S54, inputting the state action data of the batch into two Critic networks of the Critic module to obtain corresponding action values, inputting the state variable data of the batch into the Value network of the Value module to obtain the state values, calculating the dominance values according to the dominance functions, and updating the strategy network based on the behavior clone loss functions weighted by the dominance functions; s55, updating a target Value network of the Value module and a target Critic network of the Critic module by adopting a soft update mechanism; Repeating the steps S51 to S55 until the preset convergence condition is met or the preset maximum training step number is reached.
9. A computer device comprising a memory and a processor, the memory being electrically connected to the processor, the memory storing a computer program, wherein the computer program, when executed by the processor, causes the processor to implement the method as claimed in any one of claims 1 to 8.
10. A computer readable storage medium storing a computer program, wherein the computer program is executed by a processor, the processor implementing the method according to any one of claims 1 to 8.

Description

Unmanned aerial vehicle offline reinforcement learning training method combining implicit Q learning Technical Field The invention belongs to the technical field of unmanned aerial vehicle self-control and decision-making driven by artificial intelligence, and relates to an unmanned aerial vehicle offline reinforcement learning training method combining implicit Q learning. Background In recent years, unmanned aerial vehicles are being increasingly used in the fields of industry, agriculture, logistics industry and the like due to their high maneuverability, high safety, low running cost and excellent capability in complex task execution. Deep reinforcement learning (Deep Reinforcement Learning, DRL) becomes a powerful tool for solving the autonomous maneuver decision problem of the unmanned aerial vehicle due to the fact that the deep reinforcement learning (Deep Reinforcement Learning, DRL) adaptively learns the optimal strategy through interaction with the environment. A large number of researches show that strategy modeling is carried out on the target unmanned aerial vehicle through the reinforcement learning model, and the autonomous decision making capability of the unmanned aerial vehicle in a complex flight environment can be effectively improved by combining interactive training. However, the existing reinforcement learning method still has the following defects that firstly, the cold start problem is outstanding, the traditional reinforcement learning depends on a large amount of online interaction samples for training, the unmanned aerial vehicle is high in cost and has higher safety risk when performing online training in a real environment, effective experience is difficult to obtain in an initial stage of the reinforcement learning algorithm, the convergence speed is low, the training is unstable, secondly, the unmanned aerial vehicle maneuver decision data are usually sourced from a real flight test or a high-fidelity simulation platform, the acquisition cost is high, the quantity is limited, the existing history interaction data cannot be fully utilized by the existing reinforcement learning method, the data utilization rate is low, a high-efficiency training closed loop is difficult to form, finally, the offline reinforcement learning performance is limited, the traditional offline reinforcement learning algorithm is easily influenced by distribution deviation and overestimation problems in the strategy learning process, and the performance of the learned strategy is degraded in the actual deployment. Although Soft Actor-Critic (SAC) algorithms perform well in continuous control tasks, they still face problems of policy overfitting and insufficient stability in offline scenarios. In order to solve the above problems, the invention provides an unmanned aerial vehicle offline reinforcement Learning training method, system and device combining implicit Q Learning (IMPLICIT Q-Learning, IQL). The method realizes performance improvement and training efficiency optimization by means of the following innovative design, namely 1, a unified unmanned aerial vehicle offline reinforcement learning training framework is established, model optimization is realized only by means of maneuvering decision data conforming to offline reinforcement learning training standards under the precondition that reinforcement learning algorithm and unmanned aerial vehicle simulation environment are not relied on, data utilization rate is improved, 2, implicit dominance function estimation and strategy regularization mechanisms are introduced in the strategy learning process by combining implicit Q learning improvement soft actor critics, distribution deviation is effectively relieved, stability and strategy generalization performance of offline training are improved, 3, an unmanned aerial vehicle offline reinforcement learning pre-training mechanism is introduced, strategy initialization is carried out by means of existing unmanned aerial vehicle maneuvering decision data, and cold starting problem of reinforcement learning is relieved. Disclosure of Invention The invention aims to provide an unmanned aerial vehicle offline reinforcement learning training method combining implicit Q learning, which comprises the steps of constructing an unmanned aerial vehicle simulation environment and a reinforcement learning training environment, designing a flight decision strategy intelligent agent and performing maneuvering interaction with the environment to generate offline reinforcement learning training data, constructing an offline reinforcement learning algorithm framework of soft actor critics based on implicit Q learning, and sampling data from an experience playback pool based on the framework to perform model training. The unmanned aerial vehicle reinforcement learning training method aims at solving the problems that cold start is difficult, data utilization rate is low, offline training performance is limited and the l