CN-122018525-A - Reinforced learning reinforced continuous domain ant colony optimization method for three-dimensional path planning of unmanned aerial vehicle

CN122018525ACN 122018525 ACN122018525 ACN 122018525ACN-122018525-A

Abstract

The invention provides a reinforcement learning reinforcement continuous domain ant colony optimization method for unmanned aerial vehicle three-dimensional path planning, and belongs to the technical field of unmanned aerial vehicle path planning. The method integrates a reinforcement learning mechanism into a continuous domain ant colony optimization algorithm framework, so that each searching individual is used as an independent intelligent agent to adaptively select an optimization strategy, four collaborative operations of quantum behavior exploration, multi-elite guide development, bezier curve smoothing and information entropy minimization adjustment are designed, the problems of exploration and development balance, path feasibility and track regularity are comprehensively solved, and the limitation of a single strategy of the existing method is overcome. Meanwhile, in the aspect of algorithm performance, the influence of flight dynamics constraint and complex three-dimensional multi-obstacle environment is considered, the designed reinforcement learning reinforced continuous domain ant colony optimization algorithm can generate a high-quality, smooth and safe flight path, optimal solutions are obtained in four test cases, and compared with the second algorithm, the method is improved by 8.3% on average, and the method has high convergence rate and good stability.

Inventors

LONG YUE
LIU TAO
LI TIESHAN
CHEN ZHENLEI

Assignees

电子科技大学

Dates

Publication Date: 20260512
Application Date: 20260206

Claims (10)

1. The reinforcement learning reinforcement continuous domain ant colony optimization method for the unmanned aerial vehicle three-dimensional path planning is characterized by comprising the following steps of: step 1, establishing an unmanned plane path planning mathematical model, and constructing a comprehensive cost function comprising path length, collision avoidance, height constraint and smoothness; initializing a reinforcement learning enhanced continuous domain ant colony optimization framework, configuring an independent Q table for each individual, and defining a state space and an action space, wherein the specific process is as follows: setting the population scale as Random initialization Individual, the i-th individual is , Individual(s) The represented path scheme is denoted as It contains n waypoint coordinates expressed as Wherein Three-dimensional coordinates for the jth waypoint, configuring an independent Q-table for each individual All Q tables are initialized to zero matrix, and state space is defined Respectively correspond to four kinds of collaborative operations of subsequent designs, define action space In which the motion is Indicating a choice to perform the mth co-operation, ; Calculating the cost of each individual by the cost function defined in step 1 Finding a globally optimal solution Setting the maximum iteration times Initializing the current iteration number ; Step3, designing four cooperative operations, namely a quantum behavior exploration operation, a multi-elite guide development operation, a Bezier curve smoothing operation and an information entropy minimization adjustment operation; the quantum behavior exploration operation specifically comprises the following steps: step length is generated through a Levy flight mechanism, and the breadth of the exploration range is represented: , Wherein, the Are random variables subject to corresponding normal distributions, Is the Levy index; representing standard deviation, the calculation formula is as follows: , Is a gamma function; A d-th dimensional component of the new solution is generated, , Wherein L is a characteristic length parameter, For the levy index, Is that A uniform random number over the interval of time, Is a sign function; to select the guide solution in the first Values of dimensions; is the first Standard deviation of dimension, d represents a certain coordinate component of a certain waypoint in the path, logarithmic factor Simulating quantum tunneling effect; The multi-elite guide development operation specifically comprises the following steps: Defining elite collections Wherein For the number of elite solutions, assigning a ranking-based weight to each elite solution: , , constructing a new solution by a gaussian kernel function: , represent the new solution The dimension component is from the mean value Standard deviation of Is used for sampling in the normal distribution of (a), Introducing an optimal solution bias: , Wherein, the In order to bias the intensity coefficient of the light, At the first place for the current optimal individual Values of dimensions; The Bezier curve smoothing operation is specifically: for three consecutive waypoints Constructing a quadratic bezier curve: , Wherein, the As a parameter of the curve, The weighted mixing strategy is adopted to process the secondary Bezier curve, avoid excessive smoothing, , Wherein, the Is a smooth weight coefficient; the information entropy minimization adjustment operation specifically comprises the following steps: Calculating the direction change entropy of the path: , For the purpose of normalization to the probability distribution, ; In order to integrate the angle change, , For the turning angle at the jth waypoint, The climbing angle of the jth waypoint; the ideal position of each waypoint is obtained through the minimum entropy of the direction change, , The waypoint update employs an adaptive step size strategy, , Wherein the adjustment coefficient , For the current number of iterations, The maximum iteration number; step 4, executing Q-learning adaptive strategy selection by -Greedy policy selects a co-operation for each individual; step 5, evaluating new solutions and updating a Q table, defining instant rewards according to the cost improvement condition, and updating the Q value by adopting a self-adaptive learning rate; step 6, updating the population and the global optimal solution, judging whether the maximum iteration times are reached, returning to the step 4 to continue iteration if the maximum iteration times are not reached, and otherwise, executing the step 7; step 7, using global optimal solution As a final unmanned three-dimensional path planning scheme, the optimal path comprises Individual waypoints , 。
2. The reinforcement learning enhancement continuous domain ant colony optimization method of claim 1, wherein the specific process of step 1 is: For any path P i from the start point S to the end point T, including n waypoints, i is the path number, then the spatial coordinates of the jth waypoint on path P i are P i,j , , Is the horizontal coordinate of the jth waypoint of the ith path, Is the vertical coordinate of the jth waypoint of the ith path, The height coordinate of the jth waypoint of the ith path; path length cost In order to achieve this, the first and second, , Wherein, the Representing adjacent waypoints And Euclidean distance between them; Cost of collision In order to achieve this, the first and second, , Wherein, the For the number of the barriers, K is the total number of the barriers, Evaluating parameters for a collision; High cost of In order to achieve this, the first and second, , For the height of the jth waypoint, To set the desired fly height according to the mission requirements, For the low-altitude flight penalty factor, For the high-altitude flight penalty factor, For the lowest safe altitude at the jth waypoint, Is the highest allowable altitude at the jth waypoint; Smoothness cost In order to achieve this, the first and second, , And As a penalty factor for the smoothness cost, For the turning angle at the jth waypoint, The climbing angle of the j+1th waypoint is the climbing angle of the j th waypoint; Thus, the total cost function In order to achieve this, the first and second, ; Is a weight coefficient of different cost.
3. The reinforcement learning enhancement continuous domain ant colony optimization method of claim 2, wherein the collision evaluation parameter is, , And The height coordinates of two adjacent waypoints, The horizontal distance from the center of the cylinder to the path section obtained by connecting the two adjacent waypoints; As a safety threshold value, the safety threshold value, D is the radius of the unmanned aerial vehicle, L is the dangerous distance, To wrap around the first The radius of the smallest cylinder of the individual obstacles, To wrap around the first The height of the smallest cylinder of the individual obstacles.
4. The reinforcement learning enhancement continuous domain ant colony optimization method of claim 2, wherein the climb angle of the jth waypoint The calculation formula of (a) is as follows, , Wherein the molecule The difference of the two adjacent waypoints is the horizontal distance of the path section, Turning angle between adjacent path segments: , Wherein, the Representing a vector dot product of the vector, The modulus of the vector is represented, Is a path segment Is arranged in the plane of the horizontal projection of (a), , Is that The unit vector of the positive direction of the axis, Representing a vector cross product.
5. The reinforcement learning enhancement continuous domain ant colony optimization method of claim 1, wherein in step 3, , , , 。
6. The reinforcement learning enhancement continuous domain ant colony optimization method of claim 1, wherein in step 3, the elite solution amount is Average value of And standard deviation The calculation formula is as follows: , , Wherein, the Is the first Solution of the elite in the first place The value of the dimension is used to determine, Ranking-based weight for the r-th elite solution, concentration parameter 。
7. The reinforcement learning enhancement continuous domain ant colony optimization method of claim 1, wherein in step 3, the special processing of boundary waypoints in Bezier curve smoothing operation is the first waypoint Construction of triplets Wherein As the starting point, the last waypoint Construction of triplets Wherein Is the end point.
8. The reinforcement learning enhancement continuous domain ant colony optimization method of claim 1, wherein the specific process of step 4 is: for the current iteration Updating the self-adaptive learning rate, , Wherein, the For the initial rate of learning to be the same, Is the final learning rate; for each individual Based on the current state And Q table By using -Greedy policy selection action: , Wherein, the Is the exploration rate parameter.
9. The reinforcement learning enhancement continuous domain ant colony optimization method of claim 1, wherein the specific process of step 5 is: Calculating the cost of a new solution ; Defining an instant prize function according to the cost improvement situation: , Wherein, the Is the first Individual at multiple iterations Is provided in the position of (a), Is the updated position; updating the Q table according to the Q-learning updating rule: , Wherein, the In order to adapt the rate of learning to the user, Is a discount factor; If it is Accept the new solution Otherwise, the current position is maintained.
10. The reinforcement learning enhancement continuous domain ant colony optimization method of claim 1, wherein the specific process of step 6 is: for each individual If the cost of the new solution is better, that is Updating the individual position ; The global optimal solution is updated and the global optimal solution is updated, , Wherein, the Operator return to cost function Individual index for obtaining minimum value ; Recording the optimal cost value of the current iteration ; At the same time, the iteration number is updated ; If it is Returning to the step 4 to continue iterative optimization, if so The iteration is terminated and step 7 is entered.

Description

Reinforced learning reinforced continuous domain ant colony optimization method for three-dimensional path planning of unmanned aerial vehicle Technical Field The invention belongs to the technical field of unmanned aerial vehicle path planning, and particularly relates to a reinforcement learning reinforcement continuous domain ant colony optimization method for unmanned aerial vehicle three-dimensional path planning. Background Unmanned aerial vehicles are widely used as important tools for exploring complex environments in the fields of precision agriculture, search and rescue, infrastructure inspection, cargo transportation and the like. In these application scenarios, autonomous navigation capability is one of the core requirements of the unmanned aerial vehicle system. Path planning is a key technology of autonomous navigation, and aims to calculate an optimal path from a starting point to an end point for a unmanned aerial vehicle in a complex environment. Particularly in complex terrain environments involving dense obstacles, efficient and reliable path planning algorithms are required to achieve autonomous flight of the unmanned aerial vehicle. From the existing research results, the related technical schemes mainly focus on the directions of a graph searching algorithm [1], a meta heuristic algorithm [2] and the like. The ant colony optimization algorithm is used as an important meta heuristic algorithm, and the continuous domain expansion form of the ant colony optimization algorithm can be used for solving the problem of continuous coordinate optimization in path planning. In recent years, researchers have tried to combine reinforcement learning with ant colony algorithm to improve performance, for example, reinforcement learning ant colony optimization algorithm [3] proposed by Zhang et al, but it adopts a fixed search strategy to perform self-adaptive optimization only for a single parameter, lacks a multi-strategy cooperative mechanism, lacks comprehensive consideration of various practical constraints in a path planning scene, and cannot be effectively applied to three-dimensional path planning of an unmanned aerial vehicle. Therefore, the unmanned aerial vehicle path planning algorithm with the self-adaptive decision-making capability under the complex three-dimensional environment is researched to have great significance. [1] Ding Ningning, chen Lei, niu Xiaodong, chen Haotian, zhang Yuxue and Liu Guodong an unmanned aerial vehicle path planning method based on a fusion A star potential field algorithm [ P ]. Zhejiang province, CN120029308A,2025-05-23. [2] Peng Shuyan A three-dimensional path planning method of unmanned aerial vehicle based on neuron chaos wolf algorithm [ P ] Jiangsu province is CN120315451A,2025-07-15. [3]Zhang W, Wang C, Lin W, et al. Continuous-domain ant colony optimization algorithm based on reinforcement learning[J]. International Journal of Wavelets, Multiresolution and Information Processing, 2021, 19(03): 2050084. Disclosure of Invention Aiming at the problems existing in the background technology, the invention aims to provide a reinforcement learning reinforcement continuous domain ant colony optimization method aiming at unmanned aerial vehicle three-dimensional path planning. The method integrates a reinforcement learning mechanism into a continuous domain ant colony optimization algorithm framework, so that each searching individual is used as an independent intelligent agent to adaptively select an optimization strategy, four collaborative operations of quantum behavior exploration, multi-elite guide development, bezier curve smoothing and information entropy minimization adjustment are designed, the problems of exploration and development balance, path feasibility and track regularity are comprehensively solved, and the limitation of a single strategy of the existing method is overcome. In order to achieve the above purpose, the technical scheme of the invention is as follows: A reinforcement learning reinforced continuous domain ant colony optimization method for unmanned aerial vehicle three-dimensional path planning comprises the following steps: step 1, establishing an unmanned plane path planning mathematical model, and constructing a comprehensive cost function comprising path length, collision avoidance, height constraint and smoothness; initializing a reinforcement learning enhanced continuous domain ant colony optimization framework, configuring an independent Q table for each individual, and defining a state space and an action space, wherein the specific process is as follows: setting the population scale as Random initializationIndividual, the i-th individual is,Individual(s)The represented path scheme is denoted asIt contains n waypoint coordinates expressed asWhereinThree-dimensional coordinates for the jth waypoint, configuring an independent Q-table for each individualAll Q tables are initialized to zero matrix, and state space is definedRe