CN-121979252-A - Autonomous underwater vehicle path following control method based on value distribution reinforcement learning
Abstract
The invention discloses an autonomous underwater vehicle path following control method based on value distribution reinforcement learning, the method first defines an autonomous underwater vehicle AUV path following problem, including determining AUV system inputs, determining AUV system outputs, and defining path following control errors. Secondly, establishing a Markov decision model of the AUV path following problem, and modeling a Markov decision process of the AUV path following problem. A policy network and a value network based on the value distribution are then constructed. And finally, solving the Markov decision model through a strategy network and a value network, and training the strategy network and the value network to obtain an optimal path following strategy of the autonomous underwater vehicle. According to the invention, the AUV can better understand the environmental information by fully utilizing the standard deviation of the value distribution and the trace rewards, the training efficiency is improved, the global learning and training process is more stable, and the AUV path following precision is higher.
Inventors
- WEN MINGXING
- GAO FARONG
- Lei Shaowei
- ZHANG QIZHONG
Assignees
- 杭州电子科技大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260205
Claims (8)
- 1. An autonomous underwater vehicle path following control method based on value distribution reinforcement learning is characterized by comprising the following steps: s1, defining an AUV path following problem of an autonomous underwater vehicle, wherein the AUV path following problem comprises determining AUV system input, determining AUV system output and defining path following control error; S2, establishing a Markov decision model of the AUV path following problem, and modeling a Markov decision process of the AUV path following problem; S3, constructing a strategy network and a value network based on value distribution; And S4, solving the Markov decision model through the strategy network and the value network, and training the strategy network and the value network to obtain an optimal path following strategy of the autonomous underwater vehicle.
- 2. The method for autonomous underwater vehicle path following control based on value distribution reinforcement learning of claim 1, wherein in step S1, the determining AUV system input is specifically to make the system input vector of the AUV be Wherein 、 Respectively a propeller thrust and a rudder angle, wherein a subscript t represents a t-th time step; 、 The value ranges of (a) are respectively And Wherein And Respectively the maximum propeller thrust and the maximum rudder angle; The AUV system output is determined by making the AUV system output vector be Wherein The coordinates of the t-th time step AUV along the X, Y axis under the inertial coordinate system, The included angle between the advancing direction of the AUV and the Y axis of the fixed coordinate system is the t-th time step; The defining path following control error is specifically that a t-th time step track reference point is selected according to the target path of the AUV Yaw angle is Defining a transverse track error according to the track reference point and the yaw angle Error of yaw angle And introducing a speed error 。
- 3. The autonomous underwater vehicle path following control method based on value distribution reinforcement learning of claim 2, wherein the markov decision model of the AUV path following problem is composed of a state space Space of action Reward function And the next state Composition; Defining a state space, wherein the position information of the AUV in a two-dimensional underwater operation environment comprises a transverse position x, a longitudinal position y and a yaw angle psi, and the operation environment of the AUV needs to consider a surge speed u, a swing speed v and a yaw angle speed r to introduce a transverse track error And yaw angle error Based on the framework of path tracking MDP, defining the state space of AUV path following problem as follows: ; Defining motion space, defining motion vector of t-th time step as AUV system input vector of said time step, i.e ; Defining reward functions, multiplying negative coefficients respectively according to defined track tracking control errors 、 And Mapping the current position of the AUV to the arc length of the reference track By adding a time step Inner mapped arc length And sailing distance at the desired cruising speed Dividing and then giving Multiplying by a positive coefficient Obtaining rewards, and finally obtaining an AUV rewarding function of the t-th time step: 。
- 4. The method for autonomous underwater vehicle path following control based on value distribution reinforcement learning of claim 3, wherein the constructing of the value distribution-based strategy network and the value network includes constructing a value distribution soft strategy iteration framework and constructing a risk-sensitive strategy function, and performing parameter setting to set maximum iteration rounds respectively Maximum time step per iteration Training set size for experience playback extraction Learning rate of value distribution network Learning rate of policy network Network learning rate updated by strategy entropy coefficient Initializing target network update parameters and discount factors 。
- 5. The method for autonomous underwater vehicle path following control based on value distribution reinforcement learning of claim 4 wherein said constructing a soft strategy iteration framework of value distribution is performed by a value distribution function Random policy function Constructing a value distribution soft strategy iteration framework, embedding continuous distribution of return distribution function learning state-action returns in maximum entropy reinforcement learning, wherein And The value network including the value distribution function is realized by using a fully-connected deep neural network, and the value network input is The value distribution soft strategy iteration framework evaluates the strategy by describing the distribution of random accumulated returns; the policy function for constructing risk sensitivity is used for constructing a risk sensitive state sequence according to the standard deviation and average rewards of value distribution for a risk sensitive scene: Wherein the method comprises the steps of For a preset value to distribute the standard deviation threshold, The average rewarding value of all tracks at present; Respectively representing a state space and an action space of risk sensitivity, and simultaneously combining a random strategy function and uniformly sampling the risk sensitivity action space to obtain a risk sensitivity strategy function Acquiring action, constructing correspondent policy optimization objective function Optimizing policy parameters by maximizing policy objective function values The strategy network containing the risk sensitive strategy function is realized by using a fully-connected deep neural network, and the input of the strategy network is a state vector The output of the policy network is the motion vector 。
- 6. The autonomous underwater vehicle path following control method based on value distribution reinforcement learning of claim 5, wherein the strategy network and value network implementation process is as follows: random initialization value distribution network and weight parameters of policy network And Initializing policy entropy coefficients Initializing a preset value distribution standard deviation threshold Constructing experience queue set Let the experience set Is the maximum capacity of Storing in an experience cache pool and initializing to be empty; training the value network and the strategy network, initializing the iteration times, setting the current time step, and randomly initializing the state variables of the AUV Let the state variable of the current time step Initializing an initial time step size Determining an action of a current time step ; AUV in current state Execute action downwards According to a reward function Calculating a current prize value And observe a new state Recorded as a track Is an experience sample, if experience is gathered Has reached the maximum capacity Then the first added sample is deleted and the new experience sample is then added Deposit experience set In, otherwise, directly taking experience samples Deposit experience set In (a) and (b); From a set of experiences Selecting N experience samples as experience set When the number of samples in the set is not more than N, selecting an experience set All the experience samples in (1) when the experience is gathered When the number of the samples exceeds N, sampling N samples from the experience set according to the priority; Through iterative learning, calculating the standard deviation of each track and the average rewards of all tracks to obtain a risk sensitive action space Constructing a risk sensitive strategy function: Wherein, the Represents an annealing super-parameter that increases as the training iteration turns increase, Representing the time step of the current iteration round, mod represents the remainder operation, namely the initial training period When the value is 0, outputting a strategy by the original strategy function, and acquiring actions from a risk sensitive action space by uniformly sampling in the later training stage; Indicating a uniform sampling of the sample, Is shown in the state Under the risk sensitive action space, calculating the selected N experience samples through a strategy network to obtain an output action 。
- 7. The autonomous underwater vehicle path following control method based on value distribution reinforcement learning according to claim 6, wherein the implementation process of the value distribution soft strategy iteration framework on a strategy network and a value network is specifically as follows: State-action cost function for a value network Outputting the current policy Generated random cumulative returns The policy network outputs as policies Policy network optimizing objective function through policies Updating parameters, and updating objective function through value distribution by value network Parameter updating, random cumulative reporting definition : Defining a random cumulative return probability density function as Also known as a value distribution function, corresponds to a state action cost function of ; Let the current strategy be State-action cost function for a value network The policy update objective is to obtain a new policy that maximizes the state action cost function and the policy entropy The definition is: Wherein, the Is a strategy entropy coefficient; finally by new strategies The path following control of the autonomous underwater vehicle is completed.
- 8. The autonomous underwater vehicle path following control method based on value distribution reinforcement learning according to claim 7, wherein in the step S4, the weight parameters of the value network and the strategy network are updated by calculating the gradient training of the loss function on the weight parameters And Corresponding to the weight parameter of the target network And Updating with synchronous rate, policy entropy coefficient Updating by a dynamic adjustment mechanism, wherein the formula is as follows: Wherein is The target entropy is the minimum expected entropy; To update parameters; Order the And is opposite to Making decisions, e.g. Selecting a strategy network through which N experience samples pass again to output the next action, and taking the action as an input AUV to continue to follow a reference track, otherwise, judging whether the training times meet constraint conditions or not; If the training times are less than M, the AUV carries out the next iteration process, otherwise, the iteration is ended, the training processes of the value network and the strategy network are terminated, the value network and the strategy network parameter values when the iteration is terminated are saved, and the strategy output by the strategy network realizes the path following control of the AUV.
Description
Autonomous underwater vehicle path following control method based on value distribution reinforcement learning Technical Field The invention belongs to the field of deep reinforcement learning and intelligent control, and relates to an autonomous underwater vehicle path following control method based on value distribution reinforcement learning. Background With the development of current ocean field exploration, people increasingly need to conduct ocean science investigation, ocean resource exploration, ocean engineering construction and the like by means of unmanned underwater vehicles, and autonomous underwater vehicles (Autonomous Underwater Vehicle, AUV) are widely applied to ocean exploration at present. The underwater path following of an AUV is critical to achieving these complex tasks and has recently been the subject of intense investigation in the field of intelligent control. However, the motion control of the current AUV still faces many challenges, mainly including the following aspects that firstly, the AUV has a highly nonlinear dynamics model and a time-varying hydrodynamic coefficient, and is a nonlinear multi-input-output system with strong coupling, so that it is difficult to obtain an accurate AUV model, and in the face of a complex path following situation, the tracking accuracy and stability of the AUV can be obviously reduced, and it is a challenge to realize stable high-accuracy control. The quality of the automatic control method of the underwater vehicle is directly related to whether the underwater vehicle can successfully complete the task, and the safety performance of the underwater vehicle is affected. Therefore, improving the accuracy and stability of the AUV path tracking technology is of great significance to the application and exploration of the ocean field. In recent years, scholars at home and abroad develop extensive researches in the field of AUV path tracking control, and main methods are divided into two types of linear control and intelligent control. Linear control methods rely on accurate mathematical or physical models, while providing theoretically accurate control, are difficult to adapt to practically complex nonlinear systems. And lack of self-learning and self-adaptation capability, the control performance of the linear controller may be greatly reduced when the underwater robot operates in a wide range, which limits its wide application. With the advent of the intelligent information age, advances in artificial intelligence technology and data processing capabilities have driven the rapid development of intelligent control methods. By training and learning the intelligent body in the simulation system, the method reduces the dependence on an accurate mathematical model, and the controller has more excellent adaptability and mobility. In particular, the deep learning technology is applied, the deep neural network is constructed to serve as a nonlinear function approximator, so that the mapping relation between complex environment states and values can be learned, and for a large and complex environment state space, the state characteristics can be automatically extracted by using the deep neural network. The application of the technology obviously improves the decision making capability of the intelligent agent, reduces the learning and training time, and promotes the AUV path control to develop towards the intelligent and efficient direction. However, in practical engineering applications, a huge and complex state space in an underwater environment may result in insufficient strategy exploration capability and unstable strategy updating, thereby reducing tracking accuracy of autonomous underwater vehicle path following control based on deep reinforcement learning. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a path following control method of an autonomous underwater vehicle based on value distribution reinforcement learning. According to the method, a strategy exploration method is introduced by constructing a risk-sensitive state sequence, so that an intelligent body is further adapted to a complex underwater environment, and the stability of global learning and training of a model is improved. The technical scheme adopted by the invention for achieving the purpose is as follows: a method for controlling path following of an autonomous underwater vehicle based on value distribution reinforcement learning, comprising the steps of: S1, defining an AUV path following problem of an autonomous underwater vehicle, wherein the AUV path following control problem comprises three parts, namely determining AUV system input, determining AUV system output and defining path following control error; S2, establishing a Markov decision model of the AUV path following problem, carrying out Markov decision process modeling on the AUV path following problem in the step S1, wherein the decision model of the AUV path following problem