CN-122015876-A - Underwater glider dynamic path planning training method based on value distribution reinforcement learning

CN122015876ACN 122015876 ACN122015876 ACN 122015876ACN-122015876-A

Abstract

The application provides an underwater glider dynamic path planning training method based on value distribution reinforcement learning, which can be applied to the technical field of underwater gliders. The method comprises the steps of inputting training data, obtaining a t moment output action and a t+1 moment output action of each strategy network by utilizing at least one strategy network, obtaining a t moment target distribution by utilizing a target evaluation network, wherein the t moment target distribution represents a prediction distribution based on quantiles of accumulated returns of the underwater gliders after the underwater gliders execute actions based on the strategy network in a t moment sample state, obtaining a t moment sample evaluation distribution by utilizing the evaluation network, obtaining an updated evaluation network according to the t moment sample evaluation distribution and the t moment target distribution, obtaining a t moment evaluation distribution by utilizing the updated evaluation network, and obtaining a target strategy network according to the t moment evaluation distribution so as to control the underwater gliders to conduct path planning based on the target strategy network.

Inventors

GAO ZHONGKE
JUAN RONGSHUN
WANG TIANSHU

Assignees

天津大学

Dates

Publication Date: 20260512
Application Date: 20260414

Claims (10)

1. An underwater glider dynamic path planning training method based on value distribution reinforcement learning is characterized by comprising the following steps: Inputting training data, wherein the training data comprise a t time sample state of an underwater glider at a t time, a t time sample action and a t+1st time sample state obtained after executing the t time sample action, t is a positive integer, the sample state of the underwater glider at each time is determined in a pre-constructed sample state space, the sample state space comprises topographic data, ocean current data, heading angle data and position data of the underwater glider after water outlet, the sample action of the underwater glider at each time is determined in a pre-constructed sample action space, and the sample action space represents the heading angle adjustment quantity of the underwater glider; Obtaining a t moment output action of each strategy network in the t moment sample state and a t+1 moment output action of each strategy network in the t+1 moment sample state by using at least one strategy network; Obtaining a t moment target distribution according to the t+1 moment sample state and the t+1 moment output action of each strategy network in the t+1 moment sample state by utilizing a target evaluation network, wherein the t moment target distribution represents a prediction distribution of cumulative returns of the underwater glider after the action is executed based on the strategy network in the t moment sample state, and the cumulative returns of the underwater glider are represented by the quantiles and correspond to the probability represented by the quantiles; Obtaining a t moment sample evaluation distribution according to the t moment sample state and the t moment sample action by using an evaluation network, wherein the t moment sample evaluation distribution characterizes the actual distribution of the quantile of the accumulated return of the underwater glider after the t moment sample action is executed by the strategy network under the t moment sample state; updating the evaluation network according to the sample evaluation distribution at the t moment and the target distribution at the t moment to obtain an updated evaluation network; Obtaining a t moment evaluation distribution according to the t moment sample state and the t moment output action of each strategy network in the t moment sample state by utilizing the updated evaluation network; And updating parameters of each strategy network according to the t-th moment evaluation distribution to obtain a target strategy network, so as to control the underwater glider to carry out path planning based on the target strategy network.
2. The method of claim 1, wherein the terrain data characterizes terrain data within a first rectangular sea area centered on the underwater glider, the terrain data being represented by a terrain elevation matrix comprised of a first sub-element representing an absolute value of a terrain true value in the first rectangular sea area that is less than or equal to a maximum submergence depth of the underwater glider and a second sub-element representing a terrain true value in the first rectangular sea area that is less than the maximum submergence depth of the underwater glider; The ocean current data represent ocean current data in a second rectangular sea area taking the underwater glider as a center, the ocean current data comprise ocean current velocity component data of the underwater glider in the east direction at a detection point under an inertial coordinate system and ocean current velocity component data of the underwater glider in the north direction at the detection point under the inertial coordinate system, and boundaries of the second rectangular sea area are respectively parallel to longitude and latitude directions; the course angle data represent the deviation between the current course angle and the expected course angle of the underwater glider, the current course angle and the expected course angle are included angles between the body direction and the longitude direction of the underwater glider, and the expected course angle represents the target course angle under the condition that the underwater glider performs path planning; The position data characterizes a distance between a current position of the underwater glider and a task end point of the path planning and a relationship between a task start point and a task end point of the path planning.
3. The method according to claim 1, wherein the obtaining, by using a target evaluation network, a target distribution at a t time according to the t+1 time sample state and the t+1 time output action of each policy network in the t+1 time sample state includes: obtaining target output distribution of the t+1 time point corresponding to each strategy network and at least one target evaluation network respectively according to the sample state of the t+1 time point and the output action of the t+1 time point of each strategy network in the sample state of the t+1 time point by using at least one target evaluation network; Sequencing the target output distribution at the t+1 time corresponding to each strategy network and at least one target evaluation network respectively to obtain target output mixed distribution at the t+1 time corresponding to each strategy network; And averaging the target output mixed distribution at the t+1th moment corresponding to each strategy network to obtain the target distribution at the t moment.
4. A method according to claim 3, wherein said averaging the target output mix distribution at time t+1 corresponding to each of said policy networks to obtain the target distribution at time t comprises: Averaging the target output mixed distribution at the t+1 time corresponding to each strategy network to obtain target output average distribution at the t+1 time; and carrying out translation scaling on the target output average distribution at the t+1 moment to obtain the target distribution at the t moment.
5. A method according to any one of claims 1 to 3, wherein the obtaining, by using the updated evaluation network, a tth time evaluation distribution according to the tth time sample state and the tth time output action of each policy network in the tth time sample state includes: obtaining a t-th time evaluation output distribution corresponding to each strategy network according to the t-th time sample state and the t-th time output action of each strategy network in the t-th time sample state by using at least one updated evaluation network; Sequencing the t-th time evaluation output distribution corresponding to each strategy network to obtain a t-th time evaluation output mixed distribution corresponding to each strategy network; And averaging the t time evaluation output mixed distribution corresponding to each strategy network to obtain the t time evaluation distribution.
6. A method according to any one of claims 1 to 3, wherein the training data is obtained by: Randomly selecting a region with a preset area from the marine environment and determining the region as a task region; acquiring ocean current task data and topography task data in the task area; determining a task starting point and a task end point according to the longitude and the latitude in the task area; Acquiring a sample state of the underwater glider at the t moment according to the task starting point, the ocean current task data and the terrain task data; according to the sample state at the t moment, obtaining the sample action at the t moment by utilizing the strategy network; And after the underwater glider executes the sample action at the t moment, obtaining the sample state at the t+1 moment of the underwater glider.
7. A method according to any one of claims 1 to 3, wherein updating parameters of each policy network according to the evaluation distribution at the t-th moment to obtain a target policy network comprises: determining a policy network loss value of each policy network according to the t-th moment evaluation distribution; determining a policy network loss average value according to the policy network loss value of each policy network; updating parameters of each strategy network according to the strategy network loss average value to obtain an intermediate strategy network corresponding to each strategy network; Executing a test task by using the intermediate strategy network corresponding to each strategy network to obtain rewards corresponding to each intermediate strategy network; and determining the target strategy network according to rewards corresponding to the intermediate strategy networks.
8. A method according to any one of claims 1 to 3, wherein updating the evaluation network according to the sample evaluation distribution at the t-th time and the target distribution at the t-th time to obtain an updated evaluation network comprises: Determining an evaluation network loss value of the evaluation network according to the distance between the sample evaluation distribution at the t moment and the target distribution at the t moment; and updating the evaluation network according to the evaluation network loss value to obtain an updated evaluation network.
9. The method of claim 2, wherein the policy network comprises a policy ocean current feature convolution layer, a policy topography feature convolution layer, a policy feature stitching layer, and a policy full connection layer; the policy network obtains output actions in the sample state at each moment by: Processing the ocean current data by using the strategy ocean current characteristic convolution layer to obtain strategy ocean current characteristics; Processing the topographic data by using the strategic topographic feature convolution layer to obtain strategic topographic features; Splicing the strategy ocean current characteristics, the strategy topography characteristics, the course angle data and the position data by using the strategy characteristic splicing layer to obtain strategy mixing characteristics; and processing the strategy mixing characteristics by utilizing the strategy full-connection layer to obtain output actions in sample states at all times.
10. The method of claim 9, wherein the evaluation network comprises an evaluation ocean current feature convolution layer, an evaluation topography feature convolution layer, an evaluation feature stitching layer, and an evaluation full connection layer; The evaluation network obtains a sample distribution in the sample state at each instant by: processing the ocean current data by using the evaluation ocean current characteristic convolution layer to obtain an evaluation ocean current characteristic; processing the topographic data by using the evaluating topographic feature convolution layer to obtain evaluating topographic features; Splicing the evaluation ocean current characteristics, the evaluation topography characteristics, the course angle data, the position data and the corresponding sample actions in the sample states at all moments by using the evaluation characteristic splicing layer to obtain evaluation mixed characteristics; And processing the evaluation mixed characteristic by using the evaluation full-connection layer to obtain sample distribution in a sample state at each moment.

Description

Underwater glider dynamic path planning training method based on value distribution reinforcement learning Technical Field The application relates to the technical field of underwater gliders, in particular to an underwater glider dynamic path planning training method based on value distribution reinforcement learning. Background The underwater glider can glide in the ocean by the track of 'saw tooth' under the combined action of the wings by adjusting the net buoyancy and the attitude angle of the underwater glider, and each saw tooth motion (one diving motion and one floating motion) of the underwater glider is a section. The observation and detection of the marine environment can be realized by the zigzag movement of the underwater glider in the ocean. For example, the underwater glider can realize full-period tracking observation of information such as ocean temperature, salt depth and the like. However, due to the reasons of ocean current illusion and moisting in the ocean environment, floating of marine organisms, and complicated submarine topography, it is very important to perform reasonable path planning on the underwater glider. However, the path planning algorithm for the underwater glider in the related technology has low efficiency, multiple iteration times and slow calculation speed, so that the real-time response is slow. Disclosure of Invention In view of this, the embodiment of the application provides an underwater glider dynamic path planning training method based on value distribution reinforcement learning. An aspect of the embodiment of the application provides a value distribution reinforcement learning-based underwater glider dynamic path planning training method, which comprises the steps of inputting training data, wherein the training data comprises a t-th time sample state of an underwater glider at the t-th time, a t-th time sample action and a t+1-th time sample state obtained after executing the t-th time sample action, t is a positive integer, obtaining a t-th time output action of each strategy network in the t-th time sample state by utilizing at least one strategy network, a t+1-th time output action in the t+1-th time sample state by utilizing a target evaluation network, obtaining a t-th time target distribution according to the t+1-th time sample state and the t+1-th time output action of each strategy network in the t+1-th time sample state, wherein the t-th time target distribution characterizes a cumulative return of the underwater glider in the t-th time sample state based on a predicted distribution of a bit number, and a cumulative return corresponding to a probability represented by the bit number, utilizing an evaluation network, obtaining a sample state by utilizing the t-th time sample network, and a t-th time sample state after executing the strategy network based on the strategy network, updating the t-th time sample distribution, and obtaining a sample state by utilizing the t-th time sample distribution after the t-th time sample state and the t-th time sample state, and a current sample state, and obtaining a current sample state by utilizing the target evaluation network, and updating the current state by utilizing the t-th time sample distribution after the t-th time sample distribution and the t-time sample distribution and the current sample state after the t-state and the t-state, and updating parameters of each strategy network according to the t time evaluation distribution to obtain a target strategy network so as to control the underwater glider to carry out path planning based on the target strategy network. According to the embodiment of the application, the target evaluation network is utilized to obtain the t moment target distribution according to the t+1 moment sample state and the t+1 moment output action, the t moment target distribution characterizes the prediction distribution based on the quantile of the accumulated return after the action is performed by the strategy network by the underwater glider in the t moment sample state, the t moment sample evaluation distribution can be obtained by the evaluation network, the t moment sample evaluation distribution characterizes the distribution based on the quantile of the accumulated return after the action is performed by the strategy network by the underwater glider in the t moment sample state, therefore, the output of the evaluation network is not expected for the accumulated return any more, the t moment target distribution obtained by the target evaluation network and the t moment sample evaluation distribution obtained by the evaluation network can more intuitively reflect the evaluation of the t moment sample action, the evaluation network is updated according to the t moment sample distribution, the evaluation network can also be enabled to converge more rapidly, the parameters of the strategy networks are updated based on the t moment sample evaluation distribution, and the iterat