CN-122018506-A - Unmanned surface vessel game countermeasure control method, device, program and storage medium based on LW-PPO

CN122018506ACN 122018506 ACN122018506 ACN 122018506ACN-122018506-A

Abstract

The invention relates to an unmanned surface vessel game countermeasure control method, device, program and storage medium based on LW-PPO, belonging to the technical field of USV intelligent control and deep reinforcement learning. The method comprises the steps of modeling a Markov game process of a game environment, initializing a role state, obtaining a hidden state vector at the previous moment and a state vector at the current moment, inputting the hidden state vector at the current moment into a liquid neural network, extracting time sequence characteristics, obtaining the hidden state vector at the current moment, inputting the vector into a strategy network, generating selected actions, converting the selected actions into acceleration and steering rate control quantities through a PID (proportion integration differentiation) controller, performing kinematic update according to the control quantities and the current state, obtaining an updated state vector, judging whether the environment is final or not in an interactive mode, if not, performing cyclic execution, introducing the explicit modeling USV continuous time dynamics of the liquid neural network, enhancing the time sequence characteristic extraction capability, replacing the conventional KL divergence constraint through a smooth regular term, and improving the strategy update stability.

Inventors

WANG XINGMEI
LIU YUNCHENG
LI GONG
REN JUN
Xu Junzheng
XU YUEZHU

Assignees

哈尔滨工程大学

Dates

Publication Date: 20260512
Application Date: 20260209

Claims (8)

1. The unmanned surface boat game countermeasure control method based on the LW-PPO is characterized by comprising the following steps of: step 1, modeling a Markov game process according to a game environment, setting a hidden state of each role, and initializing; Step 2, acquiring a hidden state vector of the last moment containing all game role states and a current moment state vector containing the speed, position coordinates and course angle of all game roles at the current moment, and inputting the hidden state vector into a liquid neural network for time sequence feature extraction to obtain the hidden state vector at the current moment; Step 3, introducing a strategy network, and inputting the hidden state vector at the current moment into the strategy network to obtain a selected action; step 4, converting the selected action by a controller to obtain an action control quantity tuple; Step 5, according to the action control quantity tuple, combining the current moment state vector, and performing kinematic update to obtain an updated state vector; And 6, interacting the updated state vector with the environment to judge whether the game is ended, outputting a game result if the game is ended, otherwise, taking the updated state vector as a state vector at the next moment, and returning to the step 2.
2. The LW-PPO-based unmanned surface vessel game countermeasure control method according to claim 1, wherein the current time hidden state vector in step 2 Calculating a hidden state vector at the current moment according to a continuous time differential equation of a liquid neural network and discretizing the hidden state vector; Wherein, the For the hidden state vector at the previous moment, In order to calculate the differential of the sample, As a time decay constant that can be learned, Is the current time state vector.
3. The method for controlling game antagonism of unmanned surface vessel based on LW-PPO as set forth in claim 1, wherein step 3 comprises the steps of generating action probabilities by the strategy network based on the time sequence features extracted by the liquid neural network, and selecting the action with the highest probability as the selected action.
4. The LW-PPO based unmanned surface vessel game countermeasure control method according to claim 1, wherein the method for converting the selected actions by the controller in step 4 specifically comprises: Wherein, the In order to control the amount of motion, In order to control the quantity of tuples for an action, For the control amount of the acceleration, In order to control the amount of steering rate, As a value of the error it is, As an integral term of the error, As a differential term for the error, Respectively proportional, integral and differential gain parameters.
5. The LW-PPO based unmanned surface vessel game countermeasure control method according to claim 4, wherein the kinematic update in step 5 specifically includes a speed update, a position coordinate update, and a heading angle update; Wherein, the In order to update the speed of the vehicle, For the current time of day speed, In order to update the heading angle, As the heading angle at the current time, In order to perform the modular arithmetic operation, In order to update the position coordinates after the update, The current time position coordinates; and combining the updated speed, the position coordinates and the heading angle into an updated state vector.
6. A computer device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method according to any one of claims 1 to 5.
7. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 5.
8. A computer program product comprising computer instructions which, when executed by a processor, implement the steps of the method according to any one of claims 1 to 5.

Description

Unmanned surface vessel game countermeasure control method, device, program and storage medium based on LW-PPO Technical Field The invention relates to the technical field of USV intelligent control and deep reinforcement learning, in particular to an unmanned surface vessel game countermeasure control method, device, program and storage medium based on LW-PPO. Background Along with the continuous promotion of the scale and the complexity of ocean activities, the unmanned surface vessel plays an important role in the tasks of sea area monitoring, emergency rescue, port inspection and the like. When the USV executes tasks, decisions such as autonomous navigation, avoidance, intervention and the like must be completed under complex and changeable ocean environments and potential countermeasure risks, and higher requirements are put on the instantaneity, stability and robustness of the intelligent control algorithm. The traditional USV control method is mostly dependent on manual design rules and PID and other classical controllers to realize path tracking and obstacle avoidance. The method has advantages in the aspects of simple structure and easy realization, but is difficult to cope with complex game countermeasure scenes in a high-dimensional state space, and especially lacks self-adaptability and strategy evolution capability when facing the role with competitive strategies. The development of deep reinforcement learning (Deep Reinforcement Learning, DRL) provides a new technological paradigm for USV autonomous decision making. By performing a large amount of interactive training on the agent in the simulation environment, the DRL algorithm can automatically learn a game strategy close to the optimal. However, the direct application of DRL to USV gaming counter still has the following problems: (1) Environmental non-stationarity, namely in a game countermeasure environment, an opponent strategy continuously evolves along with training, and from the perspective of a single agent, the environment transition probability is dynamically changed, and the KL divergence or importance sampling constraint of the traditional PPO dependence is easy to lose efficacy when a strategy distribution support set is offset or even not overlapped, so that strategy update oscillation or even divergence is caused; (2) The evolution of USV course and speed has continuous time characteristic, the traditional Actor-Critic structure adopting a multi-layer perceptron (Multilayer Perceptron, MLP) is still static mapping in nature, the time sequence dependence and physical response can not be modeled explicitly, and the complex maneuvering behavior under the delay feedback is difficult to be accurately described; (3) And the reward sparsity and the strategy convergence are slow, namely, in a complex game countermeasure scene, strong feedback is obtained only when final events such as defeat or defeated occur, and the simple final reward design is difficult to guide the intelligent agent to learn the approach, avoidance and intervention strategies in stages. Therefore, there is a need for a USV game challenge control method with continuous time dynamics modeling capability, strategy update stability and course rewarding design, so as to improve convergence speed, training stability and practical application robustness of the agent in a non-steady challenge environment. Disclosure of Invention The invention aims to solve the problems of insufficient time sequence modeling capability, unstable strategy updating, difficult rapid convergence and the like in strong-resistance and multi-stage game tasks of the existing USV deep reinforcement learning method, provides an LW-PPO-based unmanned surface boat game countermeasure control method, device, program and storage medium by combining LNN and Wasserstein distances, and designs a USV game countermeasure training environment and a course rewarding mechanism in a matched manner, so that the autonomous game countermeasure capability of the USV in a complex marine environment is improved. The invention provides an LW-PPO-based unmanned surface vessel game countermeasure control method, an apparatus, a program and a storage medium, wherein the core comprises the following technical scheme: an unmanned surface vessel game countermeasure control method based on LW-PPO comprises the following steps: And step 1, modeling a Markov game process according to a game environment, setting the hidden state of each role, and initializing. And 2, acquiring a hidden state vector of the last moment containing the states of all game roles and a current moment state vector containing the speeds, position coordinates and heading angles of all game roles at the current moment, and inputting the hidden state vector into a liquid neural network for time sequence feature extraction to obtain the hidden state vector at the current moment. And step 3, introducing a strategy network, and inputting the hidden state vecto