CN-121995924-A - Unmanned ship cooperative trapping method with double-layer priority playback and self-supervision assistance
Abstract
The invention relates to the field of cooperative control and intelligent decision-making of unmanned systems, in particular to a double-layer priority playback and self-supervision auxiliary unmanned ship cooperative capture method, which comprises the following steps of constructing a sea area environment model cooperatively captured by a plurality of unmanned ships; the method comprises the steps of establishing a cooperative control strategy network, including a centralized training stage and a distributed execution stage, introducing a double-layer priority experience playback mechanism in the centralized training stage, introducing a self-supervision auxiliary task module in the cooperative control strategy network, constructing a multidimensional rewarding function, comprehensively evaluating the target approaching behavior, the cooperative trapping state and the safety constraint of the trapping unmanned boats, defining trapping success conditions, calculating rewarding results, outputting control instructions of the trapping unmanned boats according to the rewarding results, and carrying out cooperative trapping on the target unmanned boats. According to the invention, through double-layer priority experience playback and self-supervision auxiliary tasks, the collaborative trapping efficiency, learning stability and strategy adaptability of the multi-unmanned ship in a complex sea area are remarkably improved.
Inventors
- HAN XINJIE
- MU DONGDONG
- WANG FEI
- FAN YUNSHENG
- LV DEYU
- MA CHENXU
- GUAN KAIWEN
Assignees
- 大连海事大学
Dates
- Publication Date
- 20260508
- Application Date
- 20260212
Claims (9)
- 1. The unmanned ship cooperative trapping method assisted by double-layer priority playback and self-supervision is characterized by comprising the following steps of: constructing a sea area environment model cooperatively captured by a plurality of unmanned vessels, wherein the sea area environment model comprises a two-dimensional plane sea area, the two-dimensional plane sea area is provided with a plurality of unmanned vessels for capturing, an unmanned vessel for escaping and a plurality of static island reef barriers, a kinematic model is built for each unmanned vessel, and barrier constraints are defined; Establishing a cooperative control strategy network based on a multi-agent deep reinforcement learning framework, wherein the cooperative control strategy network comprises a centralized training stage and a distributed execution stage, the centralized training stage utilizes global state information to perform strategy optimization, and each unmanned ship in the distributed execution stage independently outputs control actions based on local observation information of the unmanned ship; Introducing a double-layer priority experience playback mechanism in the centralized training stage, carrying out priority assessment on collected state, action, rewards and next state experience samples, and carrying out non-uniform sampling training according to the result of the priority assessment; Introducing a self-supervision auxiliary task module into the cooperative control strategy network, and enhancing the characteristic expression capability of the cooperative control strategy network on the environment dynamic information and the cooperative relationship by adopting a multi-task joint learning mode; Constructing a multidimensional rewarding function, comprehensively evaluating the target approaching behavior, the cooperative trapping state and the safety constraint of the trapping unmanned ships, defining trapping success conditions, calculating rewarding results, outputting control instructions of the trapping unmanned ships according to the rewarding results, and cooperatively trapping the target unmanned ships.
- 2. The method for collaborative trapping of the unmanned ship with the double-layer priority playback and the self-supervision assistance according to claim 1 is characterized in that the motion state of the unmanned ship in the kinematic model comprises position coordinates, course angles, linear speeds and angular speeds, and the local observation information comprises self position, speed, course, relative distance and azimuth angle with a target, and relative distance and azimuth angle with teammates.
- 3. The method for collaborative trapping of a double-layer priority playback and self-supervision assisted unmanned boat according to claim 1, wherein constructing the collaborative control strategy network based on a multi-agent depth deterministic strategy gradient framework comprises: configuring an Actor network and a centralized Critic network for each unmanned trapping boat; the centralized training stage uses a shared experience playback buffer to store global experiences, and a Critic network guides all the Actor networks to update; and each unmanned ship in the distributed execution stage only loads and uses the own Actor network, and independently generates and executes control actions based on local observation information.
- 4. The unmanned ship collaborative trapping method with double-layer priority playback and self-supervision assistance according to claim 3, wherein the input of the Actor network is self local observation, the output of the Actor network is deterministic continuous action, the input of the Critic network is joint observation and joint action of all unmanned ships, and the output of the Critic network is Q value for evaluating the global state and action value.
- 5. The method for collaborative trapping of a double-layer preferential playback and self-supervision auxiliary unmanned boat according to claim 1, wherein the double-layer preferential experience playback mechanism comprises: For each sample in the experience playback buffer Calculate its global priority And local priority ; Global priority The calculation formula of (2) is as follows: Wherein the method comprises the steps of As a result of the time-series differential error of the sample, In order to prevent a very small positive number of zero errors, Super-parameters for adjusting priority level sensitivity; Local priority The calculation formula of (2) is as follows: Wherein the method comprises the steps of For the current catch-up distance feature, In order to be a round-progress feature, For the feature of the degree of collaboration of the agent, For the weight coefficient of the current catch distance feature, As a weighting factor for the round progress feature, The weight coefficient of the agent cooperation degree characteristic, For the mapping function corresponding to the current capture distance feature, For the mapping function corresponding to the round progress feature, Mapping functions corresponding to the intelligent agent cooperative degree characteristics; By weight parameter Fusing global priority and local priority to obtain a sample Is of integrated priority of (2) : Calculating sampling probability of each sample according to comprehensive priority : According to the sampling probability of each sample Non-uniform sampling from the buffer; Introducing importance sampling weights Correcting the sampling deviation: where N is the buffer capacity of the buffer, Is a correction parameter that increases from 0 to 1 over time.
- 6. The method for collaborative trapping of a double-layer priority playback and self-supervision auxiliary unmanned ship according to claim 1, wherein the self-supervision auxiliary task module comprises: Setting a shared feature encoder in an Actor-Critic network, wherein the shared feature encoder is used for extracting high-dimensional features from input observation information; the shared feature encoder is connected with a main task output head and at least one auxiliary task pre-measuring head in parallel, the main task output head is used for outputting strategy actions or Q values, and the auxiliary task pre-measuring head is used for predicting the inherent properties of the environment or the task; multitasking learning by jointly optimizing a total loss function The method comprises the following steps: Wherein the method comprises the steps of In order to strengthen the task loss of learning, For the loss of the kth auxiliary task, For its corresponding weight coefficient.
- 7. The unmanned ship collaborative trapping method based on double-layer priority playback and self-supervision assistance according to claim 6, wherein the auxiliary tasks comprise a target position prediction task for predicting coordinates of a target ship at a next time step, a trapping step number prediction task for predicting a remaining time step number required from a current state to successful trapping, a trapping step number evaluation task for evaluating a surrounding effect of a current trapping ship array on the target, and an agent importance evaluation task for evaluating contribution weights of the trapping ships in the current situation.
- 8. The method of claim 1, wherein the multidimensional rewards function comprises a proximity reward for encouraging a target to be approached and maintaining a correct heading, a matrix reward for encouraging the formation of an evenly angularly distributed enclosure, a partial-catch reward for rewarding an effective intermediate catch condition, and a partial-catch reward for encouraging a quick completion time reward, and a collision penalty for enforcing a security constraint.
- 9. The method for collaborative trapping of the unmanned ship with double-layer priority playback and self-supervision assistance according to claim 1, wherein the successful trapping condition is required to satisfy the following conditions: The distances between all the unmanned enclosing boats and the target unmanned boats are smaller than a preset capturing distance; any two trapping unmanned boats have an azimuth angle of no more than that of the target unmanned boat ; In the whole trapping process, no collision between unmanned boats or between unmanned boats and obstacles occurs.
Description
Unmanned ship cooperative trapping method with double-layer priority playback and self-supervision assistance Technical Field The invention relates to the field of unmanned system cooperative control and intelligent decision, in particular to a double-layer priority playback and self-supervision auxiliary unmanned ship cooperative capture method. Background With the increasing demands of ocean resource development and offshore safety, unmanned surface vessels play an increasingly important role as intelligent ocean equipment in the tasks of performing reconnaissance, monitoring, cooperative combat and the like. The multi-unmanned-ship cluster collaborative capture task, namely, a plurality of USVs capture one escape target ship in a collaborative manner, is a key scene for checking intelligent collaborative capability of the cluster. The key to success of the task is that the cluster needs to generate and execute in real time a joint strategy capable of coordinating individual behaviors, optimizing overall efficiency and simultaneously meeting multiple constraints in a dynamic and uncertain complex environment. Traditional cooperative control methods, such as PID control, sliding mode control or optimization algorithm-based methods, often have the limitations of high computational complexity, poor adaptability, difficult online optimization and the like when processing the problem of highly nonlinear and strongly coupled multi-agent dynamic game. In recent years, multi-agent deep reinforcement learning provides a new idea for solving such problems. The multi-agent depth deterministic strategy gradient algorithm and the variant thereof become research hot spots by virtue of the capability of processing continuous action space and realizing centralized training and distributed execution. However, existing MADDPG-frame-based unmanned boat containment studies still have significant shortcomings: First, most studies do not fully consider the complex constraints and disturbances caused by dense island barrier in marine environments, and the algorithms have insufficient balance between barrier avoidance and containment. The observation space of the intelligent body often lacks the perception information for the definition and multiple directions of the obstacle, so that the learned strategy is insufficient in obstacle avoidance capability in complex terrain and is easy to be in local dilemma or collide. Second, existing methods are generally based on simple rewards for distance chase. Such reward signals are sparse and single-dimensional, and it is difficult to effectively excite multiple-dimensional desired behaviors such as "approaching a target", "forming a surrounding", "maintaining a safe distance", "avoiding a collision", "efficiently utilizing terrain", and the like at the same time. The agent easily learns simple trailing rather than true collaborative wrapping, or learning inefficiency due to sparse rewards and convergence difficulties. Third, in sparse rewards and complex environments, the sample efficiency and training stability of algorithms face challenges, and it is difficult for the agent to learn a fine collaborative strategy. Therefore, a control method capable of effectively processing complex obstacle environments, providing dense and multidimensional rewarding guidance, improving sample efficiency and training stability, and finally realizing efficient and robust multi-unmanned-boat collaborative trapping is urgently needed. Disclosure of Invention In order to solve the problems in the prior art, the invention provides a double-layer preferential playback and self-supervision auxiliary unmanned ship collaborative trapping method, which remarkably improves the collaborative trapping efficiency, learning speed and strategy robustness of a plurality of unmanned ship clusters in a complex island sea area by improving a learning frame and a training mechanism. The technical scheme adopted by the invention specifically comprises the following steps: constructing a sea area environment model cooperatively captured by a plurality of unmanned vessels, wherein the sea area environment model comprises a two-dimensional plane sea area, the two-dimensional plane sea area is provided with a plurality of unmanned vessels for capturing, an unmanned vessel for escaping and a plurality of static island reef barriers, a kinematic model is built for each unmanned vessel, and barrier constraints are defined; Establishing a cooperative control strategy network based on a multi-agent deep reinforcement learning framework, wherein the cooperative control strategy network comprises a centralized training stage and a distributed execution stage, the centralized training stage utilizes global state information to perform strategy optimization, and each unmanned ship in the distributed execution stage independently outputs control actions based on local observation information of the unmanned ship; Introducing a double-la