CN-122021916-A - Pulse spacecraft cluster game decision method based on TD3-VR algorithm

CN122021916ACN 122021916 ACN122021916 ACN 122021916ACN-122021916-A

Abstract

The invention discloses a pulse spacecraft cluster game decision method based on a TD3-VR algorithm, and belongs to the technical field of aerospace. The method can solve the problem that strategies of two game parties are difficult to generate in a track game scene with the characteristics of participation of multiple spacecrafts, multitasking, multiple constraint, unfixed pulse intervals and the like. Firstly, constructing a mathematical model of a spacecraft cluster game scene aiming at each task involved in the scene. Then, designing a rapid trajectory planning network progressive training method of the attacking spacecraft based on the TD3-VR algorithm, and training for each task to obtain a corresponding value and strategy network. And combining all the neural networks through knowledge distillation, and then using the neural networks for game decision making, further performing task allocation and attack trajectory planning for an attack party, performing threat assessment for a defender and escape trajectory planning. The invention has the characteristics of high decision speed, low performance requirement, strong interpretability, high reliability, good real-time performance and the like.

Inventors

SHI PENG
GUO SIYUAN
XIE QINGCHAO
JIN LEI
WU DI

Assignees

北京航空航天大学

Dates

Publication Date: 20260512
Application Date: 20260209

Claims (10)

1. A pulse spacecraft cluster game decision method based on a TD3-VR algorithm is characterized by comprising the following steps: Step 1, establishing an orbit dynamics model aiming at a game scene, and respectively establishing a mathematical model of task completion conditions aiming at meeting, interference and reconnaissance tasks; Step 2, designing a dual-delay depth deterministic strategy gradient TD3-VR algorithm with vectorization experience playback, splitting rewards into a plurality of components, storing the components in a vector form, weighting and summing in a training process, and then realizing dynamic adjustment of rewards by adjusting weight vectors; Step3, designing state variables, action variables, constraint conditions and rewarding functions of reinforcement learning training by taking an attack spacecraft as a control object; step 4, designing an initial state generation method, and training a strategy network and a value network in stages through a progressive training mode, wherein the steps include gradually improving the complexity of the initial state, adjusting the maneuver mode and the amplitude of an evasion party and adjusting the rewarding weight; Step 5, combining strategy networks and value networks obtained by training for different tasks into a single neural network through knowledge distillation, and introducing a Dropout layer to evaluate output reliability; Step 6, distributing task targets for the attacking spacecraft based on the value network output, and generating an attack maneuver strategy through a strategy network; step 7, evaluating threat indexes for evading party spacecraft based on value network output, and generating evading maneuver strategies along the negative gradient direction of the value network; and 8, evaluating the confidence coefficient of the strategy, analyzing the reliability of the strategy, and implementing the strategy after the reliability is ensured.
2. The method for determining a game of pulse spacecraft cluster based on the TD3-VR algorithm according to claim 1, wherein in the step 1, the completion condition of the meeting task is that the relative distance and speed of the attacking and evading parties meet a set threshold, the completion condition of the reconnaissance task is that the relative illumination angle and distance of the attacking and evading parties meet a set threshold, and the completion condition of the interference task is that the relative distance and included angle of the line of sight of the attacking and evading parties meet a set threshold.
3. The method for clustered game decision making of impulse spacecraft based on TD3-VR algorithm according to claim 1, wherein in step 2, the vectorized experience playback dynamically calculates the bonus scalar during training by storing the bonus vector and the weight vector.
4. The method of claim 1, wherein in the step 3, the state variables include spacecraft position, speed, orbit parameters and mission related parameters, the action variables include pulse maneuver components and time intervals, the constraint condition is implemented by transforming a strategic network output, and the reward function consists of regular rewards, mission rewards, fuel rewards and pilot rewards.
5. The method for decision making of a clustered game of impulse spacecraft based on the TD3-VR algorithm of claim 1, wherein in said step 4, the initial state of the training environment is implemented by reverse orbit extrapolation, and the progressive training comprises: gradually increasing the semi-long axis and the eccentricity ratio range of the reference track; Gradually increasing the initial relative distance by adjusting initial state generation parameters; Adding random maneuver for the target, and gradually increasing maneuver amplitude; Adding maneuver along the negative gradient direction of the value network for the target, and gradually increasing maneuver amplitude; The fuel rewards and pilot rewards weights were set to 0 and trained to converge.
6. The method according to claim 1, wherein in step 5, the neural network after knowledge distillation includes a shared basic input layer and action input layer, a task independent additional input layer and output layer, and a Dropout layer, and the confidence is evaluated by a Monte-Carlo Dropout.
7. The method for decision making of the pulse spacecraft cluster game based on the TD3-VR algorithm according to claim 1, wherein in the step 6, task allocation is performed by calculating a task index matrix, selecting a combination that maximizes a total task index, and allocating a target with a highest task index to a non-target offender.
8. The method for decision-making of the cluster game of the pulse spacecraft based on the TD3-VR algorithm according to claim 1, wherein in the step 7, threat indexes of the evasive spacecraft are obtained by summing up task indexes, when the threat indexes exceed a threshold value, evasive maneuver along the negative gradient direction of the value network is generated, and maneuver amplitude is dynamically adjusted according to threat changes.
9. An electronic device, comprising: one or more processors; A memory for storing one or more programs; Wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the pulsed spacecraft cluster game decision method based on the TD3-VR algorithm of any one of claims 1-8.
10. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, enable the processor to implement a method for clustered game decision making for a pulsed spacecraft based on a TD3-VR algorithm as claimed in any one of claims 1-8.

Description

Pulse spacecraft cluster game decision method based on TD3-VR algorithm Technical Field The invention belongs to the technical field of aerospace, and particularly relates to a pulse spacecraft cluster game decision method based on a TD3-VR algorithm. Background Orbit gaming theory is a product of spacecraft orbit dynamics combined with game theory. Early studies were primarily directed to single spacecraft maneuver trajectory planning or obstacle avoidance problems. When the maneuvering behaviors of the two spacecraft are considered at the same time, the problem is evolved into a bilateral control continuous dynamic countermeasure problem, which is called a space spacecraft escape game problem. In the research of the on-orbit game problem of the spacecraft, early work is mainly based on Clohessy-WILTSHIRE (CW) dynamic equations and differential countermeasure theory, and focuses on the close-range chase game under the close-circle reference orbit. Such studies solve by converting the original game problem into a bilateral optimal control problem, but have the limitation of being applicable to game scenes where only a few continuous thrust spacecraft are in close range. The reason is that the complexity of the dynamics model is significantly increased by long-distance, multi-spacecraft participation or pulse maneuver, so that the solution based on the differential countermeasure theory becomes difficult. The solving method based on the reachable domain theory can process game problems under long-distance, multi-spacecraft and pulse maneuver and can identify a plurality of potential Nash equilibrium points. However, the overall solution time is long, and the real-time requirement of the on-orbit autonomous decision of the spacecraft is difficult to meet. With the improvement of the parallel computing performance of a computer, a deep reinforcement learning algorithm is paid attention to widely because of the effectiveness of the deep reinforcement learning algorithm in complex sequential decision problems. The algorithm can better solve the track game problem under long-distance and pulse maneuver, and the consumption of computing resources is mainly concentrated in a training stage, the computing force requirement in an execution stage is low, and the algorithm is suitable for real-time operation on a spaceborne computer. However, the existing research is mainly focused on the short-distance relative motion stage, and in the face of cluster game problems of long initial distance, large number of spacecrafts, complex constraint and unfixed pulse interval, the reliability problem caused by difficult training convergence or decision-making and fitting still exists, so that further research improvement is needed. Disclosure of Invention In order to solve the technical problems, the invention provides a pulse spacecraft cluster game decision method based on a TD3-VR algorithm, which comprises a training environment design and a progressive training method for deep reinforcement learning training, considers various requirements in engineering application including neural network confidence analysis and knowledge distillation, and can realize situation assessment and attack/evasion strategy autonomous generation. In order to achieve the above purpose, the invention adopts the following technical scheme: A pulse spacecraft cluster game decision method based on a TD3-VR algorithm comprises the following steps: Step 1, establishing an orbit dynamics model aiming at a game scene, and respectively establishing a mathematical model of task completion conditions aiming at meeting, interference and reconnaissance tasks; Step 2, designing a dual-delay depth deterministic strategy gradient TD3-VR algorithm with vectorization experience playback, splitting rewards into a plurality of components, storing the components in a vector form, weighting and summing in a training process, and then realizing dynamic adjustment of rewards by adjusting weight vectors; Step3, designing state variables, action variables, constraint conditions and rewarding functions of reinforcement learning training by taking an attack spacecraft as a control object; step 4, designing an initial state generation method, and training a strategy network and a value network in stages through a progressive training mode, wherein the steps include gradually improving the complexity of the initial state, adjusting the maneuver mode and the amplitude of an evasion party and adjusting the rewarding weight; Step 5, combining strategy networks and value networks obtained by training for different tasks into a single neural network through knowledge distillation, and introducing a Dropout layer to evaluate output reliability; Step 6, distributing task targets for the attacking spacecraft based on the value network output, and generating an attack maneuver strategy through a strategy network; step 7, evaluating threat indexes for evading party spacecraft based on va