CN-121525779-B - Reinforced learning satellite pursuit method and system based on orbit action prejudgment

CN121525779BCN 121525779 BCN121525779 BCN 121525779BCN-121525779-B

Abstract

The invention discloses a reinforcement learning satellite chase method and system based on orbit action prejudgment, which comprises the steps of constructing a SAC strategy network, designing a proper loss function for back propagation, training an initial strategy of a model, introducing a curiosity mechanism into the SAC algorithm, using a full-connection network prediction result and an actual error as reward signals, encouraging an intelligent agent to search an action space, constructing an input vector format and a data set for a transform training, carrying out position coding on the input vector when the input vector is input, designing a loss function for back propagation on the transform, intercepting the output of the transform as the input of the SAC algorithm, reconstructing SAC observation, and forming an integrated decision network by the two. The invention solves the problem of track-changing and chase-escaping game between satellites by utilizing pulse maneuver.

Inventors

WU XIANG
SHEN GANG
ZHANG BAOHENG
LIAO MINGRUI
CHEN HUANLE
ZHAO GAOPENG
BO YUMING
WANG CHAOCHEN

Assignees

南京理工大学
上海宇航系统工程研究所

Dates

Publication Date: 20260505
Application Date: 20260115

Claims (10)

1. The reinforcement learning satellite escape method based on orbit action prejudgement is characterized by comprising the following steps: s1, observing states of two parties of a design sheet agent game, and rewarding as agent input and environment feedback values; S2, constructing and training an initial SAC strategy network by taking state observation as input, constructing a loss function, introducing a curiosity mechanism, and inputting the loss function as an external signal into rewards; S3, constructing a transducer model input vector based on the following party observation and escaping party strategy, performing position coding on the vector, and generating a sequence as a data set of the transducer model; S4, training a transducer model by adopting the constructed data set, and designing a loss function to perform back propagation; S5, taking the output of the trained transducer model as a part of the input observation of the SAC strategy network, reconstructing the input state observation of the SAC strategy network, constructing an integrated decision network, training together, and designing a loss function for counter propagation to obtain an iterative training model; And S6, acquiring the orbital transfer quantity of the escaping party by adopting a trained integrated decision network, and executing the orbital transfer action based on the orbital transfer quantity to track up the satellite of the escaping party.
2. The method for learning satellite tracking escape based on orbit action prediction according to claim 1, wherein the state observation includes a position and a velocity of a reference satellite in an earth inertial coordinate system, a motion description of a rear-end satellite in the reference coordinate system, a motion description of an escape satellite in the reference coordinate system, and an escape action of the escape satellite.
3. The method for learning satellite tracking escape based on orbit action prediction according to claim 1, wherein said curiosity mechanism is to use a four-layer multi-layer perceptron network to predict the next state feature based on the current state feature and the actions selected by the agent, calculate the error between the predicted value and the true value, and convert the predicted error into intrinsic rewards And meanwhile, the prediction error is used as a loss value to update the curiosity network.
4. The method for reinforcement learning satellite tracking escape based on orbit action prediction according to claim 3, wherein the additional player rewarding with curiosity mechanism is: ; Wherein the method comprises the steps of Is the distance between satellites after t time after the corresponding action is made, Refers to the distance between satellites after time t that the action was not taken, Indicating that the chaser successfully chases the reward of the evasion, For curiosity mechanical rewards, Representing the weight; The escapement's reward is the opposite of the chaser's reward.
5. The method for reinforcement learning satellite tracking based on orbit action prediction according to claim 4, wherein the following steps are performed by The method comprises the following steps: 。
6. The method for reinforcement learning satellite tracking escape based on orbit action prejudgment according to claim 1, wherein the SAC strategy network introduces entropy regularization in an Actor-Critic framework and adopts a double Critic network and target network soft update mechanism.
7. The method for learning satellite tracking escape based on orbit action prediction according to claim 6, wherein the loss function in step S2 comprises an Actor loss function and a Critic network loss function, wherein, The Actor loss function is: ; Wherein, the Representation of the state of a pair From experience playback pool D, actions The expectations found by the sampling process from the current Actor strategy, Is shown in the state Execute action downwards Is used for the action value of the (a), Is a parameter of the Critic network and, Is a temperature coefficient of the silicon carbide material, Select action for state s Probability of (2); the Critic network loss function is: ; ; ; Wherein, the Network of functions for values Is added to the system, the loss of (a) is, As a parameter of the network of value functions, For the current Critic network to current state Execute action downwards Value prediction of (2); Is in state of Executing an action Even if the user is rewarded after the time, As a discount factor, the number of times the discount is calculated, Is the next state Under policy Action of sampling The desire is to be found that, Taking out And The minimum value of the network output is calculated, Is a SAC entropy regularization term, Is in a state of Lower selection Is a function of the probability of (1), Is the temperature coefficient of the temperature of the material, Is that Network loss.
8. The method for learning satellite tracking escape based on orbit action prediction according to claim 1, wherein said transducer model matrices all values projection parameters in two multi-head self-attention mechanisms in encoder, multi-head self-attention mechanism in decoder and masked self-attention mechanism And simultaneously setting a sliding window to eliminate old data which does not accord with data distribution in the data set of the transducer model.
9. The method for reinforcement learning satellite tracking escape based on orbit action prediction according to claim 8, wherein the loss function of the transducer model is: ; For the predicted result of the transducer Is the real track data.
10. A reinforcement learning satellite tracking system implementing the method of any one of claims 1-9, comprising: The state observation and rewarding design unit is used for designing the state observation and rewarding of the two parties of the intelligent game of the single intelligent game as intelligent input and environment feedback values; the SAC strategy network training unit takes state observation as input, constructs and trains an initial SAC strategy network, constructs a loss function, introduces a curiosity mechanism and inputs the loss function as an external signal into rewards; the data set construction unit is used for constructing a transducer model input vector based on the following party observation and escaping party strategy, carrying out position coding on the vector and generating a sequence as a data set of the transducer model; The transducer model training unit is used for training the transducer model by adopting the constructed data set and designing a loss function for back propagation; The integrated decision network construction unit uses the output of the trained transducer model as a part of the input observation of the SAC strategy network, reconstructs SAC input state observation, constructs an integrated decision network, performs co-training, designs a loss function for counter propagation, and acquires an iterative training model; And the output unit acquires the orbital transfer quantity of the escaping party by adopting a trained integrated decision network and executes the orbital transfer action based on the orbital transfer quantity to track the satellite.

Description

Reinforced learning satellite pursuit method and system based on orbit action prejudgment Technical Field The invention belongs to the field of artificial intelligent satellite orbit pursuit application, and particularly relates to a reinforcement learning satellite pursuit method and system based on orbit action prejudgment. Background Satellites are affected by the perturbation of various celestial bodies when in operation in space, and the simple two-body motion mathematical model for describing satellite motion is no longer applicable. In addition to the gravitational forces generated by earth's flatness, the gravitational effects of other celestial bodies must also be taken into account, which makes mathematical models of satellite motion extremely complex. In the field of satellite motion modeling, a traditional method builds a mathematical expression based on a physical model and solves the model through an optimization means. However, the traditional scheme has two obvious defects that on one hand, the traditional scheme generally depends on heuristic algorithms such as a particle swarm optimization algorithm, a certain time is required to be consumed in an iterative solving process, and the requirement of a satellite game scene on high real-time performance is difficult to meet, and on the other hand, the traditional scheme only expands the design around a two-body model, and the perturbation effect of other celestial bodies is not included in the calculation category, so that the model precision is limited. In recent years, as one of core technologies in the field of artificial intelligence, deep learning has achieved rapid development. The key mechanism is to simulate the operation mode of human brain neurons and accurately extract key features from complex data. Reinforcement learning, which is an important branch of deep learning, has gradually become a key component in task decision algorithm systems. The core thought of the technology is that an intelligent agent continuously performs interactive trial and error with the environment, and an optimal decision strategy is obtained through continuous iterative optimization. The method does not need to rely on a specific mathematical model, can effectively solve the difficult problem of complex satellite motion models, and can also meet the severe requirements of a satellite decision scene on strong real-time performance. In recent years, some students apply reinforcement learning to orbital evasion, but these methods do not consider that the satellite's strategy is changing during the strategy evolution, which can lead to a larger variance of the chaser strategy, and thus difficult convergence. Disclosure of Invention The invention aims to provide a reinforcement learning satellite pursuit method and a reinforcement learning satellite pursuit system based on orbit action prejudgment, which solve the problem of performing orbit-changing pursuit by utilizing pulse maneuver among satellites, realize the aim that a tracking satellite continuously approaches an escaping satellite in the process of game, and make more accurate decisions. The technical solution for realizing the purpose of the invention is as follows: an reinforcement learning satellite pursuit method based on orbit action prejudgement comprises the following steps: s1, observing states of two parties of a design sheet agent game, and rewarding as agent input and environment feedback values; S2, constructing and training an initial SAC strategy network by taking state observation as input, constructing a loss function, introducing a curiosity mechanism, and inputting the loss function as an external signal into rewards; S3, constructing a transducer model input vector based on the following party observation and escaping party strategy, performing position coding on the vector, and generating a sequence as a data set of the transducer model; S4, training a transducer model by adopting the constructed data set, and designing a loss function to perform back propagation; S5, taking the output of the trained transducer model as a part of the input observation of the SAC strategy network, reconstructing the input state observation of the SAC strategy network, constructing an integrated decision network, training together, and designing a loss function for counter propagation to obtain an iterative training model; And S6, acquiring the orbital transfer quantity of the escaping party by adopting a trained integrated decision network, and executing the orbital transfer action based on the orbital transfer quantity to track up the satellite of the escaping party. Further, the state observations include the position and velocity of the reference satellite in the earth inertial frame, the motion description of the chase satellite in the reference frame, the motion description of the escape satellite in the reference frame, and the escape action of the escape satellite. Further, the curiosity mecha