CN-121982192-A - Predictive control method, medium and equipment based on world model and online fine tuning

CN121982192ACN 121982192 ACN121982192 ACN 121982192ACN-121982192-A

Abstract

The invention discloses a predictive control method, medium and equipment based on a world model and online fine tuning, and belongs to the technical field of control algorithms. The method comprises the steps of obtaining multi-mode environment observation data, wherein the environment observation data comprises point cloud data, RGB images and IMU attitude angle information, inputting the environment observation data into a trained variational self-encoder network to predict potential states, predicting potential state sequences and candidate action sequences of a plurality of steps by utilizing a trained cyclic state space model, predicting potential state sequences and decoded occupied grid patterns of the plurality of steps, and searching an optimal action sequence in an action space by minimizing a total cost function of each track as an optimization target for a current action and a plurality of candidate action sequences. The invention improves the efficiency and accuracy of prediction.

Inventors

CHEN GUILAN
ZHAO GUORU
DIAO YANAN
XIE YINGRUI
You Zijing
WU PEIXI
Jiang Xiantai
ZHOU YONGQIANG
NING YUNKUN

Assignees

中国科学院深圳先进技术研究院

Dates

Publication Date: 20260505
Application Date: 20251205

Claims (10)

1. A predictive control method based on a world model and on-line fine tuning comprises the following steps: Acquiring multi-mode environment observation data, wherein the environment observation data comprises point cloud data, RGB images and IMU attitude angle information; inputting the environmental observation data into a trained variational self-encoder network, predicting a potential state; Predicting a potential state sequence of a subsequent multi-step by using a cyclic state space model for the predicted potential state and the candidate action sequence; for the current action and the candidate action sequences, the total cost function of each track is minimized as an optimization target, and the optimal action sequence is searched in the action space.
2. The method according to claim 1, characterized by setting the overall loss function employed for training the variation from the encoder network to: Wherein, the Is the reconstruction loss of the multi-mode observation, Is a potential loss of spatial regularization, Is the predicted loss of the join-in bonus signal, Is that The coefficient of the divergence is used to determine, The weights are predicted for rewards.
3. The method of claim 2, wherein the multi-modal observed reconstruction loss is The method comprises the following steps: Wherein: Wherein, the Is the loss of the reconstruction of the point cloud, Is a loss of structural similarity, which is a loss of structural similarity, Is the IMU pose and path reconstruction loss, , And Is the weight coefficient of the corresponding term, Is a reconstruction of the point cloud and, Is a real point cloud, p represents a real point cloud set Is provided with a three-dimensional point in the middle, Representing reconstruction point cloud sets Is provided with a three-dimensional point in the middle, Is the reconstructed image of the object, Is a true image of the person, Is the true IMU pitch angle, Is the predicted IMU pitch angle for time step t, Is the true IMU roll angle at time t, Is the predicted IMU roll angle for time step t, The number of time steps is indicated and, The x-coordinate of the real path point representing the time step t, The x-coordinate of the predicted path point representing time step t, The y-coordinate of the real path point representing the time step t, Representing the y-coordinate of the predicted path point of time step t.
4. The method of claim 2, wherein the potential spatial regularization penalty is set to: Wherein, the Representing the Kullback-Leibler divergence, i representing the th in the potential vector An index of the dimensions of the device, The dimensions of the potential space are represented, Represents the hidden variable in the potential space, Representing the input of an observation, A mean vector representing the potential state of time step t, A standard deviation vector representing the potential state of time step t.
5. The method of claim 2, wherein the predicted loss of the join bonus signal is set to: Wherein, the A true prize signal representing time step t +1, A predicted prize signal representing time step t + 1.
6. The method of claim 1, wherein the total cost function is set to: Wherein: Wherein, the Is the path-tracking error and, Is the cost of the control quantity, Is the proximity of the obstacle to the ground, Is the smoothness of the motion, which is referred to as motion, Is a final state reward that is associated with the terminal, Representing observations from predictions The current position coordinates extracted from the above-mentioned data are obtained, Representing a reference path at a time step Is set to be the target position coordinates of the (c), The safety distance is indicated by the fact that, Representing the distance between the predicted location and the nearest obstacle, Is a constant which is set up and is used for the control of the control system, Indicating the width of the end point bonus distribution, Representing the action of time step t + k-1, Representing predicted time steps Is a part of the observation of (a).
7. The method of claim 2, further comprising fine tuning the variant self-coding network by an online fine tuning and contrast learning mechanism, wherein contrast loss is set to: Wherein, the Time adjacent potential states Constructing positive sample pairs, and randomly selecting states of non-adjacent moments in Slow flushing areas A negative sample pool is constructed, B represents the batch size, The temperature coefficient is represented by a temperature coefficient, Representing a similarity metric function for two states in potential space, i representing the index of the positive sample pair in the current lot, j representing the index of the negative sample in the negative sample pool, Representing the number of negative samples in the negative sample cell, Representing the time step offset for constructing positive sample pairs.
8. The method of claim 7, wherein a parameter importance penalty term is added to the overall loss function during the online fine-tuning and contrast learning mechanism : Wherein: Wherein, the The coefficient of elasticity is represented by the equation, Represent the first The number of training parameters to be used in the training process, Representing the corresponding first of the pre-training phases The values of the individual parameters are used to determine, Representing a set of old task data It is desirable to take the form of a program, ) Is represented in the input Lower prediction output X represents the input samples of the old task, y represents the corresponding old task label, Representing parameters Importance index of (c).
9. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor realizes the steps of the method according to any of claims 1 to 8.
10. A computer device comprising a memory and a processor, on which memory a computer program is stored which can be run on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 8 when the computer program is executed.

Description

Predictive control method, medium and equipment based on world model and online fine tuning Technical Field The invention relates to the technical field of control algorithms, in particular to a predictive control method, medium and equipment based on a world model and online fine tuning. Background Aged deterioration and lower limb dysfunction (motor degenerative diseases, hip-knee replacement, cerebral palsy, sarcopenia, etc.) will increase the risk of senile falls. Aiming at the aged and lower limb dysfunction people, the design of a high-robustness and high-intelligent self-adaptive control intelligent system has become urgent clinical and community demands. The current predictive control algorithm for complex dynamic scenes mainly comprises three schemes, namely 1) traditional Model Predictive Control (MPC) +simplified physical modeling, wherein a dynamic model (such as a two-degree-of-freedom car model) constructed offline is based, a control instruction is generated by utilizing rolling time domain optimization, the control instruction is suitable for track tracking under a low-dimensional state space, but is difficult to be compatible with high-dimensional sensor data (such as IMU and vision fusion data) depending on an idealized model and linearization assumption, optimal variables are increased suddenly when unstructured obstacles are processed, so that real-time dip is caused, 2) Deep Reinforcement Learning (DRL) and MPC mixed architecture is adopted, a avoidance path or an emergency braking instruction in the dynamic environment is predicted through a DRL end-to-end training strategy network, an optimization solver in the MPC is partially replaced to accelerate calculation, but the DRL strategy is highly sensitive to historical experience data, a predictive accumulated error is easy to appear under the unseen scenes such as target pose mutation or illumination shielding, and additional calculation resources are required on line, 3) on-line self-adaption is realized, the algorithm is introduced into a dynamic parameter estimator (such as a Kalman filter), the real-time correction error is easy to be improved, the real-time control parameters are overlaid on the MPC, and the impact on the hardware load conflict is easy to be delayed, and the real-time performance is easy to cause the impact on the flexible control system to be overlaid on the flexible and the impulse control system, and the impact is greatly delayed and the impact caused by the optimal system. In the prior art, patent application CN118585813A discloses an agent control method based on a world model, which comprises the following steps of obtaining environment observation data, training the world model by utilizing a training data set, wherein the world model comprises a variable self-coding module, a sequence modeling module, a hidden state prediction module and an optimization module, the variable self-coding module is used for adding self-adaptive Gaussian noise to the observation data sampled from a replay buffer zone and then coding the observation data to generate potential vectors, the sequence modeling module is used for generating hidden states according to the potential vectors and action vectors generated by an agent, the hidden state prediction module is used for generating prediction results according to the hidden states, the optimization module is used for optimizing model parameters according to the prediction results of the hidden state prediction module, the world model is used for generating an imagination track, and the agent confirms an optimal strategy according to the imagination track. However, the imagined trajectory is generated based on the current strategy, is easy to sink to the local optimum, and is difficult to explore the globally optimum path. Patent application CN119270885A discloses a control method, a device and a medium of an unmanned system based on a multi-level world model, and can be widely applied to the technical field of unmanned control. According to the method, after a cloud side large world model is obtained by training a large world model to be trained through multi-mode data comprising image data, video data or business text data, a side middle world model and a side small world model are obtained through distillation according to the cloud side large world model, a working instruction is obtained through prediction of a natural language instruction of a target object and an intelligent body state of the side small world model input into the side middle world model, a working instruction is obtained through prediction of the working instruction input into the side small world model, a working state adjustment instruction of an unmanned system is obtained through prediction, and then the unmanned system is adjusted according to the working state adjustment instruction, so that dependence on human experts is reduced. However, in this scheme, knowledge distillation of large to small mod