CN-121989256-A - Intelligent robot action planning method based on reinforcement learning

CN121989256ACN 121989256 ACN121989256 ACN 121989256ACN-121989256-A

Abstract

The application provides an intelligent robot action planning method based on reinforcement learning, which comprises the steps of obtaining scene features and historical actions by a strategy network according to current environment observation and actions at the last moment, inputting the scene features and the historical actions with a micro time offset into a neural radiation field-physical dynamics coupled world model, generating output containing physical state increment after model processing, carrying out physical consistency state inquiry on a robot working space sampling point by the strategy network, adding the physical state increment generated by the world model with static scene features calculated based on the scene features and the sampling point positions, obtaining predicted features, aggregating, generating a physical consistency prediction state, and generating the current actions according to the predicted features. According to the method, physical prediction of the future instant scene state is generated by the model and is used as decision input, and physical prediction consistency is used as a reward item collaborative training strategy, so that the strategy has excellent zero sample generalization capability in an unknown reality environment, and semantic errors of simulation and reality are bridged.

Inventors

XIA XIN
LEI LIN
SHENG LIJUN
WANG JINGJING

Assignees

武汉船舶职业技术学院

Dates

Publication Date: 20260508
Application Date: 20260402

Claims (10)

1. The intelligent robot action planning method based on reinforcement learning is characterized by comprising the following steps of: Observing and operating at the last moment from the current environment by a strategy network to obtain scene characteristics and historical actions; Inputting the scene characteristics, the historical actions and a small time offset into a world model, wherein the world model is a coupling model of a nerve radiation field and physical dynamics, and the world model performs processing on input data to generate an output containing physical state increment; The strategy network generates a physical consistency prediction state based on the physical consistency state query of sampling points of a robot working space, wherein the query comprises the steps of adding physical state increment which is generated by the world model and corresponds to the historical actions and the tiny time offset and static scene characteristics calculated based on the scene characteristics and sampling point positions for each sampling point to obtain predicted characteristics of the sampling points, and aggregating the predicted characteristics of all the sampling points; And the strategy network generates actions at the current moment according to the physical consistency prediction state.
2. The method of claim 1, wherein the world model performs processing on the input data to generate an output comprising physical state increments, comprising: The world model generates the physical state increment by performing mapping learning from the scene features, the spatial coordinates and the historical actions to scene local physical attribute prediction variation through an internal differentiable neural network module.
3. The method of claim 1, wherein the querying comprises, for each sample point, adding the physical state delta generated by the world model corresponding to the historical actions and the small time offset to a static scene feature calculated based on the scene feature and the sample point location, comprising: calculating the position codes of the sampling points and the scene characteristics through a query network for each sampling point to generate the static scene characteristics; and vector addition operation is carried out on the static scene feature and the physical state increment, so that the predicted feature is generated.
4. The method of claim 1, wherein the act of generating the current time by the policy network based on the physical coherence prediction state comprises: Calculating a physical prediction consistency reward value based on the action, the scene feature and the actual next time environmental observation; Wherein said calculating a physical prediction uniformity prize value comprises: generating a predicted next moment image through differential rendering by utilizing the world model, the scene characteristics and the action of the current moment; And processing the predicted next moment image through an image decoder, comparing the processing result with the real next moment environment observation, and taking the square of the negative norm of the difference between the processing result and the real next moment environment observation as the physical prediction consistency rewarding value.
5. The method of claim 4, further comprising end-to-end co-training: Training and updating the strategy network by using the total rewards containing the physical prediction consistency rewards value, wherein gradients generated by updating the strategy network are back propagated to the encoder generating the scene characteristics, the world model and parameters of a query network through the physical consistency prediction state.
6. The method of claim 5, further comprising, when the end-to-end co-training is performed in a simulation environment: According to the current task target and based on the implicit guidance of the physical prediction consistency rewarding value, the randomized disturbance is carried out on the environmental parameters closely related to the task physical interaction in the simulation environment.
7. The method of claim 5, further comprising, after the end-to-end co-training is completed: Fixing backbone network parameters of the policy network and the world model; taking the scene latent codes corresponding to the scene features as optimizable variables; and updating the scene latent code through gradient descent iteration by utilizing real-time observation data acquired by the real robot.
8. A method according to claim 3, wherein calculating the position code of the sampling points and the scene features via a query network to generate the static scene features comprises: performing position coding processing on the space coordinates of the sampling points to obtain position coding vectors; Splicing the position coding vector and the scene feature to form a combined feature vector; And inputting the combined feature vector into a multi-layer perceptron, performing forward computation by the multi-layer perceptron, and outputting the static scene feature.
9. The method of claim 4, wherein processing the predicted next temporal image by an image decoder comprises: The image decoder is a lightweight convolutional neural network; The image decoder performs downsampling and channel transformation processing on the predicted next-time image to generate decoded features with the same dimension as the real next-time environmental observation.
10. The method of claim 2, wherein the world model performs mapping learning from the scene features, spatial coordinates, and the historical actions to scene local physical property prediction variables through an internal differentiable neural network module, comprising: the differentiable neural network module receives a scene latent code corresponding to the scene feature, the space coordinate and the historical action as input; The differentiable neural network module performs layer-by-layer conversion on the input through a plurality of full-connection layers and nonlinear activation functions in the differentiable neural network module, and finally outputs a vector representing the scene local physical attribute prediction variation as the physical state increment.

Description

Intelligent robot action planning method based on reinforcement learning Technical Field The application belongs to the field of robot control, and particularly relates to an intelligent robot action planning method based on reinforcement learning. Background In the field of robots, a robot motion planning method based on Deep Reinforcement Learning (DRL) enables a robot to perform complex motion planning through autonomous learning in an unstructured and dynamic reality environment. The method has the core thought that the strategy is trained in a high-fidelity simulation environment, and the trained strategy is deployed to a real robot, so that the safety risk, high cost and hardware loss of large-scale trial-and-error training on the real robot are avoided. In order to alleviate the migration problem from "simulation to reality" (Sim 2 Real), a plurality of methods are specifically adopted: 1) Domain randomization, by randomizing environmental parameters (e.g., texture, illumination, quality, friction coefficient) in simulation training to expand training data distribution, the desired strategy covers real world variations; 2) The system identification and the domain self-adaption are carried out, the system parameters or the dynamic model is estimated through the data collected by the real robot, and then the simulator is calibrated or the strategy is self-adaptively adjusted; 3) Based on predictive control of the learning model, learning a simplified kinetic model of the environment or robot itself (e.g., joint angle, end pose), and for Model Predictive Control (MPC) or multi-step planning; 4) Based on the traditional method of explicit state estimation, perception, state estimation (such as SLAM, object recognition) and motion planning are performed in series as independent modules. However, in the process of training in simulation, deploying in reality, and the various specific mitigation modes, the state upon which the agent decision is based is an extraction of static, manual design features of the current observation, as it regards the simulation and reality as two independent domains that need to be aligned, or the pipeline that regards perception, state understanding, physical reasoning and decision as relative fracturing. This state itself is not mandatory to conform to an internal, coherent, generalizable physical world model. Therefore, when facing new physical situations outside the training data distribution, the strategy lacks an internal mechanism for reasoning and generalizing based on physical common sense, and causes defects of Sim2Real performance degradation, unstable behavior and the like. Disclosure of Invention The application aims to overcome the defects in the prior art and provide an intelligent robot action planning method based on reinforcement learning. The application provides an intelligent robot action planning method based on reinforcement learning, which comprises the following steps: Observing and operating at the last moment from the current environment by a strategy network to obtain scene characteristics and historical actions; Inputting the scene characteristics, the historical actions and a small time offset into a world model, wherein the world model is a coupling model of a nerve radiation field and physical dynamics, and the world model performs processing on input data to generate an output containing physical state increment; The strategy network generates a physical consistency prediction state based on the physical consistency state query of sampling points of a robot working space, wherein the query comprises the steps of adding physical state increment which is generated by the world model and corresponds to the historical actions and the tiny time offset and static scene characteristics calculated based on the scene characteristics and sampling point positions for each sampling point to obtain predicted characteristics of the sampling points, and aggregating the predicted characteristics of all the sampling points; And the strategy network generates actions at the current moment according to the physical consistency prediction state. Optionally, the world model performs processing on the input data to generate an output containing physical state increments, including: The world model generates the physical state increment by performing mapping learning from the scene features, the spatial coordinates and the historical actions to scene local physical attribute prediction variation through an internal differentiable neural network module. Optionally, the querying includes adding, for each sample point, a physical state delta generated by the world model corresponding to the historical actions and the minute time offset to a static scene feature calculated based on the scene feature and the sample point location, including: calculating the position codes of the sampling points and the scene characteristics through a query network for each sampling point to