CN-122021780-A - Layered offline target condition reinforcement learning method based on diffusion model sub-target planning

CN122021780ACN 122021780 ACN122021780 ACN 122021780ACN-122021780-A

Abstract

The invention discloses a hierarchical offline target condition reinforcement learning method based on diffusion model sub-target planning, which belongs to the technical field of offline target condition reinforcement learning and hierarchical decision making, and firstly, in view of the fact that advanced decision making of the existing hierarchical offline target condition reinforcement learning relies on noise sensitive value function estimation, obvious vulnerability is shown in a complex environment, a diffusion model based on stream matching is designed for modeling sub-target distribution. And then, realizing implicit optimality bias by using a diffusion model without classifier guidance in the reasoning stage, and directly planning an optimal sub-target. And finally, using a low-level strategy to make a decision according to the current state and the planned sub-target. The method can enable the advanced strategy to bypass the unstable value function, directly generate the optimal sub-target which can reach and has the target directivity by using the stable generation model, and remarkably improve the robustness of long-sight distance planning.

Inventors

WANG XUESONG
ZHANG HENGRUI
CHENG YUHU

Assignees

中国矿业大学

Dates

Publication Date: 20260512
Application Date: 20260119

Claims (10)

1. The hierarchical offline target condition reinforcement learning method based on diffusion model sub-target planning is characterized by comprising the following steps of: Respectively establishing a diffusion model network without classifier guidance, a low-level strategy network and a state value network, and initializing network parameters; Establishing a target network corresponding to the state value network, and initializing target network parameters, wherein the target network structure is consistent with the state value network structure; Training the diffusion model without classifier guidance, namely iteratively updating the network parameters of the diffusion model without classifier guidance, the network parameters of a low-level strategy, the network parameters of a value network and the network parameters of a target network by using experience samples in an experience pool, so that a loss function is minimum and a target function is maximum; Generating an optimal sub-target by using the trained diffusion model without classifier guidance; And the low-level strategy network executes actions according to the current state and the generated optimal sub-target output to obtain an optimal strategy.
2. The method for strengthening learning the conditions of the layered offline targets based on the diffusion model sub-target planning according to claim 1, wherein the method for initializing the network parameters is random initialization; The method for initializing the target network parameters comprises the step of directly assigning the state value network parameters to the target network parameters.
3. The method for reinforcement learning of layered offline target conditions based on diffusion model sub-target planning of claim 1, wherein the network parameters include diffusion time steps First, the Sub-targets for step diffusion First, the Status of step First, the Step action First, the Status of step Sub-target Target(s) Diffusion model parameters without classifier guidance Low level policy network parameters Status value network parameters 。
4. The method for reinforcement learning of layered offline target conditions based on diffusion model sub-target planning of claim 1, wherein the first step is Step action At least including a vehicle lateral and longitudinal control system; First, the Status of step Including vehicle own information and ambient conditions; the vehicle information at least comprises the vehicle position in a longitude and latitude coordinate system Vehicle speed in longitude and latitude coordinate system A vehicle body moving direction; The surrounding environment state at least comprises the distance and moving speed of other objects in the environment relative to the vehicle, a road structure and a traffic signal lamp.
5. The hierarchical offline target condition reinforcement learning method based on diffusion model sub-target planning according to claim 1, wherein training the classifier-free diffusion model, iteratively updating classifier-free diffusion model network parameters, low-level policy network parameters, value network parameters, and target network parameters using experience samples in an experience pool, minimizing a loss function and maximizing a target function, comprises the steps of: step 1, using an experience pool Experience sample in (3) Diffusion model network parameters for non-classifier-directed Updating, namely minimizing a diffusion model network loss function guided by a classifier through a gradient descent method; Step 2, using an experience pool Experience sample in (3) For state value network parameters The updating method comprises the steps of adopting a gradient descent method to minimize a loss function of a state value network; Step 3, using an experience pool Experience sample in (3) For low-level policy network parameters The updating method comprises the steps of adopting a gradient ascending method to maximize an objective function of a low-level strategy network; Step4, updating the target state value network parameters ; And (3) repeating the steps 1 to 4, updating each network parameter and the target network parameter, and obtaining a trained diffusion model without classifier guidance after the set updating times are reached.
6. The method for reinforcement learning of layered offline target conditions based on diffusion model sub-target planning according to claim 5, wherein the diffusion model network loss function without classifier guidance in step 1 is expressed as; Wherein, the Representing a diffusion model network loss function without classifier guidance, The desired symbol is represented by a symbol, Representation interval Is uniformly distributed on the surface of the base plate, The sampling operation is represented by a number of samples, A standard gaussian distribution is represented and, Representing a hybrid target distribution of the data set, Representing linear interpolation between noise and real sub-objects, Representing the real sub-targets sampled from the experience pool, Representing the target obtained by the random mask, Representing noise sub-targets.
7. The method for reinforcement learning of layered offline target conditions based on diffusion model sub-target planning according to claim 5, wherein in step 2, a loss function of a state value network is expressed as; Wherein, the Representing a loss function of the state-value network, Indicating that a regression loss is desired, Indicating the number of quantiles desired, The indication function is represented by a representation of the indication function, Is shown in the state With the object The corresponding prize to be awarded is provided, Representing the target state value network as input The output of the time period is output, Representing a state value network as input The output of the time period is output, Is a super parameter, represents the rate of return discount, Representing a variable representing the expected regression loss of the input.
8. The method for reinforcement learning of layered offline target conditions based on diffusion model sub-target planning according to claim 5, wherein in step 3, the objective function of the low-level policy network is expressed as: Wherein, the Representing the objective function of the low-level policy network, As a base of the exponential function, Is the inverse temperature of the super-parameter, Representing a state value network as input The output of the time period is output, Representing a state value network as input The output of the time period is output, Representing a low-level policy as input The output is Log-likelihood of (a) is determined.
9. The method for reinforcement learning of layered offline target conditions based on diffusion model sub-target planning of claim 5, wherein in step 4, the target state value network parameters are updated The method of (1) comprises the following steps of calculating Then assigning the calculation result to the target state value network parameter , wherein, Representing the target network update rate.
10. The method for reinforcement learning of layered offline target conditions based on diffusion model sub-target planning according to claim 1, wherein the optimal sub-target is generated by using the trained diffusion model without classifier guidance, and the guided sampling process is as follows: and according to the current state and the final target, the diffusion model guided by the non-classifier directly generates the optimal sub-target by using the weighting of the unconditional sampling and the conditional sampling.

Description

Layered offline target condition reinforcement learning method based on diffusion model sub-target planning Technical Field The invention belongs to the technical field of offline target condition reinforcement learning and layered decision making, and particularly relates to a layered offline target condition reinforcement learning method based on diffusion model sub-target planning. Background The reinforcement learning can effectively solve the decision problem in a trial-and-error mode, and has been remarkably successful in the fields of games and the like. Reinforcement learning, however, relies on online interaction with the environment, which limits its application in the real world. In some scenarios, online interaction is inefficient and may present a security risk. For example, training a robotic school to walk in a real world environment may take weeks to recover state with human intervention after each fall. In an autopilot mission, a smart car may pose a hazard to surrounding pedestrians or other vehicles during the learning process. In contrast, offline reinforcement learning does not require real-time interaction with the environment, and only one static experience data set is needed to learn the strategy, thereby providing more scenes for reinforcement learning application in real scenes. Meanwhile, the general expandable strategy can be directly learned in the unlabeled data by introducing the target as a condition into the decision process, so that the training of talent-going intelligent agents is possible. In order to provide the utilization efficiency of data, offline target condition reinforcement learning has become a necessary means for training out a general strategy in the automatic driving field. Existing automatic hypothesis domains are typically accompanied by target condition settings, and only run to a specified target to complete the task, thus inherently being challenged by long-view planning and sparse rewards. The hierarchical decision predicts future potential intermediate states as sub-targets through a high-level strategy, and the low-level strategy executes actions according to the sub-targets, so that the problem of condition reinforcement learning of the offline targets is solved. Hierarchical decision uses a hierarchical structure, which improves the signal-to-noise ratio of the value function to provide a more accurate value function pilot signal. Therefore, the method has the potential of solving the long vision problem and the sparse rewarding problem, and simultaneously provides a method for the application of industrial scenes. However, hierarchical decisions can improve decision accuracy to some extent, but sub-objective planning of advanced strategies still relies on guidance of value functions, thus necessarily requiring querying for value in long-line-of-sight scenarios. As the line of sight increases, the pilot signal of the value function becomes increasingly blurred, the planning of the high-level strategy will be exceptionally difficult, and eventually the wrong sub-objective causes the low-level strategy of the output execution action to become ineffective as well. Therefore, improving the accuracy of advanced strategy sub-objective planning in hierarchical decisions becomes a critical challenge to be solved. Disclosure of Invention Aiming at the problems, the invention provides a hierarchical target condition offline reinforcement learning method based on diffusion model sub-target planning, which comprises the steps of firstly planning sub-targets by using a diffusion model without classifier guidance, and then executing actions according to sub-target output by using a low-level strategy, thereby solving the challenges of sparse rewards and long-view planning in the automatic driving field. The technical scheme adopted by the invention is that in order to achieve the aim of the invention, the method adopts a layered off-line target condition reinforcement learning based on diffusion model sub-target planning, and comprises the following steps: Respectively establishing a diffusion model network without classifier guidance, a low-level strategy network and a state value network, and initializing network parameters; using a diffusion model without classifier guidance as a high-level strategy for sub-objective planning, and using a low-level strategy network as a low-level strategy for action execution; respectively establishing a target network corresponding to the state value network, wherein the structure of the target network is consistent with that of the corresponding original network; Training a diffusion model without classifier guidance, modeling sub-target distribution through a flow matching process, and directly generating an optimal sub-target through a sampling process without classifier guidance, wherein the flow matching process is realized through flow matching loss. The method comprises the steps of constructing a speed field by using