CN-121981299-A - Learning path generation method based on reinforcement learning

CN121981299ACN 121981299 ACN121981299 ACN 121981299ACN-121981299-A

Abstract

The invention relates to the technical field of machine learning and discloses a learning path generation method based on reinforcement learning, which comprises the steps of obtaining state deviation data of a target object relative to a target value and constructing an evolution track, extracting a first derivative and a second gradient of the state deviation to determine a response phase acceleration vector, judging a response phase mutation and locking a moment point when the acceleration vector exceeds a threshold value, triggering an experience pool asymmetric sampling space reconstruction, executing sampling probability attenuation on a history sample before the moment point by using an exponential attenuation operator, correcting strategy network parameters based on the reconstructed experience pool and outputting a control instruction.

Inventors

Jiang Liheng
DU LI

Assignees

广东树华科技咨询服务有限公司

Dates

Publication Date: 20260505
Application Date: 20260120

Claims (10)

1. The learning path generation method based on reinforcement learning is characterized by comprising the following steps: step 101, algebraic deviation data representing a target object in a current task execution state relative to a preset target value are obtained, and the algebraic deviation data and a historical observation sequence are subjected to time correlation to construct a state evolution track; 102, based on the state evolution track, extracting a first-order time derivative of algebraic deviation data relative to time in a preset sliding window period to determine a state evolution momentum vector, and further extracting a second-order time gradient of algebraic deviation data in the preset sliding window period to determine a response phase acceleration vector; step 103, comparing the modular length of the response phase acceleration vector with a preset phase mutation critical threshold in real time, judging that the controlled logic enters a response phase mutation period when the modular length exceeds the phase mutation critical threshold, and locking a corresponding phase change moment; step 104, triggering asymmetric sampling space reconstruction aiming at the reinforcement learning experience pool in a response phase abrupt change period, and executing sampling probability attenuation on a historical sample which is generated before a phase change moment point and has a reward value higher than a preset average value in the experience pool by utilizing an exponential attenuation operator based on a time attenuation factor so as to release logic viscosity of a strategy network on old response phases; And 105, performing parameter correction on the strategy network according to the strategy gradient generated by the experience pool after the asymmetric sampling space reconstruction, and outputting a learning path control instruction corresponding to the target object under the current response phase.
2. The method according to claim 1, wherein in step 102, based on performing linear extrapolation of the expected performance after the execution of the action by using the state evolution momentum vector, the determining process of the response phase acceleration vector further includes performing residual calculation of the obtained predicted value and algebraic deviation data generated by the actual observation to generate a feedback deviation term, and performing damping correction on the second-order time gradient of the state evolution trajectory by using the feedback deviation term.
3. The method of claim 1, wherein in step 104, the asymmetric sampling space reconstruction further comprises synchronously lifting initial sampling weights of the interactive samples generated after the phase transition time point in priority experience playback while performing sampling probability attenuation, and driving the strategy network to deviate from a locally optimal region of the old response phase through new sample weight expansion driven by old sample sampling probability contraction.
4. The method for generating a learning path based on reinforcement learning according to claim 1, wherein step 101 further comprises obtaining an instruction response time delay representing the interaction stability of the target object, forming a multidimensional state feedback closed loop with the instruction response time delay as an orthogonal correction index and algebraic deviation data, and reducing the gain of the state evolution momentum vector through a damping function when the instruction response time delay exceeds a preset destabilization interval.
5. The method of claim 1, wherein outputting the learned path control instructions in step 105 includes converting the interaction sequence that is continuously in the negative bias interval to geometric constraints of an action search space in conjunction with a negative trajectory space mask mechanism to limit the policy network from performing searches in the validated inefficient path region.
6. The method of claim 1, wherein step 103 includes monitoring transient characteristics of forward transitions of algebraic deviation data, and performing dynamic downregulation of the phase jump threshold if a decay rate of algebraic deviation data after peak response to the phase acceleration vector is below a predetermined damping coefficient.
7. The reinforcement learning-based learning path generation method of claim 1, wherein the exponential decay operator used in step 104 follows the following quantization rule: , wherein, For the probability of a processed sample, For the original sampling weight to be the same, For a preset decay factor that characterizes the rate of remodeling of the response model, For the time span between the time of the generation of the history sample and the phase change time point, Is an exponential function based on natural logarithms.
8. The method for generating a learning path based on reinforcement learning according to claim 1, wherein when calculating the response phase acceleration vector in step 102, the algebraic deviation data is smoothed by a sliding window polynomial fitting method to filter the interference of the non-trending noise induced by the random disturbance of the interaction behavior on the modular length calculation of the response phase acceleration vector.
9. The method for generating a learning path based on reinforcement learning according to claim 1, wherein in step 105, after generating a learning path control instruction, the exploration and utilization coefficient of reinforcement learning is dynamically adjusted according to the updated policy distribution entropy value of the policy network, and after determining that the response phase mutation occurs, the exploration coefficient is lifted in a preset recovery period to assist the policy network in searching the optimal topology path under the new phase.
10. The reinforcement learning-based learning path generation method of claim 1, further comprising performing an asymmetric sampling space reconstruction in step 104 on a sample distribution in the experience pool And detecting the divergence, and limiting the parameter updating step length of the strategy network through a gradient truncation operator when the empirical distribution distance before and after reconstruction caused by sampling probability attenuation exceeds a preset safety margin.

Description

Learning path generation method based on reinforcement learning Technical Field The invention relates to a learning path generation method based on reinforcement learning, and belongs to the technical field of machine learning. Background The evolution process of the controlled system state parameters shows the phase change characteristics of the dynamic characteristics and the non-continuity, namely the target object generates transient forward transition of the system response characteristics when grasping the core logic or the key mode, under the reinforcement learning framework, the transition is shown as the non-stable drift of the state transition probability or the reward function in the Markov decision process, a great amount of successful interaction experience accumulated before is degenerated into outdated noise due to the change of the cognitive phase, if the system continuously revisits old successful experience, the generated strategy gradient can powerfully drag the strategy network to enable the strategy network to sink into the local optimum of the cognitive phase, and finally the generated learning path generates logic viscosity in a new cognitive phase. The learning path demonstration method, device, equipment and medium for deep reinforcement learning are disclosed in China patent with authority bulletin number of CN113094495B, learning path decision process demonstration is realized by configuring a state space containing student user attributes and knowledge graph attributes, the scheme is based on historical condition accumulation for strategy mapping, the cognition of an unclean controlled object is carried out on transient second order dynamics characteristics, when the cognition controlled object is faced to respond to phase migration working conditions, a large amount of over-time high reward samples are filled in a system experience pool, the phase deviation between observed data and real internal states is characterized, the history experience induces strategy update hysteresis, the system cannot drive a strategy network to separate from an old phase logic viscous area, in order to cope with the challenges, a weighting mechanism or a dynamic learning rate adjustment scheme based on time dimension can only carry out linear fine adjustment on the experience weight on the whole scale, due to the lack of physical characterization of phase deviation between observed data and real internal states, the experience space before and after the state points of the system is not accurately identified, the current reinforcement learning environment stable state transition state or constant state transition state is always used as a target phase transition condition, and the linear state transition probability is used as a target variable, and the linear state transition probability is not limited when the dynamic state is changed to be distributed on the target state, and the dynamic state is changed, and the dynamic state is not influenced. Therefore, how to overcome the strategy inertia of the reinforcement learning model when the target object generates the cognitive phase change and decouple the phase mismatch between the observed data and the real cognitive state becomes the technical problem to be solved by the invention. Disclosure of Invention In order to solve the problems in the background technology, the technical scheme of the invention is as follows, a learning path generation method based on reinforcement learning comprises the following steps: step 101, algebraic deviation data representing a target object in a current task execution state relative to a preset target value are obtained, and the algebraic deviation data and a historical observation sequence are subjected to time correlation to construct a state evolution track; 102, based on the state evolution track, extracting a first-order time derivative of algebraic deviation data relative to time in a preset sliding window period to determine a state evolution momentum vector, and further extracting a second-order time gradient of algebraic deviation data in the preset sliding window period to determine a response phase acceleration vector; step 103, comparing the modular length of the response phase acceleration vector with a preset phase mutation critical threshold in real time, judging that the controlled logic enters a response phase mutation period when the modular length exceeds the phase mutation critical threshold, and locking a corresponding phase change moment; step 104, triggering asymmetric sampling space reconstruction aiming at the reinforcement learning experience pool in a response phase abrupt change period, and executing sampling probability attenuation on a historical sample which is generated before a phase change moment point and has a reward value higher than a preset average value in the experience pool by utilizing an exponential attenuation operator based on a time attenuation factor so as to relea