CN-121979392-A - Self-adaptive adjustment method based on reinforcement learning, computer equipment and medium

CN121979392ACN 121979392 ACN121979392 ACN 121979392ACN-121979392-A

Abstract

The application relates to the technical field of computers and provides a self-adaptive adjustment method based on reinforcement learning, computer equipment and a medium. The self-adaptive adjustment method comprises the steps of obtaining an input state, a player action and a reference action, and selectively adjusting the tolerance coefficient parameter or maintaining the tolerance coefficient parameter through the self-adaptive adjustment module at least based on the difference between the player action and the reference action, so that the difference between the player action and the reference action is stabilized within a preset range.

Inventors

Request for anonymity
Request for anonymity

Assignees

深圳市固胜智能科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260206

Claims (20)

1. An adaptive adjustment method based on reinforcement learning, characterized in that the adaptive adjustment method comprises the following steps: Obtaining an input state, a player action and a reference action, wherein the reference action is a guiding action result obtained by inputting the input state and tolerance coefficient parameters into a reinforcement learning model under the constraint of soft boundary conditions of a reward function of the reinforcement learning model, the input state at least comprises a state related to the player action, and the reference action is used for teaching guidance; selectively adjusting, by an adaptive adjustment module, the tolerance coefficient parameter or maintaining the tolerance coefficient parameter based at least on a gap between the player action and the reference action to change a decision constraint of the reinforcement learning model such that the gap between the player action and the reference action is stabilized within a preset range, wherein the tolerance coefficient parameter is a dynamically adjustable control amount for adjusting soft boundary conditions in a reward function of the reinforcement learning model, the reward function being used to indicate rewards of the reinforcement learning model for taking a given action in a given state.
2. The adaptive adjustment method according to claim 1, wherein selectively adjusting, by the adaptive adjustment module, the tolerance coefficient parameter or maintaining the tolerance coefficient parameter based at least on a gap between the player action and the reference action so that the gap between the player action and the reference action is stabilized within the preset range, comprises: Evaluating respective behavior deviation values of the player action relative to the reference action in a plurality of dimensions, and then synthesizing the respective behavior deviation values of the plurality of dimensions to obtain a synthesized action deviation value of the player action relative to the reference action; when the comprehensive action deviation value is within the preset range, maintaining the tolerance coefficient parameter; And when the comprehensive action deviation value exceeds the preset range, adjusting the tolerance coefficient parameter.
3. The adaptive adjustment method according to claim 1, wherein the adaptive adjustment module is configured to adjust the tolerance coefficient parameter based on a plurality of adjustment levels preset, and wherein the gap between the player action and the reference action includes a distribution of action deviation absolute values of the player action relative to the reference action.
4. The adaptive adjustment method according to claim 3, wherein the plurality of discrete gear steps correspond to a plurality of level references from a highest level to a lowest level in order from the lowest gear step to the highest gear step, and the adaptive adjustment module is configured to: When the mean value of the absolute values of the action deviations is larger than a preset high threshold value, the tolerance coefficient parameters are improved; And when the average value of the absolute values of the action deviations is smaller than a preset low threshold value, reducing the tolerance coefficient parameter, wherein the preset high threshold value and the preset low threshold value limit the preset range.
5. The adaptive tuning method of claim 1, wherein the reinforcement learning model is configured to provide an artificial intelligence strategy having an adjustable level of aggressiveness, and wherein the tolerance coefficient parameter is configured to dynamically adjust the adjustable level of aggressiveness of the artificial intelligence strategy to match the skill level associated with the player action.
6. The adaptive adjustment method of claim 1, wherein the adaptive adjustment module is further configured to determine an initial value of the tolerance factor parameter based on historical data associated with the player action.
7. The adaptive adjustment method of claim 6, wherein the adaptive adjustment module is further configured to selectively adjust the tolerance factor parameter or maintain the tolerance factor parameter based on a gap between the player action and the reference action and a result indicator associated with a task completion status.
8. The adaptive adjustment method according to claim 7, wherein when the task completion status does not satisfy a preset condition and a gap between the player action and the reference action is out of the preset range, the adaptive adjustment module adjusts the tolerance coefficient parameter to change policy constraint intensity of the reinforcement learning model based on an action deviation statistical result.
9. The adaptive tuning method of claim 1, wherein the adaptive tuning module and the reinforcement learning model together form a closed-loop tuning structure, wherein a gap between the player action and the reference action is used to tune the tolerance coefficient parameter to affect a guided reference action generated by the reinforcement learning model under soft boundary condition constraints.
10. The adaptive tuning method of claim 1, wherein the adaptation module is further configured to adjust the tolerance factor parameter based on a player customized setting.
11. The adaptive tuning method of claim 1, wherein the method is applicable to teaching guidance in simulating an automobile driving scenario, wherein the input states are a vehicle state and a track state, the player action is a player automobile driving action, the reference action is a reference automobile driving action, and the soft boundary conditions include at least an upper safe overbending speed limit, an upper lateral acceleration limit of the vehicle, and a soft boundary of the track width.
12. The adaptive adjustment method according to claim 1, wherein the method is applicable to tutorial guidance in simulating an aircraft pilot scenario, wherein the input states are an aircraft state and an aircraft course state, the player action is a player aircraft pilot action, the reference action is a reference aircraft pilot action, and the soft boundary conditions include an aircraft speed, an aircraft steering speed, and an aircraft acceleration.
13. The adaptive adjustment method according to claim 1, wherein the method is applicable to a teaching instruction in a simulated sports training scenario, wherein the input states are a player height, a player arm length, and a player pose, the player action is a player sports action, the reference action is a reference sports action, and the soft boundary conditions include at least sports action strength, sports action speed, and sports action acceleration.
14. The adaptive adjustment method according to claim 1, wherein the hard boundary condition of the bonus function is non-adjustable and is preset based on boundary limits of a scenario associated with the player action, the hard boundary condition of the bonus function including a course width hard boundary when the scenario associated with the player action is a simulated car driving scenario.
15. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the operations implemented by the processor when executing the computer program comprising: obtaining an input state, a player action and a reference action, wherein the reference action is obtained by inputting the input state and tolerance coefficient parameters into a reinforcement learning model, the input state at least comprises a state associated with the player action, and the reference action is used for teaching guidance; Selectively adjusting, by an adaptive adjustment module, the tolerance coefficient parameter or maintaining the tolerance coefficient parameter based at least on a gap between the player action and the reference action such that the gap between the player action and the reference action is stabilized within a preset range, wherein the tolerance coefficient parameter is a dynamically adjustable control amount for adjusting soft boundary conditions in a reward function of the reinforcement learning model, the reward function being used to indicate rewards of the reinforcement learning model for taking a given action in a given state.
16. The computer device of claim 15, wherein the adaptive adjustment module is configured to adjust the tolerance coefficient parameter based on a preset plurality of adjustment levels, and wherein the gap between the player action and the reference action comprises a distribution of absolute values of action deviations of the player action relative to the reference action.
17. The computer device of claim 15, wherein the adaptive adjustment module and the reinforcement learning model together form a closed-loop adjustment structure, wherein a gap between the player action and the reference action is used to adjust the tolerance coefficient parameter to affect a guided reference action generated by the reinforcement learning model under soft boundary condition constraints.
18. A computer-readable storage medium storing computer instructions that, when executed on a computer device, cause the computer device to perform operations comprising: obtaining an input state, a player action and a reference action, wherein the reference action is obtained by inputting the input state and tolerance coefficient parameters into a reinforcement learning model, the input state at least comprises a state associated with the player action, and the reference action is used for teaching guidance; Selectively adjusting, by an adaptive adjustment module, the tolerance coefficient parameter or maintaining the tolerance coefficient parameter based at least on a gap between the player action and the reference action such that the gap between the player action and the reference action is stabilized within a preset range, wherein the tolerance coefficient parameter is a dynamically adjustable control amount for adjusting soft boundary conditions in a reward function of the reinforcement learning model, the reward function being used to indicate rewards of the reinforcement learning model for taking a given action in a given state.
19. The computer readable storage medium of claim 18, wherein the adaptive adjustment module is configured to adjust the tolerance factor parameter based on a preset plurality of adjustment levels, and wherein the gap between the player action and the reference action comprises a distribution of absolute values of action deviations of the player action relative to the reference action.
20. The computer-readable storage medium of claim 18, wherein the adaptive adjustment module forms a closed-loop adjustment structure with the reinforcement learning model, wherein a gap between the player action and the reference action is used to adjust the tolerance coefficient parameter to affect a guided reference action generated by the reinforcement learning model under soft boundary condition constraints.

Description

Self-adaptive adjustment method based on reinforcement learning, computer equipment and medium Technical Field The present application relates to the field of computer technologies, and in particular, to a self-adaptive adjustment method based on reinforcement learning, a computer device, and a medium. Background With the development of application fields such as artificial intelligence, virtual reality, augmented reality and the like, teaching guidance is widely provided for players through a teaching (coaching) system, so that the players are helped to improve training quality and skill level, for example, an artificial intelligence model-based teaching training system (AI Coach). However, the initial skill level and the learning progress of different players are different, so that individual differences of different players need to be adapted, and training effects and player experience can be improved only by teaching in accordance with the material. In the teaching training scheme based on computer assistance in the prior art, the experience level and skill level of a reference object for training a player are set by the player by himself, for example, different levels such as a goal level, a skill level, an expert level and the like are selected, but a static adjustment mode and a solidified grading mode are relied on, a personalized adjustment mechanism aiming at an individual player is lacked, and the improvement of the skill level of the player is difficult to be adapted in time, so that hysteresis and poor training experience are caused. The teaching and training schemes relying on artificial intelligence technology in the prior art sometimes employ reinforcement learning (Reinforcement Learning, RL) technology, where an agent (agent) interacts with an environment (environment) to learn an optimal strategy in trial and error, and takes action (action) in the environment and receives rewards (reward) as feedback to guide the agent to adjust the strategy to maximize long-term jackpot. For example, chinese patent with the grant bulletin number CN119417671B discloses that intelligent driving data is collected in real time through a vehicle-mounted sensor, the intelligent driving data is input into a decision model to obtain a standard driving behavior sequence, the actual driving behavior sequence of a driver is compared with the standard driving behavior sequence to calculate a driving behavior deviation value, a teaching feedback strategy is determined according to the driving behavior deviation value, and a feedback result of the driver on driving instruction information is used for updating training parameters of a current decision model. However, in the teaching training system based on reinforcement learning in the prior art, training parameters need to be updated frequently according to feedback of a driver, so that the training process of the reinforcement learning model is frequently interfered, the teaching model designed according to the highest skill level is difficult to fully utilize, the system adaptation cost in switching among different users is increased, the automation efficiency is low, and large-scale popularization and application are not facilitated. Therefore, the application provides a self-adaptive adjustment method, computer equipment and medium based on reinforcement learning, which are used for solving the technical problems in the prior art. Disclosure of Invention In a first aspect, the present application provides a reinforcement learning based adaptive adjustment method. The adaptive adjustment method comprises the steps of obtaining an input state, a player action and a reference action, wherein the reference action is a guiding action result obtained by inputting the input state and a tolerance coefficient parameter into a reinforcement learning model under the constraint of soft boundary conditions of a reward function of the reinforcement learning model, the input state at least comprises a state related to the player action, the reference action is used for teaching guidance, and the tolerance coefficient parameter is selectively adjusted or maintained through an adaptive adjustment module at least based on the difference between the player action and the reference action so as to change the decision constraint condition of the reinforcement learning model, so that the difference between the player action and the reference action is stabilized within a preset range, wherein the tolerance coefficient parameter is a dynamically adjustable control quantity used for adjusting the soft boundary conditions in a reward function of the reinforcement learning model, and the reward function is used for indicating the reinforcement learning model to take given action under a given state. According to the application, the dynamic adjustment mechanism of the tolerance coefficient parameter is utilized, the reference action output to the outside and the teaching guidance based o