CN-122021784-A - Asynchronous reinforcement learning training method, device, equipment and medium

CN122021784ACN 122021784 ACN122021784 ACN 122021784ACN-122021784-A

Abstract

An asynchronous reinforcement learning training method, device, equipment and medium relate to the technical field of artificial intelligence. The method is applied to a training engine and comprises the steps of extracting sample data comprising sample behavior strategy probabilities from a sample data area, determining behavior importance weights according to near-end behavior strategy probabilities and sample behavior strategy probabilities, performing upper limit constraint processing on the behavior importance weights, determining strategy gradient importance weights according to current behavior strategy probabilities and near-end behavior strategy probabilities, performing clipping processing on the strategy gradient importance weights, determining loss values through decoupling strategy gradient loss functions according to the processed behavior importance weights, strategy gradient importance weights, the processed strategy gradient importance weights and dominant functions corresponding to the sample data, and performing back propagation according to the loss values and updating model parameters of a current behavior strategy model. Therefore, the application can effectively process non-strategy deviation in asynchronous reinforcement learning and improve training stability and convergence accuracy.

Inventors

PENG PENG
GAO CHAO

Assignees

白杨时代(北京)科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (10)

1. An asynchronous reinforcement learning training method, applied to a training engine, the method comprising: The method comprises the steps of extracting sample data from a sample data area, wherein the sample data is generated by an inference engine running asynchronously with the training engine and is written into the sample data area, and the sample data comprises sample behavior strategy probability which is the output probability of a sample behavior strategy model to corresponding actions in the sample data when the inference engine generates the sample data; determining behavior importance weights according to the near-end behavior strategy probability and the sample behavior strategy probability, and performing upper limit constraint processing on the behavior importance weights to obtain processed behavior importance weights; Determining a strategy gradient importance weight according to the current behavior strategy probability and the near-end behavior strategy probability, and cutting the strategy gradient importance weight to obtain the processed strategy gradient importance weight; Determining a loss value through a decoupling strategy gradient loss function according to the processed behavior importance weight, the strategy gradient importance weight, the processed strategy gradient importance weight and a corresponding dominance function of the sample data; And carrying out back propagation according to the loss value and updating the model parameters of the current behavior strategy model.
2. The method of claim 1, wherein the determining the behavior importance weight based on the near-end behavior policy probability and the sample behavior policy probability comprises: and determining the behavior importance weight according to the ratio of the near-end behavior strategy probability and the sample behavior strategy probability.
3. The method of claim 1, wherein the performing an upper bound constraint on the behavioral importance weights to obtain processed behavioral importance weights comprises: and determining the smaller value of the behavior importance weight and the preset upper threshold value as the processed behavior importance weight.
4. The method of claim 1, wherein the determining a policy gradient importance weight based on the current behavior policy probability and the near-end behavior policy probability comprises: And determining the importance weight of the strategy gradient according to the ratio of the current behavior strategy probability to the near-end behavior strategy probability.
5. The method of claim 1, wherein the clipping the policy gradient importance weights to obtain processed policy gradient importance weights comprises: and determining the importance weight of the processed strategy gradient by limiting the importance weight of the strategy gradient in a preset cutting interval.
6. The method of any of claims 1-5, wherein the sample data further comprises version information of the sample behavior policy model, wherein the determining the behavior importance weight based on the near-end behavior policy probability and the sample behavior policy probability comprises: Determining the expiration degree of the sample data according to the version information of the sample behavior strategy model and the version information of the current behavior strategy model; and if the expiration is smaller than or equal to a preset expiration threshold, determining a behavior importance weight according to the near-end behavior strategy probability and the sample behavior strategy probability.
7. The method of any of claims 1-5, wherein the sample data comprises a plurality of sub-data generated by different versions of the behavior policy model, wherein determining the loss value by decoupling the policy gradient loss function comprises: And respectively determining the loss values of the plurality of sub-data through a decoupling strategy gradient loss function, and aggregating the loss values of the plurality of sub-data to obtain the loss values of the sample data.
8. An asynchronous reinforcement learning training device is characterized by being applied to a training engine and comprises a data extraction module, a first determination module, a second determination module, a third determination module and a model updating module; The data extraction module is used for extracting sample data from a sample data area, wherein the sample data is generated by an inference engine running asynchronously with the training engine and is written into the sample data area, and the sample data comprises sample behavior strategy probability which is the output probability of a sample behavior strategy model to corresponding actions in the sample data when the inference engine generates the sample data; The first determining module is used for determining a behavior importance weight according to a near-end behavior strategy probability and the sample behavior strategy probability, and performing upper limit constraint processing on the behavior importance weight to obtain a processed behavior importance weight; The second determining module is used for determining a strategy gradient importance weight according to the current behavior strategy probability and the near-end behavior strategy probability, and cutting the strategy gradient importance weight to obtain the processed strategy gradient importance weight; The third determining module is configured to determine a loss value by decoupling a policy gradient loss function according to the processed behavior importance weight, the policy gradient importance weight, the processed policy gradient importance weight, and a dominance function corresponding to the sample data; and the model updating module is used for carrying out back propagation according to the loss value and updating the model parameters of the current behavior strategy model.
9. An asynchronous reinforcement learning training device, comprising a memory and a processor; The memory is used for storing programs; The processor configured to execute the program to implement the steps of the asynchronous reinforcement learning training method of any one of claims 1 to 7.
10. A computer readable medium having stored thereon a computer program, which when executed by a processor, implements the steps of the asynchronous reinforcement learning training method of any of claims 1 to 7.

Description

Asynchronous reinforcement learning training method, device, equipment and medium Technical Field The application relates to the technical field of artificial intelligence, in particular to an asynchronous reinforcement learning training method, device, equipment and medium. Background With the development of artificial intelligence technology, reinforcement learning is widely applied to tasks such as mathematical reasoning, code generation, dialogue decision, automatic control and the like. Currently, reinforcement learning training processes typically employ an asynchronous training paradigm, i.e., sample data is continuously generated by an inference engine and is continuously and asynchronously acquired by a training engine for model training. However, in the asynchronous training paradigm, the generation process of the sample data is performed in parallel with the training process of the model, so that the sample data used by the training engine in training the model is often generated by the inference engine based on the historical behavior policy model, but not on the current behavior policy model. This results in an easy generation of non-strategic deviations, which in turn affect the stability of model training and reduce model convergence effects. Disclosure of Invention Based on the problems, the application provides an asynchronous reinforcement learning training method, an asynchronous reinforcement learning training device, asynchronous reinforcement learning training equipment and a medium, which can effectively treat non-strategy deviation in asynchronous reinforcement learning and improve training stability and convergence accuracy. The embodiment of the application discloses the following technical scheme: in a first aspect, the present application discloses an asynchronous reinforcement learning training method, applied to a training engine, the method comprising: The method comprises the steps of extracting sample data from a sample data area, wherein the sample data is generated by an inference engine running asynchronously with the training engine and is written into the sample data area, and the sample data comprises sample behavior strategy probability which is the output probability of a sample behavior strategy model to corresponding actions in the sample data when the inference engine generates the sample data; determining behavior importance weights according to the near-end behavior strategy probability and the sample behavior strategy probability, and performing upper limit constraint processing on the behavior importance weights to obtain processed behavior importance weights; Determining a strategy gradient importance weight according to the current behavior strategy probability and the near-end behavior strategy probability, and cutting the strategy gradient importance weight to obtain the processed strategy gradient importance weight; Determining a loss value through a decoupling strategy gradient loss function according to the processed behavior importance weight, the strategy gradient importance weight, the processed strategy gradient importance weight and a corresponding dominance function of the sample data; And carrying out back propagation according to the loss value and updating the model parameters of the current behavior strategy model. Optionally, the determining the behavior importance weight according to the near-end behavior policy probability and the sample behavior policy probability includes: and determining the behavior importance weight according to the ratio of the near-end behavior strategy probability and the sample behavior strategy probability. Optionally, the performing upper limit constraint processing on the behavior importance weight to obtain a processed behavior importance weight includes: and determining the smaller value of the behavior importance weight and the preset upper threshold value as the processed behavior importance weight. Optionally, the determining the policy gradient importance weight according to the current behavior policy probability and the near-end behavior policy probability includes: And determining the importance weight of the strategy gradient according to the ratio of the current behavior strategy probability to the near-end behavior strategy probability. Optionally, the clipping the policy gradient importance weight to obtain a processed policy gradient importance weight includes: and determining the importance weight of the processed strategy gradient by limiting the importance weight of the strategy gradient in a preset cutting interval. Optionally, the sample data further includes version information of the sample behavior policy model, and the determining the behavior importance weight according to the near-end behavior policy probability and the sample behavior policy probability includes: Determining the expiration degree of the sample data according to the version information of the sample behavior strategy mod