CN-115293227-B - Model training method and related equipment
Abstract
A model training method relates to the field of artificial intelligence and comprises the steps of processing first data through a first reinforcement learning model to obtain a first processing result, processing the first data through a first target neural network selected from a plurality of first neural networks to obtain a second processing result, wherein each first neural network is an iteration result obtained in the process of carrying out iterative training on a first initial neural network, and updating the first reinforcement learning model according to the first processing result and the second processing result. According to the method, the interference aiming at the target task is output by utilizing the historical training result (the anti-intelligent agent obtained in the historical iteration process) of the anti-intelligent agent, so that more effective interference aiming at the target task under different scenes can be obtained, and the training effect and generalization of the model are improved.
Inventors
- HE XIU
- LI DONG
Assignees
- 华为技术有限公司
- 华为技术有限公司
Dates
- Publication Date
- 20260421
- Application Date
- 20220621
- Priority Date
- 20220621
Claims (19)
- 1. A method of model training, the method comprising: processing first data through a first reinforcement learning model to obtain a first processing result, wherein the first data indicates the state of a target object, and the first processing result is used as control information when a target task is executed on the target object; The first data is processed through a first target neural network to obtain a second processing result, wherein the second processing result is used as interference information when the target task is executed, the first target neural network is selected from a plurality of first neural networks, and each first neural network is an iteration result obtained by carrying out iterative training on a first initial neural network; Executing the target task according to the first processing result and the second processing result to obtain a third processing result; Updating the first reinforcement learning model according to the third processing result to obtain an updated first reinforcement learning model, wherein the target object is a robot, the target task is gesture control of the robot, the first processing result is gesture control information of the robot, or The target object is a vehicle, the target task is automatic driving of the vehicle, and the first processing result is driving control information of the vehicle.
- 2. The method of claim 1, wherein the step of determining the position of the substrate comprises, The first target neural network is selected from the plurality of first neural networks based on a first selection probability corresponding to each of the plurality of first neural networks.
- 3. The method of claim 2, wherein the processing result obtained by processing the data by each first neural network is used as the interference when the target task is executed, and the first selection probability is positively correlated with the interference degree of the processing result output by the corresponding first neural network on the target task.
- 4. The method of claim 2, wherein updating the first reinforcement learning model based on the third processing result comprises: obtaining a reward value corresponding to the target task according to the third processing result; Updating the first reinforcement learning model according to the reward value; the method further comprises the steps of: and updating the first selection probability corresponding to the first target neural network according to the reward value.
- 5. The method according to any one of claims 1 to 4, further comprising: The first data is processed through a second target neural network to obtain a fourth processing result, wherein the fourth processing result is used as interference information when the target task is executed, the second target neural network is selected from a plurality of second neural networks, and each second neural network is an iteration result obtained by carrying out iterative training on a second initial neural network; and executing the target task according to the first processing result and the second processing result to obtain a third processing result, wherein the third processing result comprises: And executing the target task according to the first processing result, the fourth processing result and the second processing result to obtain a third processing result.
- 6. The method of claim 5, wherein the step of determining the position of the probe is performed, The interference types of the second processing result and the fourth processing result are different, or The interference objects of the second processing result and the fourth processing result are different, or The first target neural network is used for determining the second processing result from a first numerical range according to the first data, the second target neural network is used for determining the fourth processing result from a second numerical range according to the first data, and the second numerical range is different from the first numerical range.
- 7. The method according to any one of claims 1 to 4, further comprising: Processing second data through a second reinforcement learning model to obtain a fifth processing result, wherein the second reinforcement learning model is selected from a plurality of reinforcement learning models including the updated first reinforcement learning model, each reinforcement learning model is an iteration result obtained by performing iterative training on an initial reinforcement learning model; Processing the second data through a third target neural network to obtain a sixth processing result, wherein the third target neural network belongs to the plurality of first neural networks, and the sixth processing result is used as interference information when the target task is executed; executing the target task according to the fifth processing result and the sixth processing result to obtain a seventh processing result; and updating the third target neural network according to the seventh processing result to obtain an updated third target neural network.
- 8. The method of claim 7, wherein the second reinforcement learning model is selected from a plurality of reinforcement learning models based on a second selection probability corresponding to each reinforcement learning model of the plurality of reinforcement learning models.
- 9. A model training apparatus, the apparatus comprising: The system comprises a data processing module, a target object, a data processing module and a data processing module, wherein the data processing module is used for processing first data through a first reinforcement learning model to obtain a first processing result, and the first data is used for indicating the state of the target object and is used as control information when a target task is executed on the target object; The first data is processed through a first target neural network to obtain a second processing result, wherein the second processing result is used as interference information when the target task is executed, the first target neural network is selected from a plurality of first neural networks, and each first neural network is an iteration result obtained by carrying out iterative training on a first initial neural network; Executing the target task according to the first processing result and the second processing result to obtain a third processing result; The model updating module is used for updating the first reinforcement learning model according to the third processing result to obtain an updated first reinforcement learning model, wherein the target object is a robot, the target task is gesture control of the robot, the first processing result is gesture control information of the robot, or The target object is a vehicle, the target task is automatic driving of the vehicle, and the first processing result is driving control information of the vehicle.
- 10. The apparatus of claim 9, wherein the device comprises a plurality of sensors, The first target neural network is selected from the plurality of first neural networks based on a first selection probability corresponding to each of the plurality of first neural networks.
- 11. The apparatus of claim 10, wherein the processing result of each first neural network processing data is used as interference when executing the target task, and the first selection probability is positively correlated with the degree of interference of the processing result output by the corresponding first neural network on the target task.
- 12. The apparatus according to claim 10, wherein the model updating module is specifically configured to: obtaining a reward value corresponding to the target task according to the third processing result; Updating the first reinforcement learning model according to the reward value; The model updating module is further configured to: and updating the first selection probability corresponding to the first target neural network according to the reward value.
- 13. The apparatus according to any one of claims 9 to 12, wherein the data processing module is further configured to: The first data is processed through a second target neural network to obtain a fourth processing result, wherein the fourth processing result is used as interference information when the target task is executed, the second target neural network is selected from a plurality of second neural networks, and each second neural network is an iteration result obtained by carrying out iterative training on a second initial neural network; The data processing module is specifically configured to: And executing the target task according to the first processing result, the fourth processing result and the second processing result to obtain a third processing result.
- 14. The apparatus of claim 13, wherein the device comprises a plurality of sensors, The interference types of the second processing result and the fourth processing result are different, or The interference objects of the second processing result and the fourth processing result are different, or The first target neural network is used for determining the second processing result from a first numerical range according to the first data, the second target neural network is used for determining the fourth processing result from a second numerical range according to the first data, and the second numerical range is different from the first numerical range.
- 15. The apparatus according to any one of claims 9 to 12, wherein the data processing module is further configured to: Processing second data through a second reinforcement learning model to obtain a fifth processing result, wherein the second reinforcement learning model is selected from a plurality of reinforcement learning models including the updated first reinforcement learning model, each reinforcement learning model is an iteration result obtained by performing iterative training on an initial reinforcement learning model; Processing the second data through a third target neural network to obtain a sixth processing result, wherein the third target neural network belongs to the plurality of first neural networks, and the sixth processing result is used as interference information when the target task is executed; executing the target task according to the fifth processing result and the sixth processing result to obtain a seventh processing result; The model updating module is further configured to: and updating the third target neural network according to the seventh processing result to obtain an updated third target neural network.
- 16. The apparatus of claim 15, wherein the second reinforcement learning model is selected from a plurality of reinforcement learning models based on a second selection probability corresponding to each reinforcement learning model of the plurality of reinforcement learning models.
- 17. Model training apparatus, characterized in that it comprises a memory storing code and a processor configured to obtain said code and to perform the method according to any of claims 1 to 8.
- 18. A computer readable storage medium comprising computer readable instructions which, when run on a computer device, cause the computer device to perform the method of any one of claims 1 to 8.
- 19. A computer program product comprising computer readable instructions which, when run on a computer device, cause the computer device to perform the method of any of claims 1 to 8.
Description
Model training method and related equipment Technical Field The application relates to the field of artificial intelligence, in particular to a model training method and related equipment. Background Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence through digital computers or digital computer-controlled machines, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, man-machine interaction, recommendation and search, AI-based theory, and the like. Reinforcement learning (reinforcement learning, RL) is an important machine learning method in the field of artificial intelligence, and has many applications in the fields of autopilot, intelligent control robots, analytical prediction, and the like. Specifically, the main problem to be solved by reinforcement learning is how the intelligent device directly interacts with the environment to learn the skills adopted when executing a specific task, so as to realize the maximum long-term rewards for the specific task. In the application of reinforcement learning algorithms, online environments are often required to interact to obtain data and train. It is common practice to model real scenes of the real world, generating a virtual simulated online environment. In this case, if there is a slight difference between the training environment and the real environment to be deployed, the algorithm obtained by training is likely to fail, resulting in poor performance in the real scene. The above problems can be alleviated by improving the robustness of the reinforcement learning algorithm. One method is to train the reinforcement learning algorithm under the condition of interference by introducing virtual interference in the virtual environment, so as to improve the capability of the reinforcement learning algorithm to cope with the interference, and enhance the robustness and generalization of the algorithm, that is, a pair of anti-intelligent agents can be arranged for the reinforcement learning model to be trained, the data output by the anti-intelligent agents can jointly execute tasks with the output data of the reinforcement learning model, and the data output by the anti-intelligent agents can serve as the interference for executing target tasks. However, since the difference between the training environment and the deployment environment is unpredictable, in the existing training method, only a certain specific disturbance (for example, for the robot control, a force within a specific range can be applied to a certain joint as the disturbance) can be output by the anti-intelligent agent, and when the change in the real environment is inconsistent with the imaginary disturbance (that is, the disturbance to the output of the anti-intelligent agent), the algorithm effect is reduced, and the robustness is poor. Disclosure of Invention The application provides a model training method which can improve the training effect and generalization of a model. The method comprises the steps of processing first data through a first reinforcement learning model to obtain a first processing result, wherein the first data are used for indicating the state of a target object and used as control information when a target task is executed on the target object, processing the first data through a first target neural network to obtain a second processing result, the second processing result is used for being used as interference information when the target task is executed, the first target neural network is selected from a plurality of first neural networks, each first neural network is an iteration result obtained in the process of performing iterative training on a first initial neural network, executing the target task according to the first processing result and the second processing result to obtain a third processing result, and updating the first reinforcement learning model according to the third processing result to obtain an updated first reinforcement learning model. In one possible implementation, the first reinforcement learning model may be an initialized model, or the output of one iteration of the model training process. It should be appreciated that reinforcement learning models in embodiments of the present application include, but are not limited to, deep neur