CN-122021363-A - End-to-end intelligent driving training system, method, equipment and storage medium

CN122021363ACN 122021363 ACN122021363 ACN 122021363ACN-122021363-A

Abstract

The application discloses an end-to-end intelligent driving training system, an end-to-end intelligent driving training method, end-to-end intelligent driving training equipment and a storage medium, and relates to the technical field of intelligent driving. The system comprises a pre-trained end-to-end driving strategy network and a strategy optimization module for optimizing the end-to-end driving strategy network, wherein the end-to-end driving strategy network is configured to output driving actions according to current strategy parameters and input environment states, the strategy optimization module is configured to respectively generate a plurality of dimensionality reward signals according to the input environment states, respectively calculate advantage function values for the reward signals of each dimensionality, and perform weighted fusion on the calculated advantage function values to obtain a fusion advantage value for updating the strategy parameters. The scheme of the application greatly improves the stability and convergence efficiency of the reinforcement learning training process, and simultaneously enhances the modularization degree and the interpretability of the reward model system.

Inventors

REN SHAOQING
WANG PENG
CHEN KUNSHENG
LIU GUOYI
CHENG ZHENGXIN
FU XIAOXIN
YE CHAOQIANG
SHE XIAOLI

Assignees

安徽蔚来智驾科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260414

Claims (10)

1. An end-to-end intelligent driving training system is characterized by comprising a pre-trained end-to-end driving strategy network and a strategy optimization module for optimizing the end-to-end driving strategy network; The end-to-end driving strategy network is configured to output driving actions according to the current strategy parameters and the input environmental states; The strategy optimization module is configured to respectively generate a plurality of dimensionality reward signals according to the input environment state, respectively calculate a dominance function value for the reward signals of each dimensionality, and perform weighted fusion on the calculated dominance function values to obtain a fusion dominance value for updating the strategy parameters.
2. The end-to-end intelligent driving training system of claim 1, wherein the policy optimization module comprises a reward combination module, a benefit calculation module, and a benefit fusion module; The rewards combination module comprises a plurality of independent and expandable rewards models, wherein each rewards model is used for outputting a rewards signal according to the environment state; The advantage calculation module is used for calculating a corresponding advantage function value according to the reward signals output by each reward model; The advantage fusion module is used for carrying out weighted fusion calculation on the plurality of advantage function values calculated by the advantage calculation module to obtain the fusion advantage value.
3. The end-to-end intelligent driving training system according to claim 2, wherein the advantage calculation module comprises a plurality of advantage calculators corresponding to the rewards models one by one, each advantage calculator is used for estimating the state value of the rewards signals output by the corresponding rewards model, and calculating the advantage function value based on the state value.
4. The end-to-end intelligent driving training system of claim 2, wherein the reward model comprises at least two of a safety reward model, an efficiency reward model, and a comfort reward model.
5. The end-to-end intelligent driving training system of claim 4, wherein the inputs of the safety rewards model comprise at least the relative position, speed and bounding box of the vehicle and all surrounding obstacles, the inputs of the efficiency rewards model comprise at least the speed of the vehicle and the maximum speed allowed by the current road section, and the inputs of the comfort rewards model comprise at least the longitudinal acceleration and the lateral acceleration of the vehicle.
6. The end-to-end intelligent driving training system of claim 1, further comprising a strategic gradient estimation module; the strategy gradient estimation module is configured to estimate a strategy gradient according to the fusion dominance value obtained by the strategy optimization module, wherein the strategy gradient is used for updating strategy parameters of the end-to-end driving strategy network.
7. An end-to-end intelligent driving training system according to claim 1 or 6, characterized in that the system further comprises an environmental simulation module; the environment simulation module is configured to generate an environment state in a simulated driving environment and input the environment state into the end-to-end driving strategy network, execute a driving action output by the end-to-end driving strategy network and generate a next environment state, and is also configured to input the generated environment state into the strategy optimization module.
8. An end-to-end intelligent driving training method, characterized in that it is applied to the end-to-end intelligent driving training system according to any one of claims 1 to 7, said method at least comprising: Generating a plurality of dimension reward signals based on the environmental state respectively; Calculating dominance function values for the reward signals of each dimension respectively, and carrying out weighted fusion on the calculated dominance function values to obtain a fusion dominance value; and updating strategy parameters of an end-to-end driving strategy network based on the fusion advantage value.
9. An electronic device comprising at least one processor; And a memory communicatively coupled to the at least one processor; Wherein the memory has stored therein a computer program which when executed by the at least one processor implements the end-to-end intelligent driving training method of claim 8.
10. A computer readable storage medium having stored therein a plurality of program code adapted to be loaded and executed by a processor to perform the end-to-end intelligent driving training method of claim 8.

Description

End-to-end intelligent driving training system, method, equipment and storage medium Technical Field The application relates to the technical field of intelligent driving, in particular to an end-to-end intelligent driving training system, an end-to-end intelligent driving training method, end-to-end intelligent driving training equipment and a storage medium. Background In the technical field of intelligent driving, an End-to-End (End-to-End) automatic driving paradigm based on deep learning has become an important research direction. This paradigm aims to directly map raw data of sensors (e.g., cameras, lidars) to control instructions (e.g., steering, throttle, braking) of a vehicle through a single neural network model, thereby simplifying the complexity of a traditional modular pipeline. To achieve safe, comfortable and efficient driving behavior, a two-stage architecture of Pre-training (Pre-training) +reinforcement learning (RL) Post-training is commonly used in the industry. Specifically, the model is first pre-trained on large-scale real or simulation data to obtain basic driving capability, and then fine-tuned and optimized in the simulation environment through reinforcement learning to further improve the performance and meet complex multi-objective requirements, so that in the post-RL training stage, it is important to design a reward function (Reward Function) capable of comprehensively evaluating driving behavior multi-dimensional performance (such as safety, comfort, traffic efficiency and traffic rule compliance). In the post-RL training phase, the existing end-to-end intelligent driving model adopts two main methods, namely 1. Scalar rewards weighted summation, namely, predefining a plurality of sub rewards items, distributing manually set weights to the sub rewards items, and summing the sub rewards weighted summation into a single scalar rewards signal. Then, a merit function is calculated based on this scalar reward to guide the gradient update of the policy network. 2. Integrated rewards model training a single neural network as the rewards model that takes driving state or trajectory as input, directly outputs an integrated scalar rewards value, typically by mimicking human driving data or training according to human preferences (e.g., RLHF). The solution to the multi-objective rewards problem has the following defects of gradient confusion and inefficiency in the strategy optimization process, serious dependence on heuristic experience in multi-objective rewards balance and insufficient system expandability and interpretability. Disclosure of Invention Aiming at the technical defects existing in the prior art, the application provides an end-to-end intelligent driving training system, an end-to-end intelligent driving training method, end-to-end intelligent driving training equipment and a storage medium, a multi-target rewarding optimization process is radically reconstructed, and the technical problems that the training process is unstable, rewarding balance depends on a large amount of engineering experience (namely 'parameter adjustment') and the system is difficult to expand and maintain in the existing 'training after pre-training and reinforcement learning' framework are effectively solved. In a first aspect, the present application provides an end-to-end intelligent driving training system, the system comprising a pre-trained end-to-end driving strategy network, and a strategy optimization module for optimizing the end-to-end driving strategy network; The end-to-end driving strategy network is configured to output driving actions according to the current strategy parameters and the input environmental states; The strategy optimization module is configured to respectively generate a plurality of dimensionality reward signals according to the input environment state, respectively calculate a dominance function value for the reward signals of each dimensionality, and perform weighted fusion on the calculated dominance function values to obtain a fusion dominance value for updating the strategy parameters. In some embodiments, the policy optimization module includes a reward combination module, a benefit calculation module, and a benefit fusion module; The rewards combination module comprises a plurality of independent and expandable rewards models, wherein each rewards model is used for outputting a rewards signal according to the environment state; The advantage calculation module is used for calculating a corresponding advantage function value according to the reward signals output by each reward model; The advantage fusion module is used for carrying out weighted fusion calculation on the plurality of advantage function values calculated by the advantage calculation module to obtain the fusion advantage value. In some embodiments, the advantage calculating module includes a plurality of advantage calculators corresponding to the reward models one by one, and each advantage calc