EP-4738199-A1 - AUTONOMOUS DRIVING MODEL BASED ON MULTIMODAL LARGE MODEL, TRAINING METHOD, AND AUTONOMOUS DRIVING METHOD

EP4738199A1EP 4738199 A1EP4738199 A1EP 4738199A1EP-4738199-A1

Abstract

The present disclosure provides an autonomous driving model based on a multimodal large model, a training method, an autonomous driving method, and relates to the technical field of computers, and in particular to the technical fields of autonomous driving and artificial intelligence. An implementation solution includes: obtaining a training corpus dataset including at least a visual-text aligned corpus and a spatial understanding training corpus for an autonomous driving scenario; encoding, using a vision encoder, visual data in the visual-text aligned corpus to obtain encoded data; mapping the encoded data using a mapping layer; separately processing, using a generation layer, the mapped encoded data, text data, and the spatial understanding training corpus to obtain a first prediction result and a second prediction result of the autonomous driving model; and adjusting a parameter of the autonomous driving model based on at least the first prediction result and the second prediction result. The autonomous driving model trained according to embodiments of the present disclosure has both a multimodal information understanding capability and a reasoning capability in an autonomous driving scenario.

Inventors

HUANG, JIZHOU
ZENG, ZENGFENG

Assignees

Apollo Intelligent Driving Technology (Beijing) Co., Ltd.

Dates

Publication Date: 20260506
Application Date: 20250917

Claims (15)

A method for training an autonomous driving model, wherein the autonomous driving model comprises a vision encoder, a mapping layer, and a generation layer, and the method comprises: obtaining (S202) a training corpus dataset, wherein the training corpus dataset comprises at least a visual-text aligned corpus and a spatial understanding training corpus for an autonomous driving scenario; encoding (S204), using the vision encoder, visual data in the visual-text aligned corpus to obtain encoded data; mapping (S206) the encoded data using the mapping layer; processing (S208), using the generation layer, the mapped encoded data and text data in the visual-text aligned corpus to obtain a first prediction result of the autonomous driving model; processing (S210), using the generation layer, the spatial understanding training corpus to obtain a second prediction result of the autonomous driving model; and adjusting (S212) a parameter of the autonomous driving model based on at least the first prediction result and the second prediction result.
The method according to claim 1, wherein the visual-text aligned corpus comprises a visual-text aligned corpus for a general scenario and a visual-text aligned corpus for the autonomous driving scenario.
The method according to any one of claims 1-2, wherein adjusting the parameter of the autonomous driving model based on at least the first prediction result and the second prediction result comprises: in a first stage of training, adjusting the parameter of the autonomous driving model based on the first prediction result; and in a second stage following the first stage, adjusting the parameter of the autonomous driving model based on the second prediction result.
The method according to claim 3, wherein the training corpus dataset further comprises an autonomous driving domain-specific knowledge corpus, and the method further comprises: in the second stage, processing, using the generation layer, the autonomous driving domain-specific knowledge corpus to obtain a third prediction result of the autonomous driving model; and adjusting the parameter of the autonomous driving model based on the third prediction result.
The method according to claim 3, wherein the training corpus dataset further comprises instruction training data, wherein the instruction training data comprises an input instruction and a sample response corresponding to the input instruction, and the method further comprises: in a third stage following the second stage, processing, using the generation layer, the input instruction to obtain a predicted response of the autonomous driving model; and adjusting the parameter of the autonomous driving model based on a difference between the predicted response and the sample response.
The method according to claim 5, wherein the training corpus dataset further comprises instruction training data and chain-of-thought reasoning training data, wherein the chain-of-thought reasoning training data comprises a sample input image and a sample reasoning result corresponding to the sample input image, and the method further comprises: in the third stage, encoding the sample input image using the vision encoder to obtain an encoded sample input; mapping the encoded sample input using the mapping layer; processing, using the generation layer, the mapped encoded sample input to obtain a predicted reasoning result of the autonomous driving model, wherein the predicted reasoning result comprises at least one subtask prediction result; and adjusting the parameter of the autonomous driving model based on a difference between the predicted reasoning result and the sample reasoning result.
The method according to claim 6, wherein the at least one subtask prediction result is output according to a predefined subtask order, and to output each subtask prediction result, content of at least one other subtask prediction result output previously is taken into consideration.
The method according to claim 6, wherein the predicted reasoning result comprises code for calling an autonomous driving system.
The method according to any one of claims 1 to 8, wherein the visual data in the visual-text aligned corpus comprises video data.
The method according to claim 9, wherein the vision encoder comprises a video encoder, and the method further comprises: encoding the video data using the video encoder, to obtain encoded video data; mapping, using an embedding matrix, the encoded video data to obtain an embedding vector corresponding to the encoded video data; processing, using the generation layer, at least the embedding vector to obtain a predicted video vector; and decoding, using a video decoder, the predicted video vector to obtain a predicted video.
The method according to claim 10, wherein processing, using the generation layer, at least the embedding vector to obtain the predicted video vector comprises: processing, using the generation layer, the embedding vector and at least one given driving action to obtain the predicted video vector.
The method according to claim 11, further comprising: determining a generation probability of the predicted video vector as a reward value for the at least one given driving action; and performing direct preference optimization on the autonomous driving model based on the given driving action and the reward value.
The method according to any one of claims 1 to 8, further comprising: obtaining at least one simulated driving action and simulated visual-text information of a driving scenario in a simulation environment; encoding, using the vision encoder, visual data in the simulated visual-text information to obtain encoded simulation data; mapping the encoded simulation data using the mapping layer; processing, using the generation layer, the mapped encoded simulation data and text data in the simulated visual-text information to obtain a simulated prediction result of the autonomous driving model; executing the simulated prediction result in the simulation environment; obtaining simulation feedback generated in the simulation environment for prediction of the simulated prediction result; and performing direct preference optimization on the autonomous driving model based on the simulated driving action and the simulation feedback.
An autonomous driving method, comprising: obtaining (S702) visual data and text data of a current scenario; and inputting (S704) the visual data and the text data of the current scenario into an autonomous driving model to obtain a predicted driving decision output by the autonomous driving model, and controlling a vehicle to perform autonomous driving based on the predicted driving decision, wherein the autonomous driving model is trained using the method according to any one of claims 1 to 13.
An autonomous driving apparatus, comprising: an obtaining unit configured to obtain visual data and text data of a current scenario; and an autonomous driving decision unit configured to input the visual data and the text data of the current scenario into an autonomous driving model to obtain a predicted driving decision output by the autonomous driving model, and control a vehicle to perform autonomous driving based on the predicted driving decision, wherein the autonomous driving model is trained using the method according to any one of claims 1 to 13.

Description

TECHNICAL FIELD The present disclosure relates to the technical field of computers, in particular to the technical fields of autonomous driving and artificial intelligence, and specifically to an autonomous driving model based on a multimodal large model, a training method, an autonomous driving method, an apparatus, an electronic device, a computer-readable storage medium, a computer program product, and a vehicle. BACKGROUND Artificial intelligence is a subject on making a computer simulate some thinking processes and intelligent behaviors (such as learning, reasoning, thinking, and planning) of a human, and involves both hardware-level technologies and software-level technologies. Artificial intelligence hardware technologies generally include the technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, and big data processing. Artificial intelligence software technologies mainly include the following several general directions: computer vision technologies, speech recognition technologies, natural language processing technologies, machine learning/deep learning, big data processing technologies, and knowledge graph technologies. The autonomous driving model is used to generate driving decisions based on environmental information of a current driving scenario and driving data of a vehicle, thereby achieving vehicle control. Methods described in this section are not necessarily methods that have been previously conceived or employed. It should not be assumed that any of the methods described in this section is considered to be a conventional technology just because they are included in this section, unless otherwise indicated expressly. Similarly, the problem mentioned in this section should not be considered to be universally recognized in any conventional technology, unless otherwise indicated expressly. SUMMARY The present disclosure provides an autonomous driving model based on a multimodal large model, a training method, an autonomous driving method, an apparatus, an electronic device, a computer-readable storage medium, a computer program product, and a vehicle. According to an aspect of the present disclosure, a method for training an autonomous driving model is provided. The autonomous driving model includes a vision encoder, a mapping layer, and a generation layer, and the method includes: obtaining a training corpus dataset, where the training corpus dataset includes at least a visual-text aligned corpus and a spatial understanding training corpus for an autonomous driving scenario; encoding, using the vision encoder, visual data in the visual-text aligned corpus to obtain encoded data; mapping the encoded data using the mapping layer; processing, using the generation layer, the mapped encoded data and text data in the visual-text aligned corpus to obtain a first prediction result of the autonomous driving model; processing, using the generation layer, the spatial understanding training corpus to obtain a second prediction result of the autonomous driving model; and adjusting a parameter of the autonomous driving model based on at least the first prediction result and the second prediction result. According to another aspect of the present disclosure, an autonomous driving method is provided. The method includes: obtaining visual data and text data of a current scenario; and inputting the visual data and the text data of the current scenario into an autonomous driving model to obtain a predicted driving decision output by the autonomous driving model, and controlling a vehicle to perform autonomous driving based on the predicted driving decision, where the autonomous driving model is trained using the method provided in the embodiments of the present disclosure. According to another aspect of the present disclosure, an apparatus for training an autonomous driving model is provided. The autonomous driving model includes a vision encoder, a mapping layer, and a generation layer, and the apparatus includes: an obtaining unit configured to obtain a training corpus dataset, where the training corpus dataset includes at least a visual-text aligned corpus and a spatial understanding training corpus for an autonomous driving scenario; an encoding unit configured to encode, using the vision encoder, visual data in the visual-text aligned corpus to obtain encoded data; a mapping unit configured to map the encoded data using the mapping layer; a generation unit configured to process, using the generation layer, the mapped encoded data and text data in the visual-text aligned corpus to obtain a first prediction result of the autonomous driving model, and process, using the generation layer, the spatial understanding training corpus to obtain a second prediction result of the autonomous driving model; and a parameter adjustment unit configured to adjust a parameter of the autonomous driving model based on at least the first prediction result and the second