US-20260126799-A1 - LANGUAGE-GROUNDED VEHICLE PATH PLANNING
Abstract
A device includes a memory configured to store images representing scenes associated with a vehicle. The device includes one or more processors configured to obtain a set of images representing a scene associated with the vehicle. The one or more processors are configured to generate, based on the set of images, language-grounded scene tokens. The one or more processors are configured to provide the language-grounded scene tokens to a planning transformer to generate a path plan prediction for the vehicle.
Inventors
- Rajeev YASARLA
- Litian Liu
- Fatih Murat PORIKLI
- Deepti Balachandra HEGDE
- Shizhong Steve HAN
- Hong Cai
- Shweta Mahajan
- Apratim Bhattacharyya
- Risheek GARREPALLI
- Yunxiao SHI
- Manish Kumar Singh
Assignees
- QUALCOMM INCORPORATED
Dates
- Publication Date
- 20260507
- Application Date
- 20241101
Claims (20)
- 1 . A device comprising: a memory configured to store images that represent scenes associated with a vehicle; and one or more processors configured to: obtain a set of images representing a scene associated with the vehicle; generate, based on the set of images, language-grounded scene tokens; and provide the language-grounded scene tokens to a planning transformer to generate a path plan prediction for the vehicle.
- 2 . The device of claim 1 , wherein the one or more processors are configured to generate vehicle control signals based on the path plan prediction.
- 3 . The device of claim 1 , wherein, to generate the language-grounded scene tokens, the one or more processors are configured to: provide the set of images as input to an image encoder to generate image features; provide the image features as input to a perception machine-learning model to generate map data representing objects within the scene; provide the image features, the map data, or both, as input to a prediction machine-learning model to generate motion prediction data representing trajectory predictions associated with the scene; and generate scene feature data based on the image features, the map data, the motion prediction data, or a combination thereof.
- 4 . The device of claim 3 , wherein the image encoder includes a language-grounded bird's eye view encoder.
- 5 . The device of claim 3 , wherein the one or more processors are configured to generate the language-grounded scene tokens based on the scene feature data.
- 6 . The device of claim 3 , wherein the prediction machine-learning model comprises a language-grounded motion transformer model.
- 7 . The device of claim 3 , wherein the perception machine-learning model comprises a language-grounded map transformer model.
- 8 . The device of claim 1 , further comprising a modem coupled to the one or more processors and configured to receive the images, to send the path plan prediction, or both.
- 9 . The device of claim 1 , further comprising one or more cameras coupled to the one or more processors and configured to capture the images.
- 10 . The device of claim 1 , further comprising one or more sensors configured to capture sensor data associated with the vehicle, wherein the one or more processors are configured to generate the path plan prediction based at least in part on the sensor data.
- 11 . The device of claim 1 , wherein the device is an automobile.
- 12 . The device of claim 1 , wherein the device is an aircraft.
- 13 . The device of claim 1 , wherein the device is a watercraft.
- 14 . A method comprising: obtaining a set of images representing a scene associated with a vehicle; generating, based on the set of images, language-grounded scene tokens; and providing the language-grounded scene tokens to a planning transformer to generate a path plan prediction for the vehicle.
- 15 . The method of claim 14 , further comprising generating vehicle control signals based on the path plan prediction.
- 16 . The method of claim 14 , wherein generating the language-grounded scene tokens comprises: providing the set of images as input to an image encoder to generate image features; providing the image features as input to a perception machine-learning model to generate map data representing objects within the scene; providing the image features, the map data, or both, as input to a prediction machine-learning model to generate motion prediction data representing trajectory predictions associated with the scene; and generating scene feature data based on the image features, the map data, the motion prediction data, or a combination thereof.
- 17 . The method of claim 14 , further comprising: providing the language-grounded scene tokens and one or more text tokens as input to a large language model to generate language-grounded scene data including a scene description, a masked-scene prediction, a future scene prediction, a waypoint prediction, or a combination thereof.
- 18 . The method of claim 17 , further comprising: determining an error value based on the language-grounded scene data; and modifying parameters of a scene feature data model based on the error value to improve language grounding of the scene feature data model, wherein the scene feature data model is configured to generate language-grounded scene feature data used to generate the language-grounded scene tokens.
- 19 . The method of claim 14 , further comprising one or more sensors configured to capture sensor data associated with the vehicle, wherein the path plan prediction is based at least in part on the sensor data.
- 20 . A non-transitory computer-readable medium storing instructions executable to cause one or more processors to: obtain a set of images representing a scene associated with a vehicle; generate, based on the set of images, language-grounded scene tokens; and provide the language-grounded scene tokens to a planning transformer to generate a path plan prediction for the vehicle.
Description
I. FIELD The present disclosure is generally related to vehicle path planning for vehicle automation, and in particular to language-grounded vehicle path planning. II. DESCRIPTION OF RELATED ART Vehicle autonomy is sometimes described in terms of several tasks, including perception, prediction, planning, and control tasks. The perception task generally includes operations related to analyzing the environment around the vehicle, such as determining where the vehicle is relative to objects, other vehicles, or landmarks in the environment. The prediction task generally includes operations related to identifying expected or predicted future actions or relative positions of the objects, other vehicles or landmarks. The planning tasks generally include operations related to planning movements of the vehicle being controlled (commonly referred to as the “ego-vehicle”) in view of the results of the perception task, the prediction task, and goals associated with the vehicle. The control tasks generally include operations related to causing specific subsystems of the vehicle to implement some set of the planned movements. Various approaches have been taken to use machine learning (ML) to perform some or all of these tasks. Nevertheless, there remain many challenges associated with ML-based vehicle autonomy. For example, perception and prediction tasks often rely on image data and/or sensor data to map an area around the vehicle and make predictions related to the vehicle's surroundings. Humans are generally more comfortable with specifying the vehicle's goals via natural-language instructions, and it can be challenging to integrate image data and natural-language instructions in order to make planning decisions that are based on both. III. SUMMARY According to one implementation of the present disclosure, a device includes a memory configured to store images representing scenes associated with a vehicle. The device also includes one or more processors configured to obtain a set of images representing a scene associated with the vehicle. The one or more processors are configured to generate, based on the set of images, language-grounded scene tokens. The one or more processors are configured to provide the language-grounded scene tokens to a planning transformer to generate a path plan prediction for the vehicle. According to another implementation of the present disclosure, a method includes obtaining a set of images representing a scene associated with a vehicle. The method includes generating, based on the set of images, language-grounded scene tokens. The method includes providing the language-grounded scene tokens to a planning transformer to generate a path plan prediction for the vehicle. According to another implementation of the present disclosure, a non-transitory computer-readable medium stores instructions executable to cause one or more processors to obtain a set of images representing a scene associated with a vehicle. The instructions are executable to cause the one or more processors to generate, based on the set of images, language-grounded scene tokens. The instructions are executable to cause the one or more processors to provide the language-grounded scene tokens to a planning transformer to generate a path plan prediction for the vehicle. According to another implementation of the present disclosure, an apparatus includes means for obtaining a set of images representing a scene associated with a vehicle. The apparatus includes means for generating, based on the set of images, language-grounded scene tokens. The apparatus includes means for providing the language-grounded scene tokens to a planning transformer to generate a path plan prediction for the vehicle. Other aspects, advantages, and features of the present disclosure will become apparent after review of the entire application, including the following sections: Brief Description of the Drawings, Detailed Description, and the Claims. IV. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram illustrating aspects of a system for language-grounded vehicle path planning, in accordance with some examples of the present disclosure. FIG. 2 is a diagram of illustrative aspects of operations associated with the system for language-grounded vehicle path planning of FIG. 1, in accordance with some examples of the present disclosure. FIG. 3 is another diagram of illustrative aspects of operations associated with the system for language-grounded vehicle path planning of FIG. 1, in accordance with some examples of the present disclosure. FIG. 4 is a diagram of illustrative aspects of operations associated with training the system for language-grounded vehicle path planning of FIG. 1, in accordance with some examples of the present disclosure. FIG. 5 is a diagram of illustrative aspects of operations associated with a language-grounded scene model of the system for language-grounded vehicle path planning of FIG. 1, in accordance with some examples of the p