US-12626519-B2 - Feature detection using language models

US12626519B2US 12626519 B2US12626519 B2US 12626519B2US-12626519-B2

Abstract

In various examples, feature identification using language models for autonomous and semi-autonomous systems and applications is described herein. Systems and methods described herein may use a language model(s) to determine information associated with features, such as surface markings, within an environment. For example, sensor data may be used to generate one or more images or other sensor data representations corresponding to an environment. The image(s) may then be processed to generate input data (e.g., input tokens) that is applied to the language model(s). Based at least on processing the input data, the language model(s) may be trained to output data (e.g., output tokens) representing information associated with one or more features. Additionally, the output data may be used to determine the information associated the feature(s) within the environment, where the information may then be used to update a map and/or navigate one or more machines within the environment.

Inventors

Karan Sapra
Yu Zhang
Yangdongfang Yang
Yixuan LIN
Ge Cong
Andrew Tao

Assignees

NVIDIA CORPORATION

Dates

Publication Date: 20260512
Application Date: 20240215

Claims (20)

1 . A method comprising: generating, based at least on one or more machine learning models processing at least a representations corresponding to an environment, one or more input tokens representative of a feature located within the environment; generating, based at least on one or more language models processing the one or more input tokens, at least one or more first output tokens representative of a first location of a first point associated with the feature and one or more second output tokens representative of a second location of a second point associated with the feature; determining, based at least on the one or more first output tokens and the one or more second output tokens, at least the first location of the first point within the representation and the second location of the second point within the representation; determining, based at least on the first location and the second location, information associated with the feature within the environment by at least connecting the second point to the first point; and causing a map to be updated to indicate at least the information associated with the feature.
2 . The method of claim 1 , wherein: the feature includes a surface line represented by the representation; the one or more input tokens are representative of the surface line as represented by the representation; and the first point and the second point are associated with the surface line within the representation.
3 . The method of claim 1 , further comprising determining, based at least on the one or more first output tokens, at least one of: one or more classes associated with the first point; one or more types associated with the first point; one or more colors associated with the first point; or one or more shapes associated with the first point.
4 . The method of claim 1 , wherein: the first location includes first coordinates of the first point associated with the feature and within the representation; and the second location includes second coordinates of the second point associated with the feature and within the representation.
5 . The method of claim 1 , further comprising: determining, based at least on the one or more first output tokens, that the first point includes a starting point associated with the feature; and determining, based at least on the one or more second output tokens, that the second point includes at least one of an intermediary point or an ending point associated with the feature, wherein the connecting the second point to the first point is further based at least on the first point including the starting point and the second point including the at least one of the intermediary point or the ending point.
6 . The method of claim 1 , wherein the generating the one or more input tokens comprises: generating, based at least on the one or more machine learning models processing the representation, at least one of feature data associated with the representation or heatmap data associated with the representation; and generating, based at least on the at least one of the feature data or the heatmap data, the one or more input tokens representative of the feature located within the environment.
7 . The method of claim 1 , wherein the one or more representations comprises one or more of: an intensity image corresponding to the environment; a color image corresponding to the environment; a height image corresponding to the environment; or a point cloud corresponding to the environment.
8 . A system comprising: one or more processors to: generate one or more input tokens representing a feature located within an environment as represented by one or more representations; generate, based at least on one or more language models processing the one or more input tokens, sets of tokens associated with points corresponding to the feature, an individual set of tokens from the sets of tokens representing at least one or more attributes associated with an individual point from the points; generate, based at least on the sets of tokens, information associated with the feature; and perform one or more operations based at least on the information.
9 . The system of claim 8 , wherein the one or more processors are further to: generate, based at least on sensor data, the one or more representations of the feature located within the environment, wherein the one or more input tokens are generated based at least on one or more machine learning models processing the one or more representations.
10 . The system of claim 8 , wherein: the feature includes a traffic feature represented by the one or more representations; and the one or more input tokens represent at least the traffic feature as represented by the one or more representations.
11 . The system of claim 8 , wherein the one or more attributes include at least one of: one or more classes associated with the individual point; a location associated with the individual point; one or more types associated with the feature; one or more colors associated with the feature; or one or more shapes associated with the feature.
12 . The system of claim 8 , wherein: associated with a first set of tokens from the sets of tokens represents at least a first location of a first point from the points corresponding to the feature; and a second set of tokens from the sets of tokens represents a second location of a second point from the points corresponding to the feature.
13 . The system of claim 12 , wherein the information is generated, at least, by: determining, based at least on a first set of tokens of the sets of tokens, a first location associated with a first point from the points within the environment; determining, based at least on a second set of tokens from the sets of tokens, a second location associated with a second point from the points within the environment; and determining the information by connecting the second point to the first point.
14 . The system of claim 13 , wherein the one or more processors are further to: determine, based at least on the first set of tokens, that the first point includes a starting point associated with the feature; and determine, based at least on the second set of tokens, that the second point includes at least one of an intermediary point or an ending point associated with the feature, wherein the connecting the second point to the first point is based at least on the first point including the starting point and the second point including the at least one of the intermediary point or the ending point.
15 . The system of claim 8 , wherein the one or more representations comprises one or more of: an intensity image corresponding to the environment; a color image corresponding to the environment; a height image corresponding to the environment; or a point cloud corresponding to the environment.
16 . The system of claim 8 , wherein the system is comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing one or more simulation operations; a system for performing one or more digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing one or more deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system for performing one or more generative AI operations; a system for performing operations using one or more large language models (LLMs); a system for performing one or more conversational AI operations; a system for generating synthetic data; a system for presenting at least one of virtual reality content, augmented reality content, or mixed reality content; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.
17 . One or more processors comprising processing circuitry to: generate, based at least on one or more representations corresponding to an environment, one or more inputs representative of a feature located within the environment; generate, based at least on one or more language models processing the one or more inputs, at least one or more first outputs representative of a first location of a first point associated with the feature and one or more second outputs representative of a second location of a second point associated with the feature; determine, based at least on the one or more first outputs and the one or more second outputs, information associated with the feature by at least connecting the second location of the second point to the first location of the first point; and perform one or more operations based at least on the information associated with the feature.
18 . The one or more processors of claim 17 , wherein the one or more processors are comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing one or more simulation operations; a system for performing one or more digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing one or more deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system for performing one or more generative AI operations; a system for performing operations using one or more large language models (LLMs); a system for performing one or more conversational AI operations; a system for generating synthetic data; a system for presenting at least one of virtual reality content, augmented reality content, or mixed reality content; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.
19 . The method of claim 1 , wherein the determining the information associated with the feature within the environment comprises determining, based at least on connecting the first point to the second point, a shape associated with the feature, the information including at least the shape.
20 . The method of claim 1 , wherein the one or more language models are trained using at least: one or more training input tokens representing one or more features located within one or more training representations; and ground truth data representing one or more locations of one or more points of the one or more features within the one or more training representations.

Description

BACKGROUND For vehicles (e.g., autonomous vehicle, semi-autonomous vehicles, robots, etc.) to operate safely in environments, the vehicles must be capable of effectively performing a variety of vehicle maneuvers—such as lane keeping, lane changing, lane splits, turns, stopping and starting at intersections, crosswalks, and the like. For example, for a vehicle to navigate through surface streets (e.g., city streets, side streets, neighborhood streets, etc.) and on highways (e.g., multi-lane roads), the vehicle is required to navigate among one or more divisions or demarcations (e.g., lanes, intersections, crosswalks, boundaries, etc.) of a road that are often marked using road markings, such as road lines. As such, it is important that the vehicles are able to detect the road markings within the environments, such that the vehicles are able to determine how to navigate according to rules associated with the road markings. To detect road markings, vehicles may, at least in part, use maps corresponding to the environments for which the vehicles are navigating. For example, the maps may indicate the locations of important features that the vehicles need to identify when navigating, such as road surfaces and road markings. Conventional approaches for determining the locations of road marking for these maps include using convolutional neural networks to process image data generated using image sensors of vehicles that have navigated within the environments. For instance, the image data may represent images depicting the road markings within the environments. As such, the systems are able to process the image data, such as by using one or more image processing techniques (e.g., object detection, object recognition, etc.) that use the convolutional neural networks, to detect the locations of the road markings within the images. The systems may then use the locations of the road markings from the images to determine the corresponding locations of the road markings within the maps. While these systems are able to determine the locations of the road markings within the environments using convolutional neural networks, there may be room for improving the accuracy and precision of these systems. As such, techniques for increasing the accuracy and precision of the results for the locations of the of road markings may provide for better maps for the vehicles, which may also improve the driving capabilities of the vehicles. SUMMARY Embodiments of the present disclosure relate to feature identification using language models for autonomous and semi-autonomous systems and applications. For instance, systems and methods described herein may use one or more language models—such as large language models (LLMs)—to determine information associated with features, such as road markings (e.g., lane lines, road boundary lines, crosswalk lines, yield lines, bike lane lines, etc.), within an environment. For example, sensor data (e.g., image data, LiDAR data, RADAR data, ultrasonic data, etc.) may be used to generate one or more images (or other sensor data representations, such as point clouds) corresponding to an environment, such as an intensity image, a color image, and/or a height image. The image(s) may then be processed to generate input data (e.g., a tokenized representation of feature information) that is applied to—e.g., processed by—the language model(s). Based at least on processing the input data, the language model(s) may be trained to output data (e.g., a tokenized representation of feature or attribute information corresponding to the input data) representing one or more attributes associated with one or more features. For instance, the output data may represent locations, colors, types, shapes, orientations, and/or any other attribute associated with the feature(s). As such, the output data may be used to determine the information associated the feature(s) within the environment, where the information may then be used to update a map of the environment and/or to navigate one or more machines within the environment. As such, the processes described herein may be used for offline map building or updates, and/or may be used in deployment to aid in navigation or control of one or more autonomous or semi-autonomous machines. In contrast to conventional systems, the systems described herein, in some embodiments, are able to more precisely and accurately determine information associated with features, such as road markings, within environments. This is because the current systems may use a language model(s)—and more specifically a language model(s) that is trained to determine attributes associated with such features—to determine the information associated with the features, which may be more accurate than using a convolutional neural network (CNN) alone for image processing. For instance, the accuracy of the language model(s) may be increased based at least on training the language model(s) using specific types of inputs, suc