US-20260126300-A1 - FEATURE DETECTION MODELS FOR AUTONOMOUS AND SEMI-AUTONOMOUS SYSTEMS AND APPLICATIONS

US20260126300A1US 20260126300 A1US20260126300 A1US 20260126300A1US-20260126300-A1

Abstract

In various examples, feature detection models for autonomous and/or semi-autonomous systems and applications are described herein. Systems and methods described herein may use one or more trained machine learning models to automatically generate representations of traffic features corresponding to a map, such as road markings and/or road edges. For instance, the model(s) may take, as input, an image representing at least a portion of a map that includes one or more traffic features along with one or more indications of one or more points associated with the traffic feature(s) as represented by the image. Based at least on processing the inputs, the model(s) may generate and/or output data representing additional points associated with the traffic feature(s) and/or a heatmap representing one or more lines representing the traffic feature(s). This output data may then be used to determine the representation(s) of the traffic feature(s) for annotating the map.

Inventors

Kezhao CHEN
Ruiqi Zhao
Yujian LI

Assignees

NVIDIA CORPORATION

Dates

Publication Date: 20260507
Application Date: 20241105
Priority Date: 20241101

Claims (20)

1 . A method comprising: generating one or more input tokens representative of one or more first points associated with a road marking as depicted by an image associated with a map; generating one or more embeddings associated with the image; generating, using one or more machine learning models and based at least on the one or more input tokens and the one or more embeddings, one or more output tokens representative of one or more second points associated with the road marking; generating a line representation of the road marking based at least on the one or more second points; and updating, based at least on the line representation, the map to include a label associated with the road marking.
2 . The method of claim 1 , further comprising at least one of: receiving input data representative of the one or more first points associated with the road marking; or determining, based at least on analyzing at least one of the map or the image, the one or more first points associated with the road marking.
3 . The method of claim 1 , wherein the generating the one or more output tokens comprises: generating, using the one or more machine learning models and based at least on the one or more input tokens and the one or more embeddings, one or more first output tokens representative of a first portion of the one or more second points; and generating, using the one or more machine learning models and based at least on the one or more first output tokens, one or more second output tokens representative of a second portion of the one or more second points.
4 . The method of claim 1 , further comprising: generating, using the one or more machine learning models and based at least on the one or more input tokens and the one or more embeddings, one or more image tokens associated with the image, wherein the generating the line representation is further based at least on the one or more image tokens.
5 . The method of claim 1 , further comprising: appending the one or more input tokens to one or more learnable tokens to generate one or more appended input tokens, wherein the generating the one or more output tokens is based at least on the one or more appended input tokens and the one or more embeddings.
6 . The method of claim 1 , further comprising: determining, based at least on the one or more output tokens, one or more classifications associated with the one or more second points, wherein the generating the line representation is further based at least on the one or more classifications.
7 . The method of claim 1 , further comprising: generating, using one or more decoders and based at least on the one or more output tokens, one or more coordinates associated with the one or more second points within the image, wherein the generating the line representation is based at least on the one or more coordinates.
8 . The method of claim 1 , further comprising: generating, based at least on at least one of the one or more output tokens or one or more image tokens associated with the image, a heatmap associated with the road marking, wherein the generating the line representation is further based at least on the heatmap.
9 . A data center comprising: one or more central processing units (CPUs); one or more graphics processing units (GPUs); one or more isolated trusted execution environments (TEEs); one or more interconnects for multi-GPU communication; one or more data processing units (DPUs); one or more network interface chips (NICs); wherein one or more components of the data center are to: determine one or more first points associated with a traffic feature from within a sensor data representation corresponding to a map; determine, using one or more machine learning models and based at least on input data associated with the one or more first points and the sensor data representation, one or more second points associated with the traffic feature; generate a representation of the traffic feature based at least on the one or more second points; and update, based at least on the representation, the map to include information associated with the traffic feature.
10 . The data center of claim 9 , wherein the one or more components are further to: generate one or more input tokens based at least on the one or more first points and one or more embeddings based at least on the sensor data representation, wherein the input data is associated with the one or more input tokens and the one or more embeddings.
11 . The data center of claim 10 , wherein the one or more components are further to: append the one or more input tokens to one or more learnable tokens to generate one or more appended input tokens, wherein the input data is associated with the one or more appended input tokens and the one or more embeddings.
12 . The data center of claim 9 , wherein the determination of the one or more second points associated with the traffic feature comprises: generating, using the one or more machine learning models and based at least on the input data, one or more output tokens; and determining, based at least on the one or more output tokens, the one or more second points associated with the traffic feature.
13 . The data center of claim 9 , wherein the one or more components are further to perform at least one of: receive one or more inputs representing the one or more first points associated with the traffic feature; or determine, based at least on analyzing at least one of the map or the sensor data representation, the one or more first points associated with the traffic feature.
14 . The data center of claim 9 , wherein the determination of the one or more second points associated with the traffic feature comprises: determining, using the one or more machine learning models and based at least on the input data, at least a first portion of the one or more second points; and determining, using the one or more machine learning models and based at least on second input data associated with the at least the first portion of the one or more second points, at least a second portion of the one or more second points.
15 . The data center of claim 9 , wherein the one or more components are further to: determine, using the one or more machine learning models and based at least on the input data, one or more classifications associated with the one or more second points, wherein the representation is further generated based at least on the one or more classifications.
16 . The data center of claim 9 , wherein the one or more components are further to: determine, using the one or more machine learning models and based at least on the input data, a heatmap associated with the traffic feature, wherein the representation is further generated based at least on the heatmap.
17 . The data center of claim 9 , wherein: the traffic feature includes a road marking as represented by the sensor data representation corresponding to the map; the one or more processors are further to determine, based at least on the sensor data representation, a type of marking associated with the road marking; and the map is further updated to indicate the type of marking.
18 . The data center of claim 9 , wherein the data center is comprised in or is used in conjunction with at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing one or more simulation operations; a system for performing one or more digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system that provides one or more cloud gaming applications; a system for performing one or more deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system for performing one or more generative AI operations; a system for performing operations using one or more large language models (LLMs); a system for performing operations using one or more vision language models (VLMs); a system for performing operations using one or more multi-modal language models; a system for performing one or more conversational AI operations; a system for generating synthetic data; a system for presenting at least one of virtual reality content, augmented reality content, or mixed reality content; systems implementing one or more multi-modal language models; systems using or deploying one or more inference microservices; systems that incorporate deploy one or more machine learning models in a service or microservice along with an OS-level virtualization package (e.g., a container); a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.
19 . One or more processors comprising: processing circuitry to generate a line representation associated with a traffic feature as represented by a map, wherein the line representation is generated based at least on: one or more encoders of one or more machine learning models generating one or more input tokens associated with one or more first points of the traffic feature and one or more image embeddings associated with an image of the traffic feature; and one or more decoders of the one or more machine learning models processing the one or more input tokens and the one or more embeddings to determine one or more second points associate with the line representation.
20 . The one or more processors of claim 19 , wherein the one or more processors are comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing one or more simulation operations; a system for performing one or more digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system that provides one or more cloud gaming applications; a system for performing one or more deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system for performing one or more generative AI operations; a system for performing operations using one or more large language models (LLMs); a system for performing operations using one or more vision language models (VLMs); a system for performing operations using one or more multi-modal language models; a system for performing one or more conversational AI operations; a system for generating synthetic data; a system for presenting at least one of virtual reality content, augmented reality content, or mixed reality content; systems implementing one or more multi-modal language models; systems using or deploying one or more inference microservices; systems that incorporate deploy one or more machine learning models in a service or microservice along with an OS-level virtualization package (e.g., a container); a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of Chinese Patent Application No. 2024115541102, filed Nov. 1, 2024, which is incorporated herein by reference in its entirety. BACKGROUND For vehicles (e.g., autonomous vehicle, semi-autonomous vehicles, robots, etc.) to operate safely in environments, the vehicles must be capable of effectively performing vehicle maneuvers—such as lane keeping, lane changing, lane splits, turns, stopping and starting at intersections, crosswalks, and the like, and/or other vehicle or machine maneuvers. For example, for a vehicle to navigate through surface streets (e.g., city streets, side streets, neighborhood streets, etc.) and on highways (e.g., multi-lane roads), the vehicle is required to navigate among one or more divisions or demarcations (e.g., lanes, intersections, crosswalks, boundaries, etc.) of a road that are often marked using traffic features—such as road markings that include arrows, text, graphics, and/or other types of content. As such, it is important that the vehicles are able to detect the traffic features within the environments, such that the vehicles are able to determine how to navigate according to rules associated with the traffic features. To detect traffic features, vehicles may, at least in part, use maps corresponding to the environments for which the vehicles are navigating. For example, the maps may be annotated to indicate the locations of important traffic features that the vehicles need to identify when navigating, such as road edges, road markings, traffic signs, and/or so forth. Some conventional approaches for annotating such maps includes users viewing various portions of the map in order to manually input the labels for the traffic features. For example, a user may manually indicate the location of a road marking by selecting a number of points that are located along the road marking—such as hundreds and/or thousands of points—for a given length of the road marking. However, causing users to manually indicate the locations of traffic features as represented by maps may be time consuming, be prone to user error, and/or require a large amount of computing resources (different user devices). As such, and more specifically for road marking, other conventional approaches may use curve fitting functions to connect existing road marking that are already annotated on maps. For example, if users have already annotated a first portion of a road marking and a separate, second portion of the road marking, then these conventional approaches will just attach the two portions of the road marking together using a curve fitting function. However, by merely using curve fitting functions to connect existing road markings, these conventional approaches may be accurate with regard to straight road marking, but inaccurate for road markings that include one or more curves. Additionally, since these conventional approaches operate on an entirety of a map, the generated annotations for the road markings may not align when the map is segmented into sub-sections (e.g., images), such as for providing the map to vehicles for navigating. SUMMARY Embodiments of the present disclosure relate to feature detection models for autonomous and/or semi-autonomous systems and applications. Systems and methods described herein may use one or more trained machine learning models (the model(s)) to automatically generate representations of traffic features corresponding to a map, such as road markings and/or road edges. For instance, the model(s) may take, as input, an image representing at least a portion of a map that includes one or more traffic features along with one or more indications of one or more points (e.g., one or more prompts) associated with the traffic feature(s) as represented by the image. Based at least on processing the inputs, the model(s) may generate and/or output data representing additional points associated with the traffic feature(s) and/or a heatmap representing one or more lines corresponding to the traffic feature(s). In some examples, the model(s) and/or another postprocessing component may then determine one or more final representations for the traffic feature(s) using the outputs—such as line representations for road markings and/or road edges—which may then be used to annotate the map. In contrast to conventional systems, the systems of the present disclosure, in some embodiments, are able to use the prompts and/or the input images to automatically determine the locations of traffic features as represented by maps. This way, the systems of the present disclosure do not require users to manually input all of the points for the traffic features—such as hundreds and/or thousands of points—when annotating the maps. Additionally, and as described in more detail herein, the model(s) may be trained to determine a number of points associated with the traffic features—such as up to one hundred or more points—that are then us