US-20260127823-A1 - SURROUND VIEW VISUALIZATION USING VISION LANGUAGE MODELS

US20260127823A1US 20260127823 A1US20260127823 A1US 20260127823A1US-20260127823-A1

Abstract

In various examples, a vision language model may be prompted to select a supported environment visualization pipeline (e.g., a bowl visualization pipeline that models the surrounding environment as a 3D bowl, surface topology visualization pipeline that that models the surrounding environment as a detected 3D surface topology), one or more parameters of a supported environment visualization pipeline (e.g., for a bowl visualization pipeline, a parametrization of the shape of the 3D bowl model, stitching parameters such as seam placement, blend width, or blend area, etc.), and/or a rendering viewport (e.g., a virtual camera position and orientation). As such, the selected and/or configured technique may be used to visualize an environment around an ego-machine, such as a vehicle, robot, and/or other type of object, in systems such as parking visualization systems, Surround View Systems, and/or others.

Inventors

Nuri Murat ARAR
Niral Lalit Pathak
Niranjan Avadhanam
Rajath Bellipady Shetty
Orazio Gallo

Assignees

NVIDIA CORPORATION

Dates

Publication Date: 20260507
Application Date: 20241106

Claims (20)

1 . One or more processors comprising processing circuitry to: prompt a vision-language model (VLM) to generate, based at least on image data generated using one or more cameras of an ego-machine in an environment, one or more responses indicating a selection of at least one environment visualization technique from a plurality of supported environment visualization techniques; generate a visualization of at least a portion of the environment using the at least one environment visualization technique to process the image data based at least on the selection by the VLM; and cause presentation of the visualization of at least the portion of the environment on a display associated with the ego-machine.
2 . The one or more processors of claim 1 , wherein at least one of the plurality of supported environment visualization techniques comprises modeling at least a portion of the environment, wherein the selection by the VLM represents a determination to generate the visualization based at least on modeling the environment using a detected 3D surface topology in the environment.
3 . The one or more processors of claim 1 , wherein at least one of the plurality of supported environment visualization techniques comprises a configuration of one or more parameters for the environment visualization technique, wherein the selection by the VLM represents a determination to configure a size parameter, of the one or more parameters, of one or more voxels of a signed distance function modeling the environment.
4 . The one or more processors of claim 1 , wherein at least one of the plurality of supported environment visualization techniques comprises a configuration of one or more parameters for the environment visualization technique, wherein the selection by the VLM represents a determination to configure a threshold parameter, of the one or more parameters, that designates one or more threshold distances encoded by a signed distance function modeling the environment.
5 . The one or more processors of claim 1 , wherein the selection by the VLM represents a determination to generate the visualization based at least on modeling the environment using a 3D bowl.
6 . The one or more processors of claim 1 , wherein the selection by the VLM designates one or more parameters of a 3D bowl modeling the environment.
7 . The one or more processors of claim 1 , wherein at least one of the plurality of supported environment visualization techniques comprises a configuration of one or more parameters for the environment visualization technique, wherein the selection by the VLM represents a configuration of one or more stitching parameters, of the one or more parameters, associated with stitching overlapping frames of the image data applied to the VLM.
8 . The one or more processors of claim 1 , wherein the at least one of the plurality of supported environment visualization techniques comprises a configuration of one or more parameters for the environment visualization technique, wherein the selection by the VLM comprises a configuration that represents at least one of a position or orientation of a virtual camera associated with rendering the visualization.
9 . The one or more processors of claim 1 , wherein the one or more processors are comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing deep learning operations; a system for performing remote operations; a system for performing real-time streaming; a system for generating or presenting one or more of augmented reality content, virtual reality content, or mixed reality content; a system implemented using an edge device; a system implemented using a robot; a system for performing conversational AI operations; a system implementing one or more multi-model language models; a system implementing one or more vision language models (VLMs); a system implementing one or more multi-modal language models; a system for generating synthetic data; a system for generating synthetic data using AI; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.
10 . A method comprising: prompting a vision-language model (VLM) to generate one or more responses indicating a selection of at least one environment visualization technique of a plurality of supported environment visualization techniques; and generating, using the at least one environment visualization technique corresponding to the selection, a visualization of at least a portion of an environment of an ego-machine.
11 . The method of claim 10 , wherein at least one of the plurality of supported environment visualization techniques comprises modeling at least a portion of the environment, wherein the selection by the VLM represents a determination to generate the visualization based at least on modeling the environment using a detected 3D surface topology in the environment.
12 . The method of claim 10 , wherein at least one of the plurality of supported environment visualization techniques comprises a configuration of one or more parameters for the environment visualization technique, wherein the selection by the VLM comprises configuring a size parameter, of the one or more parameters, corresponding to one or more voxels of a signed distance function modeling the environment.
13 . The method of claim 10 , wherein at least one of the plurality of supported environment visualization techniques comprises a configuration of one or more parameters for the environment visualization technique, wherein the selection by the VLM comprises configuring a threshold parameter, of the one or more parameters, corresponding to one or more threshold distances encoded by a signed distance function modeling the environment.
14 . The method of claim 10 , wherein at least one of the plurality of supported environment visualization techniques comprises modeling at least a portion of the environment, wherein the selection by the VLM represents a determination to generate the visualization based at least on modeling the environment using a 3D bowl.
15 . The method of claim 10 , wherein at least one of the plurality of supported environment visualization techniques comprises modeling at least a portion of the environment, wherein the selection by the VLM designates one or more parameters of a 3D bowl modeling the environment.
16 . The method of claim 10 , wherein at least one of the plurality of supported environment visualization techniques comprises a configuration of one or more parameters for the environment visualization technique, wherein the selection by the VLM designates one or more stitching parameters, of the one or more parameters, associated with stitching overlapping frames of image data.
17 . The method of claim 10 , wherein the method is performed by at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing deep learning operations; a system for performing remote operations; a system for performing real-time streaming; a system for generating or presenting one or more of augmented reality content, virtual reality content, or mixed reality content; a system implemented using an edge device; a system implemented using a robot; a system for performing conversational AI operations; a system implementing one or more multi-model language models; a system implementing one or more vision language models (VLMs); a system implementing one or more multi-modal language models; a system for generating synthetic data; a system for generating synthetic data using AI; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.
18 . A system comprising: one or more processors to generate, within a simulation rendered using one or more light transport simulation algorithms, a visualization of at least a portion of a simulated environment around a simulated ego-machine using an environment visualization pipeline selected based at least on prompting a vision-language model (VLM).
19 . The system of claim 18 , wherein the simulation is generated, at least in part, using one or more content creation applications of a three-dimensional (3D) content collaboration platform for 3D assets.
20 . The system of claim 19 , wherein the simulated environment is represented in at least one content creation application of the one or more content creation applications using an OpenUSD format.

Description

BACKGROUND Vehicle Surround View Systems provide occupants of a vehicle with a visualization of the area surrounding the vehicle. Surround view visualizations can help the driver to see the surrounding environment - including blind spots where the driver's line of sight may be occluded by parts of the vehicle or other objects in the environment - without requiring the driver to reposition (e.g., turn their head, get off the driver's seat, lean a certain direction, etc.). As such, surround view visualizations may assist and facilitate a variety of driving maneuvers, such as smoothly entering or exiting a parking spot without colliding with vulnerable road users (e.g., pedestrians) or objects (e.g., a road curb or other vehicles). More and more vehicles, especially those of luxury brands or newer models, are equipped with Surround View Systems. There are a variety of techniques to generate a surround view visualization. Existing Surround View Systems often use fisheye cameras—typically mounted at the front, left, rear, and right sides of the vehicle body—to perceive the surrounding area from multiple directions. Frames generated using the individual cameras may be aligned using corresponding camera parameters, seams may be placed in overlapping regions, overlapping image data may be blended at the seams to create a stitched image (e.g., a 360° surround view visualization of the environment surrounding the vehicle), and the stitched image may be projected onto a 3D model of the surrounding environment. The 3D shape used to model the surrounding environment may rely on various assumptions about the geometry of the surrounding environment. Some existing techniques model the surrounding environment as a 3D bowl, which typically comprises a flat, circular ground plane for the inner portion of the bowl connected to an outer bowl represented as a curved surface rising from the ground plane to a height or with a slope that increases proportionally to the distance from the bowl center. As such, some Surround View systems project (e.g., stitched) images onto a 3D bowl, render a view of the resulting textured 3D bowl from the perspective of a virtual camera, and present the rendered view on a monitor visible to occupants or an operator (e.g., driver) of the vehicle. However, the projection and/or stitching processes can introduce a variety of artifacts, including geometric distortions (e.g., size or shape misalignments), texture distortions (e.g., blur, ghosting, object disappearance, object distortions), and color distortions. Since these artifacts may obscure or omit useful visual information and are often distracting to the driver, the artifacts can interfere with the safe operation of the vehicle in certain scenarios. As a result, there is a need for improved visualization techniques that reduce visual artifacts, better represent useful visual information, and/or otherwise improve the visual quality of resulting images. SUMMARY Embodiments of the present disclosure relate to surround view visualization using vision language models (VLMs). Systems and methods are disclosed that generate a visualization of the environment surrounding an ego-machine may be using an environment visualization technique selected and/or configured using one or more VLMs. In contrast to conventional systems, a VLM may be prompted to select a supported environment visualization pipeline (e.g., a bowl visualization pipeline that models the surrounding environment as a 3D bowl, surface topology visualization pipeline that that models the surrounding environment as a detected 3D surface topology), one or more parameters of a supported environment visualization pipeline (e.g., for a bowl visualization pipeline, a parametrization of the shape of the 3D bowl model, stitching parameters such as seam placement, blend width, or blend area, etc.), and/or a rendering viewport (e.g., a virtual camera position and orientation). As such, the VLM may be used to provide situational awareness of the surrounding environment, for example, by evaluating sensor (e.g., image) data and deciding which is the best environment visualization technique to use (and/or one or more parameters of the environment visualization technique) based on the scene represented in sensor (e.g., image) data. As such, the selected and/or configured technique may be used to visualize an environment around an ego-machine, such as a vehicle, robot, and/or other type of object, in systems such as parking visualization systems, Surround View Systems, and/or others. BRIEF DESCRIPTION OF THE DRAWINGS The present systems and methods for surround view visualization using vision language models are described in detail below with reference to the attached drawing figures, wherein: FIG. 1 is a diagram illustrating an example data flow through an example Surround View System, in accordance with some embodiments of the present disclosure; FIG. 2 is a diagram illustrating an example technique for sel