US-12626326-B2 - Image stitching with an adaptive three-dimensional bowl model of the surrounding environment for surround view visualization

US12626326B2US 12626326 B2US12626326 B2US 12626326B2US-12626326-B2

Abstract

In various examples, an environment surrounding an ego-object is visualized using an adaptive 3D bowl that models the environment with a shape that changes based on distance (and direction) to one or more representative point(s) on detected objects. Distance (and direction) to detected objects may be determined using 3D object detection or a top-down 2D or 3D occupancy grid, and used to adapt the shape of the adaptive 3D bowl in various ways (e.g., by sizing its ground plane to fit within the distance to the closest detected object, fitting a shape using an optimization algorithm). The adaptive 3D bowl may be enabled or disabled during each time slice (e.g., based on ego-speed), and the 3D bowl for each time slice may be used to render a visualization of the environment (e.g., a top-down projection image, a textured 3D bowl, and/or a rendered view thereof).

Inventors

Hairong Jiang
Nuri Murat ARAR
Orazio Gallo
Jan Kautz
Ronan Letoquin

Assignees

NVIDIA CORPORATION

Dates

Publication Date: 20260512
Application Date: 20230223

Claims (20)

1 . A method comprising: determining, using one or more machine learning models, one or more distances and one or more directions from an ego-object to one or more closest detected objects in an environment; generating a three-dimensional (3D) bowl that adaptively models the environment with a shape that is based at least on the one or more distances and the one or more directions to the one or more closest detected objects; and generating a visualization of the environment based at least on the 3D bowl.
2 . The method of claim 1 , further comprising determining the one or more distances and the one or more directions based on at least one of performing 3D object detection or projecting sensor data into a top-down two-dimensional (2D) or 3D occupancy grid.
3 . The method of claim 1 , wherein the generating of the 3D bowl comprises identifying a cluster of detected objects within a range of distances from the ego-object corresponding to a designated separation between an inner plane of the 3D bowl and an outer edge of the 3D bowl, and fitting a circumference of the 3D bowl to the cluster of detected objects within the range of distances from the ego-object.
4 . The method of claim 1 , further comprising identifying the one or more closest detected objects within a variable range threshold that varies based at least on whether there are any detected objects within an initial range threshold.
5 . The method of claim 1 , wherein the generating of the 3D bowl comprises fitting a circumference of the 3D bowl to one or more representative points of the one or more closest detected objects.
6 . The method of claim 1 , further comprising determining the one or more distances and the one or more directions to the one or more closest detected objects based at least on combining multiple types of distance or depth estimation.
7 . The method of claim 1 , further comprising determining the one or more distances and the one or more directions from the ego-object to at least one of: 1) one or more corners, edges, or center points of one or more detected bounding shapes of the one or more closest detected objects, or 2) one or more occupied cells of an occupancy grid.
8 . The method of claim 1 , wherein the generating of the visualization comprises at least one of orthographically projecting image data representing the environment using the 3D bowl or mapping image data representing the environment onto the 3D bowl.
9 . The method of claim 1 , further comprising selecting, based at least on speed of ego-motion of the ego-object, between enabling and disabling the 3D bowl that adaptively models the environment.
10 . The method of claim 1 , wherein the generating of the 3D bowl comprises identifying and omitting, from the one or more closest detected objects, a set of the one or more closest detected objects that are not within a field of view, of a virtual camera, represented by the visualization of the 3D bowl.
11 . A processor comprising: one or more processing units to: determine, using one or more machine learning models, one or more distances and one or more directions from an ego-object to one or more closest detected objects in an environment; generate a three-dimensional (3D) bowl that adaptively models the environment with a shape that is based at least on the one or more distances and the one or more directions to the one or more closest detected objects; and generate a visualization of the environment based at least on the 3D bowl.
12 . The processor of claim 11 , the one or more processing units further to determine the one or more distances and the one or more directions to the one or more closest detected objects based on at least one of performing 3D object detection or projecting sensor data into a top-down two-dimensional (2D) or 3D occupancy grid.
13 . The processor of claim 11 , the one or more processing units further to generate the 3D bowl based at least on identifying a cluster of detected objects within a range of distances from the ego-object corresponding to a designated separation between an inner plane of the 3D bowl and an outer edge of the 3D bowl, and fitting a circumference of the 3D bowl to the cluster of detected objects within the range of distances from the ego-object.
14 . The processor of claim 11 , the one or more processing units further to identify the one or more closest detected objects within a variable range threshold that varies based at least on whether there are any detected objects within an initial range threshold.
15 . The processor of claim 11 , wherein the processor is comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing deep learning operations; a system for performing real-time streaming; a system implemented using an edge device; a system implemented using a robot; a system for performing conversational AI operations; a system for generating synthetic data; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.
16 . A system comprising: one or more processing units to: determine, using one or more machine learning models, one or more distances and one or more directions from an ego-object to one or more closest detected objects in an environment; generate a three-dimensional (3D) bowl that models the environment using a shape that is based at least on the one or more distances and the one or more directions to the one or more closest detected objects; and generate a visualization of the environment based at least on the 3D bowl.
17 . The system of claim 16 , the one or more processing units further to generate the 3D bowl based at least on fitting a circumference of the 3D bowl to one or more representative points of the one or more closest detected objects.
18 . The system of claim 16 , the one or more processing units further to determine the one or more distances and the one or more directions to the one or more closest detected objects based at least on combining multiple types of distance or depth estimation.
19 . The system of claim 16 , the one or more processing units further to determine the one or more distances and the one or more directions from the ego-object to at least one of: 1) one or more corners, edges, or center points of one or more detected bounding shapes of the one or more closest detected objects, or 2) one or more occupied cells of an occupancy grid.
20 . The system of claim 16 , wherein the system is comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing digital twin operations; a system for performing deep learning operations; a system for performing real-time streaming; a system implemented using an edge device; a system implemented using a robot; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for generating synthetic data; or a system implemented at least partially using cloud computing resources.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of U.S. Provisional Application No. 63/326,724, filed on Apr. 1, 2022, the contents of which are incorporated by reference in their entirety. This application is related to U.S. application Ser. No. 18/173,589, entitled “Image Stitching with Dynamic Seam Placement based on Object Saliency for Surround View Visualization,” filed on Feb. 23, 2023; U.S. application Ser. No. 18/173,603, entitled “Image Stitching with Dynamic Seam Placement based on Ego-Vehicle State for Surround View Visualization,” filed on Feb. 23, 2023; U.S. application Ser. No. 18/173,615, entitled “Under Vehicle Reconstruction for Vehicle Environment Visualization,” filed on Feb. 23, 2023; and U.S. application Ser. No. 18/173,630, entitled “Optimized Visualization Streaming for Vehicle Environment Visualization,” filed on Feb. 23, 2023. BACKGROUND Vehicle Surround View Systems provide occupants of a vehicle with a visualization of the area surrounding the vehicle. For drivers, Surround View Systems provide the driver with the ability to view the surrounding area, including blind spots where the driver's line of sight is occluded by parts of the vehicle or other objects in the environment, without the need to reposition (e.g., turn their head, get off the driver's seat, lean a certain direction, etc.). This visualization may assist and facilitate a variety of driving maneuvers, such as smoothly entering or exiting a parking spot without hitting vulnerable road users like pedestrians or objects such as a road curb or other vehicles. More and more vehicles, especially luxury brands or new models, are being produced with Surround View Systems equipped. Existing vehicle Surround View Systems usually utilize fisheye cameras—typically mounted at the front, left, rear and right sides of the vehicle body—to perceive the surrounding area from multiple directions. Additional cameras may be included in special cases, like for long trucks or vehicles with trailers. Frames from the individual cameras are stitched together using camera parameters to align frames and blending techniques to combine overlapping regions to provide a horizontal 360° surround view visualization. Due to noise or various white balance configurations, a noticeable seam may appear where two images are stitched together. Although various mitigation measures may be used to smooth out the transition of image pixel values from one image to another (e.g., assigning pixel weight proportional to its distance to the edge, multiresolution based blending, neural network based blending), a noticeable seam is often still visible in a stitched image. Some conventional techniques attempt to avoid placing seams on top of objects detected using ultrasonic sensors. However, ultrasonic sensors typically operate over a very short range. As a result, conventional techniques only consider very close objects when placing seams, effectively ignoring potentially important objects outside the ultrasonic sensing range, or prioritizing less important objects within the ultrasonic sensing range, and placing seams over regions of a stitched image that are potentially important for a driver to safely maneuver a vehicle. Unfortunately, the process of stitching images may introduce a variety of artifacts, including geometric distortions (e.g., misalignments), texture distortions (e.g., blur, ghosting, object disappearance, object distortions), and color distortions. Distortion in a stitched image may be caused by a variety of issues, including the parallax effect, lens distortion, moving objects, and differences in exposure or illumination. For example, in some cases, capturing multiple images of a moving object using multiple cameras may in effect capture images of the object from different perspectives, such that different images may capture the object with different orientations or poses. Since the different representations of the object may not perfectly align when stitching images, an overlapping region may effectively combine images of the object and the background, creating a ghost-like effect that may appear in two locations. Some objects may even disappear from the stitched image. Since these stitching artifacts may obscure useful visual information and can be distracting to the driver, stitching artifacts can interfere with the safe operation of the vehicle. As a result, there is a need for improved stitching techniques that reduce stitching artifacts, better represent useful visual information in a stitched image, and/or otherwise improve the visual quality of stitched images. Moreover, in some existing Surround View Systems, two-dimensional (2D) images are used to approximate a three-dimensional (3D) visual representation of the environment surrounding the vehicle. For a given fisheye image, for example, each pixel captures a ray emitting from a surrounding 3D point projecting into the center of the fisheye camera and imaging