Search

DE-102025144719-A1 - UNSUMPTED JOINT TRAINING OF MULTIMODAL PERCEPTION MODELS USING CURVE-BASED SCRATCHING

DE102025144719A1DE 102025144719 A1DE102025144719 A1DE 102025144719A1DE-102025144719-A1

Abstract

A method includes obtaining light detection and distance measurement feature embeddings (LIDAR feature embeddings) and obtaining camera feature embeddings. The method includes generating fusion feature embeddings by combining the LIDAR feature embeddings and the camera feature embeddings. The method includes determining sample weights for multiple points in a three-dimensional (3D) scene. The method includes selecting a subset of the multiple points based on the sample weights. The method includes determining rendering loss by performing distinguishable rendering on the selected subset of points. The method includes determining prototype learning loss by comparing the LIDAR feature embeddings and the camera feature embeddings to a set of learnable prototypes representing portions of the 3D scene in a shared feature space. The method involves jointly training a LIDAR encoder, a camera encoder, and a fusion encoder based on rendering loss and prototype learning loss.

Inventors

  • Runjian Chen
  • Hang Zhang
  • Avinash Aghoram Ravichandran

Assignees

  • GM CRUISE HOLDINGS LLC

Dates

Publication Date
20260513
Application Date
20251031
Priority Date
20250912

Claims (10)

  1. A computer-implemented method running on data processing hardware that causes the data processing hardware to perform operations including: Obtaining LiDAR feature embeddings based on LiDAR data for a three-dimensional (3D) scene from a light detection and distance measurement encoder (LiDAR encoder); Obtaining camera feature embeddings based on camera image data for the 3D scene from a camera encoder; Generating fusion feature embeddings using a fusion encoder by combining the LiDAR feature embeddings and the camera feature embeddings; Determining sample weights for multiple points in the 3D scene based on surface curvature estimated from the fusion feature embeddings, assigning larger sample weights to points with higher surface curvature; Selecting a subset of the multiple points based on the sample weights; Determine rendering loss by performing distinguishable rendering on the selected subset of points to reconstruct at least one of the LiDAR data or camera image data; Determine prototype learning loss by comparing the LiDAR feature embeddings and the camera feature embeddings with a set of learnable prototypes representing parts of the 3D scene in a shared feature space; and co-train the LiDAR encoder, the camera encoder, and the fusion encoder based on the rendering loss and the prototype learning loss.
  2. Procedure according to Claim 1 , where determining sample weights includes: estimating a signed distance function (SDF) for the 3D scene based on the fusion feature embeddings; and determining the surface curvature based on a derivative of the SDF.
  3. Procedure according to Claim 1 , where the prototype learning loss includes a swapping prediction loss that models an interaction between the LIDAR data and the camera image data.
  4. Procedure according to Claim 3 , wherein the operations further include: determining a first similarity score between the LIDAR feature embeddings and the set of learnable prototypes; determining a second similarity score between the camera feature embeddings and the set of learnable prototypes; and performing a cross-model prediction using the first similarity score and the second similarity score.
  5. Procedure according to Claim 1 , where the prototype learning loss includes a Gram matrix regularization loss which prevents a collapse of the set of learnable prototypes by promoting diversity within the set of learnable prototypes.
  6. Procedure according to Claim 5 , wherein the operations further include determining the Gram matrix regularization loss by minimizing the non-diagonal elements of a Gram matrix determined from the set of learnable prototypes.
  7. Procedure according to Claim 1 , wherein the operations further comprise: after joint training, deploying a 3D perception model for a vehicle, wherein the 3D perception model comprises the LIDAR encoder, the camera encoder, and the fusion encoder, wherein the 3D perception model, when deployed for the vehicle, is configured to cause the vehicle to: process the real-time sensor data from one or more sensors of the vehicle; and control a maneuver of the vehicle based on the processing of the real-time sensor data.
  8. Procedure according to Claim 7 , wherein controlling the maneuver of the vehicle includes generating a control signal to actuate at least one of the steering system, braking system or acceleration system of the vehicle.
  9. Procedure according to Claim 1 , wherein the operations prior to determining the prototype learning loss further include projecting the LIDAR feature embeddings and the camera feature embeddings into the shared feature space using one or more projection heads.
  10. Procedure according to Claim 1 , wherein the rendering loss includes at least one of the following: a distance prediction loss for the LIDAR data; a color prediction loss for the camera image data; or a signed surface distance function loss.

Description

CROSS-REFERENCE TO RELATED REGISTRATION This US patent application claims priority under 35 USC §119(e) of the provisional US application 63/720113 , filed on 13 November 2024. The disclosure of this earlier application is considered to be part of the disclosure of this application and is hereby fully incorporated by reference. INTRODUCTION The information provided in this section serves the purpose of providing a general overview of the context of the disclosure. Neither the work of the inventors currently named, to the extent described in this section, nor those aspects of the description that could not otherwise qualify as prior art at the time of filing, are expressly or implicitly recognized as prior art in relation to the present disclosure. This disclosure relates generally to computer-implemented systems for three-dimensional (3D) perception and more specifically to training machine learning models for sensor fusion applications. Vehicles and other autonomous systems are often equipped with a set of sensors to perceive their surrounding environment. These sensors may include cameras that capture two-dimensional (2D) images rich in color and texture, and light detection and range-measuring sensors (LiDAR sensors) that generate 3D point clouds providing accurate spatial and geometric information. The perception systems may use machine learning models, such as deep neural networks, to process the data from these different sensor modalities. In some applications, data from both the cameras and the LiDAR are processed together to generate a comprehensive representation of the 3D scene. Training such models often involves learning to extract prominent features from both image data and point cloud data to facilitate downstream perceptual tasks such as object detection and scene understanding. SUMMARY One aspect of the disclosure provides a computer-implemented procedure that runs on data processing hardware and causes the data processing hardware to perform operations. These operations include obtaining LiDAR feature embeddings based on LiDAR data for a three-dimensional (3D) scene from a light detection and distance measurement encoder (LiDAR encoder). The operations include obtaining camera feature embeddings based on camera image data for the 3D scene from a camera encoder. The operations include generating fusion feature embeddings using a fusion encoder by combining the LiDAR feature embeddings and the camera feature embeddings. The operations include determining sample weights for multiple points in the 3D scene based on surface curvature estimated from the fusion feature embeddings. Points with higher curvature are assigned larger sample weights. The operations include selecting a subset of the multiple points based on the sample weights. The operations include determining rendering loss by performing distinguishable rendering on the selected subset of points to reconstruct at least one of the LiDAR data or camera image data. The operations include determining prototype learning loss by comparing the LiDAR feature embeddings and the camera feature embeddings to a set of learnable prototypes representing parts of the 3D scene in a shared feature space. The operations include jointly training the LiDAR encoder, the camera encoder, and the fusion encoder based on the rendering loss and the prototype learning loss. Implementations of the disclosure may include one or more of the following optional features. According to some implementations, determining the sample weights includes estimating a signed distance function (SDF) for the 3D scene based on the fusion feature embeddings and determining the surface curvature based on a derivative of the SDF. The prototype learning loss may include a swapping prediction loss that models an interaction between the lidar data and the camera image data. Here, the operations may further include determining a first similarity score between the lidar feature embeddings and the set of learnable prototypes, determining a second similarity score between the camera feature embeddings and the set of learnable prototypes, and performing a cross-model prediction under Ver The first similarity score and the second similarity score are included. According to some examples, prototype learning loss includes a Gram matrix regularization loss, which prevents the breakdown of the set of learnable prototypes by promoting diversity within that set. According to these examples, the operations may further include determining the Gram matrix regularization loss by minimizing the non-diagonal elements of a Gram matrix determined from the set of learnable prototypes. The operations may also include deploying a 3D perception model for a vehicle, where the 3D perception model incorporates the LiDAR encoder, the camera encoder, and the fusion encoder after joint training. When deployed for the vehicle, the 3D perception model is configured to cause the vehicle to process real-time sensor data