US-12626394-B2 - Fast self-supervised single image to categorical 3D objects machine learning model training

US12626394B2US 12626394 B2US12626394 B2US 12626394B2US-12626394-B2

Abstract

Systems and methods are provided for implementing a multi-stage, ML model training process for autonomous or semi-autonomous driving. The multi-stage ML model training process comprises (1) 2D and 3D supervised losses during a synthetic data ML model training, (2) 2D supervised on real-world data, and (3) 3D self-supervised losses on real-world data. The improved ML training process may not rely on 3D object recognition with real-world 3D labeled data. Once the ML model is trained, in some examples, the trained ML model can implement an inference process to predict the 3D shape, size, and 6D pose of objects within a single image, operate at a category level, and eliminate the need for computer-aided design (CAD) models during inference.

Inventors

MAYANK LUNAYACH
Sergey Zakharov
Dian Chen
Rares Ambrus
Zsolt Kira
Muhammad Zubair IRSHAD

Assignees

Toyota Research Institute, Inc.
GEORGIA TECH RESEARCH CORPORATION

Dates

Publication Date: 20260512
Application Date: 20240223

Claims (20)

1 . A computer-implemented method for training a machine learning (ML) model to recognize objects encountered during autonomous or semi-autonomous operations of a vehicle, the method comprising: receiving an image corresponding with the autonomous or semi-autonomous operations of the vehicle; initiating a multi-step training of the machine learning model on the image comprising: initiating a first stage pre-training process based on synthetic data using two-dimensional (2D) supervised machine learning (ML) model training and three-dimensional (3D) self-supervised machine learning (ML) model training; following the first stage pre-training process, initiating a second stage mixed-training process based on a combination of the synthetic data and real-world data on the 2D supervised ML model training; following the second stage mixed-training process, initiating a third stage fine-tuning process based on the real-world data without the synthetic data on the 3D self-supervised ML model training; and extracting and fusing features from the image, using a backbone network, by exposing the image to a heatmap head, a segmentation head, a pose head, and a shape head.
2 . The method of claim 1 , wherein the image is a first image, and the method is further comprising: upon training the machine learning model on the first image, the machine learning model is provided with an inference process on a set of objects in a second image.
3 . The method of claim 2 , the inference process on the set of objects in the second image comprising: detecting 2D locations of the set of objects in the second image; predicting a 3D shape of an object in the set of objects; predicting a pose of the object in the set of objects; predicting a size of the object in the set of objects; and adjusting operation of the vehicle based on the inference.
4 . The method of claim 1 , wherein a loss is calculated at each of the first stage pre-training process, the second stage mixed-training process, and the third stage fine-tuning process of the multi-step training of the machine learning model associated with two-dimensional (2D) data loss and a three-dimensional (3D) data loss.
5 . The method of claim 4 , wherein the loss uses a chamfer loss aggregated with the 2D data loss and the 3D data loss.
6 . The method of claim 4 , wherein the shape head uses a chamfer loss aggregated with the 2D data loss and the 3D data loss.
7 . The method of claim 1 , wherein during the second stage mixed-training process and the third stage fine-tuning process, 2D labels from the real-world data are employed without 3D labels.
8 . The method of claim 1 , wherein a ratio of synthetic data in the first stage pre-training process to real-world data during the second stage mixed-training process is adjustable and pre-determined.
9 . The method of claim 1 , wherein the synthetic data is determined using a learned continuation Signed Distance Function (SDF) representing shapes of different categories.
10 . The computer-implemented method of claim 1 , wherein the image is a RGB-D image.
11 . A system for training a machine learning (ML) model to recognize objects in an image, the system comprising: a memory; and a processor that is configured to execute machine readable instructions stored in the memory for causing the processor to: receive the image; initiate a multi-step training of the machine learning model on the image comprising: initiating a first stage pre-training process based on synthetic data using two-dimensional (2D) supervised machine learning (ML) model training and three-dimensional (3D) self-supervised machine learning (ML) model training; following the first stage pre-training process, initiating a second stage mixed-training process based on a combination of the synthetic data and real-world data on the 2D supervised ML model training; following the second stage mixed-training process, initiating a third stage fine-tuning process based on the real-world data without the synthetic data on the 3D self-supervised ML model training; and extracting and fusing features from the image by exposing the image to a heatmap head, a segmentation head, a pose head, and a shape head.
12 . The system of claim 11 , wherein the image is a first image, and the instructions stored in the memory further cause the processor to: upon training the machine learning model on the first image, the machine learning model is provided with an inference process on a set of objects in a second image.
13 . The system of claim 12 , the inference process on the set of objects in the second image comprising: detecting 2D locations of the set of objects in the second image; predicting a 3D shape of an object in the set of objects; predicting a pose of the object in the set of objects; predicting a size of the object in the set of objects; and adjusting operation of the vehicle based on the inference.
14 . The system of claim 11 , wherein a loss is calculated at each of the first stage pre-training process, the second stage mixed-training process, and the third stage fine-tuning process of the multi-step training of the machine learning model associated with two-dimensional (2D) data loss and a three-dimensional (3D) data loss.
15 . The system of claim 14 , wherein the loss uses a chamfer loss aggregated with the 2D data loss and the 3D data loss.
16 . The system of claim 14 , wherein the shape head uses a chamfer loss aggregated with the 2D data loss and the 3D data loss.
17 . The system of claim 11 , wherein during the second stage mixed-training process and the third stage fine-tuning process, 2D labels from the real-world data are employed without 3D labels.
18 . The system of claim 11 , wherein a ratio of synthetic data in the first stage pre-training process to real-world data during the second stage mixed-training process is adjustable and pre-determined.
19 . A non-transitory computer-readable storage medium storing a plurality of instructions executable by a processor, the plurality of instructions when executed by the processor cause the processor to: receive an image corresponding with autonomous or semi-autonomous operations of a device; initiate a multi-step training of a machine learning model to recognize objects encountered while the device is operating and depicted in the image, the multi-step training of the machine learning model comprising: initiating a first stage pre-training process based on synthetic data using two-dimensional (2D) supervised machine learning (ML) model training and three-dimensional (3D) self-supervised machine learning (ML) model training; following the first stage pre-training process, initiating a second stage mixed-training process based on a combination of the synthetic data and real-world data on the 2D supervised ML model training; following the second stage mixed-training process, initiating a third stage fine-tuning process based on the real-world data without the synthetic data on the 3D self-supervised ML model training; and extracting and fusing features from the image by exposing the image to a heatmap head, a segmentation head, a pose head, and a shape head.
20 . The non-transitory computer-readable storage medium of claim 19 , wherein the image is a first image, and the processor is further caused to: upon training the machine learning model on the first image, the machine learning model is provided with an inference process on a set of objects in a second image.

Description

CROSS-REFERENCE TO RELATED APPLICATION This patent application is co-pending with U.S. patent application Ser. No. 17/895,224, filed Aug. 25, 2022, which is incorporated by reference herein. TECHNICAL FIELD The present disclosure relates generally to shape reconstruction and pose and size estimation and, more particularly, to multi-object three-dimensional (3D) shape reconstruction and six-dimensional (6D) pose and size estimation from a single image for machine learning and robotics automation. DESCRIPTION OF RELATED ART Automated driving systems and robotics systems leverage 3D object recognition to help understand the surrounding environment. The goal is to enable a machine to perceive and interpret the 3D spatial information of objects in its vicinity, such as other vehicles, pedestrians, cyclists, and obstacles to help make informed operating decisions and ensure safe navigation. 3D object reconstruction enables these systems to obtain a fine-grained understanding of local geometry, which may be useful in scenarios such as robotics grasping. Furthermore, a system that is able to perform 6D pose estimation in real-time can lead to fast-feedback control. BRIEF SUMMARY OF THE DISCLOSURE According to various examples of the disclosed technology, a method for training a machine learning (ML) model to recognize objects. The objects may be encountered by an autonomous or semi-autonomous vehicle is operating in an environment, or other instances of object recognition implemented by a machine learning model. The method may comprise, for example, receiving an image corresponding with the autonomous or semi-autonomous operations of the vehicle and initiating a multi-step training of the machine learning model on the image. The multi-step training of the machine learning model may comprise initiating a first stage pre-training process based on synthetic data using two-dimensional (2D) supervised machine learning (ML) model training and three-dimensional (3D) self-supervised machine learning (ML) model training, following the first stage pre-training process, initiating a second stage mixed-training process based on a combination of the synthetic data and real-world data on the 2D supervised ML model training, following the second stage mixed-training process, initiating a third stage fine-tuning process based on the real-world data without the synthetic data on the 3D self-supervised ML model training, and extracting and fusing features from the image, using a backbone network, by exposing the image to a heatmap head, a segmentation head, a pose head, and a shape head. In some examples, the image is a first image and, in some examples, upon training the machine learning model on the first image, the machine learning model is provided with an inference process on a set of objects in a second image. The inference process may comprise detecting 2D locations of the set of objects in the second image, predicting a 3D shape of an object in the set of objects, predicting a pose of the object in the set of objects, predicting a size of the object in the set of objects, and adjusting operation of the vehicle based on the inference. In some examples, a loss is calculated at each of the first stage pre-training process, the second stage mixed-training process, and the third stage fine-tuning process of the multi-step training of the machine learning model associated with two-dimensional (2D) data loss and a three-dimensional (3D) data loss. The loss may use a chamfer loss aggregated with the 2D data loss and the 3D data loss. The shape head may use, for example, a chamfer loss aggregated with the 2D data loss and the 3D data loss. In some examples, during the second stage mixed-training process and the third stage fine-tuning process, 2D labels from the real-world data may be employed without 3D labels. In some examples, a ratio of synthetic data in the first stage pre-training process to real-world data during the second stage mixed-training process is adjustable and pre-determined. In some examples, synthetic data is determined using a learned continuation Signed Distance Function (SDF) representing shapes of different categories. In some examples, the image is a RGB-D image. The RGB-D image may include multiple objects for detecting, reconstructing, and initiating an action by the system/vehicle described herein. Other features and aspects of the disclosed technology will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, which illustrate, by way of example, the features in accordance with examples of the disclosed technology. The summary is not intended to limit the scope of any inventions described herein, which are defined solely by the claims attached hereto. BRIEF DESCRIPTION OF THE DRAWINGS The present disclosure, in accordance with one or more various examples, is described in detail with reference to the following figures. The figures are provided for purp