EP-4742063-A2 - ESTIMATING OBJECT PROPERTIES USING VISUAL IMAGE DATA

EP4742063A2EP 4742063 A2EP4742063 A2EP 4742063A2EP-4742063-A2

Abstract

A system is comprised of one or more processors coupled to memory. The one or more processors are configured to receive image data based on an image captured using a camera of a vehicle and to utilize the image data as a basis of an input to a trained machine learning model to at least in part identify a distance of an object from the vehicle. The trained machine learning model has been trained using a training image and a correlated output of an emitting distance sensor.

Inventors

MUSK, JAMES ANTHONY
SAHAI, SWUPNIL KUMAR
ELLUSWAMY, Ashok Kumar

Assignees

Tesla, Inc.

Dates

Publication Date: 20260513
Application Date: 20200207

Claims (15)

A system, comprising: one or more processors configured to: receive sensor data based on an image of an object captured using a vision sensor of a vehicle, the vision sensor capturing an environment of the vehicle; provide the sensor data as an input to a trained machine learning model to cause the trained machine learning model to generate an output representing at least one property of the at least one object in the environment, the at least one property comprising a velocity vector corresponding to the at least one object in the environment of the vehicle, wherein the trained machine learning model was trained using a training image and a correlated output of an emitting distance sensor; and determine a predicted path of the at least one object in the environment of the vehicle based on the velocity vector.
The system of claim 1, wherein the output representing the at least one property of the at least one object in the environment further comprises at least one of a distance of the at least one object relative to the sensor or a direction of at least one object relative to the environment.
The system of any one of the preceding claims, wherein the at least one object comprises a pedestrian or a second vehicle that is moving relative to the vehicle, wherein the one or more processors are further configured to: provide the sensor data as the input to the trained machine learning model to cause the trained machine learning model to generate the output representing the at least one property of the pedestrian or the second vehicle in the environment.
The system of any one of the preceding claims, wherein the trained machine learning model is trained by: receiving a time series of images captured using a camera of a training vehicle; receiving a time series of distance data from an emitting distance sensor of the training vehicle; tracking at least one object across the time series of images; and correlating the at least one object tracked across the time series of images with the time series of distance data to determine a plurality of distance estimates for the at least one object tracked across the time series of images, the plurality of distance estimates being used as ground-truth distance labels associated with corresponding ones of the images in the time series.
The system of any one of the preceding claims, wherein the one or more processors receive the sensor data based on generation of the sensor data by: at least one camera or at least one fisheye camera.
The system of any one of the preceding claims, wherein the one or more processors are further configured to: normalize the sensor data, wherein the one or more processors are configured to provide the normalized sensor data as the input to the trained machine learning model.
The system of any one of the preceding claims, wherein the one or more processors are further configured to: cause a vehicle control module to control operation of the vehicle based on the at least one property of the at least one object or the at least one property of at least one agent in the environment.
A method comprising: receiving, by at least one processor, sensor data based on an image of at least one object captured using a vision sensor of a vehicle, the vision sensor capturing an environment of the vehicle; providing, by the at least one processor, the sensor data as an input to a trained machine learning model to cause the trained machine learning model to generate an output representing at least one property of the at least one object in the environment, the at least one property comprising a velocity vector corresponding to the at least one object in the environment of the vehicle, wherein the trained machine learning model was trained using a training image and a correlated output of an emitting distance sensor; and determining, by the at least one processor, a predicted path of the at least one object in the environment of the vehicle based on the velocity vector.
The method of claim 8, wherein the output representing the at least one property of the at least one object in the environment further comprises at least one of a distance of the at least one object relative to the sensor or a direction of at least one object relative to the environment.
The method of claim 8 or 9, wherein the at least one object comprises a pedestrian or a second vehicle that is moving relative to the vehicle, wherein the method further comprises: providing, by the at least one processor, the sensor data as the input to the trained machine learning model to cause the trained machine learning model to generate the output representing the at least one property of the pedestrian or the second vehicle in the environment.
The method of any one of claims 8-10, wherein the trained machine learning model is trained by: receiving a time series of images captured using a camera of a training vehicle; receiving a time series of distance data from an emitting distance sensor of the training vehicle; tracking at least one object across the time series of images; and correlating the at least one object tracked across the time series of images with the time series of distance data to determine a plurality of distance estimates for the at least one object tracked across the time series of images, the plurality of distance estimates being used as ground-truth distance labels associated with corresponding ones of the images in the time series.
The method of any one of claims 8-11, further comprising: receiving, by the at least one processor, the sensor data based on generation of the sensor data by: at least one camera or at least one fisheye camera.
The method of any one of claims 8-12, further comprising: normalizing, by the at least one processor, the sensor data; and providing, by the at least one processor, the sensor data as the input to the trained machine learning model based on normalizing the sensor data.
The method of any one of claims 8-13, further comprising: causing a vehicle control module to control operation of the vehicle based on the at least one property of the at least one object or the at least one property of at least one agent in the environment.
A non-transitory computer storage media storing instructions that, when executed by a system of one or more processors, cause the one or more processors to perform operations implementing method of any one of claims 8-14 and/or implementing the system of any one of claims 1-7.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of, and claims priority to, U.S. Patent App. No. 16/279,657 titled "ESTIMATING OBJECT PROPERTIES USING VISUAL IMAGE DATA" and filed on February 19, 2019, the disclosure of which is hereby incorporated herein by reference in its entirety. BACKGROUND OF THE INVENTION Autonomous driving systems typically rely on mounting numerous sensors including a collection of vision and emitting distance sensors (e.g., radar sensor, lidar sensor, ultrasonic sensor, etc.) on a vehicle. The data captured by each sensor is then gathered to help understand the vehicle's surrounding environment and to determine how to control the vehicle. Vision sensors can be used to identify objects from captured image data and emitting distance sensors can be used to determine the distance of the detected objects. Steering and speed adjustments can be applied based on detected obstacles and clear drivable paths. But as the number and types of sensors increases, so does the complexity and cost of the system. For example, emitting distance sensors such as lidar are often costly to include in a mass market vehicle. Moreover, each additional sensor increases the input bandwidth requirements for the autonomous driving system. Therefore, there exists a need to find the optimal configuration of sensors on a vehicle. The configuration should limit the total number of sensors without limiting the amount and type of data captured to accurately describe the surrounding environment and safely control the vehicle. SUMMARY One embodiment includes a system. The system comprises one or more processors configured to: receive image data based on an image captured using a camera of a vehicle; and utilize the image data as a basis of an input to a trained machine learning model to at least in part identify a distance of an object from the vehicle; wherein the trained machine learning model has been trained using a training image and a correlated output of an emitting distance sensor; and a memory coupled to the one or more processors. Another embodiment includes a computer program product, the computer program product being embodied in a non-transitory computer readable storage medium and comprising computer instructions. The computer instructions are for receiving image data based on an image captured using a camera of a vehicle; and utilizing the image data as a basis of an input to a trained machine learning model to at least in part identify a distance of an object from the vehicle, wherein the trained machine learning model has been trained using a training image and a correlated output of an emitting distance sensor. Yet another embodiment includes a method. The method comprises receiving a selected image based on an image captured using a camera of a vehicle; receiving distance data based on an emitting distance sensor of the vehicle; identifying an object using the selected image as an input to a trained machine learning model; extracting a distance estimate of the identified object from the received distance data; creating a training image by annotating the selected image with the extracted distance estimate; training a second machine learning model to predict a distance measurement using a training data set that includes the training image; and providing the trained second machine learning model to a second vehicle equipped with a second camera. BRIEF DESCRIPTION OF THE DRAWINGS Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings. Figure 1 is a block diagram illustrating an embodiment of a deep learning system for autonomous driving.Figure 2 is a flow diagram illustrating an embodiment of a process for creating training data for predicting object properties.Figure 3 is a flow diagram illustrating an embodiment of a process for training and applying a machine learning model for autonomous driving.Figure 4 is a flow diagram illustrating an embodiment of a process for training and applying a machine learning model for autonomous driving.Figure 5 is a diagram illustrating an example of capturing auxiliary sensor data for training a machine learning network.Figure 6 is a diagram illustrating an example of predicting object properties. DETAILED DESCRIPTION The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to