US-12618976-B2 - Annotation cross-labeling for autonomous control systems

US12618976B2US 12618976 B2US12618976 B2US 12618976B2US-12618976-B2

Abstract

An annotation system uses annotations for a first set of sensor measurements from a first sensor to identify annotations for a second set of sensor measurements from a second sensor. The annotation system identifies reference annotations in the first set of sensor measurements that indicates a location of a characteristic object in the two-dimensional space. The annotation system determines a spatial region in the three-dimensional space of the second set of sensor measurements that corresponds to a portion of the scene represented in the annotation of the first set of sensor measurements. The annotation system determines annotations within the spatial region of the second set of sensor measurements that indicates a location of the characteristic object in the three-dimensional space.

Inventors

Anting Shen

Assignees

TESLA, INC.

Dates

Publication Date: 20260505
Application Date: 20231208

Claims (20)

1 . A method, comprising: obtaining, by a processor, an image of a real-world scene captured by a first sensor, wherein at least a portion of an object is represented by the image; annotating, by the processor, a portion of the image with a first annotation, the first annotation indicating a location of the object in the image; obtaining, by the processor, sensor measurements of the real-world scene captured by a second sensor, the sensor measurements representing the real-world scene in three-dimensional space; determining, by the processor, a spatial region associated with the sensor measurements that at least partially corresponds with the portion of the image annotated with the first annotation; searching, by the processor, within the spatial region to identify a portion of the spatial region that includes the object; and annotating, by the processor, the sensor measurements with a second annotation, the second annotation indicating a location of the object in the portion of the spatial region.
2 . The method of claim 1 , wherein obtaining the sensor measurements captured by the second sensor comprises obtaining sensor measurements that are arranged as a point cloud that models the real-world scene with respect to a three-dimensional coordinate system.
3 . The method of claim 1 , wherein obtaining the sensor measurements captured by the second sensor comprises obtaining sensor measurements that are arranged as a depth map, the depth map comprising depth measurements that indicate distances to objects in the real-world scene from the first sensor.
4 . The method of claim 1 , wherein searching within the spatial region to identify the portion of the spatial region that includes the object comprises: determining a filtered subset of sensor measurements contained in the spatial region, and applying an annotation model to the filtered subset of sensor measurements to identify the portion of the spatial region that includes the object.
5 . The method of claim 1 , further comprising: generating training data based on the sensor measurements and the second annotation; and training a computer model using the training data.
6 . The method of claim 1 , wherein obtaining the image of the real-world scene captured by the first sensor comprises obtaining the image from a camera, and wherein obtaining the sensor measurements of the real-world scene captured by the second sensor comprises: obtaining the sensor measurements from a light detection and ranging (LIDAR) sensor or a radio detection and ranging (RADAR) sensor.
7 . The method of claim 1 , wherein obtaining the image of the real-world scene captured by the first sensor comprises obtaining the image from a camera, and wherein obtaining the sensor measurements of the real-world scene captured by the second sensor comprises: obtaining the sensor measurements from an infrared (IR) sensor.
8 . The method of claim 1 , wherein obtaining the sensor measurements of the real-world scene captured by the second sensor comprises obtaining the sensor measurements of the real-world scene, where the sensor measurements are captured with respect to a viewpoint of the second sensor which is different from a viewpoint of the first sensor.
9 . The method of claim 1 , wherein annotating the sensor measurements with the second annotation comprises annotating the sensor measurements based on a bounding box that surrounds at least a portion of the object in the three-dimensional space.
10 . The method of claim 1 , wherein annotating the sensor measurements with the second annotation comprises assigning a label to a subset of the sensor measurements, the label identifying the object.
11 . A system, comprising: at least one computer processor for executing computer program instructions; and a non-transitory computer-readable storage medium storing computer program instructions executable by the at least one computer processor to perform operations comprising: obtaining an image of a real-world scene captured by a first sensor, wherein at least a portion of an object is represented by the image; annotating a portion of the image with a first annotation, the first annotation indicating a location of the object in the image; obtaining sensor measurements of the real-world scene captured by a second sensor, the sensor measurements representing the real-world scene in three-dimensional space; determining a spatial region associated with the sensor measurements that at least partially corresponds with the portion of the image annotated with the first annotation; searching within the spatial region to identify a portion of the spatial region that includes the object; and annotating the sensor measurements with a second annotation, the second annotation indicating a location of the object in the portion of the spatial region.
12 . The system of claim 11 , wherein the instructions that cause the at least one computer processor to obtain the sensor measurements captured by the second sensor cause the at least one computer processor to obtain sensor measurements that are arranged as a point cloud that models the real-world scene with respect to a three-dimensional coordinate system.
13 . The system of claim 11 , wherein the instructions that cause the at least one computer processor to obtain the sensor measurements captured by the second sensor cause the at least one computer processor to obtain the sensor measurements, wherein the sensor measurements are arranged as a depth map, the depth map comprising depth measurements that indicate distances to objects in the real-world scene from the first sensor.
14 . The system of claim 11 , wherein the instructions that cause the at least one computer processor to search within the spatial region to identify the portion of the spatial region that includes the object cause the at least one computer processor to: determine a filtered subset of sensor measurements contained in the spatial region; and apply an annotation model to the filtered subset of sensor measurements to identify the portion of the spatial region that includes the object.
15 . The system of claim 11 , wherein the instructions further cause the at least one computer processor to: generate training data based on the sensor measurements and the second annotation; and train a computer model using the training data.
16 . The system of claim 11 , wherein the instructions that cause the at least one computer processor to obtain the image of the real-world scene captured by the first sensor cause the at least one computer processor to obtain the image from a camera, and wherein the instructions that cause the at least one computer processor to obtain the sensor measurements of the real-world scene captured by the second sensor cause the at least one computer processor to: obtain the sensor measurements from a light detection and ranging (LIDAR) sensor or a radio detection and ranging (RADAR) sensor.
17 . The system of claim 11 , wherein the instructions that cause the at least one computer processor to obtain the image of the real-world scene captured by the first sensor cause the at least one computer processor to obtain the image from a camera, and wherein the instructions that cause the at least one computer processor to obtain the sensor measurements of the real-world scene captured by the second sensor cause the at least one computer processor to obtain the sensor measurements from an infrared (IR) sensor.
18 . The system of claim 11 , wherein the instructions that cause the at least one computer processor to obtain the sensor measurements of the real-world scene captured by the second sensor cause the at least one computer processor to obtain the sensor measurements of the real-world scene, wherein the sensor measurements are captured with respect to a viewpoint of the second sensor which is different from a viewpoint of the first sensor.
19 . The system of claim 11 , wherein the instructions that cause the at least one computer processor to annotate the sensor measurements with the second annotation cause the at least one computer processor to annotate the sensor measurements based on a bounding box that surrounds at least a portion of the object in the three-dimensional space.
20 . The system of claim 11 , wherein the instructions that cause the at least one computer processor to annotate the sensor measurements with the second annotation cause the at least one computer processor to assign a label to a subset of the sensor measurements, the label identifying the object.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of U.S. patent application Ser. No. 17/806,358, filed Jun. 10, 2022, which is a continuation of U.S. patent application Ser. No. 16/514,721, now U.S. Pat. No. 11,361,457, filed Jul. 17, 2019, which claims the benefit of U.S. Provisional Patent Application Ser. No. 62/701,441, filed Jul. 20, 2018, all of which are incorporated by reference herein in their entirety. BACKGROUND This invention relates generally to autonomous control systems, and more particularly to training computer models for autonomous control systems. Autonomous control systems are systems that guide vehicles (e.g., automobiles, trucks, vans) without direct guidance by human operators. Autonomous control systems analyze the surrounding physical environment in various ways to guide vehicles in a safe manner. For example, an autonomous control system may detect and/or track objects in the physical environment, and responsive to a detected object, guide the vehicle away from the object such that collision with the object can be avoided. As another example, an autonomous control system may detect boundaries of lanes on the road such that the vehicle can be guided within the appropriate lane with the flow of traffic. Typically, the autonomous control system includes sensors that capture the surrounding environment as a set of sensor measurements in the form of images, videos, point cloud data, and the like. Often times, autonomous control systems use computer models to analyze the surrounding environment and perform detection and control operations. The computer models are trained using training data that resemble potential environments the autonomous control system would encounter during operation. The training data may correspond to the type of sensor data generated by the sensors of the autonomous control system. In preparation for the training process, portions of the training data are annotated to label various objects of interest. Computer models can learn representations of the objects through these annotations. For example, annotations for an image of a street from a camera may be regions of the image containing pedestrians that computer models can be trained on to learn representations of people on the street. Typically, annotations for training data can be generated by human operators who manually label the regions of interest, or can also be generated by annotation models that allow human operators to simply verify the annotations and relabel only those that are inaccurate. While fairly accurate labels can be easily and conveniently generated for certain types of sensor measurements, other types of sensor measurements can be difficult to annotate due to the format, size, or complexity of the data. For example, light detection and ranging (LIDAR) sensors generate sensor measurements in three-dimensional (3D) space that can be difficult for human operators to label compared to a two-dimensional (2D) image. In addition, although annotation models can be used to generate the annotations, this can also be difficult due to the significant amount of data that needs to be processed and the missing sensor measurements that result from the particular sensing mechanism. SUMMARY An annotation system uses annotations for a first set of sensor measurements from a first sensor to identify annotations for a second set of sensor measurements from a second sensor. Annotations for the first set of sensor measurements may be generated relatively easily and conveniently, while annotations for the second set of sensor measurements may be more difficult to generate than the first set of sensor measurements due to the sensing characteristics of the second sensor. In one embodiment, the first set of sensor measurements are from a camera that represent a scene in a two-dimensional (2D) space, and the second set of sensor measurements are from an active sensor, such as a light detection and ranging (LIDAR) sensor, that represent the scene in a three-dimensional space (3D). Specifically, the annotation system identifies reference annotations in the first set of sensor measurements that indicates a location of a characteristic object in the 2D space. The annotation system determines a spatial region in the 3D space of the second set of sensor measurements that corresponds to a portion of the scene represented in the annotation of the first set of sensor measurements. The spatial region is determined using at least a viewpoint of the first sensor and the location of the first annotation in the 2D space. In one embodiment, the spatial region is represented as a viewing frustum, which is a pyramid of vision containing the region of space that may appear in the reference annotation in the 2D image. In one instance, the spatial region may be shaped as a rectangular pyramid. The annotation system determines annotations within the spatial region of the second set of sensor measurements that indic