US-20260127811-A1 - THREE-DIMENSIONAL (3D) POINT CLOUD PERCEPTION

US20260127811A1US 20260127811 A1US20260127811 A1US 20260127811A1US-20260127811-A1

Abstract

Systems and techniques are described herein for processing three-dimensional (3D) data. For example, a computing device can process a plurality of voxels to generate a plurality of tokens associated with the 3D data; process, using a linear layer of an encoder, an embedding dimension of the plurality of tokens to modify the plurality of tokens; adjust, using a layer of the encoder, an order of the modified plurality of tokens to generate a rearranged plurality of tokens; and process the modified plurality of tokens and the rearranged plurality of tokens to determine relationships between input features of the plurality of tokens.

Inventors

Shizhong Steve HAN
Hong Cai
Hsin-Pai CHENG
Soyeb Noormohammed NAGORI
Jihad Masri
Fatih Murat PORIKLI

Assignees

QUALCOMM INCORPORATED

Dates

Publication Date: 20260507
Application Date: 20241106

Claims (20)

1 . An apparatus for processing three-dimensional (3D) data, the apparatus comprising: one or more memories configured to store the 3D data; and one or more processors coupled to the one or more memories and configured to: process a plurality of voxels to generate a plurality of tokens associated with the 3D data; process, using a linear layer of an encoder, an embedding dimension of the plurality of tokens to modify the plurality of tokens; adjust, using a layer of the encoder, an order of the modified plurality of tokens to generate a rearranged plurality of tokens; and process the modified plurality of tokens and the rearranged plurality of tokens to determine relationships between input features of the plurality of tokens.
2 . The apparatus of claim 1 , wherein the layer of the encoder is a convolution layer that is part of a convolutional neural network (CNN).
3 . The apparatus of claim 1 , wherein the one or more processors are configured to process the embedding dimension and the rearranged plurality of tokens using element-wise multiplication of the embedding dimension and the rearranged plurality of tokens.
4 . The apparatus of claim 1 , wherein the plurality of tokens is a plurality of one-dimensional (1D) sequential tokens.
5 . The apparatus of claim 1 , wherein the rearranged plurality of tokens is the modified plurality of tokens arranged in raster order.
6 . The apparatus of claim 1 , wherein the one or more processors are configured to adjust, using the layer of the encoder, the order of the modified plurality of tokens based on a local proximity relationship of tokens of the modified plurality of tokens.
7 . The apparatus of claim 1 , wherein the one or more processors are configured to: detect an object based on the relationships between the input features of the embedding dimension and the plurality of tokens.
8 . The apparatus of claim 1 , wherein the one or more processors are configured to: generate an aerial view representation of the plurality of voxels based on the relationships between the input features of the embedding dimension and the plurality of tokens.
9 . The apparatus of claim 1 , wherein the one or more processors are configured to process the plurality of voxels to generate the plurality of tokens using partitions of a voxel from the plurality of voxels, wherein the plurality of tokens is associated with values associated with x-y coordinates of the partitions, and wherein tokens of the plurality of tokens are arranged in order based on the x-y coordinates.
10 . The apparatus of claim 1 , further comprising a sensor configured to capture the 3D data.
11 . A method for processing three-dimensional (3D) data, the method comprising: processing a plurality of voxels to generate a plurality of tokens associated with the 3D data; processing, using a linear layer of an encoder, an embedding dimension of the plurality of tokens to modify the plurality of tokens; adjusting, using a layer of the encoder, an order of the modified plurality of tokens to generate a rearranged plurality of tokens; and processing the modified plurality of tokens and the rearranged plurality of tokens to determine relationships between input features of the plurality of tokens.
12 . The method of claim 11 , wherein the layer of the encoder is a convolution layer that is part of a convolutional neural network (CNN).
13 . The method of claim 11 , further comprising: processing the embedding dimension and the rearranged plurality of tokens using element-wise multiplication of the embedding dimension and the rearranged plurality of tokens.
14 . The method of claim 11 , wherein the plurality of tokens is a plurality of one-dimensional (1D) sequential tokens.
15 . The method of claim 11 , wherein the rearranged plurality of tokens is the modified plurality of tokens arranged in raster order.
16 . The method of claim 11 , further comprising: adjusting, using the layer of the encoder, the order of the modified plurality of tokens based on a local proximity relationship of tokens of the modified plurality of tokens.
17 . The method of claim 11 , further comprising: detecting an object based on the relationships between the input features of the embedding dimension and the plurality of tokens.
18 . The method of claim 11 , further comprising: generating an aerial view representation of the plurality of voxels based on the relationships between the input features of the embedding dimension and the plurality of tokens.
19 . The method of claim 11 , further comprising: processing the plurality of voxels to generate the plurality of tokens using partitions of a voxel from the plurality of voxels, wherein the plurality of tokens is associated with values associated with x-y coordinates of the partitions, and wherein tokens of the plurality of tokens are arranged in order based on the x-y coordinates.
20 . A non-transitory computer-readable medium having stored thereon instructions that, when executed by at least one processor, cause the at least one processor to: process a plurality of voxels to generate a plurality of tokens associated with 3D data; process, using a linear layer of an encoder, an embedding dimension of the plurality of tokens to modify the plurality of tokens; adjust, using a layer of the encoder, an order of the modified plurality of tokens to generate a rearranged plurality of tokens; and process the modified plurality of tokens and the rearranged plurality of tokens to determine relationships between input features of the plurality of tokens.

Description

TECHNICAL FIELD The present disclosure generally relates to processing three-dimensional (3D) data. For example, aspects of the present disclosure relate to systems and methods for 3D point cloud perception. BACKGROUND Three-dimensional (3D) perception based on point cloud data has critical applications in the fields of autonomous driving, robotics, and augmented reality. The decreasing cost of light detection and ranging (LIDAR) devices has made real-time autonomous driving increasingly feasible, utilizing both camera and LIDAR sensors. LIDAR sensors can provide accurate 3D location information in varying illumination and weather conditions. LIDAR sensors can allow for data to be captured including ranging information, allowing devices to have a depth measurement (e.g., range measurement) associated with detected objects. The depth measurements can be represented as volumetric pixels (e.g., voxels). SUMMARY The following presents a simplified summary relating to one or more aspects disclosed herein. Thus, the following summary should not be considered an extensive overview relating to all contemplated aspects, nor should the following summary be considered to identify key or critical elements relating to all contemplated aspects or to delineate the scope associated with any particular aspect. Accordingly, the following summary has the sole purpose to present certain concepts relating to one or more aspects relating to the mechanisms disclosed herein in a simplified form to precede the detailed description presented below. In some aspects, an apparatus for processing three-dimensional (3D) data. The apparatus can include at least one memory and at least one processor coupled to the at least one memory and configured to: process a plurality of voxels to generate a plurality of tokens associated with the 3D data; process, using a linear layer of an encoder, an embedding dimension of the plurality of tokens to modify the plurality of tokens; adjust, using a layer of the encoder, an order of the modified plurality of tokens to generate a rearranged plurality of tokens; and process the modified plurality of tokens and the rearranged plurality of tokens to determine relationships between input features of the plurality of tokens. In some aspects, a method for processing three-dimensional (3D) data is provided. The method can include: processing a plurality of voxels to generate a plurality of tokens associated with the 3D data; processing, using a linear layer of an encoder, an embedding dimension of the plurality of tokens to modify the plurality of tokens; adjusting, using a layer of the encoder, an order of the modified plurality of tokens to generate a rearranged plurality of tokens; and processing the modified plurality of tokens and the rearranged plurality of tokens to determine relationships between input features of the plurality of tokens. In some aspects, a non-transitory computer-readable medium is provided having stored thereon instructions that, when executed by at least one processor, cause the at least one processor to: process a plurality of voxels to generate a plurality of tokens associated with the 3D data; process, using a linear layer of an encoder, an embedding dimension of the plurality of tokens to modify the plurality of tokens; adjust, using a layer of the encoder, an order of the modified plurality of tokens to generate a rearranged plurality of tokens; and process the modified plurality of tokens and the rearranged plurality of tokens to determine relationships between input features of the plurality of tokens. In some aspects, an apparatus for processing three-dimensional (3D) data is provided. The apparatus includes: means for processing a plurality of voxels to generate a plurality of tokens associated with the 3D data; means for processing, using a linear layer of an encoder, an embedding dimension of the plurality of tokens to modify the plurality of tokens; means for adjusting, using a layer of the encoder, an order of the modified plurality of tokens to generate a rearranged plurality of tokens; and means for processing the modified plurality of tokens and the rearranged plurality of tokens to determine relationships between input features of the plurality of tokens. The foregoing has outlined rather broadly the features and technical advantages of examples according to the disclosure in order that the detailed description that follows may be better understood. Additional features and advantages will be described hereinafter. The conception and specific examples disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present disclosure. Such equivalent constructions do not depart from the scope of the appended claims. Characteristics of the concepts disclosed herein, both their organization and method of operation, together with associated advantages will be better understood from the following descripti