US-12626478-B2 - Method for omnidirectional dense regression for machine perception tasks via distortion-free CNN and spherical self-attention

US12626478B2US 12626478 B2US12626478 B2US 12626478B2US-12626478-B2

Abstract

A method and device for performing a perception task are disclosed. The method and device incorporate a dense regression model. The dense regression model advantageously incorporates a distortion-free convolution technique that is designed to accommodate and appropriately handle the varying levels of distortion in omnidirectional images across different regions. In addition to distortion-free convolution, the dense regression model further utilizes a transformer that incorporates an spherical self-attention that use distortion-free image embedding to compute an appearance attention and uses spherical distance to compute a positional attention.

Inventors

Yuliang Guo
Zhixin Yan
Yuyan Li
Xinyu Huang
Liu Ren

Assignees

ROBERT BOSCH GMBH

Dates

Publication Date: 20260512
Application Date: 20211213

Claims (20)

1 . A method for operating a device to perform a perception task, the method comprising: receiving, with a processor of the device, an omnidirectional image of an environment; generating, with the processor of the device, first encoded features based on the omnidirectional image using a convolutional neural network encoder; generating, with the processor, second encoded features based on the first encoded features using a transformer neural network; and generating, with the processor, final perception outputs based on the second encoded features using a convolutional neural network decoder.
2 . The method of claim 1 , wherein (i) the omnidirectional image has a format that includes varying levels of image distortion across different regions of the omnidirectional image and (ii) both the convolutional neural network encoder and the convolutional neural network decoder each perform convolution operations that take into account the varying levels of image distortion across the different regions of the omnidirectional image.
3 . The method of claim 1 , wherein the transformer neural network incorporates a self-attention matrix that combines (i) a feature similarity based self-attention and (ii) a spherical distance based self-attention.
4 . The method of claim 1 , the generating the final perception outputs further comprising: generating the final perception outputs based on the second encoded features and based on intermediate encoded features from the convolutional neural network encoder that are provided to the convolutional neural network decoder via skip connections.
5 . The method of claim 1 , wherein: the generating the first encoded features includes performing a first sequence of convolution operations on the omnidirectional image using the convolutional neural network encoder; and the generating the final perception outputs includes performing a second sequence of convolution operations on the second encoded features using the convolutional neural network decoder.
6 . The method of claim 5 , wherein the performing of at least one respective convolution operation in at least one of the first sequence of convolution operations and the second sequence of convolution operations comprises: generating a plurality of perspective projection input feature patches based on omnidirectional input features provided as input to the at least one respective convolution operation; generating a plurality of perspective projection output feature patches by performing a convolution operation on the plurality of perspective projection input feature patches; and generating omnidirectional output features as an output of the at least one respective convolution operation based on the plurality of perspective projection output feature patches.
7 . The method of claim 6 , the generating the plurality of perspective projection input feature patches further comprising: defining a plurality of virtual cameras each having a defined field of view and a defined camera pose; and generating each respective perspective projection input feature patch in the plurality of perspective projection input feature patches by projecting features of the omnidirectional input features onto a respective image plane depending on a respective virtual camera of the plurality of virtual cameras.
8 . The method of claim 7 , wherein viewing frustums of the plurality of virtual cameras are overlapping such that the plurality of perspective projection input feature patches each have a padding.
9 . The method of claim 7 , the generating the omnidirectional output features further comprising: projecting each respective perspective projection output feature patch of the plurality of perspective projection output feature patches into an omnidirectional image domain using the respective virtual camera of the plurality of virtual cameras.
10 . The method of claim 5 , the generating the first encoded features further comprising: performing a respective pooling operation using the convolutional neural network encoder after performing at least one respective convolution operation in the first sequence of convolution operations.
11 . The method of claim 5 , the generating the final perception outputs further comprising: performing a respective upsampling operation using the convolutional neural network decoder after performing at least one respective convolution operation in the second sequence of convolution operations.
12 . The method of claim 5 , the generating the final perception outputs further comprising: receiving, at the convolutional neural network decoder, intermediate encoded features from the convolutional neural network via a skip connection; and concatenating the intermediate encoded features with intermediate decoded features of the convolutional neural network decoder.
13 . The method of claim 5 , the generating the final perception outputs further comprising: reshaping the second encoded features before performing the second sequence of convolution operations on the second encoded features.
14 . The method according to claim 1 , the generating the second encoded features further comprising: generating a plurality of feature vectors each having a predetermined length based on the first encoded features; and generating, using the transformer neural network, the second encoded features based on the plurality of feature vectors.
15 . The method of claim 14 , the generating the second encoded features further comprising: generating, using the transformer neural network, a first self-attention matrix based on a comparison of each feature vector in the plurality of feature vectors with each other feature vector in the plurality of feature vectors; generating, using the transformer neural network, a second self-attention matrix based on a spherical distance between each feature vector in the plurality of feature vectors and each other feature vector in the plurality of feature vectors; and generating, using the transformer neural network, the second encoded features based on the first self-attention matrix and the second self-attention matrix.
16 . The method of claim 15 , the generating the first self-attention matrix further comprising: forming an input feature matrix with the plurality of feature vectors; generating a first intermediate feature matrix by performing a first convolution operation on the input feature matrix; transposing the first intermediate feature matrix; generating a second intermediate feature matrix by performing a second convolution operation on the input feature matrix; and generating the first self-attention matrix based on a product of the transposed first intermediate feature matrix and the second intermediate feature matrix.
17 . The method of claim 16 , the generating the second encoded features further comprising: generating a third intermediate feature matrix by performing a third convolution operation on the input feature matrix; generating a third self-attention matrix by summing the first self-attention matrix and the second self-attention matrix; determining a fourth intermediate feature matrix by determining a product of the third self-attention matrix and the third intermediate feature matrix; and determining an output feature matrix by performing a fourth convolution operation on the fourth intermediate feature matrix, the second encoded features being determined based on the output feature matrix.
18 . The method of claim 1 , wherein the perception task includes at least one of depth estimation and semantic segmentation and the final perception outputs for the omnidirectional image include at least one of depth estimations and semantic segmentation labels for the omnidirectional image.
19 . The method of claim 1 further comprising: capturing, with a 360-camera sensor of the device, the omnidirectional image of the environment.
20 . A device for performing a perception task, the device comprising: a 360-camera sensor configured to capture an omnidirectional image of an environment; a memory configured to store a neural network model including a convolutional neural network encoder, a transformer neural network, and a convolutional neural network decoder; and a processor operably connected to the 360-camera sensor and the memory, the processor being configured to: generate first encoded features based on the omnidirectional image using the convolutional neural network encoder; generate second encoded features based on the first encoded features using the transformer neural network; and generate final perception outputs based on the second encoded features using the convolutional neural network decoder.

Description

FIELD The system and method disclosed in this document relates to machine perception and, more particularly, to omnidirectional dense regression for machine perception tasks via distortion-free CNN and spherical self-attention. BACKGROUND Unless otherwise indicated herein, the materials described in this section are not admitted to be the prior art by inclusion in this section. Omnidirectional images, also called 360 images or panoramic images, are one of the most popular image types for many applications such as virtual reality, autonomous driving, and robotics. Additionally, omnidirectional dense regression problems are critical to the operation of three-dimensional or omnidirectional measurement tools, especially when visual interactions with human beings or the production of interpretable outputs is desired. Typical dense regression problems include depth estimation and semantic segmentation, where both local feature encoding and global feature encoding are required for high levels of performance. Previous attempts at solving these dense regression problems were based on a deep structure of local encoding layers, such as a Fully Convolutional Network (FCN). However, FCNs have limitations both with respect dense regression problems and with respect to processing omnidirectional images. Firstly, FCNs lack the global context that is critical for determining the physical scale for depth estimation or for inferring an overall layout of the semantically meaningful scene. Secondly, FCNs have huge drawbacks when applied to processing omnidirectional images because omnidirectional images include different levels of image distortion within different regions of the image, but a conventional FCN processes each region of the image equivalently. Recently, an emerging technique to handle global encoding is the self-attention module, which is integrated as a core part of Transformer architectures. The self-attention module is highly suitable for dense regression problems, such as depth estimation or semantic segmentation, because it explicitly utilizes long-range contextual information from different regions. However, the typical design of the self-attention module is not suitable for omnidirectional images for at least two reasons. Firstly, different regions from an omnidirectional image include different levels of image distortion such that the hidden features of different regions are not directly comparable to each other. Secondly, the position embedding utilized in Transformer architectures is not compatible with omnidirectional space, such that the position embedding is not effective. Accordingly, what is needed is a technique for processing omnidirectional images for dense regression problems that incorporates both local feature encoding and global feature encoding, and which takes into account the varying levels of distortion present in omnidirectional image formats. SUMMARY A method for operating a device to perform a perception task is disclosed. The method comprises receiving, with a processor of the device, an omnidirectional image of an environment. The method further comprises generating, with the processor of the device, first encoded features based on the omnidirectional image using a convolutional neural network encoder. The method further comprises generating, with the processor, second encoded features based on the first encoded features using a transformer neural network. The method further comprises generating, with the processor, final perception outputs based on the second encoded features using a convolutional neural network decoder. A device for performing a perception task is disclosed. The device comprises a 360-camera sensor configured to capture an omnidirectional image of an environment. The device further comprises a memory configured to store a neural network model including a convolutional neural network encoder, a transformer neural network, and a convolutional neural network decoder. The device further comprises a processor operably connected to the 360-camera sensor and the memory. The processor is configured to generate first encoded features based on the omnidirectional image using the convolutional neural network encoder. The processor is further configured to generate second encoded features based on the first encoded features using the transformer neural network. The processor is further configured to generate final perception outputs based on the second encoded features using the convolutional neural network decoder. BRIEF DESCRIPTION OF THE DRAWINGS The foregoing aspects and other features of the method and system are explained in the following description, taken in connection with the accompanying drawings. FIG. 1 summarizes a dense regression model for performing a machine perception task in the omnidirectional image domain. FIG. 2 shows an exemplary end-user device that incorporates the dense regression model of FIG. 1 to perform the machine perception task. FIG. 3 shows a m