EP-4057183-B1 - METHODS AND SYSTEMS FOR OBJECT DETECTION

EP4057183B1EP 4057183 B1EP4057183 B1EP 4057183B1EP-4057183-B1

Inventors

SU, YU
SCHOELER, MARKUS

Dates

Publication Date: 20260506
Application Date: 20210310

Claims (13)

Computer implemented method for object detection, the method comprising the following steps carried out by computer hardware components: - determining an output (408) of a first pooling layer (424) based on input data (402); - determining an output (412) of a dilated convolution layer (428), wherein the input of the dilated convolution layer (428) is the output (408) of the first pooling layer (424); - determining an output (414) of a second pooling layer (430), wherein the input of the second pooling layer (430) is the output (412) of the dilated convolution layer (428); - determining an output of a further dilated convolution layer (506), wherein the input of the further dilated convolution layer (506) is the output of second pooling layer (430); - determining an output of a further pooling layer (508), wherein the input of the further pooling layer (508) is the output of the further dilated convolution layer (506); - up-sampling the output of the dilated convolution layer (428) and the output of the further dilated convolution layer (506) to a pre-determined resolution; - concatenating (528) the output of the dilated convolution layer (428) and the output of the further dilated convolution layer (506); and - carrying out the object detection based on the concatenated output (542, 544).
The method of claim 1, wherein a dilation rate of a kernel of the dilated convolution layer (428) is different from a dilation rate of a kernel of the further dilated convolution layer (506).
The method of at least one of claims 1 to 2, wherein each of the first pooling layer (424) and the second pooling layer (430) comprises either mean pooling or max pooling.
The method of at least one of claims 1 to 3, wherein a respective kernel of each of the first pooling layer (424) and the second pooling layer (430) is of size 2x2, 3x3, 4x4 or 5x5.
The method of at least one of claims 1 to 4, wherein a respective kernel of each of the first pooling layer (424) and the second pooling layer (430) depends of a size of a kernel of the dilated convolution layer (428).
The method of at least one of claims 1 to 5, wherein the input data (402) comprises a 2D grid with channels.
The method of at least one of claims 1 to 6, wherein the input data (402) is determined based on data from a radar system.
The method of at least one of claims 1 to 7, wherein the input data (402) is determined based on at least one of data from a LIDAR system or data from a camera system.
The method of at least one of claims 1 to 8, wherein the object detection is carried out using a detection head (546, 548).
The method of claim 9, wherein the detection head (546, 548) comprises an artificial neural network.
Computer system (700), the computer system (700) comprising a plurality of computer hardware components configured to carry out steps of the computer implemented method of at least one of claims 1 to 10.
Vehicle, comprising the computer system (700) of claim 11 and a sensor (708, 710), wherein the input data (402) is determined based on an output of the sensor (708, 710).
Non-transitory computer readable medium comprising instructions for carrying out the computer implemented method of at least one of claims 1 to 10.

Description

FIELD The present disclosure relates to methods and systems for object detection. BACKGROUND Various sensors, such as cameras, radar sensors or LIDAR sensors, may be used in automotive applications to monitor the environment of a vehicle. Driver assistant systems may make use of data captured from the sensors, for example by analyzing the data to detect objects. For object detection, convolutional neural networks (CNN) may be used. However, object detection may be a cumbersome task. Accordingly, there is a need to provide methods and systems for object detection that lead to efficient and accurate results. Jamie Sherrah: "Fully Convolutional Networks for Dense Semantic Labelling of High-Resolution Aerial Imagery", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 8 June 2016, discloses a method for labelling high-resolution aerial imagery, where fine boundary detail is important. Therefore, a full-resolution labelling is inferred using a deep fully convolutional network with no downsampling, using the atrous method and obviating the need for deconvolution or interpolation. US 2020/175313 A1 discloses a neural network method and a neural network apparatus with dilated convolution. The neural network repeatedly performs convolution operations and sub-sampling operations through several layers. In CN 112 464 930 A, a target detection method using a convolutional neural network is disclosed. Zhao Hengshuang et al: "Pyramid Scene Parsing Network", 2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE COMPUTER SOCIETY, US, 21 July 2017, pages 6230-6239, ISSN: 1063-6919, DOI: 10.1109/CVPR.2017.660, discloses a global context information evaluation by different-region-based context aggregation through a pyramid pooling module together with a pyramid scene parsing network (PSPNet). In US 2018/260956 A1, a system and method for semantic segmentation using hybrid dilated convolution (HDC) is disclosed. SUMMARY The present disclosure provides a computer implemented method, a computer system, a vehicle and a non-transitory computer readable medium according to the independent claims. Embodiments are given in the subclaims, the description and the drawings. In one aspect, the present disclosure is directed at a computer implemented method for object detection, the method comprising the following steps performed (in other words: carried out) by computer hardware components: determining an output of a first pooling layer based on input data; determining an output of a dilated convolution layer, wherein the input of the dilated convolution layer is the output of the first pooling layer; determining an output of a second pooling layer, wherein the input of the second pooling layer is the output of the dilated convolution layer. The method further comprises the following steps performed (in other words: carried out) by the computer hardware components: determining an output of a further dilated convolution layer, wherein the input of the further dilated convolution layer is the output of the second pooling layer; determining an output of a further pooling layer, wherein the input of the further pooling layer is the output of the further dilated convolution layer; up-sampling of the output of the dilated convolution layer and the output of the further dilated convolution layer to a pre-determined resolution; concatenating the output of the dilated convolution layer and the output of the further dilated convolution layer; and carrying out the object detection based on the concatenated output. In other words, pooling operations in the first pooling layer are based on the input data to determine the output of the first pooling layer. The input data may be subjected to further layers (for example a further dilated convolution layer) before the pooling operations are carried out. The output of the first pooling layer is the input of the dilated convolution layer. The dilated convolution layer directly follows after the first pooling layer (in other words: no further layer is provided between the first pooling layer and the dilated convolution layer). Dilated convolution operations in the dilated convolution layer determine the output of the dilated convolution layer. The output of the dilated convolution layer is the input of the second pooling layer. The second pooling layer directly follows after the dilated convolution layer (in other words: no further layer is provided between the dilated convolution layer and the second pooling layer). On the other hand, immediately before the second pooling layer, there is the dilated convolution layer, and immediately before the dilated convolution layer, there is the first pooling layer. The object detection is based on at least the output of the dilated convolution layer or the output of the second pooling layer. It will be understood that one or more layers may be provided between the input data and the first pooling layer.