EP-4738283-A1 - MID-LEVEL ENSEMBLE FOR SENSOR FUSION

EP4738283A1EP 4738283 A1EP4738283 A1EP 4738283A1EP-4738283-A1

Abstract

Computer-implemented methods and related aspects for generating prediction output for an Automated Driving System of a vehicle are disclosed. The computer-implemented method comprises obtaining, by one or more processors, a first sensor dataset originating from a first sensor, where the first sensor dataset comprises information about a portion of a surrounding environment of the vehicle. The computer-implemented method further comprises obtaining, by one or more processors, a second sensor dataset originating from a second sensor different from the first sensor, where the second sensor dataset comprises information about the portion of the surrounding environment of the vehicle. Further, the computer-implemented method comprises processing, by one or more processors, the first sensor dataset using a first ensemble of encoder networks, where the first ensemble of encoder networks comprises two or more encoder networks, each trained to output a first set of encoded features based on the first sensor dataset. The computer-implemented method further comprises processing, by one or more processors, the second sensor dataset using a second ensemble of encoder networks, where the second ensemble of encoder networks comprises two or more encoder networks, each trained to output a second set encoded features based on the second sensor dataset. The computer-implemented method further comprises fusing, by one or more processors, one or more sets of the first sets of encoded features with one or more sets of the second sets of encoded features using a fusion algorithm configured to fuse encoded features from two or more different datasets and to output a set of fused encoded features, and generating, by one or more processors, a prediction output based on the set of fused encoded features using a decoder network trained to solve a perception task or a planning task for the vehicle based on encoded sensor data features.

Inventors

JOHNANDER, Joakim
FATEMI DEZFOULI, MARYAM
LINDSTRÖM, Carl

Assignees

Zenseact AB

Dates

Publication Date: 20260506
Application Date: 20241029

Claims (15)

A computer-implemented method (S100) for generating prediction output for an Automated Driving System of a vehicle, the computer-implemented method comprising: obtaining (S101), by one or more processors, a first sensor dataset originating from a first sensor, the first sensor dataset comprising information about a portion of a surrounding environment of the vehicle; obtaining (S102), by one or more processors, a second sensor dataset originating from a second sensor different from the first sensor, the second sensor dataset comprising information about the portion of the surrounding environment of the vehicle; processing (S103), by one or more processors, the first sensor dataset using a first ensemble of encoder networks, wherein the first ensemble of encoder networks comprises two or more encoder networks, each trained to output a first set of encoded features based on the first sensor dataset; processing (S104), by one or more processors, the second sensor dataset using a second ensemble of encoder networks, wherein the second ensemble of encoder networks comprises two or more encoder networks, each trained to output a second set encoded features based on the second sensor dataset; fusing (S107), by one or more processors, one or more sets of the first sets of encoded features with one or more sets of the second sets of encoded features using a fusion algorithm configured to fuse encoded features from two or more different datasets and to output a set of fused encoded features; and generating (S108), by one or more processors, a prediction output based on the set of fused encoded features using a decoder network trained to solve a perception task or a planning task for the vehicle based on encoded sensor data features.
The computer-implemented method (S100) according to claim 1, wherein the first sensor is of a different sensor modality as compared to the second sensor.
The computer-implemented method (S100) according to claim 1 or 2, further comprising: transmitting (S109), by one or more processors, the generated prediction output to one or more downstream functions of the Automated Driving System configured to control the vehicle based on the generated prediction output.
The computer-implemented method (S100) according to any one of claims 1-3, further comprising: storing (S110), by one or more processors, the first sensor dataset, the second sensor dataset, and the prediction output, wherein the prediction output forms pseudo-labels for the first sensor dataset and the second sensor dataset.
The computer-implemented method (S100) according to any one of claims 1-4, further comprising: selecting (S105a), by one or more processors, one first set of encoded features from a plurality of first sets of encoded features output from the first ensemble of encoder networks; selecting (S105b), by one or more processors, one second set of encoded features from a plurality of second sets of encoded features output from the second ensemble of encoder networks; and wherein the fusing (S107) comprises fusing, by one or more processors, the selected first set of encoded features with the selected second set of encoded features using the fusion algorithm.
The computer-implemented method (S100) according to any one of claims 1-4, further comprising: averaging (S106a), by one or more processors, a plurality of first sets of encoded features output from the first ensemble of encoder networks; averaging (S106b), by one or more processors, a plurality of second sets of encoded features output from the second ensemble of encoder networks; wherein the fusing (S107) comprises fusing, by one or more processors, the averaged first sets of encoded features with the averaged second sets of encoded features using the fusion algorithm.
The computer-implemented method (S100) according to any one of claims 1-6, wherein each encoder network of the first ensemble of encoder networks is trained on different training datasets as compared to the other encoder network(s) of the first ensemble of encoder networks; or wherein each encoder network of the second ensemble of encoder networks is trained on different training datasets as compared to the other encoder network(s) of the second ensemble of encoder networks.
A computer program product comprising instructions which, when the program is executed by a computer, causes the computer to carry out the computer-implemented method (S100) according to any one of claims 1-7.
A computer-readable storage medium comprising instructions which, when executed by a computer, causes the computer to carry out the computer-implemented method (S100) according to any one of claims 1-7.
A system (10) for generating prediction output for an Automated Driving System (310) of a vehicle (1), the system comprising: a first ensemble (202a) of encoder networks, wherein the first ensemble of encoder networks comprises two or more encoder networks (203a), each trained to output a set of encoded features based on a sensor dataset output from a first sensor (324a); a second ensemble (202b) of encoder networks, wherein the second ensemble of encoder networks comprises two or more encoder networks (203b), each trained to output a set encoded features based on a sensor dataset output from a second sensor (324b), wherein the first sensor (324a) is different from the second sensor (324b); a fusion algorithm (206) configured to fuse encoded features from two or more different datasets and to output a set of fused encoded features; a decoder network (207) trained to solve a perception task or a planning task for the vehicle based on encoded sensor data features; one or more processors (11) and one or more memory storage areas (12) comprising program code, the one or more memory storage areas and the program code being configured to, with the one or more processors, cause the system (10) to at least: process a first sensor dataset using the first ensemble (202a) of encoder networks in order to obtain a first set of encoded features, the first sensor dataset comprising information about a portion of a surrounding environment of the vehicle; process a second sensor dataset using the second ensemble (202b) of encoder networks in order to obtain a second set of encoded features, the second sensor dataset comprising information about the portion of the surrounding environment of the vehicle; fuse one or more set of the first set of encoded features with one or more set of the second sets of encoded features using the fusion algorithm (206); and generate a prediction output based on the set of fused encoded features using the decoder network (207).
The system (10) according to claim 10, wherein the first sensor (324a) is of a different sensor modality as compared to the second sensor (324b).
The system (10) according to claim 10 or 11, wherein the one or more memory storage areas and the program code are further configured to, with the one or more processors, cause the system (10) to at least: transmit the generated prediction output to one or more downstream functions (312, 316, 318) of the Automated Driving System configured to control the vehicle (1) based on the generated prediction output.
The system (10) according to any one of claims 10-12, wherein the one or more memory storage areas and the program code are further configured to, with the one or more processors, cause the system to at least: store the first sensor dataset, the second sensor dataset, and the prediction output, wherein the prediction output forms pseudo-labels for the first sensor dataset and the second sensor dataset.
The system (10) according to any one of claims 10-13, wherein each encoder network (203a) of the first ensemble (202a) of encoder networks is trained on different training datasets as compared to the other encoder network(s) of the first ensemble of encoder networks; or wherein each encoder network (203b) of the second ensemble (202b) of encoder networks is trained on different training datasets as compared to the other encoder network(s) of the second ensemble of encoder networks.
A vehicle (1) comprising: a first sensor (324a) and a second sensor (324b), wherein the first sensor (324a) is different from the second sensor (324b); a system (10) for generating predictive output for an Automated Driving System of the vehicle, the system comprising: a first ensemble (202a) of encoder networks, wherein the first ensemble of encoder networks comprises two or more encoder networks (203a), each trained to output a set of encoded features based on a sensor dataset output from the first sensor (324a); a second ensemble (202b) of encoder networks, wherein the second ensemble of encoder networks comprises two or more encoder networks (203b), each trained to output a set encoded features based on a sensor dataset output from the second sensor (324b); a fusion algorithm (206) configured to fuse encoded features from two or more different datasets and to output a set of fused encoded features; a decoder network (207) trained to solve a perception task or a planning task for the vehicle based on encoded sensor data features; one or more processors (11) and one or more memory storage areas (12) comprising program code, the one or more memory storage areas and the program code being configured to, with the one or more processors, cause the system to at least: process a first sensor dataset using the first ensemble (202a) of encoder networks in order to obtain a first set of encoded features, the first sensor dataset comprising information about a portion of a surrounding environment of the vehicle; process a second sensor dataset using the second ensemble (202b) of encoder networks in order to obtain a second set of encoded features, the second sensor dataset comprising information about the portion of the surrounding environment of the vehicle; fuse one or more set of the first set of encoded features with one or more set of the second sets of encoded features using a fusion algorithm (206); and generate a prediction output based on the set of fused encoded features using the decoder network (207).

Description

TECHNICAL FIELD The disclosed technology relates to methods and systems for generating prediction output for an Automated Driving System (ADS) of a vehicle. In particular, but not exclusively the disclosed technology relates to an architecture for a perception functionality of an ADS utilizing a mid-level ensemble of artificial neural networks. BACKGROUND Deep neural networks (DNNs) are today used in many different fields of technology. The DNN's ability to identify and analyse complex relationships in data has made them suitable for automation of different tasks. In this capacity, DNNs has for instance found many useful functions within the field of computer vision, such as object detection and classification tasks. More specifically, the DNNs can be used for allowing computers to obtain a high-level understanding from digital images or video in order to form their perception the world around them. An example of such an application is within the field of autonomous driving. Today, there is ongoing research and development within a number of technical areas associated to both the ADAS and the Autonomous Driving (AD) field. ADAS and AD will herein be referred to under the common term Automated Driving System (ADS) corresponding to all of the different levels of automation as for example defined by the SAE J3016 levels (1 - 5) of driving automation, and in particular for level 4 and 5. ADS solutions have already found their way into a majority of the new cars on the market with only rising prospects of utilization in the not too distant future. An ADS may be construed as a complex combination of various components that can be defined as systems where perception, decision making, and operation of the vehicle are performed by electronics and machinery instead of or in tandem with a human driver, and as introduction of automation into road traffic. This includes handling of the vehicle, destination, as well as awareness of surroundings. While the automated system has control over the vehicle, it allows the human operator to leave all or at least some responsibilities to the system. An ADS commonly combines a variety of sensors to perceive the vehicle's surroundings, such as for example, radar, lidar, sonar, camera, navigation system e.g. GPS, odometer and/or inertial measurement units (IMUs), upon which advanced control systems may interpret sensory information to identify appropriate navigation paths, as well as obstacles, free-space areas, and/or relevant signage. While improved accuracy and robustness of DNNs are constantly sought after, a trade-off between the complexity (e.g. in terms of size and network architecture) and computational efficiency (e.g. in terms of execution time, memory and processing power requirements) has to be made. One solution dealing with the former aspect is so called ensemble networks. Ensemble networks utilizes an ensemble of different DNNs to obtain an improved accuracy and robustness. More specifically, instead of just using a single DNN for a specific task, the input is fed through the ensemble of DNNs, and a combined output is formed from the individual outputs of the DNNs. However, this way of implementing ensemble networks naturally leads to longer execution times, as well as requiring more computational power. Thus, it may make them unsuitable for applications where the DNNs are to be run on a continuous feed of input data in real-time. It also may make them unsuitable for being run on resource-limited hardware. There is therefore a need for new and improved solutions for performing perception tasks, in particular for automated driving systems. SUMMARY The herein disclosed technology seeks to mitigate, alleviate or eliminate one or more of the above-identified deficiencies and disadvantages in the prior art to address various problems relating to accuracy and computational need for solving perception tasks in automated driving systems. Various aspects and embodiments of the disclosed technology are defined below and in the accompanying independent and dependent claims. A first aspect of the disclosed technology comprises a computer-implemented method for generating prediction output for an Automated Driving System of a vehicle. The computer-implemented method comprises obtaining, by one or more processors, a first sensor dataset originating from a first sensor, where the first sensor dataset comprises information about a portion of a surrounding environment of the vehicle. The computer-implemented method further comprises obtaining, by one or more processors, a second sensor dataset originating from a second sensor different from the first sensor, where the second sensor dataset comprises information about the portion of the surrounding environment of the vehicle. Further, the computer-implemented method comprises processing, by one or more processors, the first sensor dataset using a first ensemble of encoder networks, where the first ensemble of encoder networks comprises