EP-4738284-A1 - A METHOD FOR CLASSIFYING A TRAFFIC SIGN IN AN IMAGE

EP4738284A1EP 4738284 A1EP4738284 A1EP 4738284A1EP-4738284-A1

Abstract

The herein disclosed technology relates to a computer-implemented method (100) for classifying a traffic sign in an image, as well as a computing device and vehicle thereof. The method (100) comprises: obtaining (S102) the image, captured by a camera of a vehicle, depicting at least a portion of a surrounding environment of the vehicle; identifying (S104) a region in the image corresponding to a traffic sign, by processing the image through a first machine learning model configured to output detections of traffic signs in input images; extracting (S106), from the image, a crop corresponding to the identified region, wherein the crop has a native resolution based on a size of the identified region in relation to the obtained image; and determining (S108) classification data of the traffic sign by processing the crop, at the native resolution, through a second machine learning model, wherein the second machine learning model is an attention-based neural network, trained to process input images of traffic signs of varying resolution and to generate corresponding classification data, wherein the second machine learning model applies attention on a pixel-level of the input images.

Inventors

VERBEKE, Willem
MÅNSSON, Olle

Assignees

Zenseact AB

Dates

Publication Date: 20260506
Application Date: 20241031

Claims (15)

A computer-implemented method (100) for classifying a traffic sign in an image, the method (100) comprising: obtaining (S102) the image, captured by a camera of a vehicle, depicting at least a portion of a surrounding environment of the vehicle; identifying (S104) a region in the image corresponding to a traffic sign, by processing the image through a first machine learning model configured to output detections of traffic signs in input images; extracting (S106), from the image, a crop corresponding to the identified region, wherein the crop has a native resolution based on a size of the identified region in relation to the obtained image; and determining (S108) classification data of the traffic sign by processing the crop, at the native resolution, through a second machine learning model, wherein the second machine learning model is an attention-based neural network, trained to process input images of traffic signs of varying resolution and to generate corresponding classification data, wherein the second machine learning model applies attention on a pixel-level of the input images.
The method (100) according to claim 1, wherein the second machine learning model has been trained using a training dataset comprising a plurality of images of a variety of different resolutions, and wherein each image of the plurality of images has associated annotation data.
The method (100) according to claim 1 or 2, wherein the second machine learning model has a transformer-based architecture.
The method (100) according to any one of the claims 1 to 3, wherein the second machine learning model comprises at least one cross-attention module and at least one self-attention module.
The method (100) according to any one of the claims 1 to 4, wherein processing the crop through the second machine learning model comprises: flattening the crop into a numerical input data array; obtaining a latent array having a set of initial values; updating the latent array by alternatingly processing the input data array and a latent array through the cross-attention module and the self-attention module for a number of iterations, thereby generating an updated latent array; and predicting classification data for the traffic sign, based on the updated latent array.
The method (100) according to any one of the claims 1 to 5, wherein the first machine learning model is a traffic sign detection model, and wherein the second machine learning model is a traffic sign classification model.
The method (100) according to any one of the claims 1 to 6, wherein the classification data is indicative of a type of the traffic sign.
The method (100) according to any one of the claims 1 to 7, further comprising determining (S110) vehicle control data based on the determined classification data.
The method (100) according to claim 8, further comprising transmitting (S112) the vehicle control data to a control system of the vehicle.
The method (100) according to any one of the claims 1 to 9, further comprising displaying (S114) the classification data on a display device of the vehicle, by rendering the classification data as a graphical representation on the display device.
A computer program product comprising instructions, which when the program is executed by a computing device, causes the computing device to carry out the method (100) according to any one of the claims 1 to 10.
A computing device (200) for classifying a traffic sign in an image, the computing device (200) comprising control circuitry (202) configured to: obtain the image, captured by a camera of a vehicle, depicting at least a portion of a surrounding environment of the vehicle; identify a region in the image corresponding to a traffic sign, by processing the image through a first machine learning model configured to output detections of traffic signs in input images; extract, from the image, a crop corresponding to the identified region, wherein the crop has a native resolution based on a size of the identified region in relation to the obtained image; and determine classification data of the traffic sign by processing the crop, at the native resolution, through a second machine learning model, wherein the second machine learning model is an attention-based neural network, trained to process input images of traffic signs of varying resolution and to generate corresponding classification data, wherein the second machine learning model applies attention on a pixel-level of the input images.
The computing device (200) according to claim 12, wherein the control circuitry (202) is further configured to determine vehicle control data based on the determined classification data.
The computing device (200) according to claim 12 or 13, wherein the control circuitry (202) is further configured to display the classification data on a display device of the vehicle, by rendering the classification data as a graphical representation on the display device.
A vehicle (300) comprising a camera, and the computing device (200) according to any one of the claims 12 to 14.

Description

TECHNICAL FIELD The present disclosed technology relates to the field of automated driving systems. In particular, it is related to methods and devices for traffic sign recognition. BACKGROUND Traffic sign recognition (TSR) systems is an integral part of advanced driver assistance systems (ADAS) and autonomous driving (AD) technologies. These systems are designed to automatically detect and interpret traffic signs in real-time, using cameras or other onboard sensors, to either provide the driver with information about speed limits, and other traffic regulations, or to the automated driving system as a basis for the decision and control of the automated operations of the vehicle. Early TSR systems use basic image processing techniques to detect distinctive sign shapes and colors. However, these systems can have some limitations in their ability to adapt to various environmental conditions such as varying lighting, weather, and obscured or worn-out signs. Although effective in standard conditions, these early systems may sometime lack performance in more complex driving environments. For instance, they may struggle with recognizing signs that are faded, partially obscured, or positioned at unconventional angles. In addition, variations in sign designs across different countries or regions present further challenges to these systems. Recent advancements in deep learning and artificial intelligence have improved the accuracy of TSR systems by enabling models to learn from large datasets of traffic signs and road environments. These systems typically use convolutional neural networks (CNNs), sometimes together with other machine learning techniques, to identify traffic signs with great precision, even in adverse conditions. However, while these approaches have shown promise, there is always a need for improving the performance of TSR systems, e.g. in terms of reducing false positives and ensuring real-time performance and robustness across a broader range of driving environments. Such improvements could enhance the capability of automated driving systems, where accurate traffic sign interpretation is crucial for ensuring compliance with road regulations. SUMMARY The herein disclosed technology seeks to mitigate, alleviate or eliminate one or more of the above-identified deficiencies and disadvantages in the prior art to address various problems relating to traffic sign recognition, TSR, systems. More specifically, the inventors have realized that the performance of a CNN-based approach for TSR is limited by the fact that it requires the input to be of a certain size (i.e. the input image to have a certain resolution). In reality, every traffic sign captured by a camera will be of difference sizes, depending e.g. on the distance to the camera at the point of capture, or by the fact that different types of traffic signs have different shapes and sizes. This means that in CNN-based approaches, the image fed to the traffic sign classifier has to either be up-sampled or down-sampled. The aim of the disclosed technology is to address this issue by introducing an attention based neural network approach, to a two-stage traffic sign recognition pipeline. More specifically, the new and improved way of performing traffic sign recognition is configured to apply cross-attention directly on the image pixels, to enable it to work at different resolution images. Various aspects and embodiments of the disclosed technology are defined below and in the accompanying independent and dependent claims. According to a first aspect, there is provided a computer-implemented method for classifying a traffic sign in an image. The method comprises obtaining the image, captured by a camera of a vehicle, depicting at least a portion of a surrounding environment of the vehicle. The method further comprises identifying a region in the image corresponding to a traffic sign, by processing the image through a first machine learning model configured to output detections of traffic signs in input images. The method further comprises extracting, from the image, a crop corresponding to the identified region. The crop having a native resolution based on a size of the identified region in relation to the obtained image. The method further comprises determining classification data of the traffic sign by processing the crop, at the native resolution, through a second machine learning model. The second machine learning model being an attention-based neural network, trained to process input images of traffic signs of varying resolution and to generate corresponding classification data, and wherein the second machine learning model applies attention on a pixel-level of the input images. With this aspect of the disclosed technology, similar advantages and preferred features are present as in the other aspects. According to a second aspect, there is provided a computer program product comprising instructions which when the program is executed by a computing de