US-12618954-B2 - Camera soiling detection using attention-guided camera depth and LIDAR range consistency gating

US12618954B2US 12618954 B2US12618954 B2US 12618954B2US-12618954-B2

Abstract

A method includes receiving a plurality of images, wherein a first image of the one or more images comprises a range image and a second image comprises a camera image and filtering the first image to generate a filtered first image. The method also includes generating a plurality of depth estimates based on the second image and generating an attention map by combining the filtered first image and the plurality of depth estimates. Additionally, the method includes generating a consistency score indicative of a consistency of depth estimates between the first image and the second image based on the attention map, modulating one or more features extracted from the second image based on the consistency score using a gating mechanism to generate modulated one or more features, and generating a classification of one or more soiled regions in the second image based on the modulated one or more features.

Inventors

Varun Ravi Kumar
Senthil Kumar Yogamani
Shivansh Rao

Assignees

QUALCOMM INCORPORATED

Dates

Publication Date: 20260505
Application Date: 20230911

Claims (20)

1 . A method comprising: receiving a plurality of images, wherein a first image of the plurality of images comprises a range image and a second image comprises a camera image; filtering, by one or more processors, the first image to generate a filtered first image, wherein the filtering includes a filter configured to fill in one or more sparse regions in the first image; generating, by the one or more processors, a plurality of depth estimates based on the second image; generating, by the one or more processors, an attention map by combining the filtered first image and the plurality of depth estimates; combining, by the one or more processors, the attention map with a generated Structural Similarity Index Matrix (SSIM) score to generate a consistency score indicative of a consistency of depth estimates between the first image and the second image based on the attention map; modulating, by the one or more processors, one or more features extracted from the second image based on the consistency score using a gating mechanism to generate modulated one or more features; and generating, by the one or more processors, a classification of one or more soiled regions in the second image based on the modulated one or more features.
2 . The method of claim 1 , wherein the first image is generated in real time by a Light Detection and Ranging (LIDAR) sensor.
3 . The method of claim 1 , further comprising extracting one or more features from the first image prior to the filtering the first image.
4 . The method of claim 1 , wherein the filter comprises a morphological filter having a binary mask of pixels that expands one or more regions of the first image, wherein the binary mask is associated with a radius, and wherein the radius of the binary mask determines how far the morphological filter expands the one or more regions of the first image.
5 . The method of claim 1 , further comprising generating the plurality of depth estimates using a depth decoder.
6 . The method of claim 1 , wherein generating the attention map further comprises dividing the second image into a plurality of layers based on a distance from a camera used to acquire the second image.
7 . The method of claim 6 , wherein dividing the second image into the plurality of layers further comprises grouping one or more projected points based on the distance from the camera used to acquire the second image.
8 . The method of claim 7 , further comprising calculating the distance from the camera based on a plurality of pixel coordinates in a plane of the second image and a center of the camera in a camera coordinate system.
9 . The method of claim 1 , wherein the SSIM score indicates consistency of the plurality of depth estimates.
10 . The method of claim 1 , further comprising identifying the one or more soiled regions based on the consistency score.
11 . The method of claim 1 , wherein generating the classification further comprises generating the classification of the one or more soiled regions using a softmax function configured to normalize probability distribution.
12 . The method of claim 1 , further comprising operating an Advanced Driver Assistance Systems (ADAS) system based on the classification of the one or more soiled regions.
13 . An apparatus for camera soiling detection, the apparatus comprising: a memory for storing a plurality of images; and processing circuitry in communication with the memory, wherein the processing circuitry is configured to: receive the plurality of images, wherein a first image of the plurality of images comprises a range image and a second image comprises a camera image; filter the first image to generate a filtered first image, wherein the filtering includes a filter configured to fill in one or more sparse regions in the first image; generate a plurality of depth estimates based on the second image; generate an attention map by combining the filtered first image and the plurality of depth estimates; combine the attention map with a generated Structural Similarity Index Matrix (SSIM) score to generate a consistency score indicative of a consistency of depth estimates between the first image and the second image based on the attention map; modulate one or more features extracted from the second image based on the consistency score using a gating mechanism to generate modulated one or more features; and generate a classification of one or more soiled regions in the second image based on the modulated one or more features.
14 . The apparatus of claim 13 , wherein the first image is generated in real time by a Light Detection and Ranging (LIDAR) sensor.
15 . The apparatus of claim 13 , wherein the processing circuitry is further configured to extract one or more features from the first image prior to the filtering the first image.
16 . The apparatus of claim 13 , wherein the filter comprises a morphological filter having a binary mask of pixels that expands one or more regions of the first image, wherein the binary mask is associated with a radius, and wherein the radius of the binary mask determines how far the morphological filter expands the one or more regions of the first image.
17 . The apparatus of claim 13 , wherein the processing circuitry is further configured to generate the plurality of depth estimates using a depth decoder.
18 . The apparatus of claim 13 , wherein the processing circuitry configured to generate the attention map is further configured to divide the second image into a plurality of layers based on a distance from a camera used to acquire the second image.
19 . The apparatus of claim 18 , wherein the processing circuitry configured to divide the second image into the plurality of layers is further configured to group one or more projected points based on the distance from the camera used to acquire the second image.
20 . The apparatus of claim 19 , wherein the processing circuitry is further configured to calculate the distance from the camera based on a plurality of pixel coordinates in a plane of the second image and a center of the camera in a camera coordinate system.

Description

TECHNICAL FIELD This disclosure relates to cognitive neural networks. BACKGROUND Autonomous vehicles and semi-autonomous vehicles may use artificial intelligence and machine learning—specifically deep neural networks (DNNs)—for performing any number of operations for operating, piloting, and navigating the vehicle. For example, DNNs may be used for object detection, lane and road boundary detection, safety analysis, drivable free-space analysis, control generation during vehicle maneuvers, and/or other operations. Before any autonomous or semi-autonomous vehicle can safely navigate on the road, the DNNs and other software that enable the vehicle to drive itself are generally tested to verify and validate that they perform safely. More specifically, DNN-powered autonomous and semi-autonomous vehicles should be able to respond properly to an incredibly diverse set of situations, including interactions with emergency vehicles, pedestrians, animals, and a virtually infinite number of other obstacles For autonomous vehicles to achieve autonomous driving levels 3-5 (e.g., conditional automation (Level 3), high automation (Level 4), and full automation (Level 5)) the autonomous vehicles should be capable of operating safely in all environments, and without the requirement for human intervention when potentially unsafe situations present themselves. Advanced Driver Assistance Systems (ADAS) use sensors and software to help vehicles avoid hazardous situations to ensure safety and reliability. Cameras are an essential part of the sensor suite to achieve Level 3 autonomous driving because they provide a high-resolution view of the surrounding environment. Cameras allow autonomous vehicles to see other vehicles, pedestrians, and obstacles, and to make decisions about how to navigate safely. Surround-view cameras are a type of camera that is mounted on the outside of the autonomous vehicle. Surround-view cameras provide a 360-degree view of the vehicle's surroundings, which is essential for autonomous driving. However, surround-view cameras are also directly exposed to the external environment, which means that these cameras may get soiled by rain, fog, or snow. SUMMARY In general, this disclosure describes techniques for the camera soiling detection. Surround-view cameras are directly exposed to the external environment, which means that they are vulnerable to getting soiled by rain, fog, snow, dust, and mud. Soiling may significantly impact the performance of the cameras, making it difficult for the autonomous vehicles to see other vehicles, pedestrians, and obstacles. For example, rain may cause water droplets to form on the camera lens, which may blur the image. Fog may also cause the image to be blurred, and snow may block the view entirely. Dust and mud may also interfere with the camera's ability to see clearly. This disclosure describes example techniques for more accurately detecting soiling on the cameras. That is, in one or more examples, the example techniques include facilitation of the soiling detection using range images, such as range image produced from Light Detection and Ranging (LIDAR) sensors. LIDAR sensors uses laser light to measure the distance to objects in the environment. This allows a LIDAR sensor to create a three-dimensional (3D) map of the surrounding area, which may be used for a variety of purposes, including autonomous driving. A range image may be obtained from a LIDAR sensor. The range image is a two-dimensional (2D) image that represents the distance to objects in the scene. The range image may be obtained by using a LIDAR sensor to measure the time it takes for laser pulses to travel to objects and back. Depth estimates may be obtained from the camera's depth decoder. An encoder is a part of the camera that may be used to extract features from the image. These features may then be used to train a model that may estimate the depth of objects in the scene. The extracted features may then be used in the feature gating module to determine which features should be used to estimate the depth of the scene. The depth estimates from the camera and LIDAR sensor may then be combined to produce a single depth estimate for the scene. Next, distance-based segmentation technique may generate an attention map that highlights regions where the camera is likely to be soiled. Advantageously, the disclosed real-time soiling detection techniques improve the safety of autonomous driving systems. As yet another non-limiting advantage, the disclosed machine learning techniques are computationally efficient. In one example, a method includes receiving a plurality of images, wherein a first image of the one or more images comprises a range image and a second image comprises a camera image, and filtering, by one or more processors, the first image to generate a filtered first image. The filtering includes a filter configured to fill in one or more sparse regions in the first image. The method also includes ge