US-12619679-B2 - Method for generating a detailed visualization of machine learning model behavior

US12619679B2US 12619679 B2US12619679 B2US 12619679B2US-12619679-B2

Abstract

A method is provided for generating a visualization for explaining a behavior of a machine learning (ML) model. In the method, an image is input to the ML model for an inference operation. The input image has an increased resolution compared to an image resolution the ML model was intended to receive as an input. A resolution of a plurality of resolution-independent convolutional layers of the neural network are adjusted because of the increased resolution of the input image. A resolution-independent convolutional layer of the neural network is selected. The selected resolution-independent convolutional layer is used to generate a plurality of activation maps. The plurality of activation maps is used in a visualization method to show what features of the image were important for the ML model to derive an inference conclusion. The method may be implemented in a computer program having instructions executable by a processor.

Inventors

Brian Ermans
PETER DOLIWA
GERARDUS ANTONIUS FRANCISCUS DERKS
Wilhelmus Petrus Adrianus Johannus Michiels
Frederik Dirk Schalij

Assignees

NXP B.V.

Dates

Publication Date: 20260505
Application Date: 20210809

Claims (20)

1 . A method for generating a visualization for explaining a behavior of a machine learning (ML) model having a neural network, the method comprising: selecting an image for input to the ML model for an inference operation, wherein the image has an increased resolution compared to an image resolution the ML model was intended to receive as an input; increasing a resolution of a plurality of resolution-independent convolutional layers of the neural network because of the increased resolution of the input image that is selected for the inference operation; selecting a resolution-independent convolutional layer of the neural network; inputting the input image into the ML model for the inference operation; using the selected resolution-independent convolutional layer to generate a plurality of activation maps; using the plurality of activation maps in a visualization method to generate the visualization to show a user which features of the image were important for the ML model to derive an inference conclusion; and presenting results of application of the visualization method to the user on a display for analysis.
2 . The method of claim 1 , wherein selecting the resolution-independent convolutional layer further comprises selecting a final convolutional layer of the plurality of resolution-independent convolutional layers.
3 . The method of claim 1 , wherein the visualization method is a Grad-CAM (gradient-weighted class activation mapping) visualization method.
4 . The method of claim 1 , wherein selecting the image for input to the ML model for an inference operation further comprises upscaling the image to provide the increased resolution.
5 . The method of claim 1 , further comprising generating a plurality of heat maps from the plurality of activation maps to use in the visualization method.
6 . The method of claim 1 , wherein the neural network is used for one of image classification, object detection, semantic segmentation, or instance segmentation.
7 . The method of claim 1 , further comprising adding a layer after a final resolution-independent convolutional layer of the plurality of resolution-independent convolutional layers to adjust for a mismatch between an output size of the final resolution-independent convolutional layer and an input of a first resolution-dependent layer.
8 . The method of claim 7 , wherein the added layer comprises one of either an average pooling layer, max pooling layer, global average pooling layer, or global max pooling layer.
9 . The method of claim 1 , further comprising: adding a fully connected layer after the plurality of resolution-independent convolutional layers; and training only the added fully connected layer.
10 . The method of claim 1 , further comprising computing an average gradient for each activation map of the plurality of activation maps.
11 . A computer program comprising instructions executable by a processor, for executing a method for generating a visualization for explaining a behavior of a machine learning (ML) model having a neural network, the executable instructions comprising: instructions for selecting an image for input to the ML model for an inference operation, wherein the image has an increased resolution compared to an image resolution the ML model was intended to receive as an input; instructions for increasing a resolution of a plurality of resolution-independent convolutional layers of the neural network because of the increased resolution of the input image; instructions for selecting a resolution-independent convolutional layer of the neural network; instructions for inputting the input image into the ML model for the inference operation; instructions for using the selected resolution-independent convolutional layer to generate a plurality of activation maps; instructions for using the plurality of activation maps in a visualization method to generate the visualization to show a user which features of the image were important for the ML model to derive an inference conclusion; and instructions for presenting results of application of the visualization method to the user on a display for analysis.
12 . The computer program of claim 11 , wherein the instructions for selecting the convolutional layer further comprises instructions for selecting a final convolutional layer of the plurality of resolution-independent convolutional layers.
13 . The computer program of claim 11 , wherein the visualization method is a Grad-CAM (gradient-weighted class activation mapping) visualization method.
14 . The computer program of claim 11 , wherein the instructions for selecting the image for input to the ML model for an inference operation further comprises instructions for upscaling the image to provide the increased resolution.
15 . The computer program of claim 11 , further comprising instructions for generating a plurality of heat maps from the plurality of activation maps to use in the visualization method.
16 . The computer program of claim 11 , wherein the neural network is used for one of image classification, object detection, semantic segmentation, or instance segmentation.
17 . The computer program of claim 11 , further comprising instructions for adding a layer after a final convolutional layer of the plurality of resolution-independent convolutional layers to adjust for a mismatch between an output size of the final resolution-independent convolutional layer and an input of a first resolution-dependent layer.
18 . The computer program of claim 17 , wherein the added layer comprises one of either an average pooling layer, max pooling layer, global average pooling layer, or global max pooling layer.
19 . The computer program of claim 11 , further comprising: adding a fully connected layer after the plurality of resolution-independent convolutional layers; and training only the added fully connected layer.
20 . The computer program of claim 11 , further comprising computing an average gradient for each activation map of the plurality of activation maps.

Description

BACKGROUND Field This disclosure relates generally to machine learning, and more particularly, to a method for generating a more detailed visualization of machine learning (ML) model behavior. Related Art Machine learning (ML) is becoming more widely used in many of today's applications, such as applications involving forecasting and classification. In ML, improving human interpretability and explainability of results is important. A lack of understanding about how a ML model derives its conclusions makes it difficult to verify that the ML model is working as expected and no significant flaws of the model are overlooked. The lack of understanding can cause mistrust and security concerns that hinder the use of ML for important tasks. Many different approaches exist to generate visualizations that show the user which parts of the input were the most important for the model to derive its conclusion. When used on a model for image classification, for example, these visualizations show the influence of each individual input pixel or groups of pixels on the classification result. Similar visualizations can also be applied to models used for object detection. All existing approaches have limitations that limit their use for explaining model behavior. Specifically, for convolutional neural networks (CNN), several variants of visualization methods have been developed. For example, Grad-CAM (gradient-weighted class activation mapping) and Ablation-CAM) generate heatmaps showing the most influential areas of the input for a target classification based on activation maps generated from a selected convolutional layer of the CNN. The current visualization methods are considered to generate good explanations in general and are relatively computationally inexpensive, but their ability to explain model behavior may be limited by their relatively low-resolution. The low resolution is a direct result of the trade-off that is made when selecting a convolutional layer for the visualization. Heatmaps are noisier and generally less semantically meaningful towards the input of the CNN while the resolution is reduced towards the output of the CNN. This means that in order to generate the most meaningful visualizations the layer that is typically selected is close to the output of the network which results in a very low resolution. For example, some neural network architectures like MobileNetV2 require input images having a specific resolution, such as for example, 224×224 pixels. By the time the processing of the image through the CNN reaches the last convolutional layers, the resolution of the generated visualizations may be reduced to only 7×7 pixels. This low resolution makes it hard to interpret the visualizations in many cases, especially if smaller objects are involved or the classification decision of the model depends on finer details of the input. Similar constraints apply when using visualization methods like Grad-CAM on CNNs used for object detection, semantic segmentation, instance segmentation and other related tasks. Single shot object detectors also have the problem because they typically use a single set of activation maps for classifying multiple different objects of different sizes. The generated activation maps cover the full input image whereas object detectors typically detect objects that are only a small portion of the input image in size. BRIEF DESCRIPTION OF THE DRAWINGS The present invention is illustrated by way of example and is not limited by the accompanying figures, in which like references indicate similar elements. Elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. FIG. 1 illustrates a simplified system for training and using a ML model in accordance with an embodiment. FIG. 2 illustrates a flowchart of a method for generating a visualization for explaining behavior of a ML model in accordance with an embodiment. FIG. 3 illustrates a diagram of layers of a neural network for generating a higher resolution visualization in accordance with an embodiment. FIG. 4 illustrates a diagram of layers of a neural network for generating a higher resolution visualization in accordance with another embodiment. FIG. 5 illustrates a data processing system useful for implementing an embodiment of the present invention. DETAILED DESCRIPTION Generally, there is provided, a method for providing a more detailed visualization for explaining the behavior of a ML model. The method includes inputting an image into the ML model for an inference operation. A resolution of the input image is increased compared to a resolution the ML model was intended to receive as an input. The ML model includes a plurality of resolution-independent convolutional layers. Most layers of a CNN are resolution independent. In this disclosure, the term “resolution-independent” means that the resolution-independent layers are not sensitive to the changes in the resolution or the number o