This week, our small deep learning study group decided to focus on a visualisation of layers since both of us wanted to see how the model sees through images when doing a classification task. In particular, I read through “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization”.

Computer vision tasks have been diversified into more than a labelling task, branching into 1) image classification (traditional single or multi labelling tasks); 2) object detection (localisation of an object); 3) semantic segmentation (pixel-wise localisation); 4) image captioning; and 5) visual question answering (VQA).

Example of Semantic segmentation (3)

image of semantic segmentation

Source: https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRCAKpr3EHEEMooof6NzhnTvy-6aMND5rHHk53ymA-bOY478I9XkA

Example of Image captioning (4)

image of dense captioning

Source: Johnson et al., 2015 Figure 3 Example captions generated and localized by our model on test images. (from top-right corner)

Example of Visual question answering (5)

image of visual question answering

Source: Ren et al., 2015 Figure 3 Sample questions and responses from our system.

To effectively design, implement and deploy these models in real life, the authors for this paper articulates that it is important to focus on interpretability and transparency of the models:

  1. to identify the failure modes (when the model performs worse than humans on a particular task(s));
  2. to establish appropriate trust and confidence in users (when the model’s performance is on par with humans); and
  3. to teach how to make better decisions (when the model’s performance is stronger than humans).

The interpretability and transparency of the models could provide ‘why the models predict what they predict’.

What would you consider a good visual explanation?

The authors defined two characteristics that make a good visual explanation: class discriminative and high resolution.

  • Class discriminative means whether a visual explanation is provided for a localisation of an object of interest.

  • High resolution in this case represents a visual explanation that could capture fine-grained details.

On this post, we will compare different visualisation techniques based on these two criteria to evaluate the extent of explanatory power.

Why do we need Grad-CAM over other visualisation techniques?

In the past, visualisation was done through Guided Backpropagation (Springenberg et al. (2014)) or Deconvolution (Zeiler and Furgus (2014)). These visualisation techniques tended to provide a high resolution but they could not localise the object(s) of interest.

Grad-CAM is based on Class Activation Mapping (CAM) by Zhou et al. (2015) where CAM could identify discriminative regions for a particular class through CNNs.

Grad-CAM complements/generalises CAM because CAM cannot visualise localised regions through models other than CNNs whilst Grad-CAM could visualise regions not only through CNNs but also from other models such as CNNs with fully-connected layers (e.g. VGG), CNNs used for structured outputs (e.g. captioning) and CNNs used in tasks with multi-modal inputs (VQA) or reinforcement learning.

In addition, Guided Grad-CAM, combining Guided Backpropagation and Grad-CAM together, could provide both high resolution and localisation through point-wise multiplication of two different outputs.

Comparison between Guided Backpropagation, Grad-CAM, and Guided Grad-CAM

image of comparison

Source: Selvaraju et al., 2016 Figure 1.

As you can see above, Guided Backpropagation (b) does not distinguish the object of interest from other objects from the image (i.e. a dog and a cat got visualised together).

Grad-CAM (c) focuses on cat for a cat classification but it does not provide any texture or feature of this cat (i.e. it only localise a cat).

Guided Grad-CAM (d) displays cat only but it also has a stripe-texture of the cat at the same time.

What does Grad-CAM look-like?

expression representing Grad-CAM expresses Grad-CAM where this expression represents “the class-discriminative localisation map Grad-CAM” with width u and height v for any class c.

There are two steps involved in calculating Grad-CAM.

Step 1 expression for step1

where represents a partial linearisation of the deep network downstream from A, and captures the importance of feature map k for a target class c.

  • In Step 1, the gradient of the score for class c, expression for prediction y for class c before the softmax, is divided by feature map expression for feature map.

  • Then these will be global-average-pooled to get the neuron importance weight expression for the neuron importance.

In Step 2, to get Grad-CAM, do ReLU on this neuron importance weight is multiplied by feature maps. The authors applied ReLu to linear combination in order to increase the gradient of the score, expression for prediction y for class c, for a class of interest (i.e. pixels whose intensity should be increased).

Step 2 expression for step2

Source: Selvaraju et al., 2016, pp. 3-4

Implementation of Grad-CAM through fast.ai library

The implementation code of Grad-CAM is adapted from Lecture 6 fast.ai course. The image dataset was collected from the web.

The full notebook can be found here.

Credit: A large part of this code is based on code from a fast.ai MOOC that will be publicly available in Jan 2019.

image of Grad-CAM on insect dataset

As you can see from the notebook, heatmap indicates the object of interest within an image.

Edit: I tried to implement Guided Grad-CAM on Monarch butterfly (Danaus plexippus) based on this github repo.

The original image of an Monarch Butterfly was as follow:

image of original monarch butterfly

Image: Original image of an monarch butterfly

Guided Back Propagation provides a texture and general outline of the Monarch Butterfly.

image of Guided Back Prop on monarch butterfly

Image: Guided Back Propagation on the image of an monarch butterfly

Grad-CAM identifies where the model focuses on to classify the image.

image of Grad-CAM on monarch butterfly

Image: Grad-CAM on the image of an monarch butterfly

Because there were no other objects in the image, the image from Guided Grad-CAM appears to be quite similar to the one from Guided Back Propagation, but it does display slightly more details.

image of Guided Grad-CAM on monarch butterfly

Image: Guided Grad-CAM on the image of an monarch butterfly

Lessons learnt and future to-do-list

Guided Backpropagation and Grad-CAM explains where the model focuses on when labelling. However, Guided Backpropagation is missing localisation while Grad-CAM is missing high-resolution.

During the next week, I would like to implement Guided Grad-CAM to see both localisation and high-resolution of visualised layers.