Notes on VGG
While our small deep learning study group was playing with visualisation techniques of layers, we found that many methods were still based on VGGlike architecture. This week, therefore, we decided to read through “Very deep convolutional networks for largescale image recognition” by Simonyan and Zisserman (2014) to study a VGG model.
Many studies around this time focused on improving hyperparameters of Convolutional networks (ConvNets). This paper also investigated one of the most important hyperparameter  the effect of the ConvNets depth.
Why is this paper important and interesting?
 VGG won the first and second place in the localisation and classification challenges from ILSVRC2014.
 VGG experiments on the ConvNets depth by changing and comparing ConvNets with different layers.
What does VGG(s) look like?
ConvNet layer configuration was inspired by Ciresan et al. (2011) and Krizhevsky et al. (2012).
The base VGG has the similar configuration from AlexNet, but with the deeper layers (11 layers) and the smaller filters (3 x 3). The smaller filters were sized as 3 x 3 because this was the smallest size to capture the notion of left/right, up/down, and centre.
To experiment of the effect of the depth, the paper compare the six different models below:
 VGG model A: 8 conv layers, 3 fc layers
 VGG model ALRN: 8 conv layers (but with Local Response Normalisation from AlexNet), 3 fc layers
 VGG model B: 10 conv layers, 3 fc layers
 VGG model C: 13 conv layers (but with 2 1 x 1 linear transformation with ReLU), 3 fc layers
 VGG model D: 13 conv layers (all 3 X 3 conv layers), 3 fc layers
 VGG model E: 16 conv layers (all 3 x 3 conv layers), 3 fc layers
Source: Simonyan and Zisserman (2014), p. 3, Table 1
An interesting point I want to highlight here is that the activation maps’ size gone through the larger sizes of filters (e.g. 11 X 11 and 7 x 7) from AlexNet and ZFNet were equivalent with the activation gone through a stack of 3 x 3 filters in VGG. As illustrated below, a stack of two 3 x 3 conv layers has an effective receptive field of 5 x 5 (and the three of such layers are equivalent with one 7 x 7 layer).
So why do we need a stack of 3 x 3 filters instead of one larger filter?
A greater number of layers could provide more discriminative features and help to decrease the number of parameters. As you could see below, the largest number of parameters is around 144 millions from VGG E (19 layers), of which size is similar to the parameters of the shallow model with larger filters.
Source: Simonyan and Zisserman (2014), p. 3, Table 2
How did the authors train VGG?
Training methods
VGG used a minibatch size of 256, momentum 0.9, weightdecay penalty multiplier 5e4, and drop out for the first two fc layers with the probability of 0.5. As usual, the learning rate started from 1e2 and was decreased by a factor of 10.
Another interesting point in VGG was around the initialisation method. Because this model was developed prior to the batch normalisation adoption, the model was initially trained with only 11 layers (i.e. VGG A) to stabilise the gradients. The first four conv layers and the last three fc layers were then taken out and other intermediate layers were slided into the models with deeper layers. Intermediate layers were initialised randomly with the zero mean and the standard deviation of 1e2.
Training image size

Singlescale training: two fixed crop scales were used  224 and 384 to compare the results.

Multiscale training: each training image was rescaled between S min (256) and S max (512). This could be seen as training set data augmentation by scale jittering.
The difference between the two method can be highlighted as the second approach makes the model to be trained to classify an object a broader range of scales.
The model was trained with 4 NVIDIA Titan Black GPUs, taking about 23 weeks for a single net training (subject to the type of the architecture).
What were the results of VGG?
Dataset used was the ILSVRC2012 (images of 1000 classes, training set of 1.3m, validation set of 50k, and test set of 100k).
Top1 error (proportion of incorrectly classified images) and Top5 error (proportion of images such that the groundtruth category is outside the top5 predicted category) were compared to see the results of VGG.
Singlescale evaluation
 It turned out that ALRN (11 layers with local response normalisation) was not performing better than the model without LRN (i.e. model A) so the authors decided not to use LRN.
 A classification error decreases with the increased ConvNet depth, meaning two 3 x 3 conv layers perform better than one 5 x 5.
 As seen below, the deeper model (model C) with nonlinearity additions help (1 x 1 conv layer with ReLU) but capturing more spatial contexts (model D) using 3 x 3 filters performs better than just 1 x 1 linear transformation and ReLU (model C). Overall, this is why the model D (deep, 3 x 3 filter) performs better than the model C (deep, 1 x 1 filter) that performs better than the model B (shallower).
 Scale jittering at training time also helps to improve the performance.
Source: Simonyan and Zisserman (2014), p. 6, Table 4
Multiscale evaluation
 Scale jittering at test time turned out to be helpful (ranged between S min, 0.5(S min+S max), S max) rather than just using fixed sizes at test time.
Source: Simonyan and Zisserman (2014), p. 7, Table 5
Comparison with other models
When compared to other models with good performance, VGG ensembled of two bestperforming multiscale models (model D  16 layers and Model E  19 layers) outperformed previous stateoftheart models, other than GooLeNet (another 2014 ILSVRC2014 winner). Top5 validation error and test error was at 6.8% on the classification challenge.
Source: Simonyan and Zisserman (2014), p. 8, Table 7
Lessons learnt and future todolist
This paper was where famous VGG16 and VGG19 were born. Even now in 2018, there are many papers that use VGGlike models to experiment previouslyunexplored research areas. Unfortunately, this week I was not able to implement VGG.
 I want to implement VGG in Pytorch later.