Skip to content

3. VGGNet

Table of Contents

  1. Introduction
  2. The Landscape Before VGG
  3. Overview of VGG Architecture
  4. Innovations and Key Features
  5. Detailed Layer-wise Architecture (VGG16)
  6. Training Details
  7. Key Experimental Insights
  8. Performance and Results
  9. Impact on the Deep Learning Field
  10. Criticisms and Limitations
  11. Conclusion
  12. Further Reading

1. Introduction

VGGNet, developed by Karen Simonyan and Andrew Zisserman of the Visual Geometry Group (VGG) at the University of Oxford, was a milestone in the evolution of deep convolutional neural networks (CNNs). Introduced in 2014, VGGNet emphasized simplicity and depth, demonstrating that stacking small convolution filters could lead to powerful results.

The architecture was submitted to ILSVRC 2014, achieving top-2 in classification and first place in localization.


2. The Landscape Before VGG

Before VGG: - AlexNet (2012) showed that deeper models could outperform traditional methods.
- Models like ZFNet used large filters and visualizations but were not very deep.
- The key limitation was training deep networks effectively.

VGGNet changed the paradigm: depth + small filters = better performance.


3. Overview of VGG Architecture

VGGNet introduced two main variants:
- VGG16: 16 weight layers
- VGG19: 19 weight layers

Design Philosophy

  • Use only 3Γ—3 convolution filters
  • Use 2Γ—2 max pooling
  • Double filters after each pooling
  • End with 3 fully connected layers and softmax

This modular design allows for easy extension, replication, and understanding.


4. Innovations and Key Features

βœ… Small Filters, Deep Networks

  • Stacked multiple 3Γ—3 convolutions instead of single large ones
  • Effective receptive field is equivalent to larger filters (e.g., two 3Γ—3 = one 5Γ—5)

βœ… Uniform Architecture

  • Repeating the same block structure made the network easy to implement and analyze

βœ… Deeper is Better (up to a point)

  • Performance improved consistently as depth increased (up to 19 layers)

βœ… No Local Response Normalization (LRN)

  • Contrary to AlexNet, LRN did not help and was excluded from deeper variants

5. Detailed Layer-wise Architecture (VGG16)

Layer Type Output Size Filters / Units
Input - 224x224x3 -
Conv1_1 Convolution 224x224x64 3x3, stride 1
Conv1_2 Convolution 224x224x64 3x3
Pool1 Max Pooling 112x112x64 2x2
Conv2_1 Convolution 112x112x128 3x3
Conv2_2 Convolution 112x112x128 3x3
Pool2 Max Pooling 56x56x128 2x2
Conv3_1 Convolution 56x56x256 3x3
Conv3_2 Convolution 56x56x256 3x3
Conv3_3 Convolution 56x56x256 3x3
Pool3 Max Pooling 28x28x256 2x2
Conv4_1 Convolution 28x28x512 3x3
Conv4_2 Convolution 28x28x512 3x3
Conv4_3 Convolution 28x28x512 3x3
Pool4 Max Pooling 14x14x512 2x2
Conv5_1 Convolution 14x14x512 3x3
Conv5_2 Convolution 14x14x512 3x3
Conv5_3 Convolution 14x14x512 3x3
Pool5 Max Pooling 7x7x512 2x2
FC6 Fully Connected 4096 -
FC7 Fully Connected 4096 -
FC8 Fully Connected 1000 -
Softmax Classification 1000 -

6. Training Details

  • Dataset: ImageNet 1.2M images, 1000 categories
  • Optimizer: SGD, momentum = 0.9
  • Weight Decay: 5e-4
  • Learning Rate: 0.01 initially, manually reduced
  • Batch Size: 256
  • Regularization: Dropout in FC layers
  • Data Augmentation:
  • Random cropping
  • Horizontal flipping
  • Scale jittering (explained below)

7. Key Experimental Insights

The authors conducted many experiments. Here are the most important takeaways:

πŸ”Έ Local Response Normalization (LRN) Does Not Help

  • Adding LRN (as in AlexNet) in configuration A (A-LRN) did not improve performance.
  • LRN was dropped in deeper configurations (B–E).

πŸ”Έ Depth Helps Significantly

  • Increasing depth from 11 layers (A) to 19 layers (E) consistently reduced classification error.
  • Performance saturates at 19 layersβ€”deeper models may still help on larger datasets.

πŸ”Έ 1Γ—1 vs. 3Γ—3 Filters

  • Configuration C had 1Γ—1 convolutions, while D used all 3Γ—3.
  • Even with the same depth, D outperformed C.
  • Conclusion: Non-linearity helps, but capturing spatial context via 3Γ—3 filters is more important.

πŸ”Έ Shallow vs. Deep with Same Receptive Field

  • A shallow variant of B replaced two 3Γ—3 layers with one 5Γ—5 layer (same receptive field).
  • The shallow net had 7% higher top-1 error.
  • Shows deep networks with small filters learn better than shallow networks with large filters.

πŸ”Έ Scale Jittering Boosts Accuracy

  • Instead of training on fixed image sizes (e.g., S=256), they used jittered S ∈ [256, 512].
  • Even when testing on a single scale, jittered training led to significantly better performance.
  • Confirms the power of multi-scale data augmentation.

8. Performance and Results

VGG16 and VGG19 performed exceptionally well on ILSVRC 2014.

Model Top-5 Error Rate
VGG16 7.3%
VGG19 7.5%
AlexNet 15.3%

VGGNet also performed well in localization and transferred effectively to other tasks (like object detection and segmentation).


9. Impact on the Deep Learning Field

VGGNet had a major influence:
- Became the go-to feature extractor for transfer learning
- Inspired modular CNN architectures
- Used in Fast R-CNN, Style Transfer, and more
- Set a new standard for depth and simplicity
- Still used as a baseline in academic papers


10. Criticisms and Limitations

Limitation Detail
Large Model Size ~138M parameters for VGG16
High Memory Requirement Not ideal for edge or mobile
No BatchNorm Later models like ResNet added this
Slow Training Training takes weeks on multi-GPU setups

11. Conclusion

VGGNet proved that simplicity (uniform architecture) and depth (more layers) are powerful ingredients in CNN design. Though newer architectures are more efficient, VGG remains a classic due to its elegance and effectiveness. Its influence is still seen in modern vision models today.


12. Further Reading