Post

VGGNet: Simplicity and Depth in CNN Design

Table of Contents

  1. Introduction
  2. The Landscape Before VGG
  3. Overview of VGG Architecture
  4. Innovations and Key Features
  5. Detailed Layer-wise Architecture (VGG16)
  6. Training Details
  7. Key Experimental Insights
  8. Performance and Results
  9. Impact on the Deep Learning Field
  10. Criticisms and Limitations
  11. Conclusion

1. Introduction

VGGNet, developed by Karen Simonyan and Andrew Zisserman of the Visual Geometry Group (VGG) at the University of Oxford, was a milestone in the evolution of deep convolutional neural networks (CNNs). Introduced in 2014, VGGNet emphasized simplicity and depth, demonstrating that stacking small convolution filters could lead to powerful results.

The architecture was submitted to ILSVRC 2014, achieving top-2 in classification and first place in localization.


2. The Landscape Before VGG

Before VGG:

  • AlexNet (2012) showed that deeper models could outperform traditional methods.
  • Models like ZFNet used large filters and visualizations but were not very deep.
  • The key limitation was training deep networks effectively.

VGGNet changed the paradigm: depth + small filters = better performance.


3. Overview of VGG Architecture

VGGNet introduced two main variants:

  • VGG16: 16 weight layers
  • VGG19: 19 weight layers

Design Philosophy

  • Use only 3×3 convolution filters
  • Use 2×2 max pooling
  • Double filters after each pooling
  • End with 3 fully connected layers and softmax

This modular design allows for easy extension, replication, and understanding.


4. Innovations and Key Features

✅ Small Filters, Deep Networks

  • Stacked multiple 3×3 convolutions instead of single large ones
  • Effective receptive field is equivalent to larger filters (e.g., two 3×3 = one 5×5)

✅ Uniform Architecture

  • Repeating the same block structure made the network easy to implement and analyze

✅ Deeper is Better (up to a point)

  • Performance improved consistently as depth increased (up to 19 layers)

✅ No Local Response Normalization (LRN)

  • Contrary to AlexNet, LRN did not help and was excluded from deeper variants

5. Detailed Layer-wise Architecture (VGG16)

LayerTypeOutput SizeFilters / Units
Input-224x224x3-
Conv1_1Convolution224x224x643x3, stride 1
Conv1_2Convolution224x224x643x3
Pool1Max Pooling112x112x642x2
Conv2_1Convolution112x112x1283x3
Conv2_2Convolution112x112x1283x3
Pool2Max Pooling56x56x1282x2
Conv3_1Convolution56x56x2563x3
Conv3_2Convolution56x56x2563x3
Conv3_3Convolution56x56x2563x3
Pool3Max Pooling28x28x2562x2
Conv4_1Convolution28x28x5123x3
Conv4_2Convolution28x28x5123x3
Conv4_3Convolution28x28x5123x3
Pool4Max Pooling14x14x5122x2
Conv5_1Convolution14x14x5123x3
Conv5_2Convolution14x14x5123x3
Conv5_3Convolution14x14x5123x3
Pool5Max Pooling7x7x5122x2
FC6Fully Connected4096-
FC7Fully Connected4096-
FC8Fully Connected1000-
SoftmaxClassification1000-

6. Training Details

  • Dataset: ImageNet 1.2M images, 1000 categories
  • Optimizer: SGD, momentum = 0.9
  • Weight Decay: 5e-4
  • Learning Rate: 0.01 initially, manually reduced
  • Batch Size: 256
  • Regularization: Dropout in FC layers
  • Data Augmentation:
    • Random cropping
    • Horizontal flipping
    • Scale jittering (explained below)

7. Key Experimental Insights

The authors conducted many experiments. Here are the most important takeaways:

🔸 Local Response Normalization (LRN) Does Not Help

  • Adding LRN (as in AlexNet) in configuration A (A-LRN) did not improve performance.
  • LRN was dropped in deeper configurations (B–E).

🔸 Depth Helps Significantly

  • Increasing depth from 11 layers (A) to 19 layers (E) consistently reduced classification error.
  • Performance saturates at 19 layers—deeper models may still help on larger datasets.

🔸 1×1 vs. 3×3 Filters

  • Configuration C had 1×1 convolutions, while D used all 3×3.
  • Even with the same depth, D outperformed C.
  • Conclusion: Non-linearity helps, but capturing spatial context via 3×3 filters is more important.

🔸 Shallow vs. Deep with Same Receptive Field

  • A shallow variant of B replaced two 3×3 layers with one 5×5 layer (same receptive field).
  • The shallow net had 7% higher top-1 error.
  • Shows deep networks with small filters learn better than shallow networks with large filters.

🔸 Scale Jittering Boosts Accuracy

  • Instead of training on fixed image sizes (e.g., S=256), they used jittered S ∈ [256, 512].
  • Even when testing on a single scale, jittered training led to significantly better performance.
  • Confirms the power of multi-scale data augmentation.

8. Performance and Results

VGG16 and VGG19 performed exceptionally well on ILSVRC 2014.

ModelTop-5 Error Rate
VGG167.3%
VGG197.5%
AlexNet15.3%

VGGNet also performed well in localization and transferred effectively to other tasks (like object detection and segmentation).


9. Impact on the Deep Learning Field

VGGNet had a major influence:

  • Became the go-to feature extractor for transfer learning
  • Inspired modular CNN architectures
  • Used in Fast R-CNN, Style Transfer, and more
  • Set a new standard for depth and simplicity
  • Still used as a baseline in academic papers

10. Criticisms and Limitations

LimitationDetail
Large Model Size~138M parameters for VGG16
High Memory RequirementNot ideal for edge or mobile
No BatchNormLater models like ResNet added this
Slow TrainingTraining takes weeks on multi-GPU setups

11. Conclusion

VGGNet proved that simplicity (uniform architecture) and depth (more layers) are powerful ingredients in CNN design. Though newer architectures are more efficient, VGG remains a classic due to its elegance and effectiveness. Its influence is still seen in modern vision models today.


This post is licensed under CC BY 4.0 by the author.