VGGNet: Simplicity and Depth in CNN Design

Posted Oct 16, 2019 Updated Jun 13, 2025

By Abhijit More 4 min read

Introduction
The Landscape Before VGG
Overview of VGG Architecture
Innovations and Key Features
Detailed Layer-wise Architecture (VGG16)
Training Details
Key Experimental Insights
Performance and Results
Impact on the Deep Learning Field
Criticisms and Limitations
Conclusion

1. Introduction

VGGNet, developed by Karen Simonyan and Andrew Zisserman of the Visual Geometry Group (VGG) at the University of Oxford, was a milestone in the evolution of deep convolutional neural networks (CNNs). Introduced in 2014, VGGNet emphasized simplicity and depth, demonstrating that stacking small convolution filters could lead to powerful results.

The architecture was submitted to ILSVRC 2014, achieving top-2 in classification and first place in localization.

2. The Landscape Before VGG

Before VGG:

AlexNet (2012) showed that deeper models could outperform traditional methods.
Models like ZFNet used large filters and visualizations but were not very deep.
The key limitation was training deep networks effectively.

VGGNet changed the paradigm: depth + small filters = better performance.

3. Overview of VGG Architecture

VGGNet introduced two main variants:

VGG16: 16 weight layers
VGG19: 19 weight layers

Design Philosophy

Use only 3×3 convolution filters
Use 2×2 max pooling
Double filters after each pooling
End with 3 fully connected layers and softmax

This modular design allows for easy extension, replication, and understanding.

4. Innovations and Key Features

✅ Small Filters, Deep Networks

Stacked multiple 3×3 convolutions instead of single large ones
Effective receptive field is equivalent to larger filters (e.g., two 3×3 = one 5×5)

✅ Uniform Architecture

Repeating the same block structure made the network easy to implement and analyze

✅ Deeper is Better (up to a point)

Performance improved consistently as depth increased (up to 19 layers)

✅ No Local Response Normalization (LRN)

Contrary to AlexNet, LRN did not help and was excluded from deeper variants

5. Detailed Layer-wise Architecture (VGG16)

Layer	Type	Output Size	Filters / Units
Input	-	224x224x3	-
Conv1_1	Convolution	224x224x64	3x3, stride 1
Conv1_2	Convolution	224x224x64	3x3
Pool1	Max Pooling	112x112x64	2x2
Conv2_1	Convolution	112x112x128	3x3
Conv2_2	Convolution	112x112x128	3x3
Pool2	Max Pooling	56x56x128	2x2
Conv3_1	Convolution	56x56x256	3x3
Conv3_2	Convolution	56x56x256	3x3
Conv3_3	Convolution	56x56x256	3x3
Pool3	Max Pooling	28x28x256	2x2
Conv4_1	Convolution	28x28x512	3x3
Conv4_2	Convolution	28x28x512	3x3
Conv4_3	Convolution	28x28x512	3x3
Pool4	Max Pooling	14x14x512	2x2
Conv5_1	Convolution	14x14x512	3x3
Conv5_2	Convolution	14x14x512	3x3
Conv5_3	Convolution	14x14x512	3x3
Pool5	Max Pooling	7x7x512	2x2
FC6	Fully Connected	4096	-
FC7	Fully Connected	4096	-
FC8	Fully Connected	1000	-
Softmax	Classification	1000	-

6. Training Details

Dataset: ImageNet 1.2M images, 1000 categories
Optimizer: SGD, momentum = 0.9
Weight Decay: 5e-4
Learning Rate: 0.01 initially, manually reduced
Batch Size: 256
Regularization: Dropout in FC layers
Data Augmentation:
- Random cropping
- Horizontal flipping
- Scale jittering (explained below)

7. Key Experimental Insights

The authors conducted many experiments. Here are the most important takeaways:

🔸 Local Response Normalization (LRN) Does Not Help

Adding LRN (as in AlexNet) in configuration A (A-LRN) did not improve performance.
LRN was dropped in deeper configurations (B–E).

🔸 Depth Helps Significantly

Increasing depth from 11 layers (A) to 19 layers (E) consistently reduced classification error.
Performance saturates at 19 layers—deeper models may still help on larger datasets.

🔸 1×1 vs. 3×3 Filters

Configuration C had 1×1 convolutions, while D used all 3×3.
Even with the same depth, D outperformed C.
Conclusion: Non-linearity helps, but capturing spatial context via 3×3 filters is more important.

🔸 Shallow vs. Deep with Same Receptive Field

A shallow variant of B replaced two 3×3 layers with one 5×5 layer (same receptive field).
The shallow net had 7% higher top-1 error.
Shows deep networks with small filters learn better than shallow networks with large filters.

🔸 Scale Jittering Boosts Accuracy

Instead of training on fixed image sizes (e.g., S=256), they used jittered S ∈ [256, 512].
Even when testing on a single scale, jittered training led to significantly better performance.
Confirms the power of multi-scale data augmentation.

8. Performance and Results

VGG16 and VGG19 performed exceptionally well on ILSVRC 2014.

Model	Top-5 Error Rate
VGG16	7.3%
VGG19	7.5%
AlexNet	15.3%

VGGNet also performed well in localization and transferred effectively to other tasks (like object detection and segmentation).

9. Impact on the Deep Learning Field

VGGNet had a major influence:

Became the go-to feature extractor for transfer learning
Inspired modular CNN architectures
Used in Fast R-CNN, Style Transfer, and more
Set a new standard for depth and simplicity
Still used as a baseline in academic papers

10. Criticisms and Limitations

Limitation	Detail
Large Model Size	~138M parameters for VGG16
High Memory Requirement	Not ideal for edge or mobile
No BatchNorm	Later models like ResNet added this
Slow Training	Training takes weeks on multi-GPU setups

11. Conclusion

VGGNet proved that simplicity (uniform architecture) and depth (more layers) are powerful ingredients in CNN design. Though newer architectures are more efficient, VGG remains a classic due to its elegance and effectiveness. Its influence is still seen in modern vision models today.

Computer Vision

CNN Architectures

This post is licensed under CC BY 4.0 by the author.