Skip to content

4. GoogLeNet / InceptionNet

Table of Contents

  1. Introduction
  2. Historical Context
  3. Motivation Behind Inception Architecture
  4. GoogLeNet Architecture Overview
  5. The Inception Module
  6. Layer-wise Architecture of GoogLeNet
  7. Design Innovations and Rationale
  8. Training Details
  9. Key Takeaways and Performance
  10. Limitations and Challenges
  11. Conclusion
  12. Further Reading

1. Introduction

GoogLeNet, introduced in 2014 by Szegedy et al. in the paper “Going Deeper with Convolutions”, was the winning model of ILSVRC 2014 (ImageNet Large Scale Visual Recognition Challenge). It marked a leap forward in convolutional network design by introducing the Inception module, which allowed the network to go "deeper" and "wider" without significantly increasing computational cost.

It achieved a Top-5 error rate of 6.67%, beating architectures like VGGNet and showcasing efficient utilization of compute.


2. Historical Context

  • Deep learning was gaining momentum after AlexNet (2012) and ZFNet (2013).
  • VGGNet in 2014 demonstrated performance through depth and simplicity.
  • However, very deep networks led to:
  • Overfitting
  • High memory/computation cost

GoogLeNet’s innovation was in efficient deepening of networks using a modular, multi-path design — the Inception module.


3. Motivation Behind Inception Architecture

The paper was driven by the question:

“What is the optimal local sparse structure in a convolutional vision network, and how can it be approximated efficiently?”

Key motivations:
- Use multiple filter sizes at the same layer to capture features at different scales.
- Keep computational complexity constant or lower.
- Replace naive stacking of layers with carefully designed modules.


4. GoogLeNet Architecture Overview

  • Total depth: 22 layers (27 if pooling counted)
  • Parameters: ~5 million (much fewer than VGGNet’s 138M)
  • Core building block: the Inception module
  • Includes auxiliary classifiers to mitigate vanishing gradients

Architecture Highlights:

  • Alternates between Inception modules and MaxPooling
  • Ends with global average pooling instead of fully connected layers
  • Includes 2 auxiliary classifiers acting as regularizers and intermediate supervision

5. The Inception Module

Each Inception module consists of parallel convolutional and pooling operations:

Path Operation
1 1×1 convolution
2 1×1 → 3×3 convolution
3 1×1 → 5×5 convolution
4 3×3 max pooling → 1×1 convolution

Why 1×1 Convolutions?

  • Used for dimension reduction (bottleneck layers)
  • Reduce number of input channels before expensive 3×3 or 5×5 filters
  • Add non-linearity and depth

→ Efficient deep feature extraction with low cost


6. Layer-wise Architecture of GoogLeNet

Stage Layer Output Size
Input 224×224×3
Conv1 7×7 conv, stride 2 112×112×64
MaxPool1 3×3, stride 2 56×56×64
Conv2 1×1 conv → 3×3 conv 56×56×192
MaxPool2 3×3, stride 2 28×28×192
Inception (3a–3b) Multiple filters 28×28×256, 28×28×480
MaxPool3 3×3, stride 2 14×14×480
Inception (4a–4e) Deeper modules up to 14×14×832
MaxPool4 3×3, stride 2 7×7×832
Inception (5a–5b) Final Inception blocks 7×7×1024
GlobalAvgPool Avg pool over 7×7 1×1×1024
Dropout 40%
Linear Fully connected → Softmax 1000 classes

📌 Auxiliary Classifiers are added after Inception 4a and 4d


7. Design Innovations and Rationale

🔹 Inception Modules

  • Multi-scale processing in parallel
  • Efficient parameter usage via 1×1 conv
  • Inspired by Network-in-Network approach

🔹 Global Average Pooling

  • Reduces risk of overfitting from fully connected layers
  • Encourages feature-to-class correspondence

🔹 Auxiliary Classifiers

  • Help mitigate vanishing gradients
  • Provide regularization
  • Only used during training, not inference

🔹 Fewer Parameters

  • ~5M compared to VGG-16’s 138M
  • Efficient yet accurate

8. Training Details

  • Dataset: ImageNet (ILSVRC 2014)
  • Data Augmentation:
  • Random crops (224×224)
  • Random horizontal flips
  • Photometric distortions
  • Optimizer: SGD with momentum
  • Loss Function: Softmax + auxiliary classifier losses
  • Regularization:
  • Dropout (40%)
  • L2 weight decay
  • Batch Size: ~32–128 depending on GPU
  • Training Time: Several days on multiple GPUs

9. Key Takeaways and Performance

Feature Impact
✅ Inception Modules Efficient deep computation
✅ Auxiliary classifiers Improved gradient flow
✅ Global average pooling Reduced overfitting
✅ Smart filter design Multi-scale feature extraction
✅ State-of-the-art accuracy 6.67% Top-5 error (ILSVRC 2014)

10. Limitations and Challenges

Issue Explanation
❌ Complex architecture Inception module is harder to design manually
❌ Handcrafted filter paths Later solved via Inception-v2/v3/v4 (AutoML, NAS)
❌ Not fully modular Still has specific assumptions on input size, filter types
❌ Gradient flow Still benefits from auxiliary classifiers due to depth

11. Conclusion

GoogLeNet / InceptionNet brought a new way of thinking: not just stacking layers deeper, but designing smarter modules.

With the Inception module, it offered:
- Depth
- Width
- Multi-scale feature learning
- Parameter efficiency

GoogLeNet laid the foundation for further architectures like Inception-v3, v4, Xception, and NASNet. It was an early proof that carefully designed networks can outperform deeper or wider brute-force models.

🎯 “Going deeper with convolutions” wasn’t just a paper title — it was a revolution in CNN design.


12. Further Reading