Post

ResNet: Solving the Degradation Problem with Skip Connections

Table of Contents

  1. Introduction
  2. The Problem of Depth in Neural Networks
  3. What is Residual Learning?
  4. The ResNet Architecture
  5. Residual Blocks Explained
  6. ResNet Variants and Depths
  7. Design Principles and Insights
  8. Training Details
  9. Performance and Impact
  10. Limitations and Later Evolutions
  11. Conclusion

1. Introduction

ResNet, short for Residual Network, was introduced by Microsoft Research in the landmark paper “Deep Residual Learning for Image Recognition” (He et al., 2015). It solved a long-standing problem in deep learning — training very deep networks.

With architectures like ResNet-50, ResNet-101, and ResNet-152, it achieved unprecedented depth and accuracy on ImageNet, while still being easy to optimize.

Key Innovation: Skip connections (or residual connections) that allow gradients to flow directly across layers.


2. The Problem of Depth in Neural Networks

As researchers attempted to build deeper convolutional neural networks, they ran into issues like:

  • Vanishing gradients during backpropagation
  • Degradation problem: Accuracy got worse as layers increased beyond a point, even with batch normalization

This was surprising — deeper networks should be strictly more expressive. But in practice, deeper vanilla CNNs were harder to train effectively.

🧐 Observation: Simply adding more layers degraded performance rather than improved it.


3. What is Residual Learning?

Instead of each layer learning an entirely new representation, ResNet proposes to learn residual functions.

Formally:

If the desired underlying mapping is $H(x)$, we let the stacked layers approximate:

$F(x) = H(x) - x \quad \Rightarrow \quad H(x) = F(x) + x$

This is implemented by adding the input $x$ to the output of a few stacked layers:

1
Output = Layer(x) + x

This simple change allows:

  • Better gradient flow
  • Easier optimization
  • Use of identity mappings to preserve information

4. The ResNet Architecture

ResNet is built using Residual Blocks stacked together. The standard version for ImageNet classification includes:

  • Initial 7x7 Conv + MaxPooling
  • 4 Stages of Residual Blocks (conv2_x to conv5_x)
  • Global Average Pooling
  • Fully Connected Layer (1000-way softmax)

Example: ResNet-50

StageLayersOutput Size
Conv17x7 conv, 64, stride 2112x112x64
Pool13x3 max pool, stride 256x56x64
Conv2_x1x1, 3x3, 1x1 (x3)56x56x256
Conv3_x1x1, 3x3, 1x1 (x4)28x28x512
Conv4_x1x1, 3x3, 1x1 (x6)14x14x1024
Conv5_x1x1, 3x3, 1x1 (x3)7x7x2048
PoolAvg pool, 1x11x1x2048
FCFully connected1000 classes

5. Residual Blocks Explained

There are two main types:

a. Basic Block (used in ResNet-18/34)

1
2
3
Input --> Conv (3x3) --> BN --> ReLU --> Conv (3x3) --> BN --> + --> ReLU --> Output
                                 ^                         |
                                 |-------------------------|

b. Bottleneck Block (used in ResNet-50/101/152)

1
2
3
Input --> Conv (1x1) --> BN --> ReLU --> Conv (3x3) --> BN --> ReLU --> Conv (1x1) --> BN --> + --> ReLU
                                                                        ^                         |
                                                                        |-------------------------|

Why bottlenecks?

  • Reduce computational cost (via 1x1 conv)
  • Enable training deeper models

If input and output shapes differ, a projection (1x1 conv) is used in the skip path.


6. ResNet Variants and Depths

ModelDepthParametersTop-5 Error (ILSVRC)
ResNet-1818~11M~7.5%
ResNet-3434~21M~7.0%
ResNet-5050~25M~6.8%
ResNet-101101~44M~6.5%
ResNet-152152~60M~6.2%

7. Design Principles and Insights

✅ Identity Shortcuts:

  • Let the network preserve signal across layers
  • Require no extra parameters if dimensions match

✅ Deep Supervision (implicitly):

  • Gradients can directly flow through residual connections
  • Improves convergence

✅ Feature Reuse:

  • Encourages layers to refine, not relearn, features

✅ Avoids Overfitting:

  • Despite greater depth, ResNet generalizes better

8. Training Details

  • Dataset: ImageNet (1.28M images)
  • Data Augmentation:

    • Random resized crop to 224x224
    • Horizontal flips
  • Optimization:

    • SGD with momentum (0.9)
    • Weight decay: 1e-4
    • Batch size: 256
    • LR schedule: Step decay (e.g., divide by 10 every 30 epochs)
  • Epochs: 90–120
  • Initialization: MSRA (He et al., 2015) weight init for ReLU

9. Performance and Impact

  • Won 1st place in ILSVRC 2015 (Image classification)
  • Enabled very deep CNNs to be trained effectively (up to 1000+ layers)
  • Inspired many successor models:

    • DenseNet (feature concatenation instead of addition)
    • ResNeXt (grouped convolutions)
    • EfficientNet (compound scaling)
    • Vision Transformers still use ResNet as backbone

10. Limitations and Later Evolutions

IssueSolution in later models
Computationally heavyResNeXt, MobileNet, ShuffleNet
Static design (manual)NASNet, EfficientNet
Only additive identityDenseNet uses concatenation
Diminishing returnsEfficient architectures (e.g., Swin Transformer)

11. Conclusion

ResNet revolutionized the way we design deep networks. It solved the degradation problem using a simple, elegant idea: learning residuals instead of direct mappings.

Its skip connections became a blueprint for almost all modern CNNs, and its legacy continues in hybrid models and transformers alike.

✨ “Sometimes, the best way forward is to remember where you started.”


This post is licensed under CC BY 4.0 by the author.