XceptionNet

Posted Dec 11, 2019 Updated Jun 13, 2025

By Abhijit More 4 min read

Introduction
Historical Context and Motivation
Architectural Philosophy
Understanding Depthwise Separable Convolutions
Detailed Architecture of XceptionNet
Training Details
Performance and Benchmarks
Comparison with Other Architectures
Limitations and Critiques
Real-world Applications and Use Cases
Conclusion

1. Introduction

XceptionNet (Extreme Inception), introduced by François Chollet in 2017, is a deep convolutional neural network architecture that builds upon the success of Inception modules by replacing them with depthwise separable convolutions. It achieves better accuracy and efficiency on large-scale image classification tasks such as ImageNet.

The core insight of XceptionNet is that spatial feature extraction and channel-wise correlation can be completely decoupled, offering a more efficient alternative to traditional convolution operations.

2. Historical Context and Motivation

Before XceptionNet:

Inception architectures (GoogLeNet, Inception-v3) were dominant due to their use of parallel convolutions for efficient feature extraction.
However, Inception modules were complex to design and implement.

François Chollet proposed that if we go to the “extreme” of Inception’s idea—completely decoupling cross-channel and spatial correlations—we arrive at depthwise separable convolutions.

XceptionNet can be thought of as a “linear stack” of depthwise separable convolution layers with residual connections.

This idea was inspired by the success of MobileNet, which also used depthwise separable convolutions but targeted mobile devices.

3. Architectural Philosophy

The key hypothesis:

Conventional convolutions combine spatial filtering and feature combining in a single step.
Xception proposes doing this in two separate operations:
1. Depthwise convolution: Spatial filtering
2. Pointwise convolution: Feature combining (1x1 convolution)

This not only improves performance but simplifies the design by using a single repeated module structure, unlike Inception which required complex manual design.

Xception = Extreme Inception = Inception with maximal factorization.

4. Understanding Depthwise Separable Convolutions

A standard convolution performs filtering and combining simultaneously:

Given an input of shape (H, W, C)
A kernel of shape (k, k, C, N)
Output: (H’, W’, N)

This is computationally expensive: k²×C×N multiplications.

In Depthwise Separable Convolutions:

Depthwise Convolution:
- Apply a single k×k filter per input channel (no combining across channels).
- Cost: k² × C
Pointwise Convolution:
- 1×1 convolution to linearly combine all C channels into N channels.
- Cost: C × N

Total cost: k²×C + C×N → significantly lower than k²×C×N

This factorization reduces computations while maintaining representational power.

5. Detailed Architecture of XceptionNet

XceptionNet consists of 36 convolutional layers, structured into three flows:

Entry Flow

Starts with two conventional convolutions (stride 2)
Followed by 3 blocks of depthwise separable convolutions with ReLU and batch normalization
Each block ends with a residual connection

Middle Flow

8 identical modules:
- Each with three depthwise separable convolution layers
- Residual connection around the block
Designed to capture complex features at a deeper level

Exit Flow

Further depthwise separable convolutions with increasing filter size
Global average pooling
Fully connected layer with softmax for classification

Flow	Components
Entry Flow	Conv layers + 3 blocks with depthwise separables
Middle Flow	8 identical residual modules
Exit Flow	Final depthwise separable + GAP + Softmax

Activation: ReLU after each depthwise and pointwise convolution

Normalization: BatchNorm after each convolution

6. Training Details

Dataset: ImageNet (1.2M training images)
Optimizer: RMSprop or SGD with momentum
Learning Rate Schedule: Step decay or cosine annealing
Weight Initialization: Glorot Uniform or He initialization
Batch Size: 128 to 256
Regularization:
- Dropout (optional)
- L2 weight decay

Xception trains slower than MobileNet due to larger size, but achieves higher accuracy.

7. Performance and Benchmarks

Model	Top-1 Accuracy	Top-5 Accuracy	Parameters	FLOPs
VGG16	71.5%	89.8%	138M	~15.3B
InceptionV3	78.0%	93.9%	23M	~5.7B
Xception	79.0%	94.5%	22.8M	~8.4B

Xception offers a strong trade-off between accuracy and model complexity.

8. Comparison with Other Architectures

Xception vs InceptionV3

Xception simplifies Inception modules using depthwise separable convolutions.
No manual design of filter concatenations.
Slightly better accuracy.

Xception vs MobileNet

Both use depthwise separable convolutions.
Xception is deeper and more accurate but heavier.
MobileNet is optimized for speed and deployment.

Xception vs ResNet

Xception uses residual connections like ResNet.
However, ResNet uses standard convolutions.
Xception achieves similar or better performance with fewer parameters.

9. Limitations and Critiques

Limitation	Description
🚫 Heavier than MobileNet	Not ideal for edge devices
🚫 Depthwise separables can underfit	If model depth is not enough
🚫 Training instability	More sensitive to initialization and hyperparameters
🚫 Less interpretability	Hard to visualize intermediate representations

10. Real-world Applications and Use Cases

Image classification tasks (ImageNet, CIFAR)
Feature extraction for transfer learning
Used in Keras as a pretrained model backbone
Medical imaging, satellite imagery, and automated retail checkout systems
Deployed in cloud-based inference pipelines

11. Conclusion

XceptionNet demonstrates that by completely factorizing standard convolutions into depthwise and pointwise operations, we can build highly efficient and powerful deep learning models. It blends the best of Inception (modularity), ResNet (residual learning), and MobileNet (depthwise separability).

For practitioners, Xception offers a balanced approach: modular design, high accuracy, and reasonable computational load.

It has influenced architectures like EfficientNet, NASNet, and the general trend toward lightweight, high-performance CNNs.

Computer Vision

CNN Architectures

This post is licensed under CC BY 4.0 by the author.