CNNs and Convolutions

Before CNNs, people used fully connected layers on images. The problem? It doesn't scale.

Beyond the parameter explosion, fully connected layers destroy spatial structure.

The network has no idea that p1 and p2 are neighbours, but p1 and p25 are corners apart. It's just 25 numbers with 25 separate weights. The spatial layout is destroyed the moment you flatten.

2 The Core Idea: Weight Sharing

Instead of connecting every pixel to every neuron, slide a small kernel across the image. The same kernel weights are applied at every position.

3 Sliding the Kernel

4 Output Size, Padding, Stride

Padding

Input	Kernel	Padding	Stride	Output
5×5	3×3	0	1	3×3
28×28	3×3	1	1	28×28
28×28	3×3	1	2	14×14

padding = kernel_size // 2 keeps spatial size the same (for stride=1). Adds zeros around the input so the kernel can be centred on edge pixels.

Stride

5 Channels — RGB and Beyond

Multiple Output Channels

6 Computational Cost

7 What Kernels Learn

These aren't hand-designed — the network learns them via backprop. But if you visualise learned kernels from layer 1, they look exactly like classic edge detectors.

Hierarchical Features

8 Pooling vs Stride

Max Pooling

Property	Max Pooling	Stride 2 Conv
Parameters	None	Learned
Gradient flow	Only to max position	To all positions
Flexibility	Fixed operation	Learns what to keep

9 Receptive Field

The Formula

10 Full Architecture: LeNet

Parameter Count

11 From LeNet to Vision Transformers

1998

LeNet

Gradient-Based Learning — LeCun et al.

First working CNN. Conv → Pool → FC architecture for digit recognition. Proved that neural networks could learn hierarchical features from raw pixels.

Problem: Limited by compute of the era. Couldn't scale to larger images or deeper networks.

2012

AlexNet — The Deep Learning Revolution

ImageNet Classification with Deep CNNs — Krizhevsky et al.

LeNet scaled up massively. Won ImageNet by a huge margin. Key additions: ReLU (instead of sigmoid), Dropout for regularisation, GPU training with model parallelism.

Problem: Still just "stack more layers" — no principled way to go deeper. 8 layers was already pushing limits.

2014

VGGNet — Deeper with 3×3s

Very Deep Convolutional Networks — Simonyan & Zisserman

"Just use 3×3 convs everywhere, stack them deep." VGG-16 and VGG-19. Key insight: two 3×3 convs have same receptive field as one 5×5, but fewer parameters and more non-linearity.

Problem: 138M parameters. Very slow. Going beyond 19 layers caused degradation — deeper models performed worse.

2014

GoogLeNet / Inception — Multiple Kernel Sizes

Going Deeper with Convolutions — Szegedy et al.

"Why choose one kernel size? Use multiple in parallel." Inception module: 1×1, 3×3, 5×5 convs side by side, concatenate results. Introduced 1×1 convolutions for cheap channel reduction before expensive operations.

Problem: Complex architecture, hard to modify. Still couldn't go arbitrarily deep.

2015

ResNet — The Skip Connection Breakthrough ⭐

Deep Residual Learning — He et al.

Solved the degradation problem — why deeper networks performed worse. Key innovation: skip connections. Instead of learning H(x), learn the residual F(x) = H(x) - x, so output = F(x) + x.

Fix: The +x provides a gradient highway. Even if F(x) gradients vanish, gradient still flows through identity. Enabled 152 layers (vs VGG's 19). Still the backbone of most vision models today.

2017

MobileNet — CNNs for Phones

MobileNets — Howard et al.

Key trick: depthwise separable convolutions. Split spatial filtering (depthwise) from channel mixing (pointwise 1×1). Same effective operation, ~8× fewer parameters and FLOPs.

Fix: Made CNNs practical for mobile devices and edge deployment. Spawned MobileNetV2, V3 with further optimisations.

2017

Squeeze-and-Excitation Networks

SE-Net — Hu et al.

"Let the network learn which channels are important." Global average pool → FC → sigmoid → reweight channels. Adds channel attention with minimal overhead.

Fix: Can be added to any architecture. Won ImageNet 2017. Channel attention became a standard component.

2019

EfficientNet — Neural Architecture Search

EfficientNet — Tan & Le

"Stop hand-designing, let NAS find optimal scaling." Compound scaling: scale depth, width, and resolution together with fixed ratios. EfficientNet-B0 to B7 family.

Fix: State-of-the-art accuracy/efficiency tradeoff. Showed that scaling all dimensions together beats scaling just one.

2020

Vision Transformer (ViT) — Attention Is All You Need (for Vision)

An Image is Worth 16x16 Words — Dosovitskiy et al.

Throw away convolutions entirely. Chop image into 16×16 patches, flatten, add positional embeddings, feed to standard transformer encoder. Direct application of NLP transformer architecture to vision.

Problem: Needs massive data (ImageNet-21k or JFT-300M) to match CNNs. Without enough data, lacks the inductive bias that convolutions provide for free.

2021

Swin Transformer — Efficient Vision Transformer

Swin Transformer — Liu et al.

Make ViT efficient and hierarchical like CNNs. Key innovations: shifted windows for local attention (not full image), hierarchical feature maps (like CNN pyramid), linear complexity in image size.

Fix: Best of both worlds — transformer flexibility with CNN-like efficiency. State-of-the-art on many vision benchmarks.

2022

ConvNeXt — CNNs Strike Back

A ConvNet for the 2020s — Liu et al.

"What if we modernise CNNs with all the tricks learned from transformers?" Pure CNN with: larger kernels (7×7), fewer activations, LayerNorm instead of BatchNorm, inverted bottlenecks, separate downsampling layers.

Result: Pure CNN matching Swin Transformer performance. Proved that architecture differences matter less than training recipes. CNNs aren't dead.

12 Why Vision Transformers?

The network can eventually see both corners. But information has to travel through 30 layers — by the time distant pixels "meet", the information has been transformed and potentially lost.

Transformers: Self-attention compares every patch to every other patch directly in layer 1. Global context immediately.

Property	CNN	Transformer
Local patterns	Great (built-in bias)	Has to learn it
Global patterns	Slow (needs depth)	Instant (attention)
Data needed	Less (strong bias helps)	More (no bias)
Compute	Cheaper	O(n²) in patches

13 1×1 Convolutions

14 Depthwise Separable Convolutions

Normal conv does spatial filtering AND channel mixing in one operation. Depthwise separable splits them:

15 Dilated (Atrous) Convolutions

Used in DeepLab (segmentation), WaveNet (audio) — anywhere you need global context without losing resolution.

16 Transposed Convolutions & U-Net

Transposed Convolution

U-Net

Skip connections from encoder to decoder preserve fine spatial details (edges, precise boundaries) that get lost during downsampling.

17 BatchNorm & Skip Connections

BatchNorm in CNNs

BatchNorm2d normalises across batch AND spatial dimensions, but separately per channel:

Contents

Why Convolutions?

1 The Problem with Fully Connected Layers