Computer Vision

CNNs and Convolutions

From sliding kernels to Vision Transformers โ€” why convolutions work, how they evolved, and when to use them

Contents

Part I

Why Convolutions?

1 The Problem with Fully Connected Layers

Before CNNs, people used fully connected layers on images. The problem? It doesn't scale.

28ร—28 grayscale image = 784 pixels (flattened) FC layer: 784 โ†’ 512 Weight matrix W: 784 ร— 512 = 401,408 parameters 224ร—224 RGB image = 224 ร— 224 ร— 3 = 150,528 pixels (flattened) FC layer: 150,528 โ†’ 512 Weight matrix W: 150,528 ร— 512 = 77 million parameters

Beyond the parameter explosion, fully connected layers destroy spatial structure.

๐Ÿค” Question

"What do you mean by 'destroy spatial structure'?"

โœ“ Answer

When you flatten an image, you turn a 2D grid into a 1D vector:

5ร—5 image (each p_i is a pixel): p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 p14 p15 p16 p17 p18 p19 p20 p21 p22 p23 p24 p25 Flattened into a vector: [p1, p2, p3, p4, p5, p6, p7, ... p25]

The network has no idea that p1 and p2 are neighbours, but p1 and p25 are corners apart. It's just 25 numbers with 25 separate weights. The spatial layout is destroyed the moment you flatten.

๐Ÿค” Question

"Why can't we just add positional information back in?"

โœ“ Answer

You can! And people do โ€” Vision Transformers add positional embeddings to image patches. But there's a catch: adding spatial info lets the network learn spatial relationships, but doesn't force it to. The network still has to figure out from data that "nearby pixels matter more."

Convolutions bake in that assumption. The kernel physically can only see a local region. No learning required to know neighbours are related.

2 The Core Idea: Weight Sharing

Instead of connecting every pixel to every neuron, slide a small kernel across the image. The same kernel weights are applied at every position.

Think about what a "vertical edge detector" is:

Kernel: [-1 0 1] [-1 0 1] [-1 0 1]
๐Ÿค” Question

"Why is weight sharing important beyond just reducing parameters?"

โœ“ Answer

Should the network learn a separate vertical edge detector for the top-left corner vs the bottom-right corner? No โ€” a vertical edge is a vertical edge regardless of position.

Weight sharing means: learn the pattern once, apply it everywhere.

Without weight sharing, the network might learn "vertical edge at position (10,10)" and "vertical edge at position (50,50)" as separate things. Needs way more data to generalise.

Key Insight

Two properties make convolutions powerful:

Part II

The Convolution Operation

3 Sliding the Kernel

A kernel (small matrix) slides across the image. At each position:

  1. Place the kernel over a patch of the input
  2. Element-wise multiply kernel and patch
  3. Sum to get one output value
  4. Slide to next position, repeat
Input (5ร—5) Kernel (3ร—3) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 1 2 3 4 5 โ”‚ โ”‚ 1 0 1 โ”‚ โ”‚ 2 3 4 5 6 โ”‚ * โ”‚ 0 1 0 โ”‚ โ”‚ 3 4 5 6 7 โ”‚ โ”‚ 1 0 1 โ”‚ โ”‚ 4 5 6 7 8 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ 5 6 7 8 9 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Position (0,0) โ€” kernel placed over top-left 3ร—3: [1 2 3] [1 0 1] [2 3 4] ร— [0 1 0] [3 4 5] [1 0 1] Output = 1ร—1 + 2ร—0 + 3ร—1 + 2ร—0 + 3ร—1 + 4ร—0 + 3ร—1 + 4ร—0 + 5ร—1 = 1 + 3 + 3 + 3 + 5 = 15

4 Output Size, Padding, Stride

How big is the output after convolution?

$$\text{output\_size} = \left\lfloor \frac{\text{input\_size} - \text{kernel\_size} + 2 \times \text{padding}}{\text{stride}} \right\rfloor + 1$$
Input Kernel Padding Stride Output
5ร—5 3ร—3 0 1 3ร—3
28ร—28 3ร—3 1 1 28ร—28
28ร—28 3ร—3 1 2 14ร—14

Padding

padding = kernel_size // 2 keeps spatial size the same (for stride=1). Adds zeros around the input so the kernel can be centred on edge pixels.

Stride

๐Ÿค” Question

"What's the point of stride > 1? Why not always stride 1?"

โœ“ Answer

Stride controls how much the kernel moves each step. stride=2 means skip every other position, roughly halving spatial dimensions.

It's used for downsampling โ€” making feature maps smaller as you go deeper. Smaller feature maps = less computation, and deeper layers need to "see" larger regions of the original image.

5 Channels โ€” RGB and Beyond

Real images are RGB โ€” 3 channels. The kernel becomes 3D to handle this:

RGB image: H ร— W ร— 3 Red channel Green channel Blue channel โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ H ร— W โ”‚ โ”‚ H ร— W โ”‚ โ”‚ H ร— W โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Kernel: 3 ร— 3 ร— 3 (one slice per channel) Red slice Green slice Blue slice โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ w w w โ”‚ โ”‚ w w w โ”‚ โ”‚ w w w โ”‚ โ”‚ w w w โ”‚ โ”‚ w w w โ”‚ โ”‚ w w w โ”‚ โ”‚ w w w โ”‚ โ”‚ w w w โ”‚ โ”‚ w w w โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
๐Ÿค” Question

"Why do we sum all 3 channel outputs into one number?"

โœ“ Answer

Because we want to detect patterns that involve all channels together.

Think about detecting "orange" in an image: high red, medium green, low blue. A single-channel kernel can't see this โ€” it only looks at one colour. But a 3-channel kernel can learn weights that respond when all three conditions are met.

Same for skin detection โ€” skin has a specific RGB signature. The kernel needs to see red AND green AND blue together to recognise it.

Key Insight

One kernel spans ALL input channels and collapses them into one output channel.

Multiple Output Channels

Want to detect multiple patterns? Use multiple kernels.

Input: H ร— W ร— 3 Kernel 1 (3ร—3ร—3): detects vertical edges โ†’ Output channel 1 Kernel 2 (3ร—3ร—3): detects horizontal edges โ†’ Output channel 2 Kernel 3 (3ร—3ร—3): detects blobs โ†’ Output channel 3 ... Kernel 16 (3ร—3ร—3): detects ??? (learned) โ†’ Output channel 16 Stack all outputs: H' ร— W' ร— 16

In PyTorch:

nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3) # Creates 16 kernels, each of size 3ร—3ร—3 # Total weights: 16 ร— 3 ร— 3 ร— 3 = 432 (plus 16 biases)
๐Ÿค” Question

"So each of the 16 kernels applies on the RGB channels and sums up the output to get one channel?"

โœ“ Answer

Yes, exactly.

Kernel 1 (3ร—3ร—3): - red slice applies to red channel โ†’ number - green slice applies to green channel โ†’ number - blue slice applies to blue channel โ†’ number - sum โ†’ one value at that position - slide across whole image โ†’ Output channel 1 (H' ร— W') Kernel 2 (3ร—3ร—3): - same process, different weights โ†’ Output channel 2 (H' ร— W') ... Stack all 16 โ†’ Output (H' ร— W' ร— 16)

6 Computational Cost

๐Ÿค” Question

"What's the computational cost of a conv layer?"

โœ“ Answer

At each output position, the kernel does k ร— k ร— in_channels multiply-adds.

Total cost:

$$k^2 \times \text{in\_channels} \times \text{out\_channels} \times H' \times W'$$
๐Ÿค” Follow-up

"Why multiply by H' ร— W'?"

โœ“ Answer

The kernel has to slide across every position in the output. Each position requires those kยฒ ร— in_channels operations. More positions = more computation.

It's not matrix multiplication โ€” it's literally "slide to position, compute, slide to next position, compute..."

Part III

Building CNNs

7 What Kernels Learn

Early layers learn simple patterns:

Vertical edge: Horizontal edge: Diagonal: [-1 0 1] [-1 -1 -1] [ 1 0 -1] [-1 0 1] [ 0 0 0] [ 0 1 0] [-1 0 1] [ 1 1 1] [-1 0 1]

These aren't hand-designed โ€” the network learns them via backprop. But if you visualise learned kernels from layer 1, they look exactly like classic edge detectors.

Hierarchical Features

Layer 1: pixels โ†’ edges Layer 2: edges โ†’ corners, curves, textures Layer 3: textures โ†’ parts (eyes, wheels, handles) Layer 4+: parts โ†’ objects (faces, cars, cups)
Key Insight

The network builds a hierarchy: simple โ†’ complex. Each layer combines the previous layer's features into something more abstract. That's why depth works.

8 Pooling vs Stride

Both shrink spatial size. What's the difference?

Max Pooling

Input (4ร—4): Max pool (2ร—2, stride 2): โ”Œโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ” โ”‚ 1 โ”‚ 3 โ”‚ 2 โ”‚ 1 โ”‚ โ”‚ 4 โ”‚ 6 โ”‚ โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค โ”‚ 4 โ”‚ 2 โ”‚ 6 โ”‚ 5 โ”‚ โ†’ โ”‚ 8 โ”‚ 9 โ”‚ โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค โ””โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”˜ โ”‚ 7 โ”‚ 8 โ”‚ 1 โ”‚ 2 โ”‚ โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค Output (2ร—2) โ”‚ 3 โ”‚ 5 โ”‚ 9 โ”‚ 4 โ”‚ โ””โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”˜ Top-left: max(1,3,4,2) = 4 Top-right: max(2,1,6,5) = 6 Bottom-left: max(7,8,3,5) = 8 Bottom-right: max(1,2,9,4) = 9
๐Ÿค” Question

"Why is max pooling bad for gradients? Won't the non-max pixels never update?"

โœ“ Answer

Yes โ€” same issue as ReLU. Gradient flows only to the max position, others get zero. But in practice:

๐Ÿค” Question

"Why did people move away from pooling?"

โœ“ Answer

Two reasons:

  1. Gradient flow: Stride 2 conv gives gradients to all positions (weighted by learned kernel). Smoother learning.
  2. Information loss: Max pooling actively throws away 3 of 4 values per window. Stride 2 conv doesn't discard pixels โ€” it just visits fewer positions. The learned kernel decides what to keep.
Property Max Pooling Stride 2 Conv
Parameters None Learned
Gradient flow Only to max position To all positions
Flexibility Fixed operation Learns what to keep

9 Receptive Field

๐Ÿค” Question

"If each conv only sees 3ร—3, how does the network see the whole image?"

โœ“ Answer

Stacking layers expands the receptive field.

Layer 1: one output position sees 3ร—3 of input Layer 2: sees 3ร—3 of Layer 1 each Layer 1 position saw 3ร—3 of input combined: 5ร—5 of original input Layer 3: sees 3ร—3 of Layer 2 combined: 7ร—7 of original input

The Formula

For stride 1 and kernel size k:

$$\text{RF after } n \text{ layers} = 1 + n \times (k - 1)$$

With stride > 1, receptive field grows faster:

$$\text{RF}_{new} = \text{RF}_{old} + (k - 1) \times \text{stride\_product}$$
๐Ÿค” Question

"What's stride_product?"

โœ“ Answer

Cumulative product of all strides from previous layers. After stride 2, the feature map is half the size โ€” so one pixel in that feature map corresponds to 2 pixels in the previous layer.

Layer 1: Conv 3ร—3, stride 1 โ†’ RF = 3, stride_product = 1 Layer 2: Conv 3ร—3, stride 2 โ†’ RF = 5, stride_product = 2 Layer 3: Conv 3ร—3, stride 1 โ†’ RF = 5 + (3-1)ร—2 = 9, stride_product = 2 Layer 4: Conv 3ร—3, stride 2 โ†’ RF = 9 + (3-1)ร—2 = 13, stride_product = 4 After stride 2, each kernel "covers" 2ร— more of the original input.

10 Full Architecture: LeNet

LeNet (1998) โ€” one of the first CNNs. Simple enough to understand fully:

Input: (1, 32, 32) โ† grayscale 32ร—32 image โ†“ Conv2d(1, 6, kernel=5) โ†’ (6, 28, 28) ReLU MaxPool2d(2, stride=2) โ†’ (6, 14, 14) โ†“ Conv2d(6, 16, kernel=5) โ†’ (16, 10, 10) ReLU MaxPool2d(2, stride=2) โ†’ (16, 5, 5) โ†“ Flatten โ†’ (400,) โ†“ Linear(400, 120) โ†’ ReLU โ†’ (120,) Linear(120, 84) โ†’ ReLU โ†’ (84,) Linear(84, 10) โ†’ (10,) โ†“ Softmax โ†’ prediction

Parameter Count

Conv1: 6 ร— 1 ร— 5 ร— 5 + 6 bias = 156 Conv2: 16 ร— 6 ร— 5 ร— 5 + 16 bias = 2,416 FC1: 400 ร— 120 + 120 = 48,120 FC2: 120 ร— 84 + 84 = 10,164 FC3: 84 ร— 10 + 10 = 850 โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Total: ~61,000 parameters Notice: FC layers have way more params than conv layers. This is why modern architectures minimise FC layers.
๐Ÿค” Question

"What happens for variable image sizes? The FC part would break, right?"

โœ“ Answer

Exactly. Conv layers don't care about image size โ€” kernel just slides however many times it can. But FC layer has fixed input size.

Solutions:

Part IV

Evolution of Architectures

11 From LeNet to Vision Transformers

1998
LeNet
Gradient-Based Learning โ€” LeCun et al.

First working CNN. Conv โ†’ Pool โ†’ FC architecture for digit recognition. Proved that neural networks could learn hierarchical features from raw pixels.

Problem: Limited by compute of the era. Couldn't scale to larger images or deeper networks.

2012
AlexNet โ€” The Deep Learning Revolution
ImageNet Classification with Deep CNNs โ€” Krizhevsky et al.

LeNet scaled up massively. Won ImageNet by a huge margin. Key additions: ReLU (instead of sigmoid), Dropout for regularisation, GPU training with model parallelism.

Problem: Still just "stack more layers" โ€” no principled way to go deeper. 8 layers was already pushing limits.

2014
VGGNet โ€” Deeper with 3ร—3s
Very Deep Convolutional Networks โ€” Simonyan & Zisserman

"Just use 3ร—3 convs everywhere, stack them deep." VGG-16 and VGG-19. Key insight: two 3ร—3 convs have same receptive field as one 5ร—5, but fewer parameters and more non-linearity.

Problem: 138M parameters. Very slow. Going beyond 19 layers caused degradation โ€” deeper models performed worse.

2014
GoogLeNet / Inception โ€” Multiple Kernel Sizes
Going Deeper with Convolutions โ€” Szegedy et al.

"Why choose one kernel size? Use multiple in parallel." Inception module: 1ร—1, 3ร—3, 5ร—5 convs side by side, concatenate results. Introduced 1ร—1 convolutions for cheap channel reduction before expensive operations.

Problem: Complex architecture, hard to modify. Still couldn't go arbitrarily deep.

2015
ResNet โ€” The Skip Connection Breakthrough โญ
Deep Residual Learning โ€” He et al.

Solved the degradation problem โ€” why deeper networks performed worse. Key innovation: skip connections. Instead of learning H(x), learn the residual F(x) = H(x) - x, so output = F(x) + x.

Fix: The +x provides a gradient highway. Even if F(x) gradients vanish, gradient still flows through identity. Enabled 152 layers (vs VGG's 19). Still the backbone of most vision models today.

2017
MobileNet โ€” CNNs for Phones
MobileNets โ€” Howard et al.

Key trick: depthwise separable convolutions. Split spatial filtering (depthwise) from channel mixing (pointwise 1ร—1). Same effective operation, ~8ร— fewer parameters and FLOPs.

Fix: Made CNNs practical for mobile devices and edge deployment. Spawned MobileNetV2, V3 with further optimisations.

2017
Squeeze-and-Excitation Networks
SE-Net โ€” Hu et al.

"Let the network learn which channels are important." Global average pool โ†’ FC โ†’ sigmoid โ†’ reweight channels. Adds channel attention with minimal overhead.

Fix: Can be added to any architecture. Won ImageNet 2017. Channel attention became a standard component.

2019
EfficientNet โ€” Neural Architecture Search
EfficientNet โ€” Tan & Le

"Stop hand-designing, let NAS find optimal scaling." Compound scaling: scale depth, width, and resolution together with fixed ratios. EfficientNet-B0 to B7 family.

Fix: State-of-the-art accuracy/efficiency tradeoff. Showed that scaling all dimensions together beats scaling just one.

2020
Vision Transformer (ViT) โ€” Attention Is All You Need (for Vision)
An Image is Worth 16x16 Words โ€” Dosovitskiy et al.

Throw away convolutions entirely. Chop image into 16ร—16 patches, flatten, add positional embeddings, feed to standard transformer encoder. Direct application of NLP transformer architecture to vision.

Problem: Needs massive data (ImageNet-21k or JFT-300M) to match CNNs. Without enough data, lacks the inductive bias that convolutions provide for free.

2021
Swin Transformer โ€” Efficient Vision Transformer
Swin Transformer โ€” Liu et al.

Make ViT efficient and hierarchical like CNNs. Key innovations: shifted windows for local attention (not full image), hierarchical feature maps (like CNN pyramid), linear complexity in image size.

Fix: Best of both worlds โ€” transformer flexibility with CNN-like efficiency. State-of-the-art on many vision benchmarks.

2022
ConvNeXt โ€” CNNs Strike Back
A ConvNet for the 2020s โ€” Liu et al.

"What if we modernise CNNs with all the tricks learned from transformers?" Pure CNN with: larger kernels (7ร—7), fewer activations, LayerNorm instead of BatchNorm, inverted bottlenecks, separate downsampling layers.

Result: Pure CNN matching Swin Transformer performance. Proved that architecture differences matter less than training recipes. CNNs aren't dead.

The Through-Line

Every innovation attacks one of three constraints:

12 Why Vision Transformers?

๐Ÿค” Question

"But doesn't receptive field handle global context for CNNs?"

โœ“ Answer

Yes, but only at the end of the network.

Layer 1: RF = 3ร—3 โ†’ can only compare nearby pixels Layer 5: RF = 11ร—11 โ†’ still local Layer 10: RF = 21ร—21 โ†’ still not full image ... Layer 30: RF = 224ร—224 โ†’ finally sees whole image

The network can eventually see both corners. But information has to travel through 30 layers โ€” by the time distant pixels "meet", the information has been transformed and potentially lost.

Transformers: Self-attention compares every patch to every other patch directly in layer 1. Global context immediately.

Property CNN Transformer
Local patterns Great (built-in bias) Has to learn it
Global patterns Slow (needs depth) Instant (attention)
Data needed Less (strong bias helps) More (no bias)
Compute Cheaper O(nยฒ) in patches
๐Ÿค” Question

"But I thought the point of CNNs was local position info vs flattened networks? Aren't transformers just fancy flattened networks?"

โœ“ Answer

Good catch โ€” there's a tension here.

CNNs bake in "local matters" as a hard constraint. Transformers say "learn what matters" โ€” attention can look everywhere, but learns to focus locally when useful.

What actually happens in trained ViT: early layers have mostly local attention (learns CNN-like behaviour), later layers become more global. The transformer essentially learns to act like a CNN early on, then goes beyond.

ConvNeXt (2022) showed: if you train CNNs with transformer-style tricks, they match transformers. Architecture may matter less than training recipes.

Part V

Advanced Topics

13 1ร—1 Convolutions

๐Ÿค” Question

"A 1ร—1 kernel? What can it even do?"

โœ“ Answer

No spatial mixing โ€” just channel mixing. It's a learned weighted combination across channels at each pixel.

Input at one pixel: [c1, c2, c3, c4] (4 channels) 1ร—1 kernel: [w1, w2, w3, w4] Output = w1ร—c1 + w2ร—c2 + w3ร—c3 + w4ร—c4 That's a dot product โ€” combining information from all channels.
๐Ÿค” Question

"For a single input channel, a 1ร—1 conv would just result in scaled input?"

โœ“ Answer

Yes, exactly. 1ร—1 on single channel = scaling = pointless. The power comes from channel mixing when you have multiple channels.

Use cases:

14 Depthwise Separable Convolutions

Normal conv does spatial filtering AND channel mixing in one operation. Depthwise separable splits them:

Step 1 โ€” Depthwise: one 3ร—3 kernel per channel (spatial only) Input: H ร— W ร— 64 โ†’ 64 separate 3ร—3 kernels โ†’ H ร— W ร— 64 Cost: 3 ร— 3 ร— 64 = 576 parameters Step 2 โ€” Pointwise: 1ร—1 conv (channel mixing only) Input: H ร— W ร— 64 โ†’ 128 filters of 1ร—1ร—64 โ†’ H ร— W ร— 128 Cost: 1 ร— 1 ร— 64 ร— 128 = 8,192 parameters Total: 576 + 8,192 = 8,768 parameters Normal conv: 3 ร— 3 ร— 64 ร— 128 = 73,728 parameters ~8ร— cheaper!

Used in MobileNet, EfficientNet โ€” anywhere efficiency matters.

15 Dilated (Atrous) Convolutions

Expand receptive field without pooling by spacing out the kernel:

Normal 3ร—3: Dilated 3ร—3 (dilation=2): โ–  โ–  โ–  โ–  ยท โ–  ยท โ–  โ–  โ–  โ–  ยท ยท ยท ยท ยท โ–  โ–  โ–  โ–  ยท โ–  ยท โ–  ยท ยท ยท ยท ยท Sees 3ร—3 โ–  ยท โ–  ยท โ–  Sees 5ร—5 area, same 9 parameters
๐Ÿค” Question

"What are the gaps?"

โœ“ Answer

The kernel skips pixels. The ยท positions are just ignored โ€” the kernel doesn't look at them. Same 9 weights, but applied to pixels that are spaced apart.

Used in DeepLab (segmentation), WaveNet (audio) โ€” anywhere you need global context without losing resolution.

16 Transposed Convolutions & U-Net

Transposed Convolution

For upsampling โ€” going from small to big. Used in segmentation, GANs.

๐Ÿค” Question

"Why did transposed conv get a bad rep?"

โœ“ Answer

Checkerboard artifacts. When stride > 1, the "stamps" overlap unevenly โ€” some pixels get more contributions than others, creating visible grid patterns.

Modern solution: Nearest neighbour upsample + regular conv. Cleaner results.

Transposed conv: tries to upsize AND learn in one step โ†’ uneven "stamps" create grid patterns Upsample + conv: Step 1: Upsample (dumb resize โ€” repeat or interpolate) Step 2: Conv (learned refinement) โ†’ cleaner, no artifacts
๐Ÿค” Question

"Why conv after upsample?"

โœ“ Answer

Upsampling alone doesn't add any new information โ€” it just repeats or interpolates. Conv after lets the network learn how to fill in details.

U-Net

Segmentation architecture with encoder-decoder + skip connections:

Encoder (downsample): Decoder (upsample): 224ร—224 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ concat โ†’ 224ร—224 โ†“ โ†‘ 112ร—112 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ concat โ†’ 112ร—112 โ†“ โ†‘ 56ร—56 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ concat โ†’ 56ร—56 โ†“ โ†‘ 28ร—28 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ concat โ†’ 28ร—28 โ†“ โ†‘ 14ร—14 (bottleneck) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ The U shape: Down Up \ / \ / \ / \/ bottleneck

Skip connections from encoder to decoder preserve fine spatial details (edges, precise boundaries) that get lost during downsampling.

17 BatchNorm & Skip Connections

BatchNorm in CNNs

CNN activations have shape: (batch, channels, H, W)

BatchNorm2d normalises across batch AND spatial dimensions, but separately per channel:

For channel 0: Collect all values: 32 batches ร— 14 ร— 14 positions = 6,272 values Compute mean and variance Normalise Repeat for each of 64 channels. Why per channel? Each channel detects different features with different statistics.

Skip Connections โ€” Why Gradients Flow

The problem with deep networks:

$$\frac{\partial \text{Loss}}{\partial \text{Layer}_1} = \frac{\partial \text{Loss}}{\partial \text{Layer}_{50}} \times \frac{\partial \text{Layer}_{50}}{\partial \text{Layer}_{49}} \times \cdots$$

That's 49 multiplications. If each term < 1, gradient vanishes.

With skip connection:

$$y = F(x) + x$$ $$\frac{\partial y}{\partial x} = \frac{\partial F(x)}{\partial x} + 1$$
Key Insight

That +1 is the key. Even if โˆ‚F(x)/โˆ‚x vanishes, gradient is still at least 1. The gradient has a "highway" โ€” it can skip directly through the + without being multiplied by small weights 50 times.

That's why ResNet could train 152 layers when VGG struggled with 19.

TL;DR
๐ŸŽฏ Interview Tip

"Convolutions work because images have spatial structure โ€” nearby pixels are related. Weight sharing means we learn local patterns once and apply everywhere, giving translation equivariance. Receptive field grows with depth: early layers detect edges, deeper layers combine them into objects. Modern architectures use stride 2 conv instead of pooling (better gradients), skip connections (enables depth), and may even drop convolutions entirely for attention (ViT) when data is plentiful."


Part of the Computer Vision series  ยท  Back to Blog