From sliding kernels to Vision Transformers โ why convolutions work, how they evolved, and when to use them
Before CNNs, people used fully connected layers on images. The problem? It doesn't scale.
Beyond the parameter explosion, fully connected layers destroy spatial structure.
"What do you mean by 'destroy spatial structure'?"
When you flatten an image, you turn a 2D grid into a 1D vector:
The network has no idea that p1 and p2 are neighbours, but p1 and p25 are corners apart. It's just 25 numbers with 25 separate weights. The spatial layout is destroyed the moment you flatten.
"Why can't we just add positional information back in?"
You can! And people do โ Vision Transformers add positional embeddings to image patches. But there's a catch: adding spatial info lets the network learn spatial relationships, but doesn't force it to. The network still has to figure out from data that "nearby pixels matter more."
Convolutions bake in that assumption. The kernel physically can only see a local region. No learning required to know neighbours are related.
Instead of connecting every pixel to every neuron, slide a small kernel across the image. The same kernel weights are applied at every position.
Think about what a "vertical edge detector" is:
"Why is weight sharing important beyond just reducing parameters?"
Should the network learn a separate vertical edge detector for the top-left corner vs the bottom-right corner? No โ a vertical edge is a vertical edge regardless of position.
Weight sharing means: learn the pattern once, apply it everywhere.
Without weight sharing, the network might learn "vertical edge at position (10,10)" and "vertical edge at position (50,50)" as separate things. Needs way more data to generalise.
Two properties make convolutions powerful:
A kernel (small matrix) slides across the image. At each position:
How big is the output after convolution?
| Input | Kernel | Padding | Stride | Output |
|---|---|---|---|---|
| 5ร5 | 3ร3 | 0 | 1 | 3ร3 |
| 28ร28 | 3ร3 | 1 | 1 | 28ร28 |
| 28ร28 | 3ร3 | 1 | 2 | 14ร14 |
padding = kernel_size // 2 keeps spatial size the same (for stride=1). Adds zeros around the input
so the kernel can be centred on edge pixels.
"What's the point of stride > 1? Why not always stride 1?"
Stride controls how much the kernel moves each step. stride=2 means skip every other position,
roughly halving spatial dimensions.
It's used for downsampling โ making feature maps smaller as you go deeper. Smaller feature maps = less computation, and deeper layers need to "see" larger regions of the original image.
Real images are RGB โ 3 channels. The kernel becomes 3D to handle this:
"Why do we sum all 3 channel outputs into one number?"
Because we want to detect patterns that involve all channels together.
Think about detecting "orange" in an image: high red, medium green, low blue. A single-channel kernel can't see this โ it only looks at one colour. But a 3-channel kernel can learn weights that respond when all three conditions are met.
Same for skin detection โ skin has a specific RGB signature. The kernel needs to see red AND green AND blue together to recognise it.
One kernel spans ALL input channels and collapses them into one output channel.
Want to detect multiple patterns? Use multiple kernels.
In PyTorch:
"So each of the 16 kernels applies on the RGB channels and sums up the output to get one channel?"
Yes, exactly.
"What's the computational cost of a conv layer?"
At each output position, the kernel does k ร k ร in_channels multiply-adds.
Total cost:
"Why multiply by H' ร W'?"
The kernel has to slide across every position in the output. Each position requires those
kยฒ ร in_channels operations. More positions = more computation.
It's not matrix multiplication โ it's literally "slide to position, compute, slide to next position, compute..."
Early layers learn simple patterns:
These aren't hand-designed โ the network learns them via backprop. But if you visualise learned kernels from layer 1, they look exactly like classic edge detectors.
The network builds a hierarchy: simple โ complex. Each layer combines the previous layer's features into something more abstract. That's why depth works.
Both shrink spatial size. What's the difference?
"Why is max pooling bad for gradients? Won't the non-max pixels never update?"
Yes โ same issue as ReLU. Gradient flows only to the max position, others get zero. But in practice:
"Why did people move away from pooling?"
Two reasons:
| Property | Max Pooling | Stride 2 Conv |
|---|---|---|
| Parameters | None | Learned |
| Gradient flow | Only to max position | To all positions |
| Flexibility | Fixed operation | Learns what to keep |
"If each conv only sees 3ร3, how does the network see the whole image?"
Stacking layers expands the receptive field.
For stride 1 and kernel size k:
With stride > 1, receptive field grows faster:
"What's stride_product?"
Cumulative product of all strides from previous layers. After stride 2, the feature map is half the size โ so one pixel in that feature map corresponds to 2 pixels in the previous layer.
LeNet (1998) โ one of the first CNNs. Simple enough to understand fully:
"What happens for variable image sizes? The FC part would break, right?"
Exactly. Conv layers don't care about image size โ kernel just slides however many times it can. But FC layer has fixed input size.
Solutions:
First working CNN. Conv โ Pool โ FC architecture for digit recognition. Proved that neural networks could learn hierarchical features from raw pixels.
Problem: Limited by compute of the era. Couldn't scale to larger images or deeper networks.
LeNet scaled up massively. Won ImageNet by a huge margin. Key additions: ReLU (instead of sigmoid), Dropout for regularisation, GPU training with model parallelism.
Problem: Still just "stack more layers" โ no principled way to go deeper. 8 layers was already pushing limits.
"Just use 3ร3 convs everywhere, stack them deep." VGG-16 and VGG-19. Key insight: two 3ร3 convs have same receptive field as one 5ร5, but fewer parameters and more non-linearity.
Problem: 138M parameters. Very slow. Going beyond 19 layers caused degradation โ deeper models performed worse.
"Why choose one kernel size? Use multiple in parallel." Inception module: 1ร1, 3ร3, 5ร5 convs side by side, concatenate results. Introduced 1ร1 convolutions for cheap channel reduction before expensive operations.
Problem: Complex architecture, hard to modify. Still couldn't go arbitrarily deep.
Solved the degradation problem โ why deeper networks performed worse. Key innovation: skip connections. Instead of learning H(x), learn the residual F(x) = H(x) - x, so output = F(x) + x.
Fix: The +x provides a gradient highway. Even if F(x) gradients vanish, gradient still flows through identity. Enabled 152 layers (vs VGG's 19). Still the backbone of most vision models today.
Key trick: depthwise separable convolutions. Split spatial filtering (depthwise) from channel mixing (pointwise 1ร1). Same effective operation, ~8ร fewer parameters and FLOPs.
Fix: Made CNNs practical for mobile devices and edge deployment. Spawned MobileNetV2, V3 with further optimisations.
"Let the network learn which channels are important." Global average pool โ FC โ sigmoid โ reweight channels. Adds channel attention with minimal overhead.
Fix: Can be added to any architecture. Won ImageNet 2017. Channel attention became a standard component.
"Stop hand-designing, let NAS find optimal scaling." Compound scaling: scale depth, width, and resolution together with fixed ratios. EfficientNet-B0 to B7 family.
Fix: State-of-the-art accuracy/efficiency tradeoff. Showed that scaling all dimensions together beats scaling just one.
Throw away convolutions entirely. Chop image into 16ร16 patches, flatten, add positional embeddings, feed to standard transformer encoder. Direct application of NLP transformer architecture to vision.
Problem: Needs massive data (ImageNet-21k or JFT-300M) to match CNNs. Without enough data, lacks the inductive bias that convolutions provide for free.
Make ViT efficient and hierarchical like CNNs. Key innovations: shifted windows for local attention (not full image), hierarchical feature maps (like CNN pyramid), linear complexity in image size.
Fix: Best of both worlds โ transformer flexibility with CNN-like efficiency. State-of-the-art on many vision benchmarks.
"What if we modernise CNNs with all the tricks learned from transformers?" Pure CNN with: larger kernels (7ร7), fewer activations, LayerNorm instead of BatchNorm, inverted bottlenecks, separate downsampling layers.
Result: Pure CNN matching Swin Transformer performance. Proved that architecture differences matter less than training recipes. CNNs aren't dead.
Every innovation attacks one of three constraints:
"But doesn't receptive field handle global context for CNNs?"
Yes, but only at the end of the network.
The network can eventually see both corners. But information has to travel through 30 layers โ by the time distant pixels "meet", the information has been transformed and potentially lost.
Transformers: Self-attention compares every patch to every other patch directly in layer 1. Global context immediately.
| Property | CNN | Transformer |
|---|---|---|
| Local patterns | Great (built-in bias) | Has to learn it |
| Global patterns | Slow (needs depth) | Instant (attention) |
| Data needed | Less (strong bias helps) | More (no bias) |
| Compute | Cheaper | O(nยฒ) in patches |
"But I thought the point of CNNs was local position info vs flattened networks? Aren't transformers just fancy flattened networks?"
Good catch โ there's a tension here.
CNNs bake in "local matters" as a hard constraint. Transformers say "learn what matters" โ attention can look everywhere, but learns to focus locally when useful.
What actually happens in trained ViT: early layers have mostly local attention (learns CNN-like behaviour), later layers become more global. The transformer essentially learns to act like a CNN early on, then goes beyond.
ConvNeXt (2022) showed: if you train CNNs with transformer-style tricks, they match transformers. Architecture may matter less than training recipes.
"A 1ร1 kernel? What can it even do?"
No spatial mixing โ just channel mixing. It's a learned weighted combination across channels at each pixel.
"For a single input channel, a 1ร1 conv would just result in scaled input?"
Yes, exactly. 1ร1 on single channel = scaling = pointless. The power comes from channel mixing when you have multiple channels.
Use cases:
Normal conv does spatial filtering AND channel mixing in one operation. Depthwise separable splits them:
Used in MobileNet, EfficientNet โ anywhere efficiency matters.
Expand receptive field without pooling by spacing out the kernel:
"What are the gaps?"
The kernel skips pixels. The ยท positions are just ignored โ the kernel doesn't look at them. Same 9 weights, but applied to pixels that are spaced apart.
Used in DeepLab (segmentation), WaveNet (audio) โ anywhere you need global context without losing resolution.
For upsampling โ going from small to big. Used in segmentation, GANs.
"Why did transposed conv get a bad rep?"
Checkerboard artifacts. When stride > 1, the "stamps" overlap unevenly โ some pixels get more contributions than others, creating visible grid patterns.
Modern solution: Nearest neighbour upsample + regular conv. Cleaner results.
"Why conv after upsample?"
Upsampling alone doesn't add any new information โ it just repeats or interpolates. Conv after lets the network learn how to fill in details.
Segmentation architecture with encoder-decoder + skip connections:
Skip connections from encoder to decoder preserve fine spatial details (edges, precise boundaries) that get lost during downsampling.
CNN activations have shape: (batch, channels, H, W)
BatchNorm2d normalises across batch AND spatial dimensions, but separately per channel:
The problem with deep networks:
That's 49 multiplications. If each term < 1, gradient vanishes.
With skip connection:
That +1 is the key. Even if โF(x)/โx vanishes, gradient is still at least 1. The gradient has a "highway" โ it can skip directly through the + without being multiplied by small weights 50 times.
That's why ResNet could train 152 layers when VGG struggled with 19.
"Convolutions work because images have spatial structure โ nearby pixels are related. Weight sharing means we learn local patterns once and apply everywhere, giving translation equivariance. Receptive field grows with depth: early layers detect edges, deeper layers combine them into objects. Modern architectures use stride 2 conv instead of pooling (better gradients), skip connections (enables depth), and may even drop convolutions entirely for attention (ViT) when data is plentiful."
Part of the Computer Vision series ยท Back to Blog