Can we get both? Sharp outputs AND stable training?
2 The Diffusion Insight: Gradual Refinement
The Problem with One-Shot Generation
GAN:
random noise z ──→ [Generator] ──→ full image
One network must learn the ENTIRE mapping from noise to image.
Huge leap. Easy to mess up. Unstable.
The Diffusion Solution
Diffusion:
pure noise ──→ slightly ──→ less ──→ ... ──→ almost ──→ clean
cleaner noisy clean image
Many small steps. Each step is a tiny correction.
Much easier to learn. Stable.
Key Insight
Instead of generating an image in one shot, gradually refine from noise. Each step only
needs to make a small correction — much easier than generating everything at once.
An Analogy
Sculpting marble:
GAN approach:
"Here's a block of marble. Carve a perfect statue. One shot. Go."
Diffusion approach:
"Here's a rough shape. Make it slightly more statue-like."
"Good. Now slightly more."
"More..."
"More..."
"Done."
Which is easier to learn?
Part II
Setup & Nomenclature
3 The Full Notation (Read This First!)
Before anything else, let's nail down every symbol.
Core Variables
Symbol
Name
What It Is
$x_0$
Clean image
The original, noise-free image from our dataset
$x_t$
Noisy image at step t
The image after adding noise for t steps
$x_T$
Pure noise
After T steps, the image is complete Gaussian noise
$T$
Total timesteps
Number of noise steps (typically 1000)
$t$
Current timestep
Which step we're at (0 = clean, T = noise)
$\epsilon$
The noise
Gaussian noise $\sim \mathcal{N}(0, I)$ added to the image
Noise Schedule Parameters
Symbol
Name
What It Is
$\beta_t$
Noise variance at step t
How much noise to add at step t (small, e.g., 0.0001 to 0.02)
$\alpha_t$
Signal retention
$\alpha_t = 1 - \beta_t$ — how much signal survives step t
$\bar{\alpha}_t$
Cumulative signal retention
$\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$ — total signal remaining after t steps
Processes
Symbol
Name
What It Is
$q(x_t | x_{t-1})$
Forward process (one step)
Add noise: go from $x_{t-1}$ to $x_t$
$q(x_t | x_0)$
Forward process (direct)
Jump directly from clean $x_0$ to noisy $x_t$
$p_\theta(x_{t-1} | x_t)$
Reverse process
Remove noise: go from $x_t$ to $x_{t-1}$ (LEARNED)
$\epsilon_\theta(x_t, t)$
Noise predictor
Neural network that predicts the noise in $x_t$
The Big Picture
FORWARD (fixed, not learned):
Add noise step by step
x₀ ───→ x₁ ───→ x₂ ───→ ... ───→ x_T
clean bit more pure
image noisy noisy noise
REVERSE (learned):
Remove noise step by step
x_T ───→ x_{T-1} ───→ ... ───→ x₁ ───→ x₀
pure slightly almost clean
noise less noisy clean image!
4 The Forward Process: Destroying an Image
The forward process is fixed (not learned). We just add noise.
One Step of Noise
To go from $x_{t-1}$ to $x_t$, we add a little Gaussian noise:
Here, $\epsilon \sim \mathcal{N}(0, I)$ is one sample of standard Gaussian noise. It's the "total noise"
that, properly scaled, takes us from $x_0$ to $x_t$.
Think of it this way:
x₀ = clean image
ε = one random noise image (same size as x₀)
x_t = (shrink x₀ a bit) + (scale up ε)
= √ᾱₜ · x₀ + √(1-ᾱₜ) · ε
At t=0: √ᾱ₀ ≈ 1, √(1-ᾱ₀) ≈ 0 → x₀ ≈ x₀ (clean)
At t=T: √ᾱT ≈ 0, √(1-ᾱT) ≈ 1 → xT ≈ ε (pure noise)
The network's job: given $x_t$ and $t$, figure out what $\epsilon$ was used.
"Wait, we're using MSE loss. Didn't MSE cause VAE's blurriness? Why doesn't it here?"
Great question! The difference is what we're applying MSE to.
VAE's Problem
VAE predicts: THE FINAL IMAGE directly
Input: latent z
Output: full image x̂
Loss: ||x - x̂||²
When uncertain about whisker position:
"Could be at pixel 50 or 51..."
MSE says: output 0.5 at both → BLUR
Diffusion's Solution
Diffusion predicts: THE NOISE
Input: noisy image x_t
Output: predicted noise ε̂
Loss: ||ε - ε̂||²
What is noise? Gaussian. Smooth. No fine details.
No "whisker position" to hedge about.
Much easier target.
The Key Difference
VAE
Diffusion
Predicts
Final image (complex)
Noise (simple Gaussian)
Uncertainty
About semantic content
About which noise sample
Hedging
Blurs meaningful details
Blurs noise (who cares?)
Another Reason: Iterative Refinement
Even if one step is slightly wrong, we have 1000 steps to correct it:
Step 500: Model slightly mispredicts noise
Step 499: Corrects a bit
Step 498: Corrects more
...
Step 0: Fine details emerge correctly
Errors don't compound — they get corrected over many steps.
Part V
Architecture & Extensions
12 The U-Net Architecture
The noise predictor $\epsilon_\theta(x_t, t)$ is typically a U-Net:
Train a VAE to compress images to small latents (and decompress)
Train diffusion in the latent space (64×64 instead of 512×512)
Generate: Diffuse in latent space → decode to image
Key Insight
Latent diffusion gets 64× speedup (going from 512×512 to 64×64 is 64× fewer pixels). The
VAE handles low-level details; diffusion handles high-level structure.
This is what Stable Diffusion does.
15 Diffusion Transformers (DiT)
U-Net vs Transformer
U-Net was the original architecture. But transformers scale better.
U-Net:
- Convolutional
- Good for images
- Limited scaling
DiT (Diffusion Transformer):
- Patchify image (like ViT)
- Process with transformer blocks
- Scales to massive models
DiT Architecture
Input: noisy latent x_t
1. Patchify: split into patches (like ViT)
64×64 → 256 patches of 4×4
2. Linear embed each patch → tokens
3. Add positional embedding
4. Add time embedding (adaptive layer norm)
5. Process with transformer blocks
(self-attention + FFN)
6. Unpatchify → predicted noise ε̂
Why Transformers Win
Scaling: Transformers scale better with compute
Attention: Global attention from the start (not just in middle blocks)
Unified architecture: Same architecture for text, images, video
DiT powers Sora, Stable Diffusion 3, and other cutting-edge models.
16 Final Comparison: VAE vs GAN vs Diffusion
Aspect
VAE
GAN
Diffusion
Training
Stable ✓
Unstable ✗
Stable ✓
Output quality
Blurry ✗
Sharp ✓
Sharp ✓
Mode coverage
Good ✓
Collapse risk ✗
Excellent ✓
Likelihood
ELBO ✓
None ✗
Yes ✓
Sampling speed
Fast ✓
Fast ✓
Slow ✗ (many steps)
Loss
MSE + KL
Adversarial
Simple MSE on noise
The Diffusion Advantage
Diffusion gets best of both worlds:
Stable training like VAE (no adversarial game)
Sharp outputs like GAN (iterative refinement)
Better mode coverage than both (no collapse, no hedging)
The Tradeoff
Diffusion is slow. Generating one image needs ~50-1000 forward passes.
Fixes:
DDIM: Deterministic sampling, fewer steps (~50)
Latent diffusion: Work in smaller latent space
Consistency models: Distill to single-step generation
🎯 Interview Narrative
"VAE has blurry outputs because MSE on images encourages hedging. GAN fixes blur with adversarial
training but is unstable and mode-collapses.
Diffusion takes a different approach: gradually add noise to images (forward process), then train a
network to predict and remove that noise (reverse process). Each step is a small correction, much easier
than generating an image in one shot.
The key insight: predict the noise ε, not the image directly. Noise is simple (Gaussian), so MSE works
fine — no hedging about semantic content.
Training is simple: sample image, add noise at random timestep, predict the noise, MSE loss. No
adversarial game.
Sampling is slow (many steps), but latent diffusion (Stable Diffusion) fixes this by working in
compressed VAE latent space. Modern models use transformer backbones (DiT) instead of U-Net for better
scaling."