Part 3 of 3 • Generative Models Series

Diffusion Models: The Complete Story

From Noise to Images — Sharp Like GANs, Stable Like VAEs

Part I: Why Diffusion?
1. VAE vs GAN: The Tradeoff We Want to Break
2. The Diffusion Insight: Gradual Refinement
Part II: Setup & Nomenclature
3. The Full Notation (Read This First!)
4. The Forward Process: Destroying an Image
5. The Reverse Process: What We Want to Learn
Part III: The Training Objective
6. What Should the Network Predict?
7. Why Predict Noise? The Key Insight
8. The Loss Function Explained
Part IV: Making It Work
9. The Training Algorithm
10. The Sampling Algorithm
11. Why Doesn't MSE Cause Blur Here?
Part V: Architecture & Extensions
12. The U-Net Architecture
13. Classifier-Free Guidance
14. Latent Diffusion (Stable Diffusion)
15. Diffusion Transformers (DiT)
16. Final Comparison: VAE vs GAN vs Diffusion

Part I

Why Diffusion?

1 VAE vs GAN: The Tradeoff We Want to Break

Model	Training	Output Quality	The Problem
VAE	Stable ✓	Blurry ✗	MSE loss encourages hedging
GAN	Unstable ✗	Sharp ✓	Adversarial game is fragile

Can we get both? Sharp outputs AND stable training?

2 The Diffusion Insight: Gradual Refinement

The Problem with One-Shot Generation

GAN: random noise z ──→ [Generator] ──→ full image One network must learn the ENTIRE mapping from noise to image. Huge leap. Easy to mess up. Unstable.

The Diffusion Solution

Diffusion: pure noise ──→ slightly ──→ less ──→ ... ──→ almost ──→ clean cleaner noisy clean image Many small steps. Each step is a tiny correction. Much easier to learn. Stable.

Key Insight

Instead of generating an image in one shot, gradually refine from noise. Each step only needs to make a small correction — much easier than generating everything at once.

An Analogy

Sculpting marble: GAN approach: "Here's a block of marble. Carve a perfect statue. One shot. Go." Diffusion approach: "Here's a rough shape. Make it slightly more statue-like." "Good. Now slightly more." "More..." "More..." "Done." Which is easier to learn?

Part II

Setup & Nomenclature

3 The Full Notation (Read This First!)

Before anything else, let's nail down every symbol.

Core Variables

Symbol	Name	What It Is
$x_0$	Clean image	The original, noise-free image from our dataset
$x_t$	Noisy image at step t	The image after adding noise for t steps
$x_T$	Pure noise	After T steps, the image is complete Gaussian noise
$T$	Total timesteps	Number of noise steps (typically 1000)
$t$	Current timestep	Which step we're at (0 = clean, T = noise)
$\epsilon$	The noise	Gaussian noise $\sim \mathcal{N}(0, I)$ added to the image

Noise Schedule Parameters

Symbol	Name	What It Is
$\beta_t$	Noise variance at step t	How much noise to add at step t (small, e.g., 0.0001 to 0.02)
$\alpha_t$	Signal retention	$\alpha_t = 1 - \beta_t$ — how much signal survives step t
$\bar{\alpha}_t$	Cumulative signal retention	$\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$ — total signal remaining after t steps

Processes

Symbol	Name	What It Is
$q(x_t \| x_{t-1})$	Forward process (one step)	Add noise: go from $x_{t-1}$ to $x_t$
$q(x_t \| x_0)$	Forward process (direct)	Jump directly from clean $x_0$ to noisy $x_t$
$p_\theta(x_{t-1} \| x_t)$	Reverse process	Remove noise: go from $x_t$ to $x_{t-1}$ (LEARNED)
$\epsilon_\theta(x_t, t)$	Noise predictor	Neural network that predicts the noise in $x_t$

The Big Picture

FORWARD (fixed, not learned): Add noise step by step x₀ ───→ x₁ ───→ x₂ ───→ ... ───→ x_T clean bit more pure image noisy noisy noise REVERSE (learned): Remove noise step by step x_T ───→ x_{T-1} ───→ ... ───→ x₁ ───→ x₀ pure slightly almost clean noise less noisy clean image!

4 The Forward Process: Destroying an Image

The forward process is fixed (not learned). We just add noise.

One Step of Noise

To go from $x_{t-1}$ to $x_t$, we add a little Gaussian noise:

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} \, x_{t-1}, \beta_t I)

Let's unpack this:

$\sqrt{1-\beta_t} \, x_{t-1}$ — slightly shrink the previous image (keeps variance bounded)
$\beta_t I$ — add noise with variance $\beta_t$

In sampling form:

x_t = \sqrt{1-\beta_t} \, x_{t-1} + \sqrt{\beta_t} \, \epsilon_t

where $\epsilon_t \sim \mathcal{N}(0, I)$ is fresh noise at each step.

What This Looks Like

t=0: 🐱 (clean cat) t=100: 🐱 + slight grain t=300: blurry cat shape + noise t=500: vague blob + lots of noise t=800: mostly noise, hint of something t=1000: pure static (can't tell it was a cat)

The Magic: Jumping Directly to Any Step

We don't have to iterate through all steps. Using properties of Gaussians:

Define cumulative terms

$$\alpha_t = 1 - \beta_t$$ $$\bar{\alpha}_t = \alpha_1 \cdot \alpha_2 \cdot ... \cdot \alpha_t = \prod_{s=1}^{t} \alpha_s$$

Then we can go directly from $x_0$ to $x_t$:

q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, (1-\bar{\alpha}_t) I)

In sampling form:

x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1-\bar{\alpha}_t} \, \epsilon

where $\epsilon \sim \mathcal{N}(0, I)$ is one single sample of noise.

Key Insight

This formula says: $x_t$ is a weighted combination of the original image $x_0$ and some noise $\epsilon$.

$\sqrt{\bar{\alpha}_t}$ = how much original image remains
$\sqrt{1-\bar{\alpha}_t}$ = how much noise is added

As $t$ increases, $\bar{\alpha}_t \rightarrow 0$, so $x_t \rightarrow$ pure noise.

Why Is This Useful?

For training, we need noisy images at random timesteps. Instead of iterating:

Slow way: x₀ → x₁ → x₂ → ... → x₅₀₀ (500 iterations) Fast way: x₅₀₀ = √ᾱ₅₀₀ · x₀ + √(1-ᾱ₅₀₀) · ε (one step!)

5 The Reverse Process: What We Want to Learn

The reverse process goes from noisy to clean. This is what we train.

The Goal

Learn a distribution:

p_\theta(x_{t-1} | x_t)

"Given a noisy image $x_t$, what's the slightly-less-noisy image $x_{t-1}$?"

Model It as Gaussian

We model this as a Gaussian (just like VAE's decoder):

p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

The neural network outputs:

$\mu_\theta(x_t, t)$ — the mean (where the denoised image should be)
$\Sigma_\theta(x_t, t)$ — the variance (usually fixed, not learned)

🤔 Question

"So the network directly predicts the mean $\mu$? The less-noisy image?"

✓ Resolution

It could! But there's a better way. Instead of predicting $\mu$ directly, we predict the noise $\epsilon$. This works better in practice.

Let's see why predicting noise is the key insight...

Part III

The Training Objective

6 What Should the Network Predict?

We have options for what the network outputs:

Option	Network Predicts	Pros/Cons
A	$\mu_\theta(x_t, t)$ — the mean directly	Intuitive but less stable
B	$x_0$ — the clean image	Works, but noisy for high t
C	$\epsilon$ — the noise that was added	Works best! (this is what we do)

Let's understand why Option C is best.

7 Why Predict Noise? The Key Insight

Recall: How We Made $x_t$

x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1-\bar{\alpha}_t} \, \epsilon

This is just a linear combination. We know $x_t$ and $\bar{\alpha}_t$. If we knew $\epsilon$, we could solve for $x_0$:

x_0 = \frac{x_t - \sqrt{1-\bar{\alpha}_t} \, \epsilon}{\sqrt{\bar{\alpha}_t}}

The Insight

If we can predict the noise $\epsilon$, we can recover the clean image $x_0$.

And from $x_0$, we can compute anything we need (like $\mu$ for the reverse step).

Why Is Predicting Noise Easier?

Predicting $x_0$

"Look at this noisy mess. What's the original cat?"

Hard when $t$ is large (pure noise).

Output space: all possible images.

Predicting $\epsilon$

"Look at this noisy mess. What noise was added?"

Always the same difficulty.

Output space: Gaussian noise (simpler).

Noise is "simple" — it's always Gaussian with known statistics. The network just needs to figure out which particular Gaussian sample was added.

What Is ε Exactly?

🤔 Question

"What is ε? Is it the noise at timestep 0? The total noise?"

✓ Resolution

$\epsilon$ is the single noise sample used to create $x_t$ from $x_0$ in one jump.

Remember the direct formula:

x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1-\bar{\alpha}_t} \, \epsilon

Here, $\epsilon \sim \mathcal{N}(0, I)$ is one sample of standard Gaussian noise. It's the "total noise" that, properly scaled, takes us from $x_0$ to $x_t$.

Think of it this way: x₀ = clean image ε = one random noise image (same size as x₀) x_t = (shrink x₀ a bit) + (scale up ε) = √ᾱₜ · x₀ + √(1-ᾱₜ) · ε At t=0: √ᾱ₀ ≈ 1, √(1-ᾱ₀) ≈ 0 → x₀ ≈ x₀ (clean) At t=T: √ᾱT ≈ 0, √(1-ᾱT) ≈ 1 → xT ≈ ε (pure noise)

The network's job: given $x_t$ and $t$, figure out what $\epsilon$ was used.

8 The Loss Function Explained

The Simple Loss

\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon}\left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]

Let's parse every piece:

Loss Function Breakdown

Symbol	Meaning
$\mathbb{E}_{t, x_0, \epsilon}$	Average over: random timesteps $t$, training images $x_0$, noise samples $\epsilon$
$\epsilon$	The actual noise we added to make $x_t$
$\epsilon_\theta(x_t, t)$	Network's prediction of what noise was added
$\\| \cdot \\|^2$	MSE — squared difference

In words: "Predict the noise that was added. Minimize prediction error."

The Training Procedure

Step 1: Sample a real image $x_0$ from dataset

Step 2: Sample a random timestep $t \sim \text{Uniform}(1, T)$

Step 3: Sample noise $\epsilon \sim \mathcal{N}(0, I)$

Step 4: Create noisy image: $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$

Step 5: Predict noise: $\hat{\epsilon} = \epsilon_\theta(x_t, t)$

Step 6: Compute loss: $\mathcal{L} = \|\epsilon - \hat{\epsilon}\|^2$

Step 7: Backprop and update network

Why This Works

We train on all timesteps simultaneously. The same network learns:

At $t \approx T$: "This is mostly noise. The noise is... basically everything."
At $t \approx T/2$: "This is half-noisy. Here's what to remove."
At $t \approx 0$: "This is almost clean. Just remove this tiny bit of noise."

The network learns to denoise at every noise level.

Part IV

Making It Work

9 The Training Algorithm

def train_step(model, x_0):
            # 1. Random timestep
            t = randint(1, T)

            # 2. Sample noise
            epsilon = torch.randn_like(x_0)

            # 3. Create noisy image (one-step formula)
            alpha_bar_t = alpha_bar[t]
            x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon

            # 4. Predict the noise
            epsilon_pred = model(x_t, t)

            # 5. Simple MSE loss
            loss = F.mse_loss(epsilon, epsilon_pred)

            return loss

That's it. No adversarial training. No KL divergence. Just "predict the noise."

10 The Sampling Algorithm

To generate a new image, we reverse the process:

def sample(model):
            # Start with pure noise
            x = torch.randn(image_shape) # x_T

            # Gradually denoise
            for t in reversed(range(1, T + 1)):
            # Predict noise at this step
            epsilon_pred = model(x, t)

            # Remove predicted noise (simplified formula)
            # x_{t-1} = (x_t - noise_term) / scale + optional_noise
            x = denoise_step(x, epsilon_pred, t)

            return x # x_0 = clean image

The Denoising Step (Simplified)

The full formula for one reverse step:

x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) + \sigma_t z

where $z \sim \mathcal{N}(0, I)$ is fresh noise (adds stochasticity).

In words:

Predict what noise is in $x_t$
Subtract a scaled version of it
Rescale
Optionally add a bit of fresh noise (keeps the distribution correct)

The Generation Process Visualized

t=1000: [static noise] ↓ denoise t=900: [noise with vague shapes] ↓ denoise t=700: [blurry something] ↓ denoise t=500: [recognizable blob, maybe a face?] ↓ denoise t=300: [clearly a face, details emerging] ↓ denoise t=100: [face with features, slightly soft] ↓ denoise t=0: [sharp, detailed face] ✓

11 Why Doesn't MSE Cause Blur Here?

🤔 Question

"Wait, we're using MSE loss. Didn't MSE cause VAE's blurriness? Why doesn't it here?"

Great question! The difference is what we're applying MSE to.

VAE's Problem

VAE predicts: THE FINAL IMAGE directly Input: latent z Output: full image x̂ Loss: ||x - x̂||² When uncertain about whisker position: "Could be at pixel 50 or 51..." MSE says: output 0.5 at both → BLUR

Diffusion's Solution

Diffusion predicts: THE NOISE Input: noisy image x_t Output: predicted noise ε̂ Loss: ||ε - ε̂||² What is noise? Gaussian. Smooth. No fine details. No "whisker position" to hedge about. Much easier target.

The Key Difference

	VAE	Diffusion
Predicts	Final image (complex)	Noise (simple Gaussian)
Uncertainty	About semantic content	About which noise sample
Hedging	Blurs meaningful details	Blurs noise (who cares?)

Another Reason: Iterative Refinement

Even if one step is slightly wrong, we have 1000 steps to correct it:

Step 500: Model slightly mispredicts noise Step 499: Corrects a bit Step 498: Corrects more ... Step 0: Fine details emerge correctly

Errors don't compound — they get corrected over many steps.

Part V

Architecture & Extensions

12 The U-Net Architecture

The noise predictor $\epsilon_\theta(x_t, t)$ is typically a U-Net:

Input: x_t (noisy image) ↓ ┌─────────────────────────────┐ │ Downsample blocks │ (encode) │ 64 → 128 → 256 → 512 │ └─────────────────────────────┘ ↓ ┌─────────────────────────────┐ │ Middle block │ (process) │ (self-attention here) │ └─────────────────────────────┘ ↓ ┌─────────────────────────────┐ │ Upsample blocks │ (decode) │ 512 → 256 → 128 → 64 │ │ + skip connections from │ │ downsample blocks │ └─────────────────────────────┘ ↓ Output: ε̂ (predicted noise)

Key Additions for Diffusion

1. Time Embedding

The network needs to know which timestep $t$ we're at. We embed $t$ (like positional encoding in transformers):

def time_embedding(t, dim):
            # Sinusoidal embedding (like transformer positional encoding)
            freqs = torch.exp(-math.log(10000) * torch.arange(dim//2) / dim)
            args = t * freqs
            return torch.cat([torch.cos(args), torch.sin(args)])

This embedding is added to each block so the network knows the noise level.

2. Attention Layers

Self-attention in the middle blocks helps with global coherence (e.g., both eyes should match).

13 Classifier-Free Guidance

How do we control what gets generated? (e.g., "generate a cat" vs "generate a dog")

Conditional Generation

Add a condition $c$ (text, class label, etc.) to the network:

\epsilon_\theta(x_t, t, c)

The Guidance Trick

During sampling, we blend conditional and unconditional predictions:

\hat{\epsilon} = \epsilon_\theta(x_t, t, \emptyset) + s \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset))

where $s > 1$ is the guidance scale.

In words: "Push harder in the direction that matches the condition."

s = 1: Normal conditional generation s = 3: Strongly follow the condition (more "cat-like") s = 10: Very strongly follow (might get artifacts)

Higher guidance = more adherence to prompt, but potentially lower diversity/quality.

14 Latent Diffusion (Stable Diffusion)

The Problem: Images Are Big

A 512×512 image has 786,432 pixels. Running diffusion in pixel space is expensive.

The Solution: Diffuse in Latent Space

Standard Diffusion: noise ←→ diffusion ←→ image (512×512×3) (expensive: huge tensors) Latent Diffusion: image → VAE encoder → latent (64×64×4) ↓ diffusion here! (cheap) ↓ VAE decoder → image

How It Works

Train a VAE to compress images to small latents (and decompress)
Train diffusion in the latent space (64×64 instead of 512×512)
Generate: Diffuse in latent space → decode to image

Key Insight

Latent diffusion gets 64× speedup (going from 512×512 to 64×64 is 64× fewer pixels). The VAE handles low-level details; diffusion handles high-level structure.

This is what Stable Diffusion does.

15 Diffusion Transformers (DiT)

U-Net vs Transformer

U-Net was the original architecture. But transformers scale better.

U-Net: - Convolutional - Good for images - Limited scaling DiT (Diffusion Transformer): - Patchify image (like ViT) - Process with transformer blocks - Scales to massive models

DiT Architecture

Input: noisy latent x_t 1. Patchify: split into patches (like ViT) 64×64 → 256 patches of 4×4 2. Linear embed each patch → tokens 3. Add positional embedding 4. Add time embedding (adaptive layer norm) 5. Process with transformer blocks (self-attention + FFN) 6. Unpatchify → predicted noise ε̂

Why Transformers Win

Scaling: Transformers scale better with compute
Attention: Global attention from the start (not just in middle blocks)
Unified architecture: Same architecture for text, images, video

DiT powers Sora, Stable Diffusion 3, and other cutting-edge models.

16 Final Comparison: VAE vs GAN vs Diffusion

Aspect	VAE	GAN	Diffusion
Training	Stable ✓	Unstable ✗	Stable ✓
Output quality	Blurry ✗	Sharp ✓	Sharp ✓
Mode coverage	Good ✓	Collapse risk ✗	Excellent ✓
Likelihood	ELBO ✓	None ✗	Yes ✓
Sampling speed	Fast ✓	Fast ✓	Slow ✗ (many steps)
Loss	MSE + KL	Adversarial	Simple MSE on noise

The Diffusion Advantage

Diffusion gets best of both worlds:

Stable training like VAE (no adversarial game)
Sharp outputs like GAN (iterative refinement)
Better mode coverage than both (no collapse, no hedging)

The Tradeoff

Diffusion is slow. Generating one image needs ~50-1000 forward passes.

Fixes:

DDIM: Deterministic sampling, fewer steps (~50)
Latent diffusion: Work in smaller latent space
Consistency models: Distill to single-step generation

🎯 Interview Narrative

"VAE has blurry outputs because MSE on images encourages hedging. GAN fixes blur with adversarial training but is unstable and mode-collapses.

Diffusion takes a different approach: gradually add noise to images (forward process), then train a network to predict and remove that noise (reverse process). Each step is a small correction, much easier than generating an image in one shot.

The key insight: predict the noise ε, not the image directly. Noise is simple (Gaussian), so MSE works fine — no hedging about semantic content.

Training is simple: sample image, add noise at random timestep, predict the noise, MSE loss. No adversarial game.

Sampling is slow (many steps), but latent diffusion (Stable Diffusion) fixes this by working in compressed VAE latent space. Modern models use transformer backbones (DiT) instead of U-Net for better scaling."

TL;DR Formulas

What	Formula
Forward (direct)	$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$
Loss	$\mathcal{L} = \\|\epsilon - \epsilon_\theta(x_t, t)\\|^2$
$\epsilon$	Gaussian noise $\sim \mathcal{N}(0, I)$ used to create $x_t$
$\bar{\alpha}_t$	Cumulative signal retention $= \prod_{s=1}^t (1-\beta_s)$

You now understand VAE → GAN → Diffusion. The complete generative model journey.