Part 3 of 3 • Generative Models Series

Diffusion Models: The Complete Story

From Noise to Images — Sharp Like GANs, Stable Like VAEs

Contents

Part I

Why Diffusion?

1 VAE vs GAN: The Tradeoff We Want to Break

Model Training Output Quality The Problem
VAE Stable ✓ Blurry ✗ MSE loss encourages hedging
GAN Unstable ✗ Sharp ✓ Adversarial game is fragile

Can we get both? Sharp outputs AND stable training?

2 The Diffusion Insight: Gradual Refinement

The Problem with One-Shot Generation

GAN: random noise z ──→ [Generator] ──→ full image One network must learn the ENTIRE mapping from noise to image. Huge leap. Easy to mess up. Unstable.

The Diffusion Solution

Diffusion: pure noise ──→ slightly ──→ less ──→ ... ──→ almost ──→ clean cleaner noisy clean image Many small steps. Each step is a tiny correction. Much easier to learn. Stable.
Key Insight

Instead of generating an image in one shot, gradually refine from noise. Each step only needs to make a small correction — much easier than generating everything at once.

An Analogy

Sculpting marble: GAN approach: "Here's a block of marble. Carve a perfect statue. One shot. Go." Diffusion approach: "Here's a rough shape. Make it slightly more statue-like." "Good. Now slightly more." "More..." "More..." "Done." Which is easier to learn?
Part II

Setup & Nomenclature

3 The Full Notation (Read This First!)

Before anything else, let's nail down every symbol.

Core Variables

Symbol Name What It Is
$x_0$ Clean image The original, noise-free image from our dataset
$x_t$ Noisy image at step t The image after adding noise for t steps
$x_T$ Pure noise After T steps, the image is complete Gaussian noise
$T$ Total timesteps Number of noise steps (typically 1000)
$t$ Current timestep Which step we're at (0 = clean, T = noise)
$\epsilon$ The noise Gaussian noise $\sim \mathcal{N}(0, I)$ added to the image

Noise Schedule Parameters

Symbol Name What It Is
$\beta_t$ Noise variance at step t How much noise to add at step t (small, e.g., 0.0001 to 0.02)
$\alpha_t$ Signal retention $\alpha_t = 1 - \beta_t$ — how much signal survives step t
$\bar{\alpha}_t$ Cumulative signal retention $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$ — total signal remaining after t steps

Processes

Symbol Name What It Is
$q(x_t | x_{t-1})$ Forward process (one step) Add noise: go from $x_{t-1}$ to $x_t$
$q(x_t | x_0)$ Forward process (direct) Jump directly from clean $x_0$ to noisy $x_t$
$p_\theta(x_{t-1} | x_t)$ Reverse process Remove noise: go from $x_t$ to $x_{t-1}$ (LEARNED)
$\epsilon_\theta(x_t, t)$ Noise predictor Neural network that predicts the noise in $x_t$

The Big Picture

FORWARD (fixed, not learned): Add noise step by step x₀ ───→ x₁ ───→ x₂ ───→ ... ───→ x_T clean bit more pure image noisy noisy noise REVERSE (learned): Remove noise step by step x_T ───→ x_{T-1} ───→ ... ───→ x₁ ───→ x₀ pure slightly almost clean noise less noisy clean image!

4 The Forward Process: Destroying an Image

The forward process is fixed (not learned). We just add noise.

One Step of Noise

To go from $x_{t-1}$ to $x_t$, we add a little Gaussian noise:

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} \, x_{t-1}, \beta_t I)$$

Let's unpack this:

In sampling form:

$$x_t = \sqrt{1-\beta_t} \, x_{t-1} + \sqrt{\beta_t} \, \epsilon_t$$

where $\epsilon_t \sim \mathcal{N}(0, I)$ is fresh noise at each step.

What This Looks Like

t=0: 🐱 (clean cat) t=100: 🐱 + slight grain t=300: blurry cat shape + noise t=500: vague blob + lots of noise t=800: mostly noise, hint of something t=1000: pure static (can't tell it was a cat)

The Magic: Jumping Directly to Any Step

We don't have to iterate through all steps. Using properties of Gaussians:

Define cumulative terms
$$\alpha_t = 1 - \beta_t$$ $$\bar{\alpha}_t = \alpha_1 \cdot \alpha_2 \cdot ... \cdot \alpha_t = \prod_{s=1}^{t} \alpha_s$$

Then we can go directly from $x_0$ to $x_t$:

$$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, (1-\bar{\alpha}_t) I)$$

In sampling form:

$$x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1-\bar{\alpha}_t} \, \epsilon$$

where $\epsilon \sim \mathcal{N}(0, I)$ is one single sample of noise.

Key Insight

This formula says: $x_t$ is a weighted combination of the original image $x_0$ and some noise $\epsilon$.

As $t$ increases, $\bar{\alpha}_t \rightarrow 0$, so $x_t \rightarrow$ pure noise.

Why Is This Useful?

For training, we need noisy images at random timesteps. Instead of iterating:

Slow way: x₀ → x₁ → x₂ → ... → x₅₀₀ (500 iterations) Fast way: x₅₀₀ = √ᾱ₅₀₀ · x₀ + √(1-ᾱ₅₀₀) · ε (one step!)

5 The Reverse Process: What We Want to Learn

The reverse process goes from noisy to clean. This is what we train.

The Goal

Learn a distribution:

$$p_\theta(x_{t-1} | x_t)$$

"Given a noisy image $x_t$, what's the slightly-less-noisy image $x_{t-1}$?"

Model It as Gaussian

We model this as a Gaussian (just like VAE's decoder):

$$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$$

The neural network outputs:

🤔 Question

"So the network directly predicts the mean $\mu$? The less-noisy image?"

✓ Resolution

It could! But there's a better way. Instead of predicting $\mu$ directly, we predict the noise $\epsilon$. This works better in practice.

Let's see why predicting noise is the key insight...
Part III

The Training Objective

6 What Should the Network Predict?

We have options for what the network outputs:

Option Network Predicts Pros/Cons
A $\mu_\theta(x_t, t)$ — the mean directly Intuitive but less stable
B $x_0$ — the clean image Works, but noisy for high t
C $\epsilon$ — the noise that was added Works best! (this is what we do)

Let's understand why Option C is best.

7 Why Predict Noise? The Key Insight

Recall: How We Made $x_t$

$$x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1-\bar{\alpha}_t} \, \epsilon$$

This is just a linear combination. We know $x_t$ and $\bar{\alpha}_t$. If we knew $\epsilon$, we could solve for $x_0$:

$$x_0 = \frac{x_t - \sqrt{1-\bar{\alpha}_t} \, \epsilon}{\sqrt{\bar{\alpha}_t}}$$

The Insight

If we can predict the noise $\epsilon$, we can recover the clean image $x_0$.

And from $x_0$, we can compute anything we need (like $\mu$ for the reverse step).

Why Is Predicting Noise Easier?

Predicting $x_0$

"Look at this noisy mess. What's the original cat?"

Hard when $t$ is large (pure noise).

Output space: all possible images.

Predicting $\epsilon$

"Look at this noisy mess. What noise was added?"

Always the same difficulty.

Output space: Gaussian noise (simpler).

Noise is "simple" — it's always Gaussian with known statistics. The network just needs to figure out which particular Gaussian sample was added.

What Is ε Exactly?

🤔 Question

"What is ε? Is it the noise at timestep 0? The total noise?"

✓ Resolution

$\epsilon$ is the single noise sample used to create $x_t$ from $x_0$ in one jump.

Remember the direct formula:

$$x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1-\bar{\alpha}_t} \, \epsilon$$

Here, $\epsilon \sim \mathcal{N}(0, I)$ is one sample of standard Gaussian noise. It's the "total noise" that, properly scaled, takes us from $x_0$ to $x_t$.

Think of it this way: x₀ = clean image ε = one random noise image (same size as x₀) x_t = (shrink x₀ a bit) + (scale up ε) = √ᾱₜ · x₀ + √(1-ᾱₜ) · ε At t=0: √ᾱ₀ ≈ 1, √(1-ᾱ₀) ≈ 0 → x₀ ≈ x₀ (clean) At t=T: √ᾱT ≈ 0, √(1-ᾱT) ≈ 1 → xT ≈ ε (pure noise)

The network's job: given $x_t$ and $t$, figure out what $\epsilon$ was used.

8 The Loss Function Explained

The Simple Loss

$$\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon}\left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]$$

Let's parse every piece:

Loss Function Breakdown

Symbol Meaning
$\mathbb{E}_{t, x_0, \epsilon}$ Average over: random timesteps $t$, training images $x_0$, noise samples $\epsilon$
$\epsilon$ The actual noise we added to make $x_t$
$\epsilon_\theta(x_t, t)$ Network's prediction of what noise was added
$\| \cdot \|^2$ MSE — squared difference

In words: "Predict the noise that was added. Minimize prediction error."

The Training Procedure

Step 1: Sample a real image $x_0$ from dataset
Step 2: Sample a random timestep $t \sim \text{Uniform}(1, T)$
Step 3: Sample noise $\epsilon \sim \mathcal{N}(0, I)$
Step 4: Create noisy image: $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$
Step 5: Predict noise: $\hat{\epsilon} = \epsilon_\theta(x_t, t)$
Step 6: Compute loss: $\mathcal{L} = \|\epsilon - \hat{\epsilon}\|^2$
Step 7: Backprop and update network

Why This Works

We train on all timesteps simultaneously. The same network learns:

The network learns to denoise at every noise level.

Part IV

Making It Work

9 The Training Algorithm

def train_step(model, x_0): # 1. Random timestep t = randint(1, T) # 2. Sample noise epsilon = torch.randn_like(x_0) # 3. Create noisy image (one-step formula) alpha_bar_t = alpha_bar[t] x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon # 4. Predict the noise epsilon_pred = model(x_t, t) # 5. Simple MSE loss loss = F.mse_loss(epsilon, epsilon_pred) return loss

That's it. No adversarial training. No KL divergence. Just "predict the noise."

10 The Sampling Algorithm

To generate a new image, we reverse the process:

def sample(model): # Start with pure noise x = torch.randn(image_shape) # x_T # Gradually denoise for t in reversed(range(1, T + 1)): # Predict noise at this step epsilon_pred = model(x, t) # Remove predicted noise (simplified formula) # x_{t-1} = (x_t - noise_term) / scale + optional_noise x = denoise_step(x, epsilon_pred, t) return x # x_0 = clean image

The Denoising Step (Simplified)

The full formula for one reverse step:

$$x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) + \sigma_t z$$

where $z \sim \mathcal{N}(0, I)$ is fresh noise (adds stochasticity).

In words:

  1. Predict what noise is in $x_t$
  2. Subtract a scaled version of it
  3. Rescale
  4. Optionally add a bit of fresh noise (keeps the distribution correct)

The Generation Process Visualized

t=1000: [static noise] ↓ denoise t=900: [noise with vague shapes] ↓ denoise t=700: [blurry something] ↓ denoise t=500: [recognizable blob, maybe a face?] ↓ denoise t=300: [clearly a face, details emerging] ↓ denoise t=100: [face with features, slightly soft] ↓ denoise t=0: [sharp, detailed face] ✓

11 Why Doesn't MSE Cause Blur Here?

🤔 Question

"Wait, we're using MSE loss. Didn't MSE cause VAE's blurriness? Why doesn't it here?"

Great question! The difference is what we're applying MSE to.

VAE's Problem

VAE predicts: THE FINAL IMAGE directly Input: latent z Output: full image x̂ Loss: ||x - x̂||² When uncertain about whisker position: "Could be at pixel 50 or 51..." MSE says: output 0.5 at both → BLUR

Diffusion's Solution

Diffusion predicts: THE NOISE Input: noisy image x_t Output: predicted noise ε̂ Loss: ||ε - ε̂||² What is noise? Gaussian. Smooth. No fine details. No "whisker position" to hedge about. Much easier target.

The Key Difference

VAE Diffusion
Predicts Final image (complex) Noise (simple Gaussian)
Uncertainty About semantic content About which noise sample
Hedging Blurs meaningful details Blurs noise (who cares?)

Another Reason: Iterative Refinement

Even if one step is slightly wrong, we have 1000 steps to correct it:

Step 500: Model slightly mispredicts noise Step 499: Corrects a bit Step 498: Corrects more ... Step 0: Fine details emerge correctly

Errors don't compound — they get corrected over many steps.

Part V

Architecture & Extensions

12 The U-Net Architecture

The noise predictor $\epsilon_\theta(x_t, t)$ is typically a U-Net:

Input: x_t (noisy image) ↓ ┌─────────────────────────────┐ │ Downsample blocks │ (encode) │ 64 → 128 → 256 → 512 │ └─────────────────────────────┘ ↓ ┌─────────────────────────────┐ │ Middle block │ (process) │ (self-attention here) │ └─────────────────────────────┘ ↓ ┌─────────────────────────────┐ │ Upsample blocks │ (decode) │ 512 → 256 → 128 → 64 │ │ + skip connections from │ │ downsample blocks │ └─────────────────────────────┘ ↓ Output: ε̂ (predicted noise)

Key Additions for Diffusion

1. Time Embedding

The network needs to know which timestep $t$ we're at. We embed $t$ (like positional encoding in transformers):

def time_embedding(t, dim): # Sinusoidal embedding (like transformer positional encoding) freqs = torch.exp(-math.log(10000) * torch.arange(dim//2) / dim) args = t * freqs return torch.cat([torch.cos(args), torch.sin(args)])

This embedding is added to each block so the network knows the noise level.

2. Attention Layers

Self-attention in the middle blocks helps with global coherence (e.g., both eyes should match).

13 Classifier-Free Guidance

How do we control what gets generated? (e.g., "generate a cat" vs "generate a dog")

Conditional Generation

Add a condition $c$ (text, class label, etc.) to the network:

$$\epsilon_\theta(x_t, t, c)$$

The Guidance Trick

During sampling, we blend conditional and unconditional predictions:

$$\hat{\epsilon} = \epsilon_\theta(x_t, t, \emptyset) + s \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset))$$

where $s > 1$ is the guidance scale.

In words: "Push harder in the direction that matches the condition."

s = 1: Normal conditional generation s = 3: Strongly follow the condition (more "cat-like") s = 10: Very strongly follow (might get artifacts)

Higher guidance = more adherence to prompt, but potentially lower diversity/quality.

14 Latent Diffusion (Stable Diffusion)

The Problem: Images Are Big

A 512×512 image has 786,432 pixels. Running diffusion in pixel space is expensive.

The Solution: Diffuse in Latent Space

Standard Diffusion: noise ←→ diffusion ←→ image (512×512×3) (expensive: huge tensors) Latent Diffusion: image → VAE encoder → latent (64×64×4) ↓ diffusion here! (cheap) ↓ VAE decoder → image

How It Works

  1. Train a VAE to compress images to small latents (and decompress)
  2. Train diffusion in the latent space (64×64 instead of 512×512)
  3. Generate: Diffuse in latent space → decode to image
Key Insight

Latent diffusion gets 64× speedup (going from 512×512 to 64×64 is 64× fewer pixels). The VAE handles low-level details; diffusion handles high-level structure.

This is what Stable Diffusion does.

15 Diffusion Transformers (DiT)

U-Net vs Transformer

U-Net was the original architecture. But transformers scale better.

U-Net: - Convolutional - Good for images - Limited scaling DiT (Diffusion Transformer): - Patchify image (like ViT) - Process with transformer blocks - Scales to massive models

DiT Architecture

Input: noisy latent x_t 1. Patchify: split into patches (like ViT) 64×64 → 256 patches of 4×4 2. Linear embed each patch → tokens 3. Add positional embedding 4. Add time embedding (adaptive layer norm) 5. Process with transformer blocks (self-attention + FFN) 6. Unpatchify → predicted noise ε̂

Why Transformers Win

DiT powers Sora, Stable Diffusion 3, and other cutting-edge models.

16 Final Comparison: VAE vs GAN vs Diffusion

Aspect VAE GAN Diffusion
Training Stable ✓ Unstable ✗ Stable ✓
Output quality Blurry ✗ Sharp ✓ Sharp ✓
Mode coverage Good ✓ Collapse risk ✗ Excellent ✓
Likelihood ELBO ✓ None ✗ Yes ✓
Sampling speed Fast ✓ Fast ✓ Slow ✗ (many steps)
Loss MSE + KL Adversarial Simple MSE on noise

The Diffusion Advantage

Diffusion gets best of both worlds:

The Tradeoff

Diffusion is slow. Generating one image needs ~50-1000 forward passes.

Fixes:

🎯 Interview Narrative

"VAE has blurry outputs because MSE on images encourages hedging. GAN fixes blur with adversarial training but is unstable and mode-collapses.

Diffusion takes a different approach: gradually add noise to images (forward process), then train a network to predict and remove that noise (reverse process). Each step is a small correction, much easier than generating an image in one shot.

The key insight: predict the noise ε, not the image directly. Noise is simple (Gaussian), so MSE works fine — no hedging about semantic content.

Training is simple: sample image, add noise at random timestep, predict the noise, MSE loss. No adversarial game.

Sampling is slow (many steps), but latent diffusion (Stable Diffusion) fixes this by working in compressed VAE latent space. Modern models use transformer backbones (DiT) instead of U-Net for better scaling."

TL;DR Formulas
What Formula
Forward (direct) $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$
Loss $\mathcal{L} = \|\epsilon - \epsilon_\theta(x_t, t)\|^2$
$\epsilon$ Gaussian noise $\sim \mathcal{N}(0, I)$ used to create $x_t$
$\bar{\alpha}_t$ Cumulative signal retention $= \prod_{s=1}^t (1-\beta_s)$

You now understand VAE → GAN → Diffusion. The complete generative model journey.