Part 1 of 3 • Generative Models Series

VAE: The Complete Story

From Autoencoders to ELBO — Including Every "Wait, But Why?" Along the Way

Part I: The Problem
1. Why Generative Models? What's p(x)?
2. Autoencoders: Setup & Loss
3. The Autoencoder's Fatal Flaw
Part II: VAE Setup
4. VAE Architecture & Full Nomenclature
5. Prior, Posterior, and Likelihood — The Terminology
Part III: The Loss Function (Step by Step)
6. ELBO: Why Maximize p(x)? The Full Derivation
7. From ELBO to Loss: The Sign Flips Explained
Part IV: Understanding Each Term
8. The Reconstruction Term → MSE
9. The Reparameterization Trick (How to Sample & Backprop)
10. The KL Term: What It Measures
11. Why KL Pushes Toward (0, 1)
12. Why KL is Asymmetric
Part V: The Big Picture Questions
13. Where's the "Organization" in Latent Space?
14. What If We Crank Up KL? (β-VAE)
15. Why Does One Sample Work?
16. Interview Summary

Part I

The Problem

1 Why Generative Models? What's p(x)?

The Goal

We have a dataset of images (faces, cats, molecules). We want to:

Generate new samples that look like they came from the same distribution
Learn the underlying structure of the data
Have a meaningful latent space for interpolation and manipulation

What is p(x)?

$p(x)$ is the probability distribution over data. It answers: "How likely is this particular image?"

Imagine all possible 64×64 images (billions of billions of them): Most random pixel combinations: p(x) ≈ 0 (noise, garbage) A realistic cat photo: p(x) = high (this looks like real data) A realistic dog photo: p(x) = high (this too) Random static: p(x) ≈ 0 (not realistic)

$p(x)$ is the "true" distribution that generated our training data. We never know it exactly — we only have samples from it (our dataset).

🤔 Question

"Why does maximizing p(x) help anyone? That's just the probability of the original data, right?"

✓ Resolution

We're not computing p(x) for existing data — we're learning a model that assigns high probability to realistic images.

Think of it this way:

Training: We adjust our model so that p(training images) is high. This forces the model to understand what makes images realistic. Generation: We sample from our learned p(x). Since p(x) is high for realistic images, samples look realistic.

If we can model p(x) well, we can:

Sample from it to generate new data
Evaluate how likely any given image is
Understand the structure of the data

Key Insight

"Maximize p(x)" = "Train a model that thinks our training data is likely." If the model thinks training images are likely, it has learned what realistic images look like.

2 Autoencoders: Setup & Loss

Before VAE, let's understand regular autoencoders.

The Setup

Autoencoder Nomenclature

Symbol	Name	What It Is
$x$	Input	Original data (e.g., a 784-dim image)
$z$	Latent code	Compressed representation (e.g., 32-dim)
$\hat{x}$	Reconstruction	Decoder's attempt to recreate x
Encoder $f_\phi$	$x \rightarrow z$	Neural net that compresses
Decoder $g_\theta$	$z \rightarrow \hat{x}$	Neural net that reconstructs

Encoder Decoder f_φ g_θ x ─────────────→ z ─────────────→ x̂ [784 dims] [32 dims] [784 dims] (image) (latent) (reconstruction) ↑ "bottleneck"

The Loss

\mathcal{L}_{AE} = \|x - \hat{x}\|^2

Simple: minimize reconstruction error. The bottleneck forces compression — the encoder must keep only essential features.

3 The Autoencoder's Fatal Flaw

The Latent Space is Chaos

The encoder maps each input to an arbitrary point:

Training data mappings: Cat image 1 → z = [47.3, -892.1, 0.003] Cat image 2 → z = [51.2, -887.5, 0.012] Dog image 1 → z = [0.1, 0.2, 1000.5] Bird image → z = [-234.5, 42.0, -0.7]

There's no structure. Points are scattered wherever makes reconstruction easiest.

Why You Can't Generate

To generate new images:

Pick a random z
Decode it

But if you pick z = [10, 10, 10], the decoder has never seen this region. It outputs garbage.

Latent space: Dog [0.1, 0.2] Cat [47.3, -892.1] ??? [10, 10] (garbage) ??? [-100, 500] (garbage) Most of the space is "dead zones."

Verdict: Autoencoders are good for compression. Useless for generation — the latent space has holes everywhere.

How do we fix this? We need to constrain where encodings can live.

Part II

VAE Setup

4 VAE Architecture & Full Nomenclature

The Key Change

Instead of encoding to a point, encode to a distribution.

Autoencoder: x → Encoder → z (a point) → Decoder → x̂ VAE: x → Encoder → μ, σ → Sample z ~ N(μ, σ²) → Decoder → x̂

The Architecture

┌─────────────────┐ │ μ (mean) │──────┐ x ──→ Encoder ──┤ │ ├──→ z = μ + σ·ε ──→ Decoder ──→ x̂ │ σ (std dev) │──────┘ ↑ └─────────────────┘ │ ε ~ N(0, I) (random noise)

The encoder outputs two things: μ and σ. Then we sample z from $\mathcal{N}(\mu, \sigma^2)$.

5 Prior, Posterior, and Likelihood — The Terminology

VAE uses probabilistic language. Here's what each term means:

Full VAE Nomenclature

Symbol	Name	What It Is	In VAE
$x$	Data	Observable (the image)	Input image
$z$	Latent	Hidden variable	Compressed code
$p(z)$	Prior	Our belief about z before seeing any data	$\mathcal{N}(0, I)$ — standard Gaussian
$p(x\|z)$	Likelihood	Probability of data given latent	Decoder: given z, how likely is x?
$p(z\|x)$	True Posterior	Belief about z after seeing data x	Intractable (can't compute)
$q(z\|x)$	Approximate Posterior	Our approximation to the true posterior	Encoder: $\mathcal{N}(\mu(x), \sigma(x)^2)$

The Intuition

PRIOR p(z) = N(0, I) "Before seeing any image, I believe z could be anywhere in a standard Gaussian." LIKELIHOOD p(x|z) = Decoder "If I know z, how likely is this particular image x?" TRUE POSTERIOR p(z|x) "Given this specific image x, where should z be?" (We can't compute this exactly — it requires integrating over all x) APPROXIMATE POSTERIOR q(z|x) = Encoder "Our neural network's guess at p(z|x)" The encoder outputs μ, σ to approximate where z should be for this x.

Why "Prior" is N(0, I)

This is a choice. We pick $\mathcal{N}(0, I)$ because:

Easy to sample from at generation time
Simple and symmetric
We'll force the encoder to match this, so we can sample from it later

Key Insight

The prior $p(z) = \mathcal{N}(0, I)$ is what we'll sample from during generation. The KL term forces the encoder's output (approximate posterior) to stay close to this prior.

Part III

The Loss Function (Step by Step)

6 ELBO: Why Maximize p(x)? The Full Derivation

The Goal

We want our model to assign high probability to training data:

\text{Maximize } \log p(x)

Why log? Logs turn products into sums (easier math), and maximizing log(p) is the same as maximizing p.

The Problem: p(x) is Intractable

To compute p(x), we'd need to integrate over all possible latents:

p(x) = \int p(x|z) p(z) \, dz

This says: "Sum up the probability of x for every possible z, weighted by how likely each z is."

This integral is impossible to compute — z is continuous and high-dimensional.

The Solution: Derive a Lower Bound

We can't compute $\log p(x)$, but we can find something smaller that we can compute, and maximize that instead.

Step 1: Start with what we want

$$\log p(x)$$

Step 2: Introduce our encoder q(z|x) by multiplying by 1

$$\log p(x) = \log \int p(x|z)p(z) \cdot \frac{q(z|x)}{q(z|x)} \, dz$$

Step 3: Rewrite as expectation under q

$$= \log \mathbb{E}_{z \sim q(z|x)}\left[\frac{p(x|z)p(z)}{q(z|x)}\right]$$

Step 4: Apply Jensen's inequality (log of expectation ≥ expectation of log)

$$\geq \mathbb{E}_{z \sim q(z|x)}\left[\log\frac{p(x|z)p(z)}{q(z|x)}\right]$$

Step 5: Split the log

$$= \mathbb{E}_{q(z|x)}[\log p(x|z)] + \mathbb{E}_{q(z|x)}\left[\log\frac{p(z)}{q(z|x)}\right]$$

Step 6: Recognize the second term as negative KL divergence

$$= \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))$$

That's ELBO

Evidence Lower Bound:

\log p(x) \geq \underbrace{\mathbb{E}_{q(z|x)}[\log p(x|z)]}_{\text{reconstruction term}} - \underbrace{D_{KL}(q(z|x) \| p(z))}_{\text{KL term}} = \text{ELBO}

Since ELBO ≤ log p(x), maximizing ELBO pushes up on log p(x).

7 From ELBO to Loss: The Sign Flips Explained

This is where signs get confusing. Let's be very explicit.

What We Derived (ELBO)

\text{ELBO} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))

We want to maximize this.

Converting to a Loss (Minimize Instead)

Deep learning frameworks minimize losses. So we flip the sign:

\mathcal{L} = -\text{ELBO} = -\mathbb{E}_{q(z|x)}[\log p(x|z)] + D_{KL}(q(z|x) \| p(z))

Now we minimize $\mathcal{L}$.

What Each Term Becomes

The Reconstruction Term

Starting with: $-\mathbb{E}_{q(z|x)}[\log p(x|z)]$

If we model the decoder as Gaussian (details in next chapter):

\log p(x|z) \propto -\|x - \hat{x}\|^2

So:

-\log p(x|z) \propto +\|x - \hat{x}\|^2 = \text{MSE}

Putting It Together

\mathcal{L} = \underbrace{\|x - \hat{x}\|^2}_{\text{reconstruction loss (MSE)}} + \underbrace{D_{KL}(q(z|x) \| p(z))}_{\text{KL loss}}

Minimize this.

Summary of Sign Logic

Expression	What to Do	Sign
log p(x)	Maximize
ELBO = E[log p(x\|z)] - KL	Maximize (it's a lower bound)
-ELBO = -E[log p(x\|z)] + KL	Minimize (flip for loss)
Loss = MSE + KL	Minimize	Both terms positive, both push down

Key Insight

The ELBO has a minus sign in front of KL. When we negate to make a loss, that minus becomes plus. So in the final loss, both MSE and KL are added (both should be small).

Part IV

Understanding Each Term

8 The Reconstruction Term → MSE

What Does $\mathbb{E}_{q(z|x)}[\log p(x|z)]$ Mean?

Let's decode this notation piece by piece.

The Subscript: Which Distribution to Sample From

\mathbb{E}_{z \sim q(z|x)}[f(z)] = \int q(z|x) \cdot f(z) \, dz

The subscript tells you: "Sample z from distribution $q(z|x)$, compute $f(z)$, average over many samples."

🤔 Question

"So $\mathbb{E}_{q(z|x)}[\log p(x|z)]$ means we average over z obtained from q... but by using an x? Then aren't we also averaging over x indirectly?"

✓ Resolution

No — x is fixed here. It's one specific image.

x = one_specific_cat_image # FIXED

            total = 0
            for _ in range(1000):
            z = sample_from_encoder(x) # sample z from q(z|x)
            total += log_prob(x, decoder(z))

            expectation = total / 1000

To average over x too, you'd need an outer expectation:

\mathbb{E}_{x \sim p_{data}}\left[\mathbb{E}_{z \sim q(z|x)}[\log p(x|z)]\right]

The outer loop is your dataloader; the inner is the z sampling.

How It Becomes MSE

Step 1: Model the decoder as Gaussian

p(x|z) = \mathcal{N}(x; \hat{x}, \sigma^2 I)

Where $\hat{x} = \text{decoder}(z)$.

Step 2: Log probability of a Gaussian

\log p(x|z) = -\frac{\|x - \hat{x}\|^2}{2\sigma^2} + \text{const}

Step 3: With σ fixed (usually σ=1), drop constants

\log p(x|z) \propto -\|x - \hat{x}\|^2

Step 4: So maximizing this = minimizing MSE

9 The Reparameterization Trick

We need to sample z from $q(z|x) = \mathcal{N}(\mu, \sigma^2)$, but sampling is random — how do we backpropagate?

The Problem

# This breaks backprop:
            z = torch.normal(mu, sigma) # random operation
            loss = mse(decoder(z), x)
            loss.backward() # ∂loss/∂mu = ??? (undefined)

The randomness blocks gradients.

The Solution

Rewrite sampling as a deterministic function plus external noise:

z = \mu + \sigma \cdot \epsilon \quad \text{where } \epsilon \sim \mathcal{N}(0, 1)

# This works:
            epsilon = torch.randn_like(sigma) # randomness here (no gradient needed)
            z = mu + sigma * epsilon # deterministic function of mu, sigma
            loss = mse(decoder(z), x)
            loss.backward() # gradients flow through mu and sigma!

🤔 Question

"How is this sampling from a Gaussian?"

✓ Resolution

This is sampling from a Gaussian. It's mathematically equivalent:

z \sim \mathcal{N}(\mu, \sigma^2) \iff z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1)

ε ~ N(0, 1): ████████ (standard Gaussian) ───────────── σ · ε : ████████████ (scaled by σ) μ + σ · ε : ████████████ (shifted by μ) ───────────────── = N(μ, σ²) ✓

🤔 Follow-up

"But how do you sample from a Gaussian? If it's N(0,1) doesn't mean any value between -1 to 1 would suffice right? Even uniform dist can do that."

✓ Resolution

Gaussian ≠ uniform. They have completely different shapes.

Uniform(-1, 1): ████████████████ (flat: all values equally likely) ──────────────── -1 0 1 Gaussian(0, 1): ████████ (bell curve: peaks at center) ██████████ (tails extend to ±∞) ████████████ ──────────────── -3 -1 0 1 3

Libraries use algorithms like Box-Muller to convert uniform → Gaussian:

# Under the hood:
            u1, u2 = uniform(0, 1), uniform(0, 1)
            z = sqrt(-2 * log(u1)) * cos(2 * pi * u2) # now z ~ N(0,1)

10 The KL Term: What It Measures

🤔 Question

"What does it mean to minimize KL divergence? Like if we reduce MSE it means we're reducing distance between things... so for KL?"

KL divergence measures how different two probability distributions are.

D_{KL}(q \| p) = \mathbb{E}_{z \sim q}\left[\log \frac{q(z)}{p(z)}\right]

In words: "Sample from q, ask how much more likely this sample is under q than under p."

Condition	KL Value
q = p (identical)	KL = 0
q puts mass where p doesn't	KL → ∞
q ≈ p (similar)	KL > 0 (small)

p = N(0, 1): ████████ (target: the prior) ────────────── q = N(0, 1): ████████ KL ≈ 0 (same) q = N(2, 1): ████████ KL > 0 (shifted) q = N(0, 0.1): ██ KL > 0 (too narrow) q = N(5, 3): ██████ KL >> 0 (way off)

Key Insight

Minimizing KL = "Make the encoder's output distribution q(z|x) look like the prior p(z) = N(0, I)."

In VAE:

D_{KL}(q(z|x) \| p(z)) = D_{KL}(\mathcal{N}(\mu, \sigma^2) \| \mathcal{N}(0, I))

This pushes the encoder to output μ ≈ 0 and σ ≈ 1.

11 Why KL Pushes Toward (0, 1)

🤔 Question

"I didn't get why it's pushed close to 0, 1...?"

For two Gaussians, KL has a closed form:

D_{KL}(\mathcal{N}(\mu, \sigma^2) \| \mathcal{N}(0, 1)) = \frac{1}{2}\left(\mu^2 + \sigma^2 - 1 - \log\sigma^2\right)

Let's see when this equals zero:

Condition	Which Term	Effect
μ ≠ 0	μ² grows	Penalty → pushes μ toward 0
σ > 1	σ² grows	Penalty → pushes σ down
σ < 1	-log(σ²) grows	Penalty → pushes σ up
μ = 0, σ = 1	All cancel	KL = 0

Key Insight

The formula literally contains μ². That's a direct quadratic penalty on μ. The encoder learns to output small μ because large μ = large loss.

12 Why KL is Asymmetric

🤔 Question

"Why this asymmetric way though... MSE is symmetric. Why is KL made like this?"

The two directions of KL mean different things:

D_{KL}(q \| p) = \mathbb{E}_{z \sim q}\left[\log \frac{q(z)}{p(z)}\right]

Notice: we sample from q.

Direction	Penalizes	Ignores
$D_{KL}(q \\| p)$	q has mass where p doesn't	p has mass where q doesn't
$D_{KL}(p \\| q)$	p has mass where q doesn't	q has mass where p doesn't

🤔 Follow-up

"But why wouldn't you want both... penalty for when q doesn't overlap with p?"

✓ Resolution

In VAE, we want the asymmetry.

Think about it:

$p(z) = \mathcal{N}(0, I)$ — the prior, covers a huge region
$q(z|x)$ — encoder output for one specific image

Should the encoder for a single cat image cover the entire prior? No! That would be useless — one image mapping to everywhere.

p(z): ████████████████████████████ (entire prior) q(z|x=cat): ████ (small region for this one cat)

We want q to be smaller than p. We just don't want q to go outside p (where we'd never sample during generation).

$D_{KL}(q \| p)$ enforces exactly this: "Don't put mass outside the prior."

Part V

The Big Picture Questions

13 Where's the "Organization" in Latent Space?

🤔 Pushback

"Ok so all we did is the same autoencoder but constrained the latent space... but how is this creating organization? People talk about VAEs having overlapping clusters of classes, but why would same-class images be pushed to the same cluster?"

✓ The Honest Answer

The "beautiful organized clusters" thing is somewhat oversold.

What VAE Actually Guarantees

Only this: Everything lives in a compact region near $\mathcal{N}(0, I)$, and the decoder has seen samples from that whole region.

Why Same-Class Images End Up Nearby

This comes from reconstruction loss, not VAE magic.

Images that need similar features to reconstruct will share similar latents. Two cat images both need "pointy ears", "whiskers", etc. Similar inputs → similar features → similar z.

This would happen in a regular autoencoder too!

Autoencoder: Cats at [100, 200] Dogs at [-500, 800] (organized by class, but scattered everywhere) VAE: Cats at [0.3, 0.5] Dogs at [-0.2, 0.4] (same organization, just compressed near origin)

Key Insight

Reconstruction creates organization. KL just compresses it into a compact, sampleable region. The real VAE benefit is interpolation — no dead zones.

🤔 Earlier Pushback

"What's the guarantee VAE samples sensible μ and σ? If some x produces μ=9, σ=1000, we're screwed."

✓ Resolution

If the encoder outputs μ=9, σ=1000, that gets penalized by KL. It's not a failure case we handle — it's a case that doesn't happen because the loss prevents it.

14 What If We Crank Up KL? (β-VAE)

🤔 Question

"What happens if I focus a lot on KL and ignore reconstruction?"

Heavy KL, Weak Reconstruction → Collapse

The encoder thinks: "I get punished for deviating from N(0,I). Why encode anything?"

Result: For ALL inputs, output μ=0, σ=1. The latent carries zero information.

KL weight = ∞, Reconstruction weight = 0 Encoder: Every x → μ=0, σ=1 Decoder: Every z → blurry average of all training images

This is called posterior collapse.

β-VAE

\mathcal{L} = \text{reconstruction} + \beta \cdot D_{KL}

β Value	What You Get
β ≪ 1	Almost an autoencoder: good reconstruction, messy latent
β = 1	Standard VAE
β ≫ 1	Risk of collapse, blurry outputs

The "Disentanglement" Claim

β-VAE claimed high β leads to disentangled representations (each dimension = one factor).

🤔 Pushback

"Why would high β force disentanglement? I'd just output 0,1 for everything."

✓ You're Right

High β alone doesn't force disentanglement. It forces collapse.

🤔 Follow-up

"And why would independent factors be 'efficient'? If I entangle 10 factors I can compress to 2 dimensions. Independence requires 10 dimensions."

✓ Also Correct

Entanglement is MORE efficient for compression, not less.

Entangled (2 dims): z[0] = 0.7·face_angle + 0.3·lighting z[1] = 0.5·smile + 0.5·hair → 2 dimensions encode 4 factors Disentangled (4 dims): z[0] = face_angle z[1] = lighting z[2] = smile z[3] = hair → 4 dimensions for 4 factors

The Honest Truth

β-VAE disentanglement is shaky:

No theoretical guarantee
Only works in a narrow sweet spot (if at all)
Results mostly on toy datasets
Often just causes collapse on real data

Verdict: β-VAE disentanglement is empirical observation on toy data, not a principled mechanism. The field has moved on.

15 Why Does One Sample Work?

🤔 Question

"How is expectation over 100 samples the same as one sample? That erases the expectation."

✓ Resolution

It doesn't erase it — it approximates it.

Monte Carlo Estimation

\mathbb{E}_{z \sim q}[f(z)] \approx \frac{1}{N} \sum_{i=1}^{N} f(z_i)

With N=1:

\mathbb{E}_{z \sim q}[f(z)] \approx f(z_1)

Terrible estimate for one image. But you're not training on one image.

Batches Save Us

With batch size 64 and 1 z per image = 64 samples per gradient step. Over 1000 batches = 64,000 samples. Noise averages out.

🤔 Follow-up

"By that logic, SGD with one sample is the same as GD."

✓ Yes, That's Exactly SGD

SGD works because: unbiased gradient + many iterations = convergence.

🤔 Follow-up

"But SGD on one sample is shit — not enough variance. But too large batch is also bad?"

✓ Correct — There's a Sweet Spot

Batch Size	Quality	Problem
1	Super noisy	Unstable
32-256	Good	Sweet spot ✓
Full dataset	Perfect	Expensive, poor generalization

VAE follows the same logic: 1 z sample per image × batch of images = enough signal.

16 Interview Summary

🎯 The Full Narrative

"Autoencoders compress to a latent space but it's unstructured — you can't sample from it.

VAEs encode to a distribution (μ, σ) instead of a point, then force these distributions toward a prior N(0,I) using KL divergence. This ensures the latent space is compact and sampleable.

The loss comes from ELBO — a lower bound on log p(x). We can't compute log p(x) directly, so we maximize this bound instead. Negating ELBO gives us: MSE (reconstruction) + KL (stay near prior).

The reparameterization trick makes sampling differentiable by writing z = μ + σε where ε ~ N(0,1).

Organization comes from reconstruction (similar images need similar features), not KL. KL just compresses everything into the sampleable region.

One z sample works because batches + SGD handle noise — 64 images × 1 sample = enough gradient signal."

TL;DR Nomenclature

Symbol	Name	What It Is
$p(z)$	Prior	N(0, I) — what we sample from during generation
$p(x\|z)$	Likelihood / Decoder	Given z, how likely is x?
$q(z\|x)$	Approx. Posterior / Encoder	Given x, output distribution over z
ELBO	Lower bound	E[log p(x\|z)] - KL(q\|\|p)
Loss	-ELBO	MSE + KL (minimize this)

Next up: GANs, Diffusion Models, and how Transformers connect to all of this.