Part 2 of 3 • Generative Models Series

GANs: The Complete Story

From VAE's Blur Problem to Wasserstein Distance — Including Every "Wait, But Why?"

Contents

Part I

Why GANs? The VAE Problem

1 VAE's Blurry Outputs: Why MSE Sucks

VAE works, but outputs are blurry. Why?

The Culprit: Pixel-wise Reconstruction Loss

VAE's reconstruction term is MSE:

$$\mathcal{L}_{recon} = \|x - \hat{x}\|^2 = \sum_i (x_i - \hat{x}_i)^2$$

This penalizes each pixel independently. The problem:

Ground truth: Sharp whisker at pixel (50, 73) Decoder uncertainty: "Whisker could be at (50, 73) or (50, 74)... not sure" What MSE rewards: Output 0.5 gray at BOTH pixels This minimizes expected squared error! Result: Blurry whisker

When uncertain about exact positions, MSE rewards hedging — outputting the average. Averages are blurry.

MSE Treats All Errors Equally

Error type 1: Whisker shifted 1 pixel left MSE: LARGE (many pixels wrong) Human: "Looks fine, still a cat" Error type 2: Whisker in right place but plastic-looking texture MSE: small (few pixels different) Human: "Looks fake, uncanny"

MSE is completely misaligned with human perception.

2 The IQA Problem: No Good Metric Exists

What We Want

Image Quality Assessment (IQA): A function that tells us "how good/realistic is this image?"

Ideally it would:

Why This Is Hard

Human perception is incredibly complex. We instantly know "that face looks off" but can't write down why mathematically.

Metric What It Measures Problem
MSE / L2 Pixel-wise difference Shift by 1 pixel = huge error
SSIM Structural similarity Still pixel-based, misses semantics
PSNR Signal-to-noise ratio Derived from MSE, same issues
Perceptual loss Feature-space difference Better, but still hand-designed

Every hand-crafted metric has blind spots.

🤔 The Core Insight

"There's no ideal IQA metric. MSE is shit — it's just Euclidean distance of images, which doesn't align with human understanding. So instead, we train a discriminator to be the oracle in place of humans."

3 The GAN Insight: Learn the Metric

Key Insight

We can't write down the perfect IQA metric, but we can train a neural network to approximate human judgment.

Traditional approach: Hand-craft metric (MSE, SSIM, etc.) ↓ Always imperfect, has blind spots GAN approach: Train a network (Discriminator) to distinguish real from fake ↓ D learns what "real" means from data ↓ D's output becomes the loss function

The Discriminator as Learned Oracle

D sees thousands of real images. It learns the statistics of "realness":

When G produces something fake, D spots the deviation from these learned statistics.

D is a learned oracle that approximates human judgment.

VAE vs GAN Philosophy

VAE

"Minimize pixel difference from ground truth"

Hand-crafted loss (MSE)

→ Blurry outputs

GAN

"Fool a critic that's trying to spot fakes"

Learned loss (Discriminator)

→ Sharp outputs

Part II

GAN Setup & Training

4 Architecture & Nomenclature

Two networks playing a game:

GAN Nomenclature

Symbol Name Input Output Goal
$G(z)$ Generator Random noise z Fake image Fool D
$D(x)$ Discriminator Image (real or fake) Probability [0,1] Spot fakes
$z$ Latent noise Seed for generation
$p_z$ Noise prior Usually N(0, I)
$p_{data}$ Real data distribution What we're trying to match
$p_g$ Generator distribution What G produces
┌─────────────┐ z ~ N(0,I) ────→ │ Generator │ ────→ fake image ────┐ │ G(z) │ │ └─────────────┘ ↓ ┌─────────────┐ real image ──────────────────────────────────→ │Discriminator│ ──→ P(real) ∈ [0,1] │ D(x) │ └─────────────┘

The Game

Discriminator D: "I'll output 1 for real images, 0 for fakes"

Generator G: "I'll make fakes so good that D outputs 1 for them"

They compete and improve each other.

5 The Objective Function

What D Wants

D wants to:

This is binary classification. D maximizes:

$$\mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

Let's parse this:

What G Wants

G wants D to think fakes are real: $D(G(z)) \rightarrow 1$

G minimizes (opposite of D):

$$\mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

When D(G(z))→1, this becomes log(0)→−∞. So G maximizes by making D(G(z)) large.

The Minimax Game

$$\min_G \max_D \quad \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

Why Logs?

🤔 Question

"Why $\log D(x)$ and not just $D(x)$?"

✓ Resolution

This comes from binary cross-entropy loss.

For classification with true label $y$ and predicted probability $p$:

$$BCE = -[y \log p + (1-y) \log(1-p)]$$

6 Training: Why Alternate Updates?

The Training Loop

# Each training step: # Step 1: Train D real_images = next(dataloader) z = torch.randn(batch_size, latent_dim) fake_images = G(z).detach() # detach: don't update G here loss_D = -torch.mean(torch.log(D(real_images))) \ -torch.mean(torch.log(1 - D(fake_images))) loss_D.backward() optimizer_D.step() # Step 2: Train G z = torch.randn(batch_size, latent_dim) fake_images = G(z) loss_G = -torch.mean(torch.log(D(fake_images))) # fool D loss_G.backward() optimizer_G.step()
🤔 Question

"Why alternate? Why not update both simultaneously?"

Reason 1: Opposite Goals

D wants to maximize the objective. G wants to minimize it.

If you compute gradients simultaneously:

D's gradient: "Push this direction to classify better" G's gradient: "Push opposite direction to fool D" They're fighting over the same computation graph!

Reason 2: Moving Target

G's loss depends on current D. If D changes simultaneously:

Alternating: Step 1: Fix G, update D for current fakes Step 2: Fix D, update G for current D Each sees a stable target. Simultaneous: G computes gradient assuming D(fake) = 0.3 D updates, now D(fake) = 0.5 G's gradient was computed for wrong D!

Reason 3: Game Theory

This is a two-player game, not joint optimization.

Joint optimization: Game (minimax): loss loss ↓ │ ╲ ╱ ╱ │ ╲ ╲ ╱ ╱ × ╲ ← saddle point • ╱ ╲ minimum Gradient descent Gradient descent finds minimum ✓ spirals around saddle ✗

Alternating updates approximate finding the saddle point better than simultaneous.

Part III

The Theory

7 Optimal Discriminator

Given a fixed G, what's the best D?

Setup

D maximizes:

$$V(D) = \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{x \sim p_g}[\log(1 - D(x))]$$

Rewrite as integral:

$$V(D) = \int_x \left[ p_{data}(x) \log D(x) + p_g(x) \log(1 - D(x)) \right] dx$$

Solving for Optimal D

For each x, we maximize:

$$f(D) = a \log D + b \log(1-D)$$

where $a = p_{data}(x)$ and $b = p_g(x)$.

Take derivative, set to zero:

$$\frac{a}{D} - \frac{b}{1-D} = 0$$

Solve:

$$D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}$$
Key Insight

The optimal discriminator outputs the probability that x came from real data vs. the generator.

8 What G Actually Minimizes: JS Divergence

Plug the optimal $D^*$ back into the objective. After algebra:

$$V(G, D^*) = -\log 4 + 2 \cdot D_{JS}(p_{data} \| p_g)$$

Where $D_{JS}$ is the Jensen-Shannon divergence:

$$D_{JS}(p \| q) = \frac{1}{2}D_{KL}(p \| m) + \frac{1}{2}D_{KL}(q \| m)$$

where $m = \frac{1}{2}(p + q)$ is the mixture.

Key Insight

G is minimizing the JS divergence between real and generated distributions!

At the Global Optimum

When is JS divergence zero?

$$D_{JS}(p_{data} \| p_g) = 0 \iff p_g = p_{data}$$

At optimum, G's distribution equals the true data distribution.

And D outputs:

$$D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_{data}(x)} = \frac{1}{2}$$

D can't tell real from fake — outputs 0.5 for everything. Perfect equilibrium.

9 Why JS? Why Not Symmetric KL?

🤔 Question

"I don't know anything about Jensen-Shannon. Why not just use symmetric KL?"

What's Symmetric KL?

$$D_{KL}^{sym}(p \| q) = D_{KL}(p \| q) + D_{KL}(q \| p)$$

The Problem: KL Explodes

Remember KL:

$$D_{KL}(p \| q) = \mathbb{E}_{x \sim p}\left[\log \frac{p(x)}{q(x)}\right]$$

What happens when q(x) = 0 but p(x) > 0?

$$\log \frac{p(x)}{0} = \log \infty = \infty$$

KL explodes when distributions don't overlap.

p_data: ████████ (no overlap) p_g: ████████ KL(p_data || p_g) = ∞ KL(p_g || p_data) = ∞ Symmetric KL = ∞

Early in training, G produces garbage with no overlap with real data. Symmetric KL = infinity. Useless!

Why JS Doesn't Explode

JS uses the mixture $m = \frac{1}{2}(p + q)$:

$$D_{JS}(p \| q) = \frac{1}{2}D_{KL}(p \| m) + \frac{1}{2}D_{KL}(q \| m)$$

Since $m$ contains both $p$ and $q$:

No division by zero!

p_data: ████████ (no overlap) p_g: ████████ m (mixture): ████████ ████████ (covers both!) KL(p_data || m) = finite KL(p_g || m) = finite JS = finite (specifically, log 2 ≈ 0.69)

JS Is Bounded

$$0 \leq D_{JS}(p \| q) \leq \log 2$$
Divergence Non-overlapping Bounded?
KL(p || q) No
Symmetric KL No
JS log 2 ≈ 0.69 Yes, [0, log 2]

Did the Authors Choose JS?

No! JS emerges from the GAN objective.

The authors defined the game (train D to classify, train G to fool). Then proved: solving this game = G minimizing JS.

JS wasn't chosen — it fell out of the math.

Part IV

GAN Problems

10 Mode Collapse

G finds ONE thing that fools D and keeps making it.

Training step 1000: G outputs a decent cat D is fooled Training step 2000: G still outputs the SAME cat Why learn anything else? It works! Training step 5000: G outputs slight variations of one cat No dogs, no birds, no variety

The generator "collapses" to a few modes instead of covering the full distribution.

Why It Happens

G's objective: fool D. If one image fools D, G has no incentive to diversify.

JS divergence doesn't explicitly penalize this — G producing realistic images (even if not diverse) can have low JS.

11 Training Instability

The game can go wrong in many ways:

D Too Strong

D perfectly classifies everything: D(real) = 1.0 D(fake) = 0.0 G's gradient: ∂/∂G log(1 - D(G(z))) = ∂/∂G log(1) = 0 G can't learn! Loss is flat.

G Too Strong

G perfectly fools D: D(fake) = 1.0 D's gradient for fakes: ∂/∂D log(1 - 1) = ∂/∂D log(0) = undefined D can't learn from fakes!

Oscillation

Step 100: D learns to catch G's fakes Step 200: G adapts, makes new fakes D can't catch Step 300: D adapts to new fakes Step 400: G adapts again ... Neither converges, just oscillates forever

12 The JS Gradient Problem

Remember: JS is bounded, which seems good. But it's also a problem.

When Distributions Don't Overlap

$$D_{JS}(p_{data} \| p_g) = \log 2 \quad \text{(constant)}$$

Constant = zero gradient!

G produces garbage at position "far away": p_data: ████████ (big gap) p_g: ████████ JS = log 2 (constant) Gradient = 0 G: "Which direction should I move?" JS: "I dunno. You're just... different." G: *moves randomly, learns nothing*

Early in training, G produces random noise that doesn't overlap with real images. JS gives no signal about which direction to improve.

The Core Problem: JS is bounded (good: no explosion), but constant when distributions don't overlap (bad: no gradient).
Part V

Wasserstein to the Rescue

13 Earth Mover's Distance: The Intuition

Let's forget math and think about sand.

Two Piles of Sand

Pile A (real data): Located at position 0 ████ ──────────────────── 0 5 10 Pile B (generator): Located at position 10 ████ ──────────────────── 0 5 10

How "different" are these piles?

What KL and JS Say

KL: "Do they overlap? NO. → ∞" JS: "Do they overlap? NO. → log(2) = 0.69"

Neither tells us how far apart they are.

An Obvious Problem

Case 1: Case 2: A at 0, B at 10 A at 0, B at 100 ████ ████ ████ ████ ───────────────── ────────────────────────────────── 0 5 10 0 50 100 KL: ∞ KL: ∞ (same!) JS: 0.69 JS: 0.69 (same!)

KL and JS say these are equally different. But B at 100 is obviously farther!

Wasserstein: Measure the Actual Work

Wasserstein distance (Earth Mover's Distance):

"How much work to move one pile of sand to match the other?"

Work = amount of sand × distance moved.

Case 1: Move pile from 10 to 0 Work = (amount) × (distance) = 1 × 10 = 10 Case 2: Move pile from 100 to 0 Work = 1 × 100 = 100

Wasserstein says: Case 2 is 10× farther. Makes sense!

14 Why Wasserstein Gives Better Gradients

The JS Problem

G produces fakes at position 10. Real data at position 0. No overlap. JS = constant = 0.69 Gradient of JS w.r.t. G's position = 0 G doesn't know which way to move!

Wasserstein Gives Direction

G produces fakes at position 10. Real data at position 0. W = 10 Gradient of W w.r.t. G's position = -1 (move left!) G knows: move toward 0.

Even when distributions don't overlap, Wasserstein tells G: "Move this direction, you'll get closer."

Key Insight

Wasserstein measures actual distance, not just overlap. This gives meaningful gradients everywhere.

Intuition: G Trying to Find Real Data

Real data at position 0. G starts at position 100. With JS: G: "Where do I go?" JS: "You're different from real data." G: "But which direction??" JS: "Just... different. 0.69 different." G: *moves randomly, makes no progress* With Wasserstein: G: "Where do I go?" W: "You're 100 units away. Move LEFT." G: *moves left* W: "Now you're 90 units away. Keep going." G: *eventually reaches real data*

15 WGAN: Making It Practical

The Formal Definition

$$W(p, q) = \inf_{\gamma \in \Pi(p, q)} \mathbb{E}_{(x,y) \sim \gamma}[\|x - y\|]$$

Translation:

This is expensive to compute directly. But there's a trick!

Kantorovich-Rubinstein Duality

A theorem says:

$$W(p, q) = \sup_{\|f\|_L \leq 1} \left[ \mathbb{E}_{x \sim p}[f(x)] - \mathbb{E}_{x \sim q}[f(x)] \right]$$

Instead of solving a transport problem, find a function $f$ that maximizes the difference in expected values — but $f$ must be 1-Lipschitz.

What's 1-Lipschitz?

A function is 1-Lipschitz if it can't change too fast:

$$|f(x) - f(y)| \leq |x - y|$$

Slope ≤ 1 everywhere.

1-Lipschitz: Not 1-Lipschitz: / │ / │ / │ Slope ≤ 1 everywhere Slope > 1 (too steep)

WGAN: Use a Neural Network

Replace $f$ with discriminator D (now called "critic"):

$$W(p_{data}, p_g) \approx \max_D \left[ \mathbb{E}_{x \sim p_{data}}[D(x)] - \mathbb{E}_{z \sim p_z}[D(G(z))] \right]$$

Subject to: D is 1-Lipschitz.

WGAN Objective

Critic (D): Maximize $\mathbb{E}[D(\text{real})] - \mathbb{E}[D(\text{fake})]$

(Output high for real, low for fake. But stay smooth.)

Generator (G): Maximize $\mathbb{E}[D(\text{fake})]$

(Make D output high for fakes.)

Key Differences from Original GAN

Original GAN WGAN
D output Probability [0, 1] Any real number
D loss Binary cross-entropy Linear difference
Constraint None 1-Lipschitz
Divergence JS Wasserstein

Enforcing Lipschitz

Method 1: Weight Clipping (Original WGAN)

# After each D update: for param in D.parameters(): param.data.clamp_(-0.01, 0.01)

Crude but works.

Method 2: Gradient Penalty (WGAN-GP)

$$\mathcal{L}_D = \mathbb{E}[D(\text{fake})] - \mathbb{E}[D(\text{real})] + \lambda \mathbb{E}\left[(|\nabla D(\hat{x})| - 1)^2\right]$$

Directly penalize gradients that aren't magnitude 1.

Part VI

Conditional Generation & Latent Space

16 Conditional GANs: Controlling What You Generate

🤔 Question

"VAE can take an input x and generate something similar. But GAN just generates random stuff from noise. How do I control what it generates?"

The Problem with Vanilla GAN

Vanilla GAN: z ~ N(0, I) ──→ G ──→ random image You get whatever G decides to generate. No control over "give me a cat" vs "give me a dog".

Conditional GAN (cGAN)

Add a condition $c$ to both G and D:

Conditional GAN: z ~ N(0, I) ──┐ ├──→ G(z, c) ──→ fake image of class c condition c ──┘ D now checks: "Is this a REAL [condition]?" D(image, c) → "Is this a real cat?" (not just "is this real?")

The Objective

$$\min_G \max_D \quad \mathbb{E}_{x, c}[\log D(x, c)] + \mathbb{E}_{z, c}[\log(1 - D(G(z, c), c))]$$

The condition $c$ can be:

How to Inject the Condition

For G:

Option 1: Concatenate to z [z ; c] ──→ G ──→ image Option 2: Embed c and add to intermediate layers z ──→ G_layer1 ──→ (+embed(c)) ──→ G_layer2 ──→ ...

For D:

Option 1: Concatenate c to image (as extra channels) [image ; c_map] ──→ D ──→ real/fake Option 2: Projection discriminator D(image) · embed(c) ──→ real/fake score

Examples of Conditional GANs

Model Condition Task
cGAN Class label Generate specific class
Pix2Pix Input image Image-to-image translation
CycleGAN Domain Unpaired translation (horse↔zebra)
StyleGAN Style vectors Control attributes (age, pose)
GAN + CLIP Text Text-to-image
Key Insight

Conditional GAN lets you control generation by telling both G and D what you want. G learns to generate that condition, D learns to verify "is this a real example of that condition?"

17 GAN Latent Space: No Guarantees Like VAE

🤔 Question

"Do GANs guarantee smooth interpolation like VAE? What's to say the latent noise space doesn't have dead zones?"

✓ Short Answer

You're right to be suspicious. GAN provides no theoretical guarantee of a smooth, hole-free latent space.

Why VAE's Latent Space Works

VAE training: 1. Encoder maps images → distributions near N(0, I) (KL term forces this) 2. Decoder sees samples from ENTIRE N(0, I) region (because encoder covers it) 3. At generation: sample z ~ N(0, I) → Decoder has seen this region → No dead zones ✓ 4. Interpolation: any path between two z's → Decoder knows intermediate points → Smooth morphing ✓

Why GAN Has No Such Guarantee

GAN training: 1. G receives z ~ N(0, I) 2. G learns to map z → realistic image 3. BUT: G might only "use" certain regions of z space! z space: ████████████████████ ↑ ↑ ↑ ↑ G uses these spots, ignores the rest 4. At generation: sample z ~ N(0, I) → Might land in region G never learned → Could produce garbage (dead zone) 5. Interpolation: path between two z's → Might pass through unused regions → No guarantee of smooth morphing

Why Mode Collapse = Dead Zones

Mode collapse is exactly this problem:

Healthy G: Uses full z space, maps different z to different images z₁ → cat, z₂ → dog, z₃ → bird, ... Mode-collapsed G: Maps ALL z to same few images z₁ → cat, z₂ → cat, z₃ → cat, ... Almost entire z space is "mapped to cat" No diversity, dead zones everywhere

Empirically, It Often Works (Kinda)

In practice, well-trained GANs (especially StyleGAN) do show reasonable interpolation. Why?

StyleGAN's Solution: Mapping Network

StyleGAN doesn't sample z directly. It maps z → w through a learned network:

StyleGAN: z ~ N(0, I) ──→ Mapping Network ──→ w ──→ G ──→ image (8 FC layers) Why this helps: - w space is "disentangled" (each dimension = one factor) - Mapping network can warp z space to be smoother - Interpolation in w space works better than raw z

VAE vs GAN: Latent Space Comparison

Property VAE GAN
Smooth interpolation Guaranteed (KL forces coverage) No guarantee (empirically ok)
Dead zones Unlikely (decoder sees full prior) Possible (mode collapse)
Inference Can encode x → z (encoder exists) No encoder (need extra work)
Disentanglement Encouraged by KL (not guaranteed) None (StyleGAN adds mapping net)
🤔 Follow-up

"Wait, GAN has no encoder? How do I find z for a given image?"

✓ Resolution

Correct — vanilla GAN has no encoder. To find z for an image, you need:

This is a real practical disadvantage of GANs vs VAEs.

18 VAE vs GAN: Final Comparison

Aspect VAE GAN
Loss MSE + KL (hand-crafted) Learned by D
Training Stable (just minimize) Unstable (adversarial game)
Output quality Blurry Sharp
Mode coverage Good Poor (collapse risk)
Likelihood Yes (ELBO) No
Encoder Yes (x → z) No (need optimization or separate encoder)
Latent space Guaranteed smooth (KL forces coverage) No guarantee (empirically often ok)
Interpolation Works (decoder sees full prior) Usually works (no guarantee)
Conditional generation Encode similar x, or add condition to decoder cGAN: add condition to G and D
Theory ELBO ≤ log p(x) Minimizes JS/Wasserstein
🎯 The Narrative

"VAE uses MSE which is pixel-wise and doesn't match human perception — it rewards blurry averages. GANs replace this with a learned loss: the discriminator learns what 'real' looks like from data, then guides the generator.

The minimax game makes G minimize JS divergence between real and generated distributions. But JS has a problem: it's constant when distributions don't overlap, giving zero gradient early in training.

Wasserstein distance fixes this by measuring actual transport cost between distributions. Even non-overlapping distributions get meaningful gradients. WGAN uses a Lipschitz-constrained critic to approximate Wasserstein distance.

Conditional GANs add a condition (class, text, image) to both G and D, letting you control what gets generated. D learns to verify 'is this a real example of that condition?'

Unlike VAE, GAN has no encoder and no guarantee of smooth latent space. The KL term in VAE forces coverage of N(0,I); GAN has no such constraint, making mode collapse and dead zones possible. Empirically, well-trained GANs (especially StyleGAN) do interpolate reasonably, but there's no theoretical guarantee."

Divergence Comparison
Divergence Non-overlapping Gradient Used in
KL Undefined VAE
Symmetric KL Undefined
JS log 2 (constant) Zero Original GAN
Wasserstein Actual distance Useful! WGAN

Next up: Diffusion Models — Sharp images like GANs, stable training like VAEs.