Part 2 of 3 • Generative Models Series

GANs: The Complete Story

From VAE's Blur Problem to Wasserstein Distance — Including Every "Wait, But Why?"

Part I: Why GANs? The VAE Problem
1. VAE's Blurry Outputs: Why MSE Sucks
2. The IQA Problem: No Good Metric Exists
3. The GAN Insight: Learn the Metric
Part II: GAN Setup & Training
4. Architecture & Nomenclature
5. The Objective Function
6. Training: Why Alternate Updates?
Part III: The Theory
7. Optimal Discriminator
8. What G Actually Minimizes: JS Divergence
9. Why JS? Why Not Symmetric KL?
Part IV: GAN Problems
10. Mode Collapse
11. Training Instability
12. The JS Gradient Problem
Part V: Wasserstein to the Rescue
13. Earth Mover's Distance: The Intuition
14. Why Wasserstein Gives Better Gradients
15. WGAN: Making It Practical
Part VI: Conditional Generation & Latent Space
16. Conditional GANs: Controlling What You Generate
17. GAN Latent Space: No Guarantees Like VAE
18. VAE vs GAN: Final Comparison

Part I

Why GANs? The VAE Problem

1 VAE's Blurry Outputs: Why MSE Sucks

VAE works, but outputs are blurry. Why?

The Culprit: Pixel-wise Reconstruction Loss

VAE's reconstruction term is MSE:

\mathcal{L}_{recon} = \|x - \hat{x}\|^2 = \sum_i (x_i - \hat{x}_i)^2

This penalizes each pixel independently. The problem:

Ground truth: Sharp whisker at pixel (50, 73) Decoder uncertainty: "Whisker could be at (50, 73) or (50, 74)... not sure" What MSE rewards: Output 0.5 gray at BOTH pixels This minimizes expected squared error! Result: Blurry whisker

When uncertain about exact positions, MSE rewards hedging — outputting the average. Averages are blurry.

MSE Treats All Errors Equally

Error type 1: Whisker shifted 1 pixel left MSE: LARGE (many pixels wrong) Human: "Looks fine, still a cat" Error type 2: Whisker in right place but plastic-looking texture MSE: small (few pixels different) Human: "Looks fake, uncanny"

MSE is completely misaligned with human perception.

2 The IQA Problem: No Good Metric Exists

What We Want

Image Quality Assessment (IQA): A function that tells us "how good/realistic is this image?"

Ideally it would:

Ignore irrelevant changes (small shifts, lighting tweaks)
Catch perceptual problems (wrong textures, artifacts)
Match human judgment

Why This Is Hard

Human perception is incredibly complex. We instantly know "that face looks off" but can't write down why mathematically.

Metric	What It Measures	Problem
MSE / L2	Pixel-wise difference	Shift by 1 pixel = huge error
SSIM	Structural similarity	Still pixel-based, misses semantics
PSNR	Signal-to-noise ratio	Derived from MSE, same issues
Perceptual loss	Feature-space difference	Better, but still hand-designed

Every hand-crafted metric has blind spots.

🤔 The Core Insight

"There's no ideal IQA metric. MSE is shit — it's just Euclidean distance of images, which doesn't align with human understanding. So instead, we train a discriminator to be the oracle in place of humans."

3 The GAN Insight: Learn the Metric

Key Insight

We can't write down the perfect IQA metric, but we can train a neural network to approximate human judgment.

Traditional approach: Hand-craft metric (MSE, SSIM, etc.) ↓ Always imperfect, has blind spots GAN approach: Train a network (Discriminator) to distinguish real from fake ↓ D learns what "real" means from data ↓ D's output becomes the loss function

The Discriminator as Learned Oracle

D sees thousands of real images. It learns the statistics of "realness":

Real images have sharp edges
Real faces have certain proportions
Real textures have certain patterns

When G produces something fake, D spots the deviation from these learned statistics.

D is a learned oracle that approximates human judgment.

VAE vs GAN Philosophy

VAE

"Minimize pixel difference from ground truth"

Hand-crafted loss (MSE)

→ Blurry outputs

GAN

"Fool a critic that's trying to spot fakes"

Learned loss (Discriminator)

→ Sharp outputs

Part II

GAN Setup & Training

4 Architecture & Nomenclature

Two networks playing a game:

GAN Nomenclature

Symbol	Name	Input	Output	Goal
$G(z)$	Generator	Random noise z	Fake image	Fool D
$D(x)$	Discriminator	Image (real or fake)	Probability [0,1]	Spot fakes
$z$	Latent noise	—	—	Seed for generation
$p_z$	Noise prior	—	—	Usually N(0, I)
$p_{data}$	Real data distribution	—	—	What we're trying to match
$p_g$	Generator distribution	—	—	What G produces

┌─────────────┐ z ~ N(0,I) ────→ │ Generator │ ────→ fake image ────┐ │ G(z) │ │ └─────────────┘ ↓ ┌─────────────┐ real image ──────────────────────────────────→ │Discriminator│ ──→ P(real) ∈ [0,1] │ D(x) │ └─────────────┘

The Game

Discriminator D: "I'll output 1 for real images, 0 for fakes"

Generator G: "I'll make fakes so good that D outputs 1 for them"

They compete and improve each other.

5 The Objective Function

What D Wants

D wants to:

Output high for real images: $D(x_{real}) \rightarrow 1$
Output low for fakes: $D(G(z)) \rightarrow 0$

This is binary classification. D maximizes:

\mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]

Let's parse this:

$\log D(x)$ — log prob D assigns to real being real. D(x)→1 means log(1)=0 (no penalty)
$\log(1 - D(G(z)))$ — log prob D assigns to fake being fake. D(G(z))→0 means log(1)=0 (no penalty)

What G Wants

G wants D to think fakes are real: $D(G(z)) \rightarrow 1$

G minimizes (opposite of D):

\mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]

When D(G(z))→1, this becomes log(0)→−∞. So G maximizes by making D(G(z)) large.

The Minimax Game

\min_G \max_D \quad \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]

D maximizes (better at classifying)
G minimizes (better at fooling)

Why Logs?

🤔 Question

"Why $\log D(x)$ and not just $D(x)$?"

✓ Resolution

This comes from binary cross-entropy loss.

For classification with true label $y$ and predicted probability $p$:

BCE = -[y \log p + (1-y) \log(1-p)]

Real images ($y=1$): $-\log D(x)$ — minimize this, so maximize $\log D(x)$
Fake images ($y=0$): $-\log(1 - D(G(z)))$ — minimize this, so maximize $\log(1-D(G(z)))$

6 Training: Why Alternate Updates?

The Training Loop

# Each training step:

            # Step 1: Train D
            real_images = next(dataloader)
            z = torch.randn(batch_size, latent_dim)
            fake_images = G(z).detach() # detach: don't update G here

            loss_D = -torch.mean(torch.log(D(real_images))) \
            -torch.mean(torch.log(1 - D(fake_images)))
            loss_D.backward()
            optimizer_D.step()

            # Step 2: Train G
            z = torch.randn(batch_size, latent_dim)
            fake_images = G(z)

            loss_G = -torch.mean(torch.log(D(fake_images))) # fool D
            loss_G.backward()
            optimizer_G.step()

🤔 Question

"Why alternate? Why not update both simultaneously?"

Reason 1: Opposite Goals

D wants to maximize the objective. G wants to minimize it.

If you compute gradients simultaneously:

D's gradient: "Push this direction to classify better" G's gradient: "Push opposite direction to fool D" They're fighting over the same computation graph!

Reason 2: Moving Target

G's loss depends on current D. If D changes simultaneously:

Alternating: Step 1: Fix G, update D for current fakes Step 2: Fix D, update G for current D Each sees a stable target. Simultaneous: G computes gradient assuming D(fake) = 0.3 D updates, now D(fake) = 0.5 G's gradient was computed for wrong D!

Reason 3: Game Theory

This is a two-player game, not joint optimization.

Joint optimization: Game (minimax): loss loss ↓ │ ╲ ╱ ╱ │ ╲ ╲ ╱ ╱ × ╲ ← saddle point • ╱ ╲ minimum Gradient descent Gradient descent finds minimum ✓ spirals around saddle ✗

Alternating updates approximate finding the saddle point better than simultaneous.

Part III

The Theory

7 Optimal Discriminator

Given a fixed G, what's the best D?

Setup

D maximizes:

V(D) = \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{x \sim p_g}[\log(1 - D(x))]

Rewrite as integral:

V(D) = \int_x \left[ p_{data}(x) \log D(x) + p_g(x) \log(1 - D(x)) \right] dx

Solving for Optimal D

For each x, we maximize:

f(D) = a \log D + b \log(1-D)

where $a = p_{data}(x)$ and $b = p_g(x)$.

Take derivative, set to zero:

\frac{a}{D} - \frac{b}{1-D} = 0

Solve:

D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}

Key Insight

The optimal discriminator outputs the probability that x came from real data vs. the generator.

8 What G Actually Minimizes: JS Divergence

Plug the optimal $D^*$ back into the objective. After algebra:

V(G, D^*) = -\log 4 + 2 \cdot D_{JS}(p_{data} \| p_g)

Where $D_{JS}$ is the Jensen-Shannon divergence:

D_{JS}(p \| q) = \frac{1}{2}D_{KL}(p \| m) + \frac{1}{2}D_{KL}(q \| m)

where $m = \frac{1}{2}(p + q)$ is the mixture.

Key Insight

G is minimizing the JS divergence between real and generated distributions!

At the Global Optimum

When is JS divergence zero?

D_{JS}(p_{data} \| p_g) = 0 \iff p_g = p_{data}

At optimum, G's distribution equals the true data distribution.

And D outputs:

D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_{data}(x)} = \frac{1}{2}

D can't tell real from fake — outputs 0.5 for everything. Perfect equilibrium.

9 Why JS? Why Not Symmetric KL?

🤔 Question

"I don't know anything about Jensen-Shannon. Why not just use symmetric KL?"

What's Symmetric KL?

D_{KL}^{sym}(p \| q) = D_{KL}(p \| q) + D_{KL}(q \| p)

The Problem: KL Explodes

Remember KL:

D_{KL}(p \| q) = \mathbb{E}_{x \sim p}\left[\log \frac{p(x)}{q(x)}\right]

What happens when q(x) = 0 but p(x) > 0?

\log \frac{p(x)}{0} = \log \infty = \infty

KL explodes when distributions don't overlap.

p_data: ████████ (no overlap) p_g: ████████ KL(p_data || p_g) = ∞ KL(p_g || p_data) = ∞ Symmetric KL = ∞

Early in training, G produces garbage with no overlap with real data. Symmetric KL = infinity. Useless!

Why JS Doesn't Explode

JS uses the mixture $m = \frac{1}{2}(p + q)$:

D_{JS}(p \| q) = \frac{1}{2}D_{KL}(p \| m) + \frac{1}{2}D_{KL}(q \| m)

Since $m$ contains both $p$ and $q$:

Wherever $p > 0$, we have $m > 0$
Wherever $q > 0$, we have $m > 0$

No division by zero!

p_data: ████████ (no overlap) p_g: ████████ m (mixture): ████████ ████████ (covers both!) KL(p_data || m) = finite KL(p_g || m) = finite JS = finite (specifically, log 2 ≈ 0.69)

JS Is Bounded

0 \leq D_{JS}(p \| q) \leq \log 2

Divergence	Non-overlapping	Bounded?
KL(p \|\| q)	∞	No
Symmetric KL	∞	No
JS	log 2 ≈ 0.69	Yes, [0, log 2]

Did the Authors Choose JS?

No! JS emerges from the GAN objective.

The authors defined the game (train D to classify, train G to fool). Then proved: solving this game = G minimizing JS.

JS wasn't chosen — it fell out of the math.

Part IV

GAN Problems

10 Mode Collapse

G finds ONE thing that fools D and keeps making it.

Training step 1000: G outputs a decent cat D is fooled Training step 2000: G still outputs the SAME cat Why learn anything else? It works! Training step 5000: G outputs slight variations of one cat No dogs, no birds, no variety

The generator "collapses" to a few modes instead of covering the full distribution.

Why It Happens

G's objective: fool D. If one image fools D, G has no incentive to diversify.

JS divergence doesn't explicitly penalize this — G producing realistic images (even if not diverse) can have low JS.

11 Training Instability

The game can go wrong in many ways:

D Too Strong

D perfectly classifies everything: D(real) = 1.0 D(fake) = 0.0 G's gradient: ∂/∂G log(1 - D(G(z))) = ∂/∂G log(1) = 0 G can't learn! Loss is flat.

G Too Strong

G perfectly fools D: D(fake) = 1.0 D's gradient for fakes: ∂/∂D log(1 - 1) = ∂/∂D log(0) = undefined D can't learn from fakes!

Oscillation

Step 100: D learns to catch G's fakes Step 200: G adapts, makes new fakes D can't catch Step 300: D adapts to new fakes Step 400: G adapts again ... Neither converges, just oscillates forever

12 The JS Gradient Problem

Remember: JS is bounded, which seems good. But it's also a problem.

When Distributions Don't Overlap

D_{JS}(p_{data} \| p_g) = \log 2 \quad \text{(constant)}

Constant = zero gradient!

G produces garbage at position "far away": p_data: ████████ (big gap) p_g: ████████ JS = log 2 (constant) Gradient = 0 G: "Which direction should I move?" JS: "I dunno. You're just... different." G: *moves randomly, learns nothing*

Early in training, G produces random noise that doesn't overlap with real images. JS gives no signal about which direction to improve.

The Core Problem: JS is bounded (good: no explosion), but constant when distributions don't overlap (bad: no gradient).

Part V

Wasserstein to the Rescue

13 Earth Mover's Distance: The Intuition

Let's forget math and think about sand.

Two Piles of Sand

Pile A (real data): Located at position 0 ████ ──────────────────── 0 5 10 Pile B (generator): Located at position 10 ████ ──────────────────── 0 5 10

How "different" are these piles?

What KL and JS Say

KL: "Do they overlap? NO. → ∞" JS: "Do they overlap? NO. → log(2) = 0.69"

Neither tells us how far apart they are.

An Obvious Problem

Case 1: Case 2: A at 0, B at 10 A at 0, B at 100 ████ ████ ████ ████ ───────────────── ────────────────────────────────── 0 5 10 0 50 100 KL: ∞ KL: ∞ (same!) JS: 0.69 JS: 0.69 (same!)

KL and JS say these are equally different. But B at 100 is obviously farther!

Wasserstein: Measure the Actual Work

Wasserstein distance (Earth Mover's Distance):

"How much work to move one pile of sand to match the other?"

Work = amount of sand × distance moved.

Case 1: Move pile from 10 to 0 Work = (amount) × (distance) = 1 × 10 = 10 Case 2: Move pile from 100 to 0 Work = 1 × 100 = 100

Wasserstein says: Case 2 is 10× farther. Makes sense!

14 Why Wasserstein Gives Better Gradients

The JS Problem

G produces fakes at position 10. Real data at position 0. No overlap. JS = constant = 0.69 Gradient of JS w.r.t. G's position = 0 G doesn't know which way to move!

Wasserstein Gives Direction

G produces fakes at position 10. Real data at position 0. W = 10 Gradient of W w.r.t. G's position = -1 (move left!) G knows: move toward 0.

Even when distributions don't overlap, Wasserstein tells G: "Move this direction, you'll get closer."

Key Insight

Wasserstein measures actual distance, not just overlap. This gives meaningful gradients everywhere.

Intuition: G Trying to Find Real Data

Real data at position 0. G starts at position 100. With JS: G: "Where do I go?" JS: "You're different from real data." G: "But which direction??" JS: "Just... different. 0.69 different." G: *moves randomly, makes no progress* With Wasserstein: G: "Where do I go?" W: "You're 100 units away. Move LEFT." G: *moves left* W: "Now you're 90 units away. Keep going." G: *eventually reaches real data*

15 WGAN: Making It Practical

The Formal Definition

W(p, q) = \inf_{\gamma \in \Pi(p, q)} \mathbb{E}_{(x,y) \sim \gamma}[\|x - y\|]

Translation:

Consider all ways to "pair up" points from p with points from q
$\gamma$ is a "transport plan" — which point goes where
Find the pairing that minimizes total distance moved
That minimum = Wasserstein distance

This is expensive to compute directly. But there's a trick!

Kantorovich-Rubinstein Duality

A theorem says:

W(p, q) = \sup_{\|f\|_L \leq 1} \left[ \mathbb{E}_{x \sim p}[f(x)] - \mathbb{E}_{x \sim q}[f(x)] \right]

Instead of solving a transport problem, find a function $f$ that maximizes the difference in expected values — but $f$ must be 1-Lipschitz.

What's 1-Lipschitz?

A function is 1-Lipschitz if it can't change too fast:

|f(x) - f(y)| \leq |x - y|

Slope ≤ 1 everywhere.

1-Lipschitz: Not 1-Lipschitz: / │ / │ / │ Slope ≤ 1 everywhere Slope > 1 (too steep)

WGAN: Use a Neural Network

Replace $f$ with discriminator D (now called "critic"):

W(p_{data}, p_g) \approx \max_D \left[ \mathbb{E}_{x \sim p_{data}}[D(x)] - \mathbb{E}_{z \sim p_z}[D(G(z))] \right]

Subject to: D is 1-Lipschitz.

WGAN Objective

Critic (D): Maximize $\mathbb{E}[D(\text{real})] - \mathbb{E}[D(\text{fake})]$

(Output high for real, low for fake. But stay smooth.)

Generator (G): Maximize $\mathbb{E}[D(\text{fake})]$

(Make D output high for fakes.)

Key Differences from Original GAN

	Original GAN	WGAN
D output	Probability [0, 1]	Any real number
D loss	Binary cross-entropy	Linear difference
Constraint	None	1-Lipschitz
Divergence	JS	Wasserstein

Enforcing Lipschitz

Method 1: Weight Clipping (Original WGAN)

# After each D update:
            for param in D.parameters():
            param.data.clamp_(-0.01, 0.01)

Crude but works.

Method 2: Gradient Penalty (WGAN-GP)

\mathcal{L}_D = \mathbb{E}[D(\text{fake})] - \mathbb{E}[D(\text{real})] + \lambda \mathbb{E}\left[(|\nabla D(\hat{x})| - 1)^2\right]

Directly penalize gradients that aren't magnitude 1.

Part VI

Conditional Generation & Latent Space

16 Conditional GANs: Controlling What You Generate

🤔 Question

"VAE can take an input x and generate something similar. But GAN just generates random stuff from noise. How do I control what it generates?"

The Problem with Vanilla GAN

Vanilla GAN: z ~ N(0, I) ──→ G ──→ random image You get whatever G decides to generate. No control over "give me a cat" vs "give me a dog".

Conditional GAN (cGAN)

Add a condition $c$ to both G and D:

Conditional GAN: z ~ N(0, I) ──┐ ├──→ G(z, c) ──→ fake image of class c condition c ──┘ D now checks: "Is this a REAL [condition]?" D(image, c) → "Is this a real cat?" (not just "is this real?")

The Objective

\min_G \max_D \quad \mathbb{E}_{x, c}[\log D(x, c)] + \mathbb{E}_{z, c}[\log(1 - D(G(z, c), c))]

The condition $c$ can be:

Class label: "cat", "dog", "digit 7"
Text: "a photo of a sunset over mountains"
Image: edge map, segmentation mask, low-res image
Any embedding: whatever you can encode

How to Inject the Condition

For G:

Option 1: Concatenate to z [z ; c] ──→ G ──→ image Option 2: Embed c and add to intermediate layers z ──→ G_layer1 ──→ (+embed(c)) ──→ G_layer2 ──→ ...

For D:

Option 1: Concatenate c to image (as extra channels) [image ; c_map] ──→ D ──→ real/fake Option 2: Projection discriminator D(image) · embed(c) ──→ real/fake score

Examples of Conditional GANs

Model	Condition	Task
cGAN	Class label	Generate specific class
Pix2Pix	Input image	Image-to-image translation
CycleGAN	Domain	Unpaired translation (horse↔zebra)
StyleGAN	Style vectors	Control attributes (age, pose)
GAN + CLIP	Text	Text-to-image

Key Insight

Conditional GAN lets you control generation by telling both G and D what you want. G learns to generate that condition, D learns to verify "is this a real example of that condition?"

17 GAN Latent Space: No Guarantees Like VAE

🤔 Question

"Do GANs guarantee smooth interpolation like VAE? What's to say the latent noise space doesn't have dead zones?"

✓ Short Answer

You're right to be suspicious. GAN provides no theoretical guarantee of a smooth, hole-free latent space.

Why VAE's Latent Space Works

VAE training: 1. Encoder maps images → distributions near N(0, I) (KL term forces this) 2. Decoder sees samples from ENTIRE N(0, I) region (because encoder covers it) 3. At generation: sample z ~ N(0, I) → Decoder has seen this region → No dead zones ✓ 4. Interpolation: any path between two z's → Decoder knows intermediate points → Smooth morphing ✓

Why GAN Has No Such Guarantee

GAN training: 1. G receives z ~ N(0, I) 2. G learns to map z → realistic image 3. BUT: G might only "use" certain regions of z space! z space: ████████████████████ ↑ ↑ ↑ ↑ G uses these spots, ignores the rest 4. At generation: sample z ~ N(0, I) → Might land in region G never learned → Could produce garbage (dead zone) 5. Interpolation: path between two z's → Might pass through unused regions → No guarantee of smooth morphing

Why Mode Collapse = Dead Zones

Mode collapse is exactly this problem:

Healthy G: Uses full z space, maps different z to different images z₁ → cat, z₂ → dog, z₃ → bird, ... Mode-collapsed G: Maps ALL z to same few images z₁ → cat, z₂ → cat, z₃ → cat, ... Almost entire z space is "mapped to cat" No diversity, dead zones everywhere

Empirically, It Often Works (Kinda)

In practice, well-trained GANs (especially StyleGAN) do show reasonable interpolation. Why?

Continuous function: G is a neural net, inherently smooth
Regularization: Some GAN variants add regularization that helps
Enough capacity: Big G with lots of data tends to use more of z space
Luck: Sometimes it just works, no guarantee

StyleGAN's Solution: Mapping Network

StyleGAN doesn't sample z directly. It maps z → w through a learned network:

StyleGAN: z ~ N(0, I) ──→ Mapping Network ──→ w ──→ G ──→ image (8 FC layers) Why this helps: - w space is "disentangled" (each dimension = one factor) - Mapping network can warp z space to be smoother - Interpolation in w space works better than raw z

VAE vs GAN: Latent Space Comparison

Property	VAE	GAN
Smooth interpolation	Guaranteed (KL forces coverage)	No guarantee (empirically ok)
Dead zones	Unlikely (decoder sees full prior)	Possible (mode collapse)
Inference	Can encode x → z (encoder exists)	No encoder (need extra work)
Disentanglement	Encouraged by KL (not guaranteed)	None (StyleGAN adds mapping net)

🤔 Follow-up

"Wait, GAN has no encoder? How do I find z for a given image?"

✓ Resolution

Correct — vanilla GAN has no encoder. To find z for an image, you need:

Optimization: Start with random z, optimize to minimize $\|G(z) - x\|^2$
Train an encoder: Separately train E such that E(G(z)) ≈ z (e.g., BiGAN, ALI)
Hybrid: Use VAE-GAN (VAE encoder + GAN discriminator)

This is a real practical disadvantage of GANs vs VAEs.

18 VAE vs GAN: Final Comparison

Aspect	VAE	GAN
Loss	MSE + KL (hand-crafted)	Learned by D
Training	Stable (just minimize)	Unstable (adversarial game)
Output quality	Blurry	Sharp
Mode coverage	Good	Poor (collapse risk)
Likelihood	Yes (ELBO)	No
Encoder	Yes (x → z)	No (need optimization or separate encoder)
Latent space	Guaranteed smooth (KL forces coverage)	No guarantee (empirically often ok)
Interpolation	Works (decoder sees full prior)	Usually works (no guarantee)
Conditional generation	Encode similar x, or add condition to decoder	cGAN: add condition to G and D
Theory	ELBO ≤ log p(x)	Minimizes JS/Wasserstein

🎯 The Narrative

"VAE uses MSE which is pixel-wise and doesn't match human perception — it rewards blurry averages. GANs replace this with a learned loss: the discriminator learns what 'real' looks like from data, then guides the generator.

The minimax game makes G minimize JS divergence between real and generated distributions. But JS has a problem: it's constant when distributions don't overlap, giving zero gradient early in training.

Wasserstein distance fixes this by measuring actual transport cost between distributions. Even non-overlapping distributions get meaningful gradients. WGAN uses a Lipschitz-constrained critic to approximate Wasserstein distance.

Conditional GANs add a condition (class, text, image) to both G and D, letting you control what gets generated. D learns to verify 'is this a real example of that condition?'

Unlike VAE, GAN has no encoder and no guarantee of smooth latent space. The KL term in VAE forces coverage of N(0,I); GAN has no such constraint, making mode collapse and dead zones possible. Empirically, well-trained GANs (especially StyleGAN) do interpolate reasonably, but there's no theoretical guarantee."

Divergence Comparison

Divergence	Non-overlapping	Gradient	Used in
KL	∞	Undefined	VAE
Symmetric KL	∞	Undefined	—
JS	log 2 (constant)	Zero	Original GAN
Wasserstein	Actual distance	Useful!	WGAN

Next up: Diffusion Models — Sharp images like GANs, stable training like VAEs.