This penalizes each pixel independently. The problem:
Ground truth: Sharp whisker at pixel (50, 73)
Decoder uncertainty: "Whisker could be at (50, 73) or (50, 74)... not sure"
What MSE rewards: Output 0.5 gray at BOTH pixels
This minimizes expected squared error!
Result: Blurry whisker
When uncertain about exact positions, MSE rewards hedging — outputting the average. Averages
are blurry.
MSE Treats All Errors Equally
Error type 1: Whisker shifted 1 pixel left
MSE: LARGE (many pixels wrong)
Human: "Looks fine, still a cat"
Error type 2: Whisker in right place but plastic-looking texture
MSE: small (few pixels different)
Human: "Looks fake, uncanny"
MSE is completely misaligned with human perception.
2 The IQA Problem: No Good Metric Exists
What We Want
Image Quality Assessment (IQA): A function that tells us "how good/realistic is this image?"
Human perception is incredibly complex. We instantly know "that face looks off" but can't write down why
mathematically.
Metric
What It Measures
Problem
MSE / L2
Pixel-wise difference
Shift by 1 pixel = huge error
SSIM
Structural similarity
Still pixel-based, misses semantics
PSNR
Signal-to-noise ratio
Derived from MSE, same issues
Perceptual loss
Feature-space difference
Better, but still hand-designed
Every hand-crafted metric has blind spots.
🤔 The Core Insight
"There's no ideal IQA metric. MSE is shit — it's just Euclidean distance of images, which doesn't align
with human understanding. So instead, we train a discriminator to be the oracle in place of humans."
3 The GAN Insight: Learn the Metric
Key Insight
We can't write down the perfect IQA metric, but we can train a neural network to
approximate human judgment.
Traditional approach:
Hand-craft metric (MSE, SSIM, etc.)
↓
Always imperfect, has blind spots
GAN approach:
Train a network (Discriminator) to distinguish real from fake
↓
D learns what "real" means from data
↓
D's output becomes the loss function
The Discriminator as Learned Oracle
D sees thousands of real images. It learns the statistics of "realness":
Real images have sharp edges
Real faces have certain proportions
Real textures have certain patterns
When G produces something fake, D spots the deviation from these learned statistics.
D is a learned oracle that approximates human judgment.
# Each training step:
# Step 1: Train D
real_images = next(dataloader)
z = torch.randn(batch_size, latent_dim)
fake_images = G(z).detach() # detach: don't update G here
loss_D = -torch.mean(torch.log(D(real_images))) \
-torch.mean(torch.log(1 - D(fake_images)))
loss_D.backward()
optimizer_D.step()
# Step 2: Train G
z = torch.randn(batch_size, latent_dim)
fake_images = G(z)
loss_G = -torch.mean(torch.log(D(fake_images))) # fool D
loss_G.backward()
optimizer_G.step()
🤔 Question
"Why alternate? Why not update both simultaneously?"
Reason 1: Opposite Goals
D wants to maximize the objective. G wants to minimize it.
If you compute gradients simultaneously:
D's gradient: "Push this direction to classify better"
G's gradient: "Push opposite direction to fool D"
They're fighting over the same computation graph!
Reason 2: Moving Target
G's loss depends on current D. If D changes simultaneously:
Alternating:
Step 1: Fix G, update D for current fakes
Step 2: Fix D, update G for current D
Each sees a stable target.
Simultaneous:
G computes gradient assuming D(fake) = 0.3
D updates, now D(fake) = 0.5
G's gradient was computed for wrong D!
Reason 3: Game Theory
This is a two-player game, not joint optimization.
Joint optimization: Game (minimax):
loss loss
↓ │
╲ ╱ ╱ │ ╲
╲ ╱ ╱ × ╲ ← saddle point
• ╱ ╲
minimum
Gradient descent Gradient descent
finds minimum ✓ spirals around saddle ✗
Alternating updates approximate finding the saddle point better than simultaneous.
p_data: ████████
(no overlap)
p_g: ████████
m (mixture): ████████ ████████ (covers both!)
KL(p_data || m) = finite
KL(p_g || m) = finite
JS = finite (specifically, log 2 ≈ 0.69)
JS Is Bounded
$$0 \leq D_{JS}(p \| q) \leq \log 2$$
Divergence
Non-overlapping
Bounded?
KL(p || q)
∞
No
Symmetric KL
∞
No
JS
log 2 ≈ 0.69
Yes, [0, log 2]
Did the Authors Choose JS?
No! JS emerges from the GAN objective.
The authors defined the game (train D to classify, train G to fool). Then proved: solving this game = G
minimizing JS.
JS wasn't chosen — it fell out of the math.
Part IV
GAN Problems
10 Mode Collapse
G finds ONE thing that fools D and keeps making it.
Training step 1000: G outputs a decent cat
D is fooled
Training step 2000: G still outputs the SAME cat
Why learn anything else? It works!
Training step 5000: G outputs slight variations of one cat
No dogs, no birds, no variety
The generator "collapses" to a few modes instead of covering the full distribution.
Why It Happens
G's objective: fool D. If one image fools D, G has no incentive to diversify.
JS divergence doesn't explicitly penalize this — G producing realistic images (even if not diverse) can have
low JS.
11 Training Instability
The game can go wrong in many ways:
D Too Strong
D perfectly classifies everything:
D(real) = 1.0
D(fake) = 0.0
G's gradient: ∂/∂G log(1 - D(G(z))) = ∂/∂G log(1) = 0
G can't learn! Loss is flat.
G Too Strong
G perfectly fools D:
D(fake) = 1.0
D's gradient for fakes: ∂/∂D log(1 - 1) = ∂/∂D log(0) = undefined
D can't learn from fakes!
Oscillation
Step 100: D learns to catch G's fakes
Step 200: G adapts, makes new fakes D can't catch
Step 300: D adapts to new fakes
Step 400: G adapts again
...
Neither converges, just oscillates forever
12 The JS Gradient Problem
Remember: JS is bounded, which seems good. But it's also a problem.
G produces garbage at position "far away":
p_data: ████████
(big gap)
p_g: ████████
JS = log 2 (constant)
Gradient = 0
G: "Which direction should I move?"
JS: "I dunno. You're just... different."
G: *moves randomly, learns nothing*
Early in training, G produces random noise that doesn't overlap with real images. JS gives no signal about
which direction to improve.
The Core Problem: JS is bounded (good: no explosion), but constant when distributions don't
overlap (bad: no gradient).
Part V
Wasserstein to the Rescue
13 Earth Mover's Distance: The Intuition
Let's forget math and think about sand.
Two Piles of Sand
Pile A (real data): Located at position 0
████
────────────────────
0 5 10
Pile B (generator): Located at position 10
████
────────────────────
0 5 10
How "different" are these piles?
What KL and JS Say
KL: "Do they overlap? NO. → ∞"
JS: "Do they overlap? NO. → log(2) = 0.69"
Neither tells us how far apart they are.
An Obvious Problem
Case 1: Case 2:
A at 0, B at 10 A at 0, B at 100
████ ████ ████ ████
───────────────── ──────────────────────────────────
0 5 10 0 50 100
KL: ∞ KL: ∞ (same!)
JS: 0.69 JS: 0.69 (same!)
KL and JS say these are equally different. But B at 100 is obviously farther!
Wasserstein: Measure the Actual Work
Wasserstein distance (Earth Mover's Distance):
"How much work to move one pile of sand to match the other?"
Work = amount of sand × distance moved.
Case 1: Move pile from 10 to 0
Work = (amount) × (distance) = 1 × 10 = 10
Case 2: Move pile from 100 to 0
Work = 1 × 100 = 100
Wasserstein says: Case 2 is 10× farther. Makes sense!
14 Why Wasserstein Gives Better Gradients
The JS Problem
G produces fakes at position 10.
Real data at position 0.
No overlap.
JS = constant = 0.69
Gradient of JS w.r.t. G's position = 0
G doesn't know which way to move!
Wasserstein Gives Direction
G produces fakes at position 10.
Real data at position 0.
W = 10
Gradient of W w.r.t. G's position = -1 (move left!)
G knows: move toward 0.
Even when distributions don't overlap, Wasserstein tells G: "Move this direction, you'll get
closer."
Key Insight
Wasserstein measures actual distance, not just overlap. This gives meaningful gradients
everywhere.
Intuition: G Trying to Find Real Data
Real data at position 0.
G starts at position 100.
With JS:
G: "Where do I go?"
JS: "You're different from real data."
G: "But which direction??"
JS: "Just... different. 0.69 different."
G: *moves randomly, makes no progress*
With Wasserstein:
G: "Where do I go?"
W: "You're 100 units away. Move LEFT."
G: *moves left*
W: "Now you're 90 units away. Keep going."
G: *eventually reaches real data*
Directly penalize gradients that aren't magnitude 1.
Part VI
Conditional Generation & Latent Space
16 Conditional GANs: Controlling What You Generate
🤔 Question
"VAE can take an input x and generate something similar. But GAN just generates random stuff from noise.
How do I control what it generates?"
The Problem with Vanilla GAN
Vanilla GAN:
z ~ N(0, I) ──→ G ──→ random image
You get whatever G decides to generate.
No control over "give me a cat" vs "give me a dog".
Conditional GAN (cGAN)
Add a condition $c$ to both G and D:
Conditional GAN:
z ~ N(0, I) ──┐
├──→ G(z, c) ──→ fake image of class c
condition c ──┘
D now checks: "Is this a REAL [condition]?"
D(image, c) → "Is this a real cat?" (not just "is this real?")
Option 1: Concatenate to z
[z ; c] ──→ G ──→ image
Option 2: Embed c and add to intermediate layers
z ──→ G_layer1 ──→ (+embed(c)) ──→ G_layer2 ──→ ...
For D:
Option 1: Concatenate c to image (as extra channels)
[image ; c_map] ──→ D ──→ real/fake
Option 2: Projection discriminator
D(image) · embed(c) ──→ real/fake score
Examples of Conditional GANs
Model
Condition
Task
cGAN
Class label
Generate specific class
Pix2Pix
Input image
Image-to-image translation
CycleGAN
Domain
Unpaired translation (horse↔zebra)
StyleGAN
Style vectors
Control attributes (age, pose)
GAN + CLIP
Text
Text-to-image
Key Insight
Conditional GAN lets you control generation by telling both G and D what you want. G learns to generate
that condition, D learns to verify "is this a real example of that condition?"
17 GAN Latent Space: No Guarantees Like VAE
🤔 Question
"Do GANs guarantee smooth interpolation like VAE? What's to say the latent noise space doesn't have dead
zones?"
✓ Short Answer
You're right to be suspicious. GAN provides no theoretical guarantee of a smooth,
hole-free latent space.
Why VAE's Latent Space Works
VAE training:
1. Encoder maps images → distributions near N(0, I)
(KL term forces this)
2. Decoder sees samples from ENTIRE N(0, I) region
(because encoder covers it)
3. At generation: sample z ~ N(0, I)
→ Decoder has seen this region
→ No dead zones ✓
4. Interpolation: any path between two z's
→ Decoder knows intermediate points
→ Smooth morphing ✓
Why GAN Has No Such Guarantee
GAN training:
1. G receives z ~ N(0, I)
2. G learns to map z → realistic image
3. BUT: G might only "use" certain regions of z space!
z space:
████████████████████
↑ ↑ ↑ ↑
G uses these spots, ignores the rest
4. At generation: sample z ~ N(0, I)
→ Might land in region G never learned
→ Could produce garbage (dead zone)
5. Interpolation: path between two z's
→ Might pass through unused regions
→ No guarantee of smooth morphing
Why Mode Collapse = Dead Zones
Mode collapse is exactly this problem:
Healthy G:
Uses full z space, maps different z to different images
z₁ → cat, z₂ → dog, z₃ → bird, ...
Mode-collapsed G:
Maps ALL z to same few images
z₁ → cat, z₂ → cat, z₃ → cat, ...
Almost entire z space is "mapped to cat"
No diversity, dead zones everywhere
Empirically, It Often Works (Kinda)
In practice, well-trained GANs (especially StyleGAN) do show reasonable interpolation. Why?
Continuous function: G is a neural net, inherently smooth
Regularization: Some GAN variants add regularization that helps
Enough capacity: Big G with lots of data tends to use more of z space
Luck: Sometimes it just works, no guarantee
StyleGAN's Solution: Mapping Network
StyleGAN doesn't sample z directly. It maps z → w through a learned network:
StyleGAN:
z ~ N(0, I) ──→ Mapping Network ──→ w ──→ G ──→ image
(8 FC layers)
Why this helps:
- w space is "disentangled" (each dimension = one factor)
- Mapping network can warp z space to be smoother
- Interpolation in w space works better than raw z
VAE vs GAN: Latent Space Comparison
Property
VAE
GAN
Smooth interpolation
Guaranteed (KL forces coverage)
No guarantee (empirically ok)
Dead zones
Unlikely (decoder sees full prior)
Possible (mode collapse)
Inference
Can encode x → z (encoder exists)
No encoder (need extra work)
Disentanglement
Encouraged by KL (not guaranteed)
None (StyleGAN adds mapping net)
🤔 Follow-up
"Wait, GAN has no encoder? How do I find z for a given image?"
✓ Resolution
Correct — vanilla GAN has no encoder. To find z for an image, you need:
Optimization: Start with random z, optimize to minimize $\|G(z) - x\|^2$
Train an encoder: Separately train E such that E(G(z)) ≈ z (e.g., BiGAN, ALI)
Hybrid: Use VAE-GAN (VAE encoder + GAN discriminator)
This is a real practical disadvantage of GANs vs VAEs.
18 VAE vs GAN: Final Comparison
Aspect
VAE
GAN
Loss
MSE + KL (hand-crafted)
Learned by D
Training
Stable (just minimize)
Unstable (adversarial game)
Output quality
Blurry
Sharp
Mode coverage
Good
Poor (collapse risk)
Likelihood
Yes (ELBO)
No
Encoder
Yes (x → z)
No (need optimization or separate encoder)
Latent space
Guaranteed smooth (KL forces coverage)
No guarantee (empirically often ok)
Interpolation
Works (decoder sees full prior)
Usually works (no guarantee)
Conditional generation
Encode similar x, or add condition to decoder
cGAN: add condition to G and D
Theory
ELBO ≤ log p(x)
Minimizes JS/Wasserstein
🎯 The Narrative
"VAE uses MSE which is pixel-wise and doesn't match human perception — it rewards blurry averages. GANs
replace this with a learned loss: the discriminator learns what 'real' looks like from data, then guides
the generator.
The minimax game makes G minimize JS divergence between real and generated distributions. But JS has a
problem: it's constant when distributions don't overlap, giving zero gradient early in training.
Wasserstein distance fixes this by measuring actual transport cost between distributions. Even
non-overlapping distributions get meaningful gradients. WGAN uses a Lipschitz-constrained critic to
approximate Wasserstein distance.
Conditional GANs add a condition (class, text, image) to both G and D, letting you control what gets
generated. D learns to verify 'is this a real example of that condition?'
Unlike VAE, GAN has no encoder and no guarantee of smooth latent space. The KL term in VAE forces
coverage of N(0,I); GAN has no such constraint, making mode collapse and dead zones possible.
Empirically, well-trained GANs (especially StyleGAN) do interpolate reasonably, but there's no
theoretical guarantee."
Divergence Comparison
Divergence
Non-overlapping
Gradient
Used in
KL
∞
Undefined
VAE
Symmetric KL
∞
Undefined
—
JS
log 2 (constant)
Zero
Original GAN
Wasserstein
Actual distance
Useful!
WGAN
Next up: Diffusion Models — Sharp images like GANs, stable training like VAEs.