From Autoencoders to ELBO — Including Every "Wait, But Why?" Along the Way
We have a dataset of images (faces, cats, molecules). We want to:
$p(x)$ is the probability distribution over data. It answers: "How likely is this particular image?"
$p(x)$ is the "true" distribution that generated our training data. We never know it exactly — we only have samples from it (our dataset).
"Why does maximizing p(x) help anyone? That's just the probability of the original data, right?"
We're not computing p(x) for existing data — we're learning a model that assigns high probability to realistic images.
Think of it this way:
If we can model p(x) well, we can:
"Maximize p(x)" = "Train a model that thinks our training data is likely." If the model thinks training images are likely, it has learned what realistic images look like.
Before VAE, let's understand regular autoencoders.
| Symbol | Name | What It Is |
|---|---|---|
| $x$ | Input | Original data (e.g., a 784-dim image) |
| $z$ | Latent code | Compressed representation (e.g., 32-dim) |
| $\hat{x}$ | Reconstruction | Decoder's attempt to recreate x |
| Encoder $f_\phi$ | $x \rightarrow z$ | Neural net that compresses |
| Decoder $g_\theta$ | $z \rightarrow \hat{x}$ | Neural net that reconstructs |
Simple: minimize reconstruction error. The bottleneck forces compression — the encoder must keep only essential features.
The encoder maps each input to an arbitrary point:
There's no structure. Points are scattered wherever makes reconstruction easiest.
To generate new images:
But if you pick z = [10, 10, 10], the decoder has never seen this region. It outputs garbage.
Instead of encoding to a point, encode to a distribution.
The encoder outputs two things: μ and σ. Then we sample z from $\mathcal{N}(\mu, \sigma^2)$.
VAE uses probabilistic language. Here's what each term means:
| Symbol | Name | What It Is | In VAE |
|---|---|---|---|
| $x$ | Data | Observable (the image) | Input image |
| $z$ | Latent | Hidden variable | Compressed code |
| $p(z)$ | Prior | Our belief about z before seeing any data | $\mathcal{N}(0, I)$ — standard Gaussian |
| $p(x|z)$ | Likelihood | Probability of data given latent | Decoder: given z, how likely is x? |
| $p(z|x)$ | True Posterior | Belief about z after seeing data x | Intractable (can't compute) |
| $q(z|x)$ | Approximate Posterior | Our approximation to the true posterior | Encoder: $\mathcal{N}(\mu(x), \sigma(x)^2)$ |
This is a choice. We pick $\mathcal{N}(0, I)$ because:
The prior $p(z) = \mathcal{N}(0, I)$ is what we'll sample from during generation. The KL term forces the encoder's output (approximate posterior) to stay close to this prior.
We want our model to assign high probability to training data:
Why log? Logs turn products into sums (easier math), and maximizing log(p) is the same as maximizing p.
To compute p(x), we'd need to integrate over all possible latents:
This says: "Sum up the probability of x for every possible z, weighted by how likely each z is."
This integral is impossible to compute — z is continuous and high-dimensional.
We can't compute $\log p(x)$, but we can find something smaller that we can compute, and maximize that instead.
Evidence Lower Bound:
Since ELBO ≤ log p(x), maximizing ELBO pushes up on log p(x).
This is where signs get confusing. Let's be very explicit.
We want to maximize this.
Deep learning frameworks minimize losses. So we flip the sign:
Now we minimize $\mathcal{L}$.
Starting with: $-\mathbb{E}_{q(z|x)}[\log p(x|z)]$
If we model the decoder as Gaussian (details in next chapter):
So:
Minimize this.
| Expression | What to Do | Sign |
|---|---|---|
| log p(x) | Maximize | |
| ELBO = E[log p(x|z)] - KL | Maximize (it's a lower bound) | |
| -ELBO = -E[log p(x|z)] + KL | Minimize (flip for loss) | |
| Loss = MSE + KL | Minimize | Both terms positive, both push down |
The ELBO has a minus sign in front of KL. When we negate to make a loss, that minus becomes plus. So in the final loss, both MSE and KL are added (both should be small).
Let's decode this notation piece by piece.
The subscript tells you: "Sample z from distribution $q(z|x)$, compute $f(z)$, average over many samples."
"So $\mathbb{E}_{q(z|x)}[\log p(x|z)]$ means we average over z obtained from q... but by using an x? Then aren't we also averaging over x indirectly?"
No — x is fixed here. It's one specific image.
To average over x too, you'd need an outer expectation:
The outer loop is your dataloader; the inner is the z sampling.
Step 1: Model the decoder as Gaussian
Where $\hat{x} = \text{decoder}(z)$.
Step 2: Log probability of a Gaussian
Step 3: With σ fixed (usually σ=1), drop constants
Step 4: So maximizing this = minimizing MSE
We need to sample z from $q(z|x) = \mathcal{N}(\mu, \sigma^2)$, but sampling is random — how do we backpropagate?
The randomness blocks gradients.
Rewrite sampling as a deterministic function plus external noise:
"How is this sampling from a Gaussian?"
This is sampling from a Gaussian. It's mathematically equivalent:
"But how do you sample from a Gaussian? If it's N(0,1) doesn't mean any value between -1 to 1 would suffice right? Even uniform dist can do that."
Gaussian ≠ uniform. They have completely different shapes.
Libraries use algorithms like Box-Muller to convert uniform → Gaussian:
"What does it mean to minimize KL divergence? Like if we reduce MSE it means we're reducing distance between things... so for KL?"
KL divergence measures how different two probability distributions are.
In words: "Sample from q, ask how much more likely this sample is under q than under p."
| Condition | KL Value |
|---|---|
| q = p (identical) | KL = 0 |
| q puts mass where p doesn't | KL → ∞ |
| q ≈ p (similar) | KL > 0 (small) |
Minimizing KL = "Make the encoder's output distribution q(z|x) look like the prior p(z) = N(0, I)."
In VAE:
This pushes the encoder to output μ ≈ 0 and σ ≈ 1.
"I didn't get why it's pushed close to 0, 1...?"
For two Gaussians, KL has a closed form:
Let's see when this equals zero:
| Condition | Which Term | Effect |
|---|---|---|
| μ ≠ 0 | μ² grows | Penalty → pushes μ toward 0 |
| σ > 1 | σ² grows | Penalty → pushes σ down |
| σ < 1 | -log(σ²) grows | Penalty → pushes σ up |
| μ = 0, σ = 1 | All cancel | KL = 0 |
The formula literally contains μ². That's a direct quadratic penalty on μ. The encoder learns to output small μ because large μ = large loss.
"Why this asymmetric way though... MSE is symmetric. Why is KL made like this?"
The two directions of KL mean different things:
Notice: we sample from q.
| Direction | Penalizes | Ignores |
|---|---|---|
| $D_{KL}(q \| p)$ | q has mass where p doesn't | p has mass where q doesn't |
| $D_{KL}(p \| q)$ | p has mass where q doesn't | q has mass where p doesn't |
"But why wouldn't you want both... penalty for when q doesn't overlap with p?"
In VAE, we want the asymmetry.
Think about it:
Should the encoder for a single cat image cover the entire prior? No! That would be useless — one image mapping to everywhere.
We want q to be smaller than p. We just don't want q to go outside p (where we'd never sample during generation).
$D_{KL}(q \| p)$ enforces exactly this: "Don't put mass outside the prior."
"Ok so all we did is the same autoencoder but constrained the latent space... but how is this creating organization? People talk about VAEs having overlapping clusters of classes, but why would same-class images be pushed to the same cluster?"
The "beautiful organized clusters" thing is somewhat oversold.
Only this: Everything lives in a compact region near $\mathcal{N}(0, I)$, and the decoder has seen samples from that whole region.
This comes from reconstruction loss, not VAE magic.
Images that need similar features to reconstruct will share similar latents. Two cat images both need "pointy ears", "whiskers", etc. Similar inputs → similar features → similar z.
This would happen in a regular autoencoder too!
Reconstruction creates organization. KL just compresses it into a compact, sampleable region. The real VAE benefit is interpolation — no dead zones.
"What's the guarantee VAE samples sensible μ and σ? If some x produces μ=9, σ=1000, we're screwed."
If the encoder outputs μ=9, σ=1000, that gets penalized by KL. It's not a failure case we handle — it's a case that doesn't happen because the loss prevents it.
"What happens if I focus a lot on KL and ignore reconstruction?"
The encoder thinks: "I get punished for deviating from N(0,I). Why encode anything?"
Result: For ALL inputs, output μ=0, σ=1. The latent carries zero information.
This is called posterior collapse.
| β Value | What You Get |
|---|---|
| β ≪ 1 | Almost an autoencoder: good reconstruction, messy latent |
| β = 1 | Standard VAE |
| β ≫ 1 | Risk of collapse, blurry outputs |
β-VAE claimed high β leads to disentangled representations (each dimension = one factor).
"Why would high β force disentanglement? I'd just output 0,1 for everything."
High β alone doesn't force disentanglement. It forces collapse.
"And why would independent factors be 'efficient'? If I entangle 10 factors I can compress to 2 dimensions. Independence requires 10 dimensions."
Entanglement is MORE efficient for compression, not less.
β-VAE disentanglement is shaky:
"How is expectation over 100 samples the same as one sample? That erases the expectation."
It doesn't erase it — it approximates it.
With N=1:
Terrible estimate for one image. But you're not training on one image.
With batch size 64 and 1 z per image = 64 samples per gradient step. Over 1000 batches = 64,000 samples. Noise averages out.
"By that logic, SGD with one sample is the same as GD."
SGD works because: unbiased gradient + many iterations = convergence.
"But SGD on one sample is shit — not enough variance. But too large batch is also bad?"
| Batch Size | Quality | Problem |
|---|---|---|
| 1 | Super noisy | Unstable |
| 32-256 | Good | Sweet spot ✓ |
| Full dataset | Perfect | Expensive, poor generalization |
VAE follows the same logic: 1 z sample per image × batch of images = enough signal.
"Autoencoders compress to a latent space but it's unstructured — you can't sample from it.
VAEs encode to a distribution (μ, σ) instead of a point, then force these distributions toward a prior N(0,I) using KL divergence. This ensures the latent space is compact and sampleable.
The loss comes from ELBO — a lower bound on log p(x). We can't compute log p(x) directly, so we maximize this bound instead. Negating ELBO gives us: MSE (reconstruction) + KL (stay near prior).
The reparameterization trick makes sampling differentiable by writing z = μ + σε where ε ~ N(0,1).
Organization comes from reconstruction (similar images need similar features), not KL. KL just compresses everything into the sampleable region.
One z sample works because batches + SGD handle noise — 64 images × 1 sample = enough gradient signal."
| Symbol | Name | What It Is |
|---|---|---|
| $p(z)$ | Prior | N(0, I) — what we sample from during generation |
| $p(x|z)$ | Likelihood / Decoder | Given z, how likely is x? |
| $q(z|x)$ | Approx. Posterior / Encoder | Given x, output distribution over z |
| ELBO | Lower bound | E[log p(x|z)] - KL(q||p) |
| Loss | -ELBO | MSE + KL (minimize this) |
Next up: GANs, Diffusion Models, and how Transformers connect to all of this.