Chapter 25 of 25

Why the ELBO Isn't a Random Formula

Created Jun 7, 2026 Updated Jun 7, 2026

The ELBO shows up in variational inference, VAEs, the variational view of diffusion, and EM, always as the same intimidating line:

log p(x) ≥ E_q[ log p(x, z) ] − E_q[ log q(z) ]

It reads like something you'd have to be a genius to guess. You wouldn't — and that's the point. The ELBO is not a clever invention; it's the only thing you can write down once you accept one constraint. Follow the constraint and the formula appears on its own.

The problem. You have a latent-variable model p(x, z) = p(x | z) · p(z). To do maximum likelihood you need the marginal likelihood — the evidence:

p(x) = ∫ p(x, z) dz

This integral sums over every possible latent z. For anything interesting it's intractable: you can't evaluate p(x), so you can't even compute your training objective, let alone maximize it.

The one move. Introduce any distribution q(z) over the latents and multiply inside the integral by q(z) / q(z):

log p(x) = log ∫ p(x, z) dz
         = log ∫ q(z) · [ p(x, z) / q(z) ] dz
         = log  E_q[ p(x, z) / q(z) ]

That's the whole trick — the only trick. You've turned the intractable integral into the log of an expectation under a distribution you control. (For now read q(z) as an auxiliary distribution chosen freely for one fixed x. In amortized variational inference and the VAE a network emits it as qφ(z | x), sharing parameters across all x — the derivation below is identical, just with q conditioned on the input.)

Apply Jensen. log is concave, so by Jensen's inequality log E[·] ≥ E[log ·]:

log p(x) ≥ E_q[ log( p(x, z) / q(z) ) ]
         = E_q[ log p(x, z) ] − E_q[ log q(z) ]   =   ELBO(q)

Done. The "evidence lower bound" is just push the log inside the expectation — and Jensen guarantees that only ever makes it smaller, which is exactly why it's a lower bound. There was no creativity required; concavity did it.

What the gap is — and why this is the whole reason VI works. The inequality is actually an exact equation in disguise:

log p(x) = ELBO(q) + KL( q(z) ‖ p(z | x) )

(This holds as long as q puts mass wherever the posterior does; if q zeroes out a region the posterior doesn't, that KL is infinite — the one regularity condition the bound quietly assumes.) Since KL ≥ 0, the ELBO sits below log p(x), touching it iff q equals the true posterior p(z | x). So maximizing the ELBO does two jobs at once:

it pushes up a bound on the evidence (approximate maximum likelihood), and
it pulls q toward the true posterior p(z | x).

This is the escape hatch. You can't minimize KL(q ‖ p(z | x)) directly — the posterior needs the same intractable p(x). But maximizing a bound you can actually estimate — by Monte Carlo sampling from q, even though p(x) itself stays out of reach — implicitly minimizes that KL. The ELBO is how you do inference on a posterior you can't even write down.

Read the two terms and the VAE appears. Regroup the bound:

ELBO = E_q[ log p(x | z) ]  −  KL( q(z) ‖ p(z) )
        └─ reconstruction ─┘    └── regularizer ──┘

First term: sample latents from q, decode, measure how well they explain x — reconstruction quality.
Second term: keep the encoder's latent distribution qφ(z | x) close to the prior p(z).

One sign convention that trips people up: you maximize the ELBO, so frameworks minimize its negative. The loss you actually code — reconstruction_loss + KL — is −ELBO, where the reconstruction loss is −E_q[log p(x | z)], a negative log-likelihood. Same two terms, flipped sign.

That is the VAE loss, term for term. A VAE is not "an autoencoder with a KL penalty someone bolted on for regularization." It is the ELBO. The KL term isn't a heuristic — it's the second half of the only lower bound you could have written, and that's why it has the precise form it does and not some other penalty.

So the ELBO is simply what maximum likelihood becomes when the evidence integral is out of reach: introduce q, multiply by q/q, apply Jensen, read off two terms. And the same variational-bound pattern reappears well beyond the VAE — classical variational inference, the E-step/M-step of EM, and (with more structure layered on) the variational view of diffusion. Each adapts the bound to its own setting rather than copying the line verbatim, but the move that generates it never changes: re-derived by necessity, not invented by inspiration.

The KL sitting in that gap — and why the same divergence surfaces across so many other training objectives — is its own short: Why KL Divergence Is Everywhere in ML.