Chapter 22 of 25

Why MLE Becomes Cross-Entropy

Created Jun 7, 2026 Updated Jun 7, 2026

Cross-entropy is the default classification loss in every framework. Ask why that one and most answers stop at "it works well" or "the gradients are nice." Both are true and both miss the point. Cross-entropy is not a loss someone picked for convenience — it is what maximum likelihood turns into when your model outputs a probability distribution. Once you see the derivation, the loss stops being a hyperparameter and becomes a consequence.

The setup. A model with parameters θ outputs a distribution over the target: p_θ(y | x). For a classifier, that's the softmax vector — a probability for each class. You have data {(xᵢ, yᵢ)}, assumed i.i.d.

Maximum likelihood. The likelihood of the data under the model is the product of the per-example probabilities:

L(θ) = Πᵢ p_θ(yᵢ | xᵢ)

We want the θ that makes the observed data most probable. Products of many small numbers underflow and are awful to differentiate, so take the log (monotonic — same argmax) and flip the sign to get something to minimize:

maximize   Σᵢ log p_θ(yᵢ | xᵢ)
⇔ minimize  −Σᵢ log p_θ(yᵢ | xᵢ)      ← negative log-likelihood (NLL)

That last line is the whole game. Everything else is plugging in a distribution.

Plug in a categorical → cross-entropy. For K-class classification the target is categorical, and p_θ(yᵢ | xᵢ) is just the softmax probability the model put on the correct class. Write the empirical target distribution as p (for a single hard label, the one-hot vector for the true class) and the model's predicted distribution as q = p_θ(· | x). Cross-entropy between them is:

H(p, q) = −Σₖ pₖ log qₖ

Because p is one-hot, every term is zero except the true class, so H(p, q) = −log q_true — exactly the NLL of one example. Summed over the dataset, NLL and cross-entropy are literally the same number. Cross-entropy isn't like maximum likelihood for a classifier; it is maximum likelihood for a classifier.

The collapse to −log q_true is a special case, not the definition. When the target is soft — label smoothing, distillation, mixup, human-disagreement labels — p is no longer one-hot, and cross-entropy stays the full −Σₖ pₖ log qₖ: the expectation of −log q under the target distribution p. Same formula — only the one-hot shortcut drops away.

Why it earns the name "cross-entropy." From information theory, H(p, q) is the expected number of bits to encode samples drawn from the true distribution p using a code optimized for your predicted q. And there's an exact decomposition:

H(p, q) = H(p) + KL(p ‖ q)

H(p) is the target distribution's own entropy — fixed, nothing to do with θ. So minimizing cross-entropy is minimizing KL(p ‖ q): you are pulling the model's distribution q toward the empirical target distribution p. Three names, one operation — maximum likelihood = minimum cross-entropy = minimizing the KL from the empirical target distribution to the model. That same divergence is the quiet objective behind a surprising amount of ML — a short tour.

The same machinery gives every other loss. Change the assumed output distribution, turn the crank, read off the loss:

target distribution        NLL becomes…
─────────────────────      ─────────────────────────────
Categorical (softmax)      cross-entropy
Bernoulli (sigmoid)        binary cross-entropy / log-loss
Gaussian  N(ŷ, σ²)         mean squared error (MSE)
Laplace                    mean absolute error (MAE)
Poisson                    Poisson / count loss

Each equivalence holds up to additive constants and scale factors that don't move the argmin — the Gaussian→MSE and Laplace→MAE rows both assume the scale (variance / spread) is fixed, not learned.

MSE is the one worth checking by hand. Assume y ~ N(ŷ, σ²) and write the NLL:

−log p(yᵢ | xᵢ) = (1 / 2σ²)·(yᵢ − ŷᵢ)² + log(σ√(2π))

The constant drops out of the argmin, the 1/2σ² is just a scale, and what's left to minimize is Σ(yᵢ − ŷᵢ)² — MSE is maximum likelihood under Gaussian noise. Regression with MSE and classification with cross-entropy aren't two different philosophies. They're the same principle applied to two different noise models.

The payoff. The loss function is not a free knob. It's a statement about how you believe the target is distributed around the model's output. Modelling low-count or skewed count data? Be careful with MSE — it assumes Gaussian noise, which can be a poor fit there; a Poisson or negative-binomial likelihood is usually the better starting point. Heavy-tailed errors? MSE's Gaussian assumption will let outliers dominate; a Laplace (MAE) assumption won't. Pick the distribution your target actually follows, and the "right" loss is already chosen for you.

Where this exact identity does real work in practice: LLM pretraining is nothing but cross-entropy minimization over next tokens — and what that does and pointedly does not guarantee about truth is the subject of The Physics of Hallucination.