Chapter 1 of 7

Endogeneity

Created Apr 28, 2026 Updated Jun 7, 2026

Imagine a regression of churn on help-button clicks shows a strong negative coefficient: "users who click help churn 30% less". The temptation is to read that as causal — surface the help button more aggressively, churn drops. But the data alone cannot tell you that. Maybe both behaviours are driven by something underneath them (engaged users click help and stay around). Maybe the causation runs the other way (users who survived long enough are the ones who started needing help). Maybe the help-clicking population just differs from the rest on things you never measured.

That gap — where a regression coefficient looks like a causal effect but isn't one — is what econometricians call endogeneity. It is not a finance- or economics-specific quirk. It shows up any time you try to draw a causal conclusion from observational data, and ML practitioners hit it constantly: features that predict well in offline evaluation but do not move the metric when shipped, churn scores that are accurate but unactionable, demand models that suggest prices the business has already tried.

The unpleasant property is that the problem is fundamental. More data does not fix it; the bias does not shrink with sample size; you cannot detect it from the model's residuals alone. To get an unbiased causal estimate, you either need a corner of the data where the causation runs in only one direction, or you need extra information from outside the model that breaks the ambiguity.

The rest of this note formalises that intuition — what endogeneity is precisely, where it comes from, how to detect it, why OLS breaks under it — and the rest of the Econometrics track covers the standard tools for breaking the ambiguity in real datasets.

Definition

In the linear model y = Xβ + ε the standard exogeneity assumption is that the regressor X and the error term ε are unrelated. There are actually two flavours of this with different consequences, and they are routinely conflated in introductory treatments:

Zero conditional mean / strict exogeneity, E[ε | X] = 0 — the strongest version (the precise name depends on the setting: "zero conditional mean" in cross-sectional regression, "strict exogeneity" in panel and time-series contexts where ε at each time has to be uncorrelated with regressors from all periods). It is what makes OLS unbiased in finite samples: E[β̂] = β.
Orthogonality, E[X'ε] = 0 (equivalently Cov(X, ε) = 0 in a centred model) — the weaker version. It is what makes OLS consistent: β̂ → β as N → ∞.

Endogeneity is the technical term for the violation of these conditions: X is correlated with ε. Under endogeneity, the estimator is biased (does not reach the true β on average) and, more catastrophically, inconsistent (the bias does not vanish even with infinite data). That second property is what makes endogeneity so destructive for causal interpretation — more data does not save you.

Why OLS is biased under endogeneity

The OLS estimator is β̂ = (X'X)⁻¹ X'y. Substituting y = Xβ + ε:

β̂ = β + (X'X)⁻¹ X'ε

The second term is the bias. Its behaviour depends on which exogeneity condition holds:

If E[ε | X] = 0, the bias term has zero mean conditional on X, so OLS is unbiased in any finite sample.
If only the weaker E[X'ε] = 0 holds (orthogonality, no conditional independence), then (X'X)⁻¹ X'ε → 0 asymptotically by the law of large numbers — OLS is consistent but not necessarily unbiased in finite samples.
If E[X'ε] ≠ 0, the term does not vanish even asymptotically — OLS is inconsistent, and no amount of additional data fixes it.

The third case is what defines endogeneity, and it is genuinely fundamental: a problem that more data cannot solve. That property is what separates endogeneity from ordinary noise — noise shrinks with √N, endogeneity does not shrink at all.

Both panels are Monte-Carlo sampling distributions of β̂. Left: exogenous DGP — both panels narrow around the true β = 1 as N grows (classic √N shrinkage). Right: endogenous DGP — the distribution narrows just as fast, but around the wrong value. More data tightens the noise; it does not move the bias.

Three sources of endogeneity

Endogeneity arises from three main mechanisms, each of which calls for a different corrective technique. Recognising which source is present is the first step to choosing the right tool.

Omitted variable bias (OVB)

OVB arises when an important variable X₂ is excluded from the model but is correlated with an included variable X₁ and also affects Y. The effect of the omitted variable is absorbed by the error term, creating a correlation between X₁ and the residual.

The bias has a clean closed form. If the true model is y = β₁X₁ + β₂X₂ + ε and we estimate the misspecified y = β₁X₁ + u, then asymptotically:

plim β̂₁ = β₁ + β₂ × Cov(X₁, X₂) / Var(X₁)

The sign of the bias is the sign of β₂ × Cov(X₁, X₂), which means substantive knowledge about the omitted variable lets you predict the direction of the bias before fitting the model. The classic example is estimating the effect of education on wages with ability omitted: ability raises wages directly (β₂ > 0) and is positively correlated with education (Cov(X₁, X₂) > 0), so the OLS estimate of the education coefficient is biased upward — OLS attributes ability's effect to education itself.

Generates data from y = β₁X₁ + β₂X₂ + ε with β₁ = 1, then fits both the long regression (with X₂) and the misspecified short one (without). The violet tick is the closed-form prediction β₁ + β₂·ρ; the orange bar is the actual β̂ from the short OLS. Press ×10 sample size and watch the bars get tighter while the orange one stays planted on the wrong value — the punchline of "biased and inconsistent" made tactile.

Solutions: include the missing controls when they can be measured; use fixed effects on panel data when the omitted variable is constant within an entity (individual, firm, region) over time; use instrumental variables when no measurable control or natural fixed-effect structure is available.

Simultaneity / reverse causality

Simultaneity arises when X and Y mutually determine each other — neither variable explicitly causes the other in the unidirectional sense that regression assumes. The classical example is supply and demand: an observed (price, quantity) point is the intersection of the supply and demand curves, and a regression of quantity on price mixes both directions of causation rather than identifying either.

A specific business case is pricing. A revenue manager sets prices based on expected demand: high expected demand → high price. But separately, high price → lower demand. Observed data conflate the two effects, and a simple OLS of demand on price recovers a combined estimate, not a clean demand elasticity — this is exactly the bias the pricing-elasticity note walks through end-to-end.

Stationary demand and supply curves (violet, blue), random shocks to each, equilibrium points (orange dots), OLS line through the cloud (dashed orange). When both curves shift, the OLS slope is some hybrid of the two structural slopes — interpretable as neither. Switch to shift only supply: equilibria slide along a fixed demand curve and OLS recovers the true demand slope. That is the cost-shifter / IV intuition.

Solutions: instrumental variables that shift one side of the system but not the other (cost shifters as instruments for price are the canonical demand-curve identification strategy); the control function approach, where an explicit first-stage residual absorbs the simultaneity in a single estimating equation.

Measurement error

Measurement error arises when a variable is observed with noise: the true X_true is unobservable, and we observe X_measured = X_true + noise. There are two sub-cases that often get conflated, and the distinction matters:

Error on the regressor X. In the simple univariate classical-measurement-error case (random, mean-zero noise, uncorrelated with X_true and ε), OLS is biased toward zero — the attenuation bias. Asymptotically:
```
plim β̂ = β × Var(X_true) / (Var(X_true) + Var(noise))
```
The ratio Var(X_true) / (Var(X_true) + Var(noise)) is the reliability ratio — when noise is large relative to the true variance, the OLS coefficient is pulled strongly toward zero. The classical example is self-reported income: people don't remember exactly or shade their answers, and the true relationship between income and (say) health gets systematically understated. In multivariate settings the picture is messier — measurement error on one regressor can bias the other coefficients in either direction depending on the correlation structure, so "biased toward zero" is a reliable summary only in the univariate classical case.
Error on the outcome Y. Qualitatively different: random measurement error on Y inflates the standard errors of β̂ but does not bias the coefficient itself. OLS remains unbiased and consistent; the noise just makes the estimate noisier. Mixing up these two cases is one of the most common interpretation mistakes around measurement error — it is the regressor side that gives you attenuation.

The scatter is the simulated (X_measured, y) cloud with the OLS line (green) and the true line β = 1 (dashed). The right-hand bar reads off β̂ next to β. Pull the noise slider and the green slope flattens, exactly by the reliability ratio Var(X)/(Var(X)+Var(ν)) shown in the readout. Flip to noise on outcome Y: same amount of noise, but the slope is unmoved — only the scatter inflates. The point at which most people confuse the two.

Solutions: find a less noisy measurement when one exists; use instrumental variables that correlate with X_true but not with the measurement noise — a second independent measurement of the same underlying quantity is a natural instrument.

A fourth source, often grouped alongside the three above, is self-selection: when whether a unit appears in the sample, or in the treatment group, is itself driven by unobserved factors that also affect the outcome. The classic example is observational studies of training programmes — workers who choose to enrol differ from those who do not in ways that are correlated with future earnings, so a naive comparison of enrolled vs. non-enrolled overstates (or understates) the programme's true effect.

Self-selection is technically a special case of OVB (the unobserved drivers are the omitted variable) or of simultaneity (selection and outcome are jointly determined), but it is common enough — and the corrective tools distinctive enough — that it usually gets treated separately. The available tools split along the same line as the underlying problem: propensity-score methods (matching, weighting, stratification) are the right tool when selection is explainable by observed covariates ("selection on observables"); when selection is driven by unobservables, stronger structure is needed — Heckman selection models, IV, or an explicit research design (RD, natural experiments). A dedicated note on selection methods is planned.

Toy training-programme DGP. The unobserved u drives both enrolment (treatment T) and the outcome y; the true ATE is fixed at 2. The scatter shows y vs the unobserved confounder, coloured by treatment; the dashed lines are the group means. The right bar compares the naive difference of means against the true ATE. Press simulate RCT (π = 0) to break the selection link — naive comparison hits the truth. Anything else and the bias is precisely the selection-on-unobservables story.

How to detect endogeneity

Detecting endogeneity is harder than it sounds because there is no direct test — the error term ε is unobservable by definition. The available tools fall into three groups, none of them fully self-contained.

Hausman test

The Hausman specification test compares an OLS estimate with an IV estimate of the same coefficient. The null hypothesis is that the two estimates are equal, which (under the assumption that IV is consistent) means OLS is also consistent — i.e. no endogeneity. A significant difference is evidence of endogeneity.

The catch is that running the test requires already having a valid instrument, which is exactly the thing you would need a good answer to endogeneity for in the first place. You cannot test for endogeneity before you have found a way to deal with it. Once you have a credible IV, the main work is already the identification argument behind it; the Hausman test is secondary to that design decision (and IV is sometimes less efficient than OLS, so when there is no real evidence of endogeneity, OLS may still be the better practical choice — but that is an efficiency call, not a substitute for thinking about identification).

Two stacked histograms over Monte-Carlo replications, both with the true β = 1 marked. With ω = 0 (no endogeneity) both estimators centre on the truth; OLS is tighter — IV is paying for consistency with variance. Raise ω and OLS drifts off; IV stays put. Drop π toward zero and IV becomes useless: the histogram fans out and the Hausman p-value loses its meaning — the weak-instrument failure mode that breaks the test as a practical tool.

Theoretical reasoning

In modern empirical practice this is the dominant tool, and it is not a degenerate case of "no formal test available". The strongest evidence for endogeneity is institutional knowledge: how the data are actually generated, which variables are simultaneously determined, which decisions are made by whom and based on what information. A revenue manager setting prices based on forecast demand makes simultaneity unavoidable — no formal test is needed to see the issue. Design-based identification arguments (natural experiments, regression discontinuity, IV grounded in institutional structure) have largely replaced formal endogeneity tests in modern econometrics, and that shift is a feature rather than a methodological retreat.

Residual analysis

There is an important caveat here that catches people: OLS residuals are mechanically orthogonal to the included regressors in-sample (the estimator is constructed precisely so that the first-order condition X'(y − Xβ̂) = 0 holds), so simply checking the residual–regressor correlation does not detect endogeneity. The model has zero in-sample residual correlation with X by construction, whether or not endogeneity is present.

Residual analysis can still surface misspecification indirectly. The useful patterns are: structure over time that suggests a missing dynamic variable; nonlinearity (curved residual–fitted plots) that suggests the wrong functional form; correlation between residuals and variables not included in the model; group-level shifts in residual means that suggest a missing group-level effect; and residual patterns that line up with substantive institutional knowledge about how the data are generated. None of these are formal endogeneity tests — they are diagnostics that point at where the model might be incomplete.

Hidden-confounder DGP: y = β·X + ω·u + ε, X correlated with the unobserved u. Slide ω up to crank endogeneity. The green view ("OLS residual vs X") is what diagnostics actually plot — and it stays a flat cloud with Cov(X, ε̂) ≈ 0 for every value of ω, because OLS makes it so by construction. Flip to peek at true error ε and the orange view shows the structural error you cannot actually compute: it tilts visibly with X. The thing residuals can't see is precisely the thing that matters.

Why endogeneity matters in practice

Endogeneity is not an academic problem. It has direct business consequences.

A pricing example

In a typical pricing project, a simple OLS estimate of demand elasticity can easily be biased toward zero in economically meaningful ways — demand "looks" less sensitive to price than it actually is, because the simultaneity between observed price and demand collapses both directions into one number. Based on the biased elasticity, the optimiser sets the wrong "optimal" price, consistently too high or too low by several percent, which translates into substantial revenue loss in aggregate.

For a company with revenues in the hundreds of millions of dollars, a few percent of error is real money. Correcting the estimate via IV or a control function is, in those cases, direct financial optimisation — see the pricing-elasticity note for the full worked example.

Constant-elasticity demand Q = A·P^(−η), profit π(P) = (P−c)·Q(P), markup-rule optimum P* = η·c/(η−1). Slide the reliability ratio λ to set the biased estimate η̂ = η·λ; the chart shades the gap between the true optimum (green) and the biased optimum (orange) on the underlying profit curve. The annual-loss card converts that gap into dollars at the chosen revenue scale — the "few percent of hundreds of millions" the chapter mentions, made arithmetic.

Nobel-level significance

The 2021 Nobel Prize in Economics was awarded to Joshua Angrist, Guido Imbens, and David Card for contributions to empirical methods in causal inference, including research designs that address endogeneity and identification problems (natural experiments, LATE, IV-style identification arguments). This is mainstream importance, not academic curiosity: all of modern empirical economics is built around ways of identifying causal effects in the presence of endogeneity, and those methods are precisely what the rest of the Econometrics track covers — instrumental variables, regression discontinuity, control function, panel data, and causal ML.