Chapter 2 of 7

Instrumental Variables (2SLS)

Created Apr 28, 2026 Updated Jun 7, 2026

You want to know how a feature affects an outcome — how price affects demand, how training affects wages, how an algorithmic change affects user behaviour. You have observational data but cannot just regress outcome on feature: as the endogeneity note covers, the feature is correlated with the error term, and OLS produces a biased causal estimate.

The IV trick is to find a third variable, call it Z, that moves the feature but has no other path to the outcome. If such a Z exists, you can use the variation in the feature that comes from Z — that variation is, in a precise sense, "as if randomised" — and discard the endogenous part. The price is noisier estimates (you are using only some of the variation in the feature); the benefit is that what remains identifies the causal effect rather than a confounded association.

The canonical example is supply and demand: cost shocks (weather, fuel prices) shift the supply curve without directly affecting demand, so they trace out movements along the demand curve. The same shape shows up in product and ML practice whenever there is an exogenous nudge to a feature you cannot randomise directly: pricing experiments where the intent to treat is randomised but the actual price paid is endogenous; recommender-system changes where the algorithm switch is exogenous but exposure is endogenous; designs around discontinuities in eligibility, deadlines, or quotas.

The rest of this note formalises that intuition, walks through the standard 2SLS implementation, and covers what an IV estimate actually means — which, in modern theory, is not the average effect over the whole population, but the effect on the subset of units whose treatment was actually moved by the instrument.

The core idea

We cannot disentangle the correlation between X and ε directly. But if we can find a Z that moves X and only affects Y through X, then we can isolate the part of X that is generated by Z — that part is, by construction, uncorrelated with ε — and use it instead of the contaminated original. Z then plays the role of a "natural experiment": it generates variation in X that is exogenous, even when the data itself was collected observationally.

The rest of IV is bookkeeping around this idea: how to find such Z in practice, how to estimate cleanly using it, what the resulting estimate actually identifies, and how to defend the choice when challenged.

Conditions on a valid instrument

A variable Z qualifies as an instrument if it satisfies four conditions. The first three are about consistency of the IV estimator; the fourth is about what the IV estimator means in causal terms — and the fourth was not part of the original Wright / Wald treatment, but came later from the modern Imbens & Angrist (1994) LATE framework.

Relevance

Z must be correlated with X: Cov(Z, X) ≠ 0. If Z does not move X, no variation in Y attributable to Z can be used to learn about the X→Y effect — the instrument is uninformative.

The standard diagnostic is the F-statistic from the first-stage regression of X on Z (and the controls W). The historical rule of thumb is F > 10, but it is now widely considered too lax — the modern weak-instruments literature is covered in the diagnostics section below.

The reason weak instruments are not just "imprecise" but actively dangerous is that under weak Z, IV's median bias can move toward OLS's bias rather than toward zero. Weak IV can be both more biased and less precise than OLS — sometimes staying with OLS is the better practical choice when no strong instrument is available.

Exclusion restriction

Z must affect Y only through X — no direct effect, no indirect path through other variables. The exclusion restriction itself is a causal assumption about the data-generating process, not a statistical one: it says Z has no path to Y other than via X. In the linear IV model, exclusion together with conditional independence implies the moment condition Cov(Z, ε | W) = 0, and that moment condition is what 2SLS actually exploits. In applied work the moment condition is often presented as if it were the exclusion restriction itself; strictly speaking the moment condition is the testable-looking consequence of the exclusion argument, not the assumption itself.

This is the most critical and the trickiest condition, and it cannot be tested formally by data alone. With a single instrument the data are silent on whether exclusion holds — it must be argued through institutional or theoretical reasoning. With more than one instrument an overidentification test (below) gives partial leverage, but it can only reject joint validity, not pinpoint which instrument is bad.

The classical violation is the instrument correlating with an omitted variable that itself affects Y. If "rainy days" are used as an instrument for voter turnout but rain also affects local economic activity (which separately affects voting), then rain is not a clean instrument. The same pattern shows up routinely in instrument choices that look clean on paper.

Open up a direct path from the instrument to the outcome with the exclusion leak λ slider — the violation the exclusion restriction forbids. The IV estimate drifts off the true β by precisely λ/π: the leak scaled by the instrument's strength, so a weaker instrument makes the same leak far worse. The point to sit with is that the first-stage F stays large the whole time — a strong instrument says nothing about exclusion, and with a single instrument there's no over-identification test. Only the argument for exclusion stands between you and a confidently wrong causal number.

Independence

Z is independent of the structural error ε conditional on the controls W; in modern causal-inference notation, Z is independent of the potential outcomes Y(z, x) and the potential treatment X(z).

In simple settings this is implied by exclusion plus an as-if-randomised Z, and the two conditions are often discussed together. The modern framework states them separately because they are conceptually distinct: independence is about how Z was generated; exclusion is about how Z propagates through the model.

Monotonicity

Required for a causal interpretation of the IV estimator. Z must move X in only one direction across the population: either Z = 1 (versus Z = 0) increases X for some units and leaves it unchanged for the rest, or it decreases X for some and leaves it unchanged for the rest. There can be no "defiers" — units who would take a higher X at Z = 0 than at Z = 1.

Without monotonicity, the IV estimator is still consistent for something (a weighted contrast across units), but that something has no clean causal interpretation as the average effect on any well-defined subgroup. With monotonicity, IV identifies the Local Average Treatment Effect (LATE) — the average effect on the units whose X actually moves in response to Z. This is the modern interpretation of IV (Imbens & Angrist 1994), and the next section unpacks why it matters.

What IV actually identifies: LATE

In a world without effect heterogeneity — every unit has the same X→Y causal effect — IV identifies that effect, period. But real populations are heterogeneous, and the modern result is that IV identifies a specific local average:

Under relevance, exclusion, independence, and monotonicity, the IV estimator equals the Local Average Treatment Effect (LATE) — the average treatment effect on the compliers, the units whose treatment status responds to the instrument.

The other groups in the population — always-takers (X = 1 regardless of Z), never-takers (X = 0 regardless of Z), and (under monotonicity) no defiers — do not contribute to the IV estimate, because they have no Z-induced variation in X for IV to use.

This matters in three concrete ways:

LATE is not ATE. The average effect over the whole population can differ substantially from LATE if the complier group is unusual. In Angrist & Krueger's schooling-and-quarter-of-birth design, the compliers are people whose schooling was actually shifted by the interaction of birth date with compulsory-schooling laws — a particular subset of the population, not a representative sample.
Different valid instruments can give different LATEs. If two instruments move different complier subpopulations, they estimate different causal quantities, even when both are valid IVs. This is the LATE framework working as designed, not a contradiction.
The complier group is partly observable through the first stage. In binary-treatment / binary-instrument settings, the first stage is closely related to the share of compliers in the population. More generally, it tells us how much treatment variation the instrument actually generates — a weak first stage signals a small or local complier set, and a LATE that may not generalise far beyond it.

This perspective — IV identifies a local effect on a specific sub-population — was central to the contributions recognised by the 2021 Nobel Prize. It is also why "does this study generalise?" is a sharper question for IV than for randomised experiments: an RCT's compliers are usually the whole randomised group, but an IV's complier group is a smaller, often unobserved subset.

The population splits into compliers (whom the instrument moves), always-takers and never-takers (whom it doesn't), and — once you switch monotonicity off — defiers. IV's denominator is exactly the complier share, so the estimate is the complier effect (the LATE), and it is structurally blind to everyone else. Set the everyone-else effect away from τ_c and watch LATE separate from ATE: IV is still correct, just answering a question about a sub-population, not the whole one. Then uncheck monotonicity — defiers enter the first stage with the opposite sign, the estimate stops being any clean subgroup's effect, and it diverges as the defier share climbs toward the complier share.

The 2SLS procedure

The 2SLS estimator is named after its mechanics: two ordinary least squares regressions performed in sequence.

Stage 1

Regress the endogenous regressor X on the instruments Z and the exogenous controls W:

X = α + γZ + δW + u

Take the fitted values:

X̂ = α̂ + γ̂Z + δ̂W

X̂ is the part of X explained by Z and W. Under the instrument validity assumptions, this part of X is orthogonal to the structural error ε in the equation for Y — it is the exogenous component of X. The mechanical step of regressing X on Z does not by itself produce a clean X̂: a bad Z gives a contaminated X̂. The construction works only when Z is valid.

Stage 2

Regress Y on the fitted values X̂ (not the original X) plus the same exogenous controls:

Y = β₀ + β₁X̂ + β₂W + ε₂

The coefficient β̂₁ is the 2SLS estimator, consistent under valid Z. The crucial detail is that stage 2 uses X̂, not the original X — the endogenous variation has been discarded, only the exogenous part driven by Z and W survives, and that variation is what identifies the causal effect.

Standard errors

Naive OLS standard errors from stage 2 are too small and should not be reported. The reason is that they treat X̂ as if it were observed data, when in fact it is itself an estimate from stage 1, and that first-stage uncertainty has to be propagated. Standard econometric packages (linearmodels in Python, ivreg / AER in R, Stata's ivregress) implement the correction internally; doing 2SLS by hand as two separate OLS calls and reporting the second OLS's standard errors is one of the most common applied-IV mistakes.

Why this works

X has, conceptually, two components: an exogenous part driven by Z and W, and an endogenous part driven by factors that also affect ε. Stage 1 extracts the exogenous part as X̂; stage 2 uses only that, effectively discarding the endogenous contamination.

The price is efficiency: only part of the variation in X is used, so 2SLS has wider confidence intervals than OLS. The benefit is consistency — for causal inference that trade-off is usually worth taking, but only when the instruments are strong enough that "wider CIs" does not collapse into "useless".

The left panel is stage 1 — X regressed on the instrument Z, giving the fitted X̂ that strips out the confounder. The right panel is the structural (X, y) plane: the OLS slope is dragged off the true β = 1 by the hidden confounder, while the 2SLS slope — built only from X̂ — lands on it. Pull the instrument π slider toward zero and watch the first-stage F collapse: the 2SLS estimate starts swinging wildly on every resample, the weak-instrument failure mode where IV becomes both more biased and noisier than OLS.

In linear models with a single endogenous regressor, the control function approach gives numerically identical point estimates to 2SLS, and is sometimes a more flexible framing for nonlinear extensions. The two are different parameterisations of the same identification idea, not competing methods.

Choosing instruments in practice

Instrument choice is a theoretical exercise, not an algorithmic one. Two illustrative examples in the pricing setting, where the main endogeneity is simultaneity (prices are set with expected demand in mind) — see pricing elasticity for the full worked version.

Example 1: lagged competitor prices

competitor_price_lagged — the prices charged by a competitor in the previous period (yesterday's, last week's). The relevance argument is straightforward: revenue managers react to competitors, so the competitor's lagged price moves our current price. The exclusion argument is that our customers see only our price, not the competitor's lag — so the lagged competitor price affects our demand only through our own pricing reaction.

Two concerns commonly bite:

Common demand shocks. A region-wide event (holiday, weather, news) can move both competitors' prices and our demand. If the competitor reacted to the same shock that is moving our customers, the competitor's price is correlated with our demand through that common cause, breaking exclusion. Mitigating: control for observable demand drivers (holiday flags, event indicators) and rely on residual variation in the instrument.
Persistence in demand shocks. More subtle. If demand shocks are autocorrelated — and they almost always are — then yesterday's competitor price can still be correlated with today's demand through the persistent component of yesterday's shock that is still alive today. A one-period lag is rarely long enough to break this; longer lags break the persistence link but also weaken relevance, which moves the instrument back into the weak-IV danger zone.

Lagged competitor prices are a plausible instrument that can survive scrutiny, but only with explicit work on both channels.

Example 2: capacity constraints

A near-full-capacity indicator. Relevance: yield-management discipline says revenue managers raise prices as capacity fills up, so capacity-near-full directly moves price. Exclusion would require that capacity affects the customer-facing booking decision only through the price the customer is quoted.

Capacity-based instruments are tempting in pricing but should be approached carefully — several mechanisms can break exclusion in ways that are easy to miss:

Mechanical capping of observed sales. When capacity is nearly exhausted, quantity sold is constrained by inventory rather than by the demand curve. The IV's variation in price is then correlated with a hard cap on the outcome, not with movement along the demand curve.
Scarcity signalling. "Only 2 rooms left" UI affects conversion separately from price; if scarcity messages are correlated even partially with the capacity instrument, exclusion fails.
Capacity itself endogenous to anticipated demand. If managers open more inventory when expecting high demand, capacity is itself driven by demand expectations, putting simultaneity back one level up.

Capacity is plausible as an instrument only when none of these channels are active — when it affects the customer-facing decision variable (price) without directly limiting or stimulating the measured outcome (booking probability or quantity). That is a stronger condition than the textbook framing "capacity moves price, customer doesn't see capacity, done" suggests, and it usually requires explicit checks against each channel rather than a one-line argument.

Diagnostic tests

A proper IV analysis is never just "I ran 2SLS and got a result". The diagnostic tests below check the assumptions in the data; together with theoretical arguments about exclusion and monotonicity, they are how IV gets defended.

Weak-instruments diagnostic

The first-stage F-statistic for the joint significance of the instruments. The Stock & Yogo (2005) tables gave critical values for various tolerable bias levels and instrument counts; "F > 10" is the most-cited threshold but is widely considered too lax now. Modern recommendations (Olea & Pflueger 2013, Lee–McCrary–Moreira–Porter 2022) require substantially larger F for weak-IV-robust inference — F values into the dozens or hundreds depending on the heteroscedasticity structure and the desired coverage. When the first-stage strength is borderline, reporting weak-IV-robust confidence intervals (Anderson–Rubin, conditional likelihood ratio) is good practice.

Sargan-Hansen J-test (overidentification)

When there are more instruments than endogenous regressors, the J-test asks whether the instruments give consistent estimates: under the null that all are valid, different subsets should produce the same answer up to sampling error. A significant J-statistic rejects joint validity and implies at least one instrument is invalid — but, importantly, the test cannot identify which one. With one instrument per endogenous regressor (just-identified case) the test is not applicable.

Endogeneity test (Hausman / Wu)

Compares OLS and IV estimates: if they do not differ significantly, OLS may be sufficient and is more efficient; if they differ significantly, endogeneity is present and IV is needed. Useful for justifying IV to a sceptical reader, with the same caveat as in the endogeneity note: once you have a credible IV, the main work is the identification argument behind it; the Hausman test is secondary to that design decision.

When IV pays for itself in practice

IV is a substantial cost — finding a credible instrument takes real subject-matter work, the resulting estimates have wider confidence intervals than OLS, and the LATE interpretation forces explicit thinking about which sub-population the estimate generalises to. The trade is usually worth it in three characteristic situations:

The decision is based on the elasticity, not the prediction. Pricing optimisation, dose-response, capacity sizing — situations where the causal slope is what enters the optimisation. A biased OLS slope translates directly into wrong optimal decisions, and the cost of bias is measured in the same units as the business KPI. The pricing-elasticity note is the worked-out version.
A credible exogenous nudge already exists in the data. Discontinuities in eligibility (cutoffs, deadlines, lottery quotas), policy changes that affect different units differently, intent-to-treat experiments with imperfect compliance — all natural sources of valid Z. When the institutional structure of the data hands you an instrument, the cost of IV drops sharply.
A naive A/B test is impossible or unethical. You cannot randomise training, education, geography, or many other treatment-of-interest variables. IV is often the only option for clean causal inference in these cases.

When none of these apply — when the decision needs only predictive accuracy, when no credible instrument is available, or when randomised experimentation is feasible — IV is overkill. The right tools are then prediction-grade ML or RCTs, not 2SLS forced into a problem it does not fit.

The dual to this is that a non-trivial fraction of "IV studies" in the wild use weak or implausible instruments and would be more honest as descriptive correlational work. A credible IV is a substantial achievement and deserves to be treated as such; "we ran 2SLS so we have causal identification" is not, on its own, a defence.

Historical development

The IV idea is almost a century old, but its mainstream status in causal inference is much more recent.

Philip Wright (1928) — first application of instrumental variables, in the context of identifying supply and demand from observational price-quantity data.
Angrist & Krueger (1991) — used "quarter of birth" as an instrument for years of schooling, exploiting the interaction between birth date and compulsory-schooling laws. A landmark example of creative instrument choice and one of the foundational papers of the "credibility revolution" in empirical economics.
Imbens & Angrist (1994) — formalised the LATE framework: under monotonicity, the IV estimator identifies the average treatment effect on compliers, not the overall ATE. The modern interpretation of IV builds on this.
2021 Nobel Prize in Economics — awarded to Joshua Angrist, Guido Imbens, and David Card for contributions to empirical methods in causal inference, including IV-based research designs and the LATE framework. Their work shifted empirical economics from "structural" to "design-based" approaches and turned IV from an obscure technique into a mainstream tool.