Chapter 3 of 7
Regression Discontinuity
Created Apr 28, 2026 Updated Jun 7, 2026
Suppose your product enforces some rule-based gate. Users with score ≥ 80 get a discount, customers with order value ≥ $100 get free shipping, students above a threshold get a scholarship. You want to know what the gate actually does — does the discount drive retention, does free shipping drive larger orders, does the scholarship raise college completion?
Comparing users above versus below the cutoff is hopeless: those who score 95 are different from those who score 25 in many ways unrelated to the gate. But the comparison is much cleaner near the cutoff. A user with score 79 and a user with score 81 may still differ a little, but under a smooth assignment process there is no reason to expect all their other potential outcomes to jump exactly at 80 — the only thing that does change discontinuously at that point is the treatment rule itself. The difference in their outcomes can therefore plausibly be attributed to the gate.
Regression discontinuity formalises that intuition. If treatment is assigned by a threshold on some running variable (score, age, eligibility metric, distance, time), and people cannot precisely manipulate which side of the threshold they fall on, then units just above and just below the cutoff are as if randomised. RD uses this local-experiment structure to estimate the causal effect of the treatment at the threshold.
The cost is that the answer is genuinely local — RD speaks only about the units near the cutoff, not the whole population. The benefit is that, when the design holds, the causal interpretation is unusually clean: it does not depend on a model of the rest of the world, only on the assumption that everything except the treatment is continuous at the cutoff.
The RD idea
The core observation: when a treatment is assigned by X ≥ cutoff, units just above and just below the threshold are nearly identical in everything that matters, but receive completely different treatment. A scholarship awarded at score 80 splits a pair of 79/81 students who differ only in random test-day variation, and yet only one gets the scholarship. Their outcomes (graduation, earnings) can be compared, and the difference is attributable to the scholarship.
The observation is local, not global. Students with scores 50 and 80 are very different in many ways, and RD says nothing about the average treatment effect across that gap. It says something defensible about the effect right at the cutoff — and only there.
Formal model
The standard RD specification:
Y = α + β × D + f(X) + ε
where:
D = 1 if X ≥ c, otherwise 0 (treatment indicator)
X = running variable
c = cutoff
f(X) = smooth function of X
The outcome Y is modelled as a smooth function of the running variable X plus a jump of size β at the cutoff. The coefficient β is the treatment effect at the threshold — the part of the discontinuity in Y that cannot be explained by f(X) alone. In practice f(X) is usually allowed to have different slopes on the two sides of the cutoff (the local-linear specification with D × (X − c) interaction shown in the estimation section below); the single-f form here is just for clarity.
The identifying assumption, formalised by Hahn, Todd & van der Klaauw (2001), is continuity of potential outcomes at the cutoff: both E[Y(0) | X] and E[Y(1) | X] are continuous functions of X at X = c. This is the formal version of "everything except the treatment is continuous through the threshold". If some other factor jumps at the cutoff that is not caused by the treatment, RD attributes that jump to β and gives the wrong answer.
The whole method lives in one picture. Fit a line to the points just below the cutoff and another to the points just above, extend both to the threshold, and the vertical gap where they meet is the RD estimate β̂ — the part of the discontinuity that the smooth trend f(X) can't explain. Push the trend slope hard in either direction: a steeply sloped f(X) doesn't fool the estimate, because RD only ever compares the two sides at the cutoff. The bold dots are binned means, the standard rdplot visual.
Identification
The RD design works because of a specific identification argument: even when units are not randomly assigned overall, those near the cutoff are as if randomly assigned to either side, provided they cannot precisely manipulate their own running-variable value.
Lee (2008) made this formal: if individuals have any noise in the realised value of the running variable (test-day variance, measurement error, slight randomness in the bureaucratic process), and they cannot precisely control which side of the threshold they end up on, then conditional on being near the cutoff, treatment behaves as if it were locally randomised in a limiting sense. Continuity of pre-treatment characteristics at the cutoff is then an implication that should follow approximately and that we can partly check in the data — not a separate assumption layered on top.
This connects RD to the LATE framework from IV: RD identifies a local average treatment effect at the cutoff — the average treatment effect for the population of units who happen to be near the threshold. It is not generally the average treatment effect across the whole population. A student with a score of 50 may have a very different treatment effect than one at the cutoff, and RD says nothing about the former. The local-effect caveat is the main limitation of RD generalisability and is the right thing to keep in mind whenever an RD result is being used to justify a policy that affects units far from the original cutoff.
Two types of RD
Sharp RD
Treatment assignment is a deterministic function of the running variable: every unit with X ≥ c receives treatment, every unit with X < c does not. Estimation reduces to fitting the discontinuity directly — local linear regression on either side of the cutoff with the treatment dummy D.
The classical example is a strictly applied scholarship rule, mandatory enrolment by age cutoff, or any deterministic eligibility rule.
Fuzzy RD
The probability of receiving treatment jumps at the cutoff but does not go from 0 to 1. Some eligible students decline the scholarship, some technically-ineligible students still get it through other paths, and the treatment probability might jump from, say, 0.3 below the cutoff to 0.8 above it.
Fuzzy RD is estimated by instrumental variables, with the cutoff dummy D used as an instrument for actual treatment receipt. The resulting estimate is a LATE on the compliers near the cutoff — the average treatment effect on the units whose treatment status was actually moved by being on one side of the threshold versus the other. Standard errors and weak-instrument concerns from the IV note carry over directly to fuzzy RD.
Two panels: the first stage is the jump in treatment probability at the cutoff, the reduced form is the jump in the outcome, and the RD estimate is their ratio. In sharp RD the probability snaps from 0 to 1, the denominator is 1, and the outcome jump simply is the effect. Switch to fuzzy and only some units comply, so you divide the outcome jump by a partial probability jump — exactly IV with the cutoff dummy as instrument. Shrink the gap between P below and P above and the estimate goes noisy: a small probability jump is a weak first stage, with all the weak-instrument fragility that implies.
Choosing the bandwidth and the local polynomial
Two practical choices dominate the implementation of RD: the bandwidth (how wide a window around the cutoff to use) and the order of the local polynomial (how to model the smooth f(X) inside that window).
Bandwidth
RD estimates are typically run on data within a window [c − h, c + h]. The bandwidth h trades off bias and variance. A narrow h keeps observations close to the cutoff and more comparable, but there are fewer of them, so estimates are noisy. A wide h brings in more observations and lower variance, but units further from the cutoff may not be comparable, biasing the estimate.
The modern standard is Calonico–Cattaneo–Titiunik (CCT, 2014) robust bandwidth selection, which chooses h to minimise asymptotic mean squared error and produces confidence intervals that account for the bias-variance trade-off. The rdrobust package (R, Stata, Python) implements CCT directly and is the practical entry point for most applied RD work in 2026.
Local polynomial order
Inside the bandwidth window, the smooth function f(X) is approximated by a local polynomial — usually local linear, sometimes local quadratic, almost never higher order.
Gelman & Imbens (2019) argued strongly against high-order polynomials in RD: cubic or quartic fits are sensitive to data points far from the cutoff and produce noisy treatment-effect estimates that depend heavily on functional-form choices made by the analyst. Local linear or local quadratic with bandwidth-selected windows is the mainstream recommendation, and the rdrobust defaults reflect this.
Local linear regression with kernel
The standard estimating equation, with the running variable centred at the cutoff:
Y = α + β × D + γ × (X − c) + δ × D × (X − c) + ε
The intercept α and treatment indicator β capture the level shift at the cutoff; γ and δ allow the slope on the running variable to differ on either side. Observations are weighted by a kernel — the triangular kernel (linear decay from 1 at the cutoff to 0 at the bandwidth boundary) is the most common choice; uniform kernels are also used, and CCT recommend specific kernel-bandwidth combinations. The coefficient β̂ is the RD estimate of the treatment effect at the cutoff.
The trend f(X) here is genuinely curved, which is what makes these two knobs matter. Widen the bandwidth with a local-linear fit and the bend leaks into β̂ as bias; narrow it and the bias vanishes but the estimate leans on few points and gets noisy (resample to feel the variance) — the bias–variance trade-off in one slider. A local-quadratic absorbs the curvature at a wider window; a cubic is flexible enough to chase noise and wobble on every resample, which is precisely the Gelman–Imbens case against high-order polynomials. Points outside the bandwidth are greyed out and unused; the rest are triangular-kernel weighted.
A cautionary practical example: time-based RD
In applied work, RD designs are sometimes proposed where the running variable is time — for example, "the day a price changed" or "the moment a feature launched". These designs are tempting because rule-based time cutoffs are easy to find in operational data, but they are noticeably more fragile than the textbook RD setting and are worth being explicit about.
The textbook RD case has a fixed running variable (test score, age, eligibility metric) and a discrete treatment cutoff applied to that variable. Time-based RD has time as the running variable, and the treatment is whatever event happened at t = 0. The continuity assumption — that everything other than the treatment is smooth at the cutoff — is dramatically harder to defend when the running variable is time, because many things change with time independently of the event being studied:
- Day-of-week, week-of-month, holiday, and seasonality effects.
- Demand shocks correlated with the timing of the event.
- Concurrent changes (the price change might coincide with a marketing push, a UI change, or a competitor move).
- Anticipation effects (if the event was even partially predictable, behaviour shifts before
t = 0rather than at it, smearing the discontinuity).
Time-based "RD" is much closer to an event study than to RD in the formal sense. It can still be useful — observing what happens around an unanticipated event is informative — but the causal claim is only as strong as the assumption that nothing else changed at the same time, and that assumption almost always needs explicit defence rather than the one-line RD argument.
A time-based design is plausible only when the event is genuinely unanticipated, other concurrent changes can be ruled out through institutional knowledge or controls, and the running variable (time) is fine-grained enough that day-of-week and seasonality patterns can be absorbed. The cleanest RD designs are still based on truly fixed cutoffs in non-time running variables — test-score thresholds, age cutoffs, rainfall thresholds for relief programmes, vote shares around close-election margins, lottery cutoffs. When such a design exists in the data, it is far more defensible than a time-based one. See the pricing-elasticity note for how time-based RD is sometimes used (carefully) in pricing problems.
Diagnostic tests
A proper RD analysis is never just "I fit a local linear regression at the cutoff and got a number". Two diagnostic checks are standard, and in serious applied RD work both are expected.
McCrary density test
The McCrary (2008) test checks whether the density of the running variable is continuous at the cutoff. Under valid RD, units cannot precisely manipulate which side of the threshold they fall on, so the density should be smooth across the cutoff. A statistically significant discontinuity in the density — typically an excess of observations just on the favourable side — suggests manipulation: people gaming the system to land where the treatment is more attractive. Manipulation breaks the as-if-random argument and invalidates the RD design.
The implementation in rddensity (paired with rdrobust) is the practical version most applied work uses.
This test ignores the outcome entirely and looks only at the density of the running variable. At 0% manipulation the histogram glides through the cutoff — units can't control which side they land on, which is exactly the assumption RD rests on. Turn the manipulation slider up and units sitting just below the threshold sort to just above it: a dip opens on the left, a spike piles up on the right. That density discontinuity is the McCrary fingerprint of gaming — and once it appears, the as-if-random comparison near the cutoff is no longer credible.
Placebo tests on covariates
Pre-determined covariates — variables fixed before treatment was assigned — should not jump at the cutoff. If they do, something other than the treatment is changing at the threshold, and the RD identification assumption is suspect.
The standard practice is to fit the same RD specification (same bandwidth, same polynomial order) with a covariate as the outcome, and report the estimated discontinuity. Covariates with significant discontinuities are diagnostic red flags; the more thoroughly the placebo set has been checked, the more credible the main RD result.
Sensitivity checks
Beyond the two core diagnostics, applied RD work usually reports a few sensitivity checks: re-estimating the effect under narrower and wider bandwidths to see how the estimate moves; running placebo cutoffs at points where no treatment change actually occurs (the discontinuity should be near zero there); and, if manipulation or heaping right at the cutoff is suspected, a "donut RD" that drops observations in a small neighbourhood around the cutoff and re-estimates on the remaining data. A binned scatter / RD plot of the outcome against the running variable is the visual sanity check that usually accompanies all of this — a clear visual jump at the cutoff and smooth behaviour elsewhere is what a credible RD looks like before any regression is run. None of these checks proves validity, but they show whether the result is driven by one arbitrary modelling choice.
When RD works in practice
RD is the right reach when the data has a credible threshold and the threshold is the dominant source of variation in treatment status near the cutoff. The cleanest applications share a few features:
- A truly fixed, non-manipulable running variable. Test scores graded by an external party, age cutoffs, eligibility based on date of birth, vote-share margins in close elections — variables where individuals cannot precisely control where they land.
- A discrete treatment rule applied at the threshold. The rule should be applied consistently; partial enforcement leads to fuzzy RD (which is still useful, but estimated as IV).
- A local question. RD answers "what is the treatment effect at the cutoff", which is sometimes exactly the policy question (move the cutoff up by 5 points, what happens?) and sometimes a different question than the one the analyst really wants to answer.
When these conditions do not hold — running variable is time, treatment is partly anticipated, the cutoff is one of many simultaneous changes, or the question is about effects far from the cutoff — RD is the wrong tool, even if a discontinuity in the data looks visually compelling. A visual discontinuity in an event-study plot and a clean RD identification are not the same thing.
The summary worth keeping in mind: RD gives a very credible answer to a very narrow question. That is the source of its strength when the design fits, and the source of the most common applied mistake — treating that narrow local answer as if it were automatically a global policy effect.
Historical and Nobel context
- Thistlethwaite & Campbell (1960) — first formalised RD in educational psychology research. The paper was largely forgotten for decades.
- Hahn, Todd & van der Klaauw (2001) — formal identification result for RD under continuity of conditional means; the modern theoretical foundation.
- Lee (2008) — "as-if random near the cutoff" identification argument when individuals cannot precisely manipulate the running variable; brought RD to the centre of applied econometrics.
- Imbens & Lemieux (2008) — practical guide for applied researchers with examples, validation tests, and bandwidth selection guidance.
- Calonico, Cattaneo & Titiunik (2014) — robust bandwidth selection and inference, now the practical default through
rdrobust. - Gelman & Imbens (2019) — argument against high-order polynomials in RD.
- 2021 Nobel Prize in Economics — Joshua Angrist, Guido Imbens, and David Card were awarded for contributions to empirical methods in causal inference, including RD-based research designs and the broader credibility-revolution programme that placed quasi-experimental methods at the centre of empirical economics.
Modern RD is an active research area, and rdrobust plus the bandwidth-selection literature has standardised what was once a somewhat ad-hoc methodology. RD's strength, when the design clearly applies, is that the causal interpretation rests on a single transparent assumption — continuity through the threshold — that can be argued from institutional structure and partly checked in the data.