lenatriestounderstand

Chapter 4 of 7

Control Function Approach

Created Apr 28, 2026 Updated Jun 7, 2026

You have an endogenous variable X and a valid instrument Z, but you do not want to run a vanilla 2SLS. Maybe the relationship between Z and X is strongly nonlinear — interactions, thresholds, calendar effects — and a linear first stage would throw away most of the information in your instrument. Maybe X is a count or a binary outcome where 2SLS feels awkward. Maybe you are working in a pipeline where "fit a flexible model, take residuals, regress" is the standard shape and you want a causal-inference recipe that plugs into it.

The control function approach is that recipe. Instead of replacing X with its instrument-driven prediction and regressing Y on (which is what 2SLS does), you keep the original X and add the residual v̂ = X − X̂ to the second-stage regression as an extra control. The residual captures the part of X that is not explained by the instruments — i.e. the endogenous part — and including it as a covariate strips that part out of the X coefficient.

The trade is conceptual: 2SLS does its work by replacing X; CF does the same work by conditioning on the endogenous part. In the linear single-endogenous-regressor case the two give numerically identical point estimates. CF earns its keep when the first stage is nonlinear, when the outcome is not continuous, or when you want to plug ML-style flexible models into a causal pipeline — which is exactly the territory of modern double / debiased ML.

The rest of this note works through the mechanics, the structural assumption that does the actual identification, when CF really helps relative to 2SLS, and where the modern DML extensions take the idea.


The idea

In 2SLS the endogenous X gets replaced with — the part of X explained by the instruments — and the endogenous part is thrown away. In CF, X stays in the regression and a second variable, the residual v̂ = X − X̂, is added as a control. Under the CF assumptions, once is held fixed the coefficient on X reflects the variation in X that is orthogonal to the residual — i.e. the variation driven by the instruments — and this is the variation that identifies the causal effect.

Conceptually it is the same identification idea as 2SLS, packaged differently. 2SLS does it by replacement; CF does it by conditioning. The implementations look superficially different but the underlying logic is the same.


The two-step procedure

Stage 1

Predict X from the instruments Z and the controls W:

X = f(Z, W) + v

Save the residuals v̂ = X − f̂(Z, W). In principle, f(·) can be estimated flexibly — linear regression, regularised regression, random forest, gradient boosting, neural network — but flexible first stages need extra care for inference and overfitting, which is why DML becomes important later in this note. In classical 2SLS the equivalent first stage is a linear projection; CF lifts that restriction.

Under the CF model, the first-stage residual carries the component of X that is correlated with the structural error in the outcome equation. It is not "endogeneity itself" mechanically — it also contains noise, measurement error, and other unexplained variation that may be harmless. The point of the model is that, under its assumptions, conditioning on is what controls for the correlated part.

Stage 2

Regress Y on X, the controls W, and the first-stage residuals :

Y = β₀ + β₁X + β₂W + ρ·v̂ + η

The coefficient β̂₁ is the CF estimate of the causal effect; the coefficient ρ̂ on absorbs the endogenous variation in X.

The structural model implied by this specification is:

Y = β₀ + β₁X + β₂W + ρ·v + η,    where η ⊥ v

i.e. the endogenous error in the original Y equation is decomposed as ρ·v + η, with η independent of the first-stage residual. This is the linear control function assumption: the endogenous component of X enters the structural equation linearly through v. CF identification rests on this assumption being true or an acceptable approximation; without it, the procedure does not have a clean causal interpretation.

ρ̂ as a Hausman-style endogeneity test

The coefficient ρ̂ on is more than an adjustment — it is itself a useful diagnostic. Under the maintained CF specification, testing ρ = 0 is a test for whether the residual component of X is associated with the structural outcome error: if it is not, the linear-in- correction is unnecessary and OLS on Y against X may have been adequate; if it is, endogeneity (in the CF sense) is present and the correction is doing real work.

In a well-specified CF model this is essentially the Hausman test reformulated in CF language, and it is a free byproduct of the procedure rather than a separate run. As with Hausman, the diagnostic only carries weight to the extent that the underlying model is correctly specified — in a misspecified second stage it can fail to detect endogeneity that is actually present, or flag adjustment that has nothing to do with endogeneity.

Watch the two steps work. Stage 1 leaves the residual v̂ = X − X̂; stage 2 regresses y on X and v̂. The OLS line is dragged off the truth by the confounding δ, while the CF coefficient on X lands on β. The card row makes the headline concrete: β̂_CF and the 2SLS estimate are identical to numerical precision — replacement and conditioning are two routes to one number. The coefficient ρ̂ on v̂ is the free Hausman-style test: dial δ to zero and ρ̂ collapses, telling you the correction was unnecessary.


Why it works

X has, conceptually, two parts: the part predicted by the instruments and controls, f(Z, W), and the residual v. Under the CF model, the residual captures the part of X that is correlated with the structural outcome error, and the second stage controls for that component. When both X and are in the regression, the coefficient on X is identified from variation in X that is orthogonal to — which, under the instrument-validity and CF assumptions, is the identifying variation driven by Z and W. The coefficient on absorbs the variation correlated with the structural error.

In the linear case with a single endogenous regressor and a linear first stage, CF and 2SLS produce numerically identical point estimates of β₁. They are different parameterisations of the same identification. The difference shows up when the linear-single-endogenous-regressor setting is left behind:

  • Nonlinear first stage. Standard linear 2SLS uses a linear projection on the included instruments and controls. Nonlinearities can be added manually through transformations, interactions, or basis expansions of Z and W, but CF makes a flexible first-stage model a more natural part of the workflow — and a strict generalisation when f(·) is genuinely nonlinear.
  • Multiple endogenous regressors. CF and 2SLS can give different finite-sample estimates depending on how the residuals are constructed.
  • Nonlinear or non-continuous Y. CF generalises more cleanly to probit / logit / Poisson outcomes via the Wooldridge framework; 2SLS adapts less gracefully.

The shared ground in the linear case — same point estimates — is also why CF is sometimes presented as "just IV by another name". That undersells the gains from the flexibility of f(·) in nonlinear settings, which is where CF actually pays for itself.


A flexible first stage: what helps and what does not

The headline practical advantage of CF is that the first stage can be any predictive model. Random forests and gradient boosting capture interactions and threshold effects without explicit feature engineering; neural networks fit high-dimensional Z and W; cross-validated regularised regressions trade some bias for variance in high-dimensional settings.

A typical pricing example: predict price from day_of_week, holiday flags, capacity, lagged competitor prices. A rule like "on holidays with near-full capacity, prices spike sharply" is a nonlinear interaction; a linear first stage misses it, and the resulting is contaminated. A random forest or XGBoost first stage captures the rule and produces cleaner residuals — see pricing elasticity for a worked-out version of this kind of pipeline.

Here the true first stage genuinely bends with Z. A linear fit can only lay a straight line through it, so the curvature spills into the residual — the right panel shows v̂ still tracing a U-shape rather than flat noise. That contaminated fit is a weak instrument: it throws away most of what Z knows about X, so β̂ stays valid but wildly imprecise — resample and watch the linear estimate jump. Flip to a flexible first stage: the curve is captured, the residual flattens to noise, the instrument turns strong, and β̂ holds tight on β = 2. Turn the nonlinearity down to zero and the two modes coincide — flexibility only pays when the first stage was actually nonlinear.

The tempting next step is to read this as "use ML on the first stage, get better identification". The reality is more nuanced:

  • Instrument validity is unchanged. A flexible first stage does not relax the exclusion, independence, or relevance assumptions. If Z is a bad instrument, no amount of XGBoost on the first stage can fix it — CF inherits the same identification dependence as 2SLS.
  • Second-stage misspecification is unchanged. CF assumes the structural Y equation has the linear-in- form. If the true relationship between X, W, v, and Y is nonlinear in ways that the linear-in- form cannot absorb, the CF estimate is biased even with a perfect first stage.
  • Inference becomes harder. Naive second-stage standard errors are wrong (they ignore first-stage estimation noise), and analytical formulas are not available for most ML models. Bootstrap is one workaround; DML, below, is the principled one.

The flexible first stage helps with one of three problems — fitting f(Z, W). It does nothing for the other two. To get a method that handles flexible models on both the first stage and the structural equation, the modern answer is double / debiased machine learning.


Generated residuals and cross-fitting

A practical warning that gets overlooked when CF is run with a flexible first stage: residuals should usually be produced out-of-sample.

If the same observations are used both to fit and to construct the residuals fed into the second stage, an overfitted first stage will make artificially small — the model has memorised the training data and there is little "residual variation" left, even when the true v is substantial. Plugging those shrunk residuals into the CF second stage distorts the estimate of β₁ and the inference around it: standard errors look tighter than they should, and bias from the first stage leaks into the second.

The clean solution is cross-fitting (sometimes called sample splitting). Split the data into K folds; for each fold, fit on the other K−1 folds and use it to predict the held-out fold; concatenate the held-out predictions to construct . Each observation's residual is then produced by a model that has not seen it, and the overfitting channel is closed. With a linear first stage on a moderate sample this is rarely a binding concern; with random forests or boosted trees on a richer feature set it can change the answer materially.

Cross-fitting is not just a CF practice — it is the same habit that DML formalises and combines with orthogonal moments to deliver valid inference, which is why CF with cross-fitting and DML are very close cousins in practice.


Double / Debiased ML — taking the CF idea further

Double / Debiased Machine Learning (DML), introduced by Chernozhukov et al. (2018), is the modern generalisation of the CF approach. The setup looks similar — fit nuisance functions with flexible ML, then estimate the parameter of interest from a second-stage moment condition — but DML adds two ingredients that CF on its own does not have:

  • Neyman-orthogonal moments. The estimating equation for β is constructed so that small errors in the nuisance estimates (the ML-fitted first-stage and conditional-mean functions) do not propagate into β̂ at first order. This is what makes valid inference possible despite the slow convergence rates of ML estimators.
  • Cross-fitting (sample splitting). Nuisance functions are estimated on a held-out fold and applied to the complementary fold, then the roles are swapped. This breaks the dependence between the residuals used for estimation and the data those residuals were learned from, and avoids the over-fitting bias that naive ML+plug-in can suffer.

The combination delivers what naive CF-with-ML cannot: a √n-consistent, asymptotically normal estimate of the causal parameter even when the nuisance functions are estimated by random forests, gradient boosting, or neural networks. The standard CF procedure with bootstrapped standard errors is roughly the pre-DML version of this idea; DML is the principled treatment.

For applied work in 2026, DML has become the default whenever the first stage benefits from flexible ML. Implementations are mature: DoubleML (Python and R), econml (Python, Microsoft), and grf (R) cover the main use cases. The broader picture of causal ML methods is covered in causal ML.


Limitations and inference

Standard errors

How standard errors are computed in CF depends on what the first stage looks like.

  • Linear first stage. Analytical standard errors exist and follow the same logic as 2SLS — a variance correction that accounts for the two-step structure. The standard reference is Wooldridge's Econometric Analysis of Cross Section and Panel Data (2010). Analytical SEs in this case are one of the reasons 2SLS remains the workhorse in linear settings.
  • ML first stage. Analytical formulas are not available for most ML models, and naive second-stage OLS standard errors are wrong because they ignore first-stage estimation noise. The pre-DML solution was the bootstrap — resample the data, re-run the two-step procedure hundreds or thousands of times, and compute empirical standard errors. This works but is computationally expensive and offers no protection against the slow-convergence issues that DML's orthogonal moments address. Modern practice is to use DML directly when ML enters the first stage.

Binary or count outcomes

When Y is binary (probit / logit), a count (Poisson), or otherwise non-continuous, the linear-in- second stage is no longer the right structural model. The Wooldridge (2015) framework — Control Function Methods in Applied Econometrics — extends CF to these cases: the residual still goes into the second stage, but the functional form changes to match the outcome model. Implementations exist in standard econometric packages; rolling your own should be a fallback rather than a default.

Instrument validity is unchanged

Restating the obvious because it gets overlooked: CF flexibility in the first stage does not relax exclusion, independence, or relevance. If the instrument is bad, CF inherits the problem from 2SLS without exception.


CF vs 2SLS — when to use which

A concise practical rule:

  • Linear first stage, single endogenous regressor, continuous outcome2SLS is simpler, has analytical standard errors, and is the standard methodology with decades of applied practice. CF gives the same point estimate.
  • Nonlinear or high-dimensional first stage, ML-flavoured pipeline → CF with a flexible first stage outperforms a linear 2SLS if the linearity in the first stage was actually a binding restriction. For valid inference in this regime, prefer DML over hand-rolled CF plus bootstrap.
  • Binary or count outcome → CF (Wooldridge form) is more natural than 2SLS, which adapts awkwardly to non-continuous Y.

For classical econometric problems the historical workhorse is 2SLS — easier to explain, more standardised, more reviewer-friendly. The modern reach for CF is in applied-ML and causal-ML pipelines where flexible nuisance estimation matters. DML is the version of that reach that comes with valid inference built in, and is where most new applied work in this area sits.