Chapter 6 of 7
Causal ML Beyond Econometrics
Created Apr 28, 2026 Updated Jun 7, 2026
Standard predictive ML answers "what will Y be given X?". Causal ML asks "what would Y be if we changed X?". These are different questions and the second cannot be reliably answered through the first.
The textbook example. An insurance company observes that houses with smoke alarms have 30% lower fire damage. A predictive model captures the correlation cleanly: house has alarm → expect lower damage. But the causal question is different: if we install a smoke alarm in a house that did not have one, will damage drop 30%? Probably not — the people who already installed alarms are also more cautious about fire safety in general, and the alarm itself contributes only part of the gap. Distinguishing the predictive correlation from the causal effect is the whole game.
In product and ML practice this comes up constantly. A model that predicts "customers who saw an email convert at higher rates" is predictive. The causal question is whether sending the email causes conversion, or whether emails were already being sent to customers who would have converted anyway. Confounding (we send emails to engaged customers) breaks the predictive interpretation as a basis for the decision "should we send the email at all".
The classical methods covered in the rest of the Econometrics track — instrumental variables, regression discontinuity, panel-data fixed effects, control function — are the foundations of causal inference. Causal ML is the family of methods that combines that thinking with ML techniques: flexible nuisance estimation for valid causal estimates, heterogeneous-treatment-effect estimation at scale, and counterfactual-style prediction. The rest of this note covers the main approaches and where each fits.
This note is a map rather than a full technical treatment: the goal is to show what each causal-ML family is for and how it connects to the classical identification tools. Each method gets a short orientation rather than a full derivation; the linked notes elsewhere in the track cover the deep dives.
Identification still comes first
Before any of the methods below, one point worth making explicitly: causal ML does not remove the need for identification. Randomisation, unconfoundedness, a valid instrument, RD continuity, parallel trends, common-support / overlap — every causal-inference setup depends on at least one such assumption, and that assumption has to come from the design of the study, not from the model.
ML helps with what comes after identification is settled: estimating nuisance functions flexibly, discovering treatment-effect heterogeneity, scaling the analysis to high-dimensional confounders, evaluating policies on logged data. None of that turns observational correlations into interventions on its own. A causal ML pipeline applied to data with no identifying assumption is a sophisticated way of producing a confounded estimate, not a way of escaping confounding. The first question on any causal-ML project is still "what is the source of identifying variation in this data?" — and only after that has a defensible answer does the choice between DML, causal forests, X-learners, and the rest start to matter.
ATE and CATE
Causal ML estimands come in two main flavours:
- ATE — Average Treatment Effect. The average causal effect of a treatment across the population: "if everyone got the treatment, how much would the average outcome change?". A randomised A/B test gives ATE directly — randomisation removes the confounding that observational ATE estimates have to fight through.
- CATE — Conditional Average Treatment Effect. The same average effect, but conditional on covariates: "if everyone with characteristics X got the treatment, how much would the average outcome change for them?". Different sub-populations can have very different treatment effects, and CATE captures that heterogeneity.
CATE is the more useful quantity for personalisation: it tells you which subgroups respond most strongly to the treatment, which lets you target — promote where it works, withhold where it does not, treat differently where the effect differs. Estimating CATE requires more sophisticated methods than estimating ATE, because effectively you are estimating a function τ(X) rather than a scalar. Uplift modelling and causal forests, below, are the main toolkit.
Hold the ATE fixed and slide heterogeneity up: the average never moves, but the CATE curve τ(X) tilts and a growing slice of the population falls below zero — units the treatment actively harms. Two campaigns with an identical headline ATE can mean "helps everyone a little" or "helps half a lot, harms the other half," and the ATE can't tell them apart. Surfacing that hidden τ(X) is the entire point of CATE and uplift estimation — and why targeting beats treating everyone.
Uplift modelling
Uplift modelling is the product- and marketing-side specialisation of CATE estimation: typically binary treatment (campaign on / off, discount / no discount) and often binary outcome (converts / does not), aimed at deciding whom to target with a treatment. The general CATE problem is broader (continuous treatments, continuous outcomes, multi-valued treatments), but the uplift framing covers a large fraction of practical use cases and has its own established vocabulary in marketing.
The four behavioural quadrants
The classical mental model splits the population into four behavioural quadrants based on whether they would convert with treatment versus without:
- Persuadables. Would convert if treated, would not convert if not treated. Treatment causes conversion. These are the people to target.
- Sure things. Would convert either way. Treatment is wasted on them — the conversion was happening regardless.
- Lost causes. Would not convert either way. Treatment is wasted.
- Sleeping dogs (do-not-disturb). Would convert if not treated, would not convert if treated. Treatment is harmful. Avoid.
A standard predictive "who is likely to convert?" model targets sure things plus persuadables, which over-spends on sure things and may even target sleeping dogs. An uplift model targets only persuadables and avoids the other three.
Worth being explicit about what the four quadrants are not: this is a potential-outcomes framing. We never observe both Y(treated) and Y(untreated) for the same person — only one of the two is realised in any actual experiment, the other is the counterfactual. So the uplift model does not assign people to quadrants with certainty; it estimates the conditional treatment-effect distribution from patterns in the data, and the four quadrants are a useful conceptual classification rather than a labelling task. The framing is also specific to binary outcomes and binary treatments; for continuous outcomes or treatments the persuadables-vs-sure-things intuition still applies, but the formal object is a continuous function τ(X).
The population splits into the four quadrants by what each person would do treated vs untreated. A predictive "who is likely to convert" model chases everyone with a real chance of converting — persuadables, sure things and sleeping dogs — so it wastes treatments on sure things (who convert for free) and actively destroys conversions among sleeping dogs. An uplift model targets only persuadables, capturing the same incremental conversions with far fewer treatments and no self-harm. Raise the sleeping-dog share and watch the predictive strategy's net effect collapse — even turn negative.
Standard methods
- T-learner (two-model). Train one model on treated data, one on control; uplift =
model_treated(x) − model_control(x). Simple, but each model uses only half the data and the variance is high. - S-learner (single-model). One model with treatment as a feature; uplift =
model(x, T=1) − model(x, T=0). Convenient, but model regularisation can shrink the treatment-effect signal toward zero. - X-learner. Combines the T-learner with a refinement step that uses the propensity score; more robust when treatment groups are imbalanced.
- R-learner. Estimation focused specifically on the treatment-effect function rather than on the outcome surfaces; robust under certain confounding assumptions.
- Doubly robust learners. Combine an outcome model with a propensity-score model. Consistent if either of the two is correctly specified — hence "doubly robust".
Tools
- EconML (Microsoft) — Python, broad coverage of causal-ML methods (DML, X-learners, causal forests, heterogeneous-treatment-effect estimation). Often used together with DoWhy in the Microsoft causal-ML ecosystem.
- CausalML (Uber) — Python, originally focused on uplift in advertising; similar coverage.
- DoWhy (Microsoft) — causal-reasoning framework rooted in Pearl's DAGs and do-calculus. The user writes down the causal graph; the framework picks an appropriate estimator and runs sensitivity analyses.
- GRF (R) — generalised random forests, including the original Wager–Athey causal forest implementation.
Typical applications
- Marketing campaign targeting. Send the promotion to persuadables, skip the sure things, avoid the sleeping dogs.
- Customer retention. Discounts and special offers help some customers stay and harm others (annoyance, anchoring on the discount); uplift identifies whom to target.
- Personalised pricing. Estimate per-customer price sensitivity, set prices accordingly. The pricing-elasticity note covers the broader pricing context.
- Personalised medicine. A treatment that works on average may harm specific subgroups; CATE / uplift identifies whom to treat.
Double Machine Learning
DML, formalised by Chernozhukov et al. (2018), is the modern way to estimate causal effects in the presence of high-dimensional confounders while still getting valid statistical inference. The full mechanics are covered in the control-function note — DML is the principled extension of the CF idea — so this section is a brief summary.
The workflow:
- Predict the outcome Y from confounders W using a flexible ML model. Take residuals.
- Predict the treatment T from confounders W using a flexible ML model. Take residuals.
- Regress outcome residuals on treatment residuals. The coefficient is the treatment effect.
The two technical ingredients that make DML work are Neyman-orthogonal moment conditions (so that small errors in the ML nuisance estimates do not propagate at first order into the causal estimate) and cross-fitting (predict each observation with a model that was not trained on it, to break the dependence between residuals and the data they were learned from). The combination delivers √n-consistent, asymptotically normal estimates of the treatment effect even when the nuisance functions are estimated by random forests, gradient boosting, or neural networks.
The workflow above is the partially linear DML case — the structural equation Y = θ·T + g(W) + ε with a scalar treatment effect θ. Other DML variants use different orthogonal scores (interactive DML for fully heterogeneous outcomes, DML-IV for instrumented treatments, etc.), but the residualisation intuition is the same in each: predict, residualise, exploit the orthogonality.
DML is the right tool when the treatment effect is the parameter of interest (not the full outcome surface), confounders are high-dimensional or the relationships are nonlinear, and valid p-values / confidence intervals are required. Implementations: DoubleML (Python and R), econml (Microsoft), grf (R).
Causal trees and causal forests
Tree-based methods specifically for CATE estimation.
A causal tree is structurally like a regression tree, but splits are chosen to maximise heterogeneity in the treatment effect between leaves rather than heterogeneity in the outcome. Each leaf of a fitted causal tree carries an estimated treatment effect for individuals falling in that leaf.
A single tree is too noisy for production CATE estimation, which is why the practical method is the causal forest (Wager & Athey, 2018): an ensemble of causal trees with randomised splits and subsamples of data and features — the same recipe as Random Forest but for treatment effects rather than outcomes. Averaging across trees produces stable CATE estimates and, under conditions, valid pointwise confidence intervals — a property that few ML methods have.
The valid-inference result rests on two technical ingredients in Wager–Athey's construction: honesty (each subsample is split into two parts — one used to choose the tree splits, the other used to estimate the leaf treatment effects, so the same data does not both pick the structure and fill in the numbers) and sample splitting / subsampling (each tree sees only a random subsample of the data). Together these break the dependence that would otherwise invalidate confidence intervals built from a tree fit on the same observations the splits were chosen on. Implementations in econml (Python) and grf (R) handle the honesty mechanics internally.
Causal forests are particularly suited to settings where some interpretability of the heterogeneity is wanted: which features drive treatment-effect heterogeneity, what the tree splits look like.
Counterfactual prediction
Counterfactual prediction asks "what would have happened if…" for individual cases — different from the average treatment effect, which is over a population. Several methods build the counterfactual differently:
- Synthetic control (Abadie, Diamond & Hainmueller, 2010) — construct a "synthetic version" of a treated unit as a weighted combination of untreated comparison units chosen to match the treated unit's pre-treatment trajectory. The counterfactual is the synthetic unit's outcome after treatment. This is the canonical method for case-study-style policy evaluation in economics, and it is panel-data territory — see panel data for the panel-structure context. The augmented synthetic control method (Ben-Michael, Feller & Rothstein, 2021) is the modern extension that addresses bias when the pre-treatment match is imperfect.
- Matching. For each treated unit find a similar untreated unit; use the latter's outcome as the counterfactual. Propensity-score matching is the classical version; matching on covariates directly is also common. Matching's implicit assumption is selection on observables.
- Direct prediction. Fit a predictive model on outcomes, predict for both treated and untreated scenarios, and take the difference as the per-unit causal effect estimate. Simple, but inherits all the assumptions of the predictive model — overfitting and misspecification translate directly into bias.
- Causal feature attribution. Decompose treatment effects into per-feature contributions in a causal framework. Approaches include Asymmetric Shapley Values (Frye et al., 2020) and Causal Shapley (Heskes et al., 2020). Worth flagging that "causal SHAP" is not a single well-defined method — multiple incompatible proposals exist, and the choice matters for what the per-feature attributions actually mean.
Practical applications include pricing optimisation ("what if we change the price?"), recommendation evaluation ("would the user have clicked if we had recommended differently?"), and policy decisions about feature rollouts.
Off-policy evaluation
A specific case in reinforcement learning and contextual bandits: logged data from an old policy (a previous algorithm's decisions) is available, and the goal is to evaluate a new policy's expected performance without actually deploying it. The standard estimators:
- Importance sampling. Reweight logged data so the distribution matches what the new policy would have produced; estimate expected reward under the new policy.
- Self-normalised importance sampling. Variance-reduction trick that normalises the importance weights to sum to the sample size. Slightly biased but lower-variance, often a better practical estimator.
- Direct method. Fit a model of reward as a function of (context, action), use it to estimate the new policy's expected reward. Simple but inherits the model's misspecification.
- Doubly robust estimators. Combine importance sampling with the direct method. Consistent if either piece is correctly specified.
The single most important practical concern across all of these is overlap (also called common support): the logged policy must have actually taken, with non-trivial probability, the actions that the new policy would take. If the new policy wants to recommend item X in context C but the old policy almost never did so, there is essentially no data about what reward results — and no estimator can recover it. Importance weights become huge or undefined; direct methods extrapolate beyond their training distribution; doubly-robust estimators combine two unreliable sources. The first thing to check before running any OPE is the overlap between the action distributions of the old and new policies in the relevant contexts; without it, the data simply does not support the evaluation.
The logging policy mostly tried the low actions; the new policy wants the high-reward ones on the right. Importance sampling reweights the log by π_new/π_old to estimate the new policy's value without deploying it. Push policy divergence up and π_new piles onto actions π_old almost never took: a few records pick up enormous weights, the effective sample size collapses, and the estimate lurches on every resample. Where the old policy gathered no data, no estimator can recover the answer — checking that action-distribution overlap is the first step of any off-policy evaluation.
Modern tools: the Open Bandit Pipeline (obp, Saito et al.) is a widely used Python framework with reference implementations of most off-policy estimators; vowpal_wabbit includes contextual-bandit support with off-policy capabilities. Off-policy evaluation is standard at companies running large-scale recommendation, ranking, or advertising systems where deploying every candidate policy is infeasible.
Heterogeneous-treatment-effect discovery
Once an experiment has run, the natural follow-up question is "where did the treatment work and where did it not?". HTE discovery is the family of methods that find this without requiring subgroups to be specified up front:
- Causal forests with feature importance. Fit a causal forest, look at which features were used most heavily for splits — those are the features driving heterogeneity.
- Subgroup discovery algorithms. Algorithms that search for rules describing subgroups with significantly different treatment effects (e.g. "users in country X with engagement above Y see a 2× larger effect"). Care is needed for multiple testing — without it, you will reliably find spurious heterogeneity.
- CATE plots. Visualise the estimated CATE function as a function of one or two key features, similar to partial-dependence plots in standard ML interpretability.
The standard pipeline in product work: run the experiment, estimate ATE, then estimate HTE on key segmentations to surface where the effect concentrates. Especially valuable when the headline ATE is small but a sub-population has a large effect — that pattern is invisible in the headline number.
Common mistakes
A handful of recurring mistakes when ML practitioners reach for predictive tools to answer causal questions:
- Treating a predictive model as causal. Building a predictive model and reading feature importance as causal contribution. A feature can be a strong predictor without being a causal driver — the smoke-alarm example from the opener is the canonical case.
- Confounding in observational data. Without randomisation, association and causation are not the same thing. The classical-econometrics tools — propensity scores, IV, control function — exist precisely to handle this and should be reached for instead of (or in addition to) predictive ML. See endogeneity for the foundational discussion.
- Reverse causality. "Workers who attend training sessions earn more": training raises earnings, or higher earners self-select into training? Direction matters and the data alone cannot resolve it.
- Selection bias. Observational samples are rarely random. Models trained on biased samples produce biased causal estimates regardless of how well they predict in-sample.
- Treatment heterogeneity assumed away. Two treatments with the same ATE can have radically different distributional effects — one may help everyone a little, the other may help half a lot and harm the other half. ATE alone is silent about this; CATE / uplift estimation surfaces it.
Where this is applied
- Tech companies. Uplift modelling for advertising, recommendations, pricing; dedicated experimentation / causal-inference teams are common at large tech companies such as Booking.com, Microsoft, Uber, Airbnb, Amazon, Netflix.
- Finance. Causal effects of trading strategies, risk modelling, credit scoring (effect of credit limits, not just default prediction).
- Healthcare. Treatment-effect estimation, personalised medicine, observational studies where RCTs are infeasible.
- Public policy. Effects of programmes (job training, education interventions), often using observational methods because RCTs are not always feasible.
- Marketing. "Whom does this campaign actually convert" rather than "who is likely to convert".
Practical recommendations
- Start with an experiment when possible. A randomised A/B test gives a clean ATE without causal-ML complications. Causal ML is mostly necessary when an RCT is not possible or not sufficient — when CATE is needed, when a new policy must be evaluated without deployment, or when only observational data is available.
- For CATE estimation. EconML and CausalML cover the main methods. Start with T-learner, S-learner, or X-learner as baselines; reach for R-learner, doubly-robust learners, or causal forests when the baseline is unstable or heterogeneity is the main target rather than a side question.
- For valid inference under high-dimensional confounding. DML is the right default;
DoubleMLandeconml.dmlare the standard implementations. Background detail in the control-function note. - For policy questions where assumptions matter. Frame the problem in DoWhy or a similar causal-graph framework — the explicit DAG forces assumptions to be visible and testable.
- Don't overuse. Most ML problems are predictive, not causal. Causal-ML tools come with assumptions and complexity that should only be paid for when there is a real causal question.
Where this connects to the rest of the track
Causal ML builds directly on the classical-econometrics methods covered elsewhere in this track:
- Endogeneity — the foundational problem causal ML and classical econometrics both try to solve.
- Instrumental variables — the classical fix when randomisation is not available; causal ML's high-dimensional cousins (DML with IV, deep IV) build on this.
- Regression discontinuity — quasi-experimental design at thresholds.
- Control function — the residual-based alternative that is the direct ancestor of DML.
- Panel data — fixed effects and DiD, the natural setting for synthetic-control methods.
- Pricing elasticity — a worked-out application that uses many of these tools together.
Causal ML is a sophisticated extension of the standard ML toolkit, not a replacement. It is essential for applications where the question is decision-driven rather than prediction-driven — policy choices, pricing, personalisation, treatment recommendations — and ignorable for ordinary predictive use cases. Knowing when each is the right framing matters more than mastering any specific algorithm.