Chapter 5 of 7

Panel Data (Fixed Effects)

Created Apr 28, 2026 Updated Jun 7, 2026

Most useful data has the same units observed many times: users across days, stores across months, sensors across timestamps, experiments across runs. You want to know how some feature X affects some outcome Y — how a price change moves demand, how a UI tweak moves engagement — and you suspect the units differ in stable ways that also correlate with X. Some stores are in better locations and also charge higher prices. Some users are more engaged in general and also see different content.

A pooled regression of Y on X conflates two questions: do high-X units have high Y, and within a single unit, does Y move with X? The first one is contaminated by everything that varies across units; the second one is usually closer to the causal question people care about, though only "causal" once the remaining assumptions discussed below also hold (no time-varying confounders, no reverse causality, no anticipation, no dynamic feedback).

Fixed effects isolates the within-unit comparison. It gives each unit its own intercept (its own baseline level of Y) and identifies β from variation in X around that baseline. Stable per-unit characteristics that don't move over time — location, type, brand, persistent user traits — are absorbed by the intercept and stop being able to bias the answer, even if you can't measure them or don't know what they are.

The cost is that FE uses only within-unit variation: a unit whose X never changes contributes nothing to the estimate. The benefit is the cleanest available control for stable, unobserved per-unit confounders.

What panel data is

Panel data has the structure of N units observed across T time periods, with each observation indexed by the pair (i, t). Examples are everywhere: 100 stores observed daily for a year (100 × 365 = 36 500 rows), 1 000 patients across 10 visits, 50 currencies across 1 000 trading days, every user of a product across the days they were active.

This structure gives more identification leverage than either a pure cross-section (one observation per unit) or a pure time series (one unit observed many times). With both dimensions, you can ask within-unit and between-unit questions and use one to control for the other.

The fixed-effects idea

FE controls for time-invariant unobserved characteristics of each unit — features of a unit that are stable over the observation window but unobserved or unmeasured by the analyst. A store's location, layout, and brand. A patient's chronic conditions and underlying biology. A currency's institutional regime. A user's persistent preferences. None of these need to be in the dataset for FE to absorb their effect.

If these stable characteristics correlate with X — popular stores charging higher prices, engaged users getting different content, healthy patients self-selecting into less treatment — then ignoring them produces classic omitted-variable bias. FE absorbs them by construction, so they can no longer bias β through time-invariant differences across units. (Time-varying confounders are a different problem and are not handled by FE — see the limitations section below.)

The model and the within-transformation

The model

The basic FE specification:

y_it = α_i + β × X_it + ε_it

α_i is a unit-specific intercept that absorbs all time-invariant characteristics of unit i. The crucial allowance is that α_i is permitted to correlate with X_it — and that is exactly the situation that makes pooled OLS biased. Popular units (high α_i) tend to also have higher X (the popular spot can charge more); pooled OLS attributes the unit's popularity to X and overstates the effect of X. FE allows the correlation explicitly and that is why it can break the bias.

If we ignore α_i and run OLS, the unobserved unit-specific intercept becomes an omitted variable correlated with X, and the estimate β̂_OLS mixes the effect of X with the effect of unit popularity. FE separates them.

The within-transformation

The mechanism is the within-transformation (also called demeaning). For each unit, average the equation over time:

ȳ_i = α_i + β × X̄_i + ε̄_i

The time-mean of α_i over t is just α_i itself, since it does not vary in t. Subtracting from the original equation:

y_it − ȳ_i = β × (X_it − X̄_i) + (ε_it − ε̄_i)

The unit intercept cancels: α_i − α_i = 0. What's left is a regression of deviations from the unit mean of Y on deviations from the unit mean of X. OLS on these demeaned variables gives the FE estimator β̂_FE, consistent for β under the assumptions discussed below.

The implementation is one line in any econometric package — linearmodels.PanelOLS(...) in Python, plm in R, xtreg, fe in Stata; for high-cardinality unit FE, see the practical-implementation section below.

What β represents under FE

β under FE is identified from within-unit variation only — how a single unit's X moves around its own mean tracks how its Y moves around its own mean. Cross-sectional differences between units (how high-X units differ from low-X units) are explicitly not used; that variation is absorbed by α_i.

This is a substantive change in what β means. Pooled OLS would answer "across all units and times, does higher X go with higher Y?". FE answers "within a single unit's history, does X moving up coincide with Y moving up?". The second question is usually what people mean when they say causal effect, and the difference between the two answers is the bias FE removes.

Each colour is one unit watched over time. Inside every cluster the slope is the true β, but the clusters are stacked along a different line because each unit's hidden intercept α tracks its X-level. The dashed pooled OLS line chases that between-unit arrangement and gets β wrong — turn confounding φ negative enough and it points the opposite way (Simpson's paradox). Tick demean to apply the within-transformation: subtracting each unit's own mean cancels α, every cluster slides onto the origin, and the slope of the pooled-within cloud is the fixed-effects estimate — back on β.

FE as least-squares dummy variables

There are two equivalent ways to compute FE that produce the same β̂:

Within-transformation. Demean every variable by its unit mean and run OLS on the demeaned data. This is what packages like linearmodels actually do internally for efficiency.
Least Squares Dummy Variables (LSDV). Include N indicator variables (one per unit) in an ordinary OLS regression of Y on X and the dummies. The coefficients on the dummies are the estimated α̂_i, and the coefficient on X is the FE estimate.

The two are numerically identical. The LSDV view is sometimes useful as a mental model — it makes clear that FE is "OLS plus a lot of dummies" and explains why robust standard errors, multi-way FE, and similar extensions are not exotic; they are just the corresponding tools for an OLS-with-dummies regression. With small N the LSDV approach is fine; with large N (millions of users), the within-transformation or specialised high-dimensional algorithms are needed (see below).

Strict exogeneity

FE consistency requires strict exogeneity: E[ε_it | X_i1, X_i2, …, X_iT, α_i] = 0 for every t. The error at any time period must be uncorrelated not only with current X but with X at all past and future time periods. This is stronger than the cross-sectional exogeneity assumption discussed in the endogeneity note, and it rules out:

Feedback from past Y to future X. If past outcomes drive future treatment (a manager raises prices after a high-demand period), then X_i,t+1 depends on ε_it, breaking strict exogeneity. This is the classical "dynamic panel" problem.
Lagged dependent variables on the right-hand side. A regression of y_it on y_i,t−1 and X under FE is biased (Nickell bias). Specialised methods like Arellano–Bond GMM are needed.
Anticipation effects. If units adjust X in anticipation of future shocks, strict exogeneity fails.

For static problems with non-feedback X, strict exogeneity is a reasonable assumption. For dynamic settings it is the assumption that breaks first, and the answer is usually a different identification strategy — IV, dynamic-panel estimators, or one of the modern DiD methods covered below.

Standard errors and clustering

Standard errors in panel data deserve their own discussion because the default OLS formulas are almost always wrong here.

The reason is that observations within a unit are typically correlated over time — yesterday's demand at this store is informative about today's demand even after controlling for X. OLS standard errors assume independent observations and underestimate the true sampling variance when within-unit correlation is present. The result is overconfident standard errors, narrower confidence intervals than warranted, and inflated significance.

The standard fix is cluster-robust standard errors, clustered by the unit dimension. These compute variances that allow arbitrary correlation within each unit over time but assume independence between units. Implementations are built into most panel-data packages (PanelOLS(..., cov_type='clustered', cluster_entity=True) in linearmodels, vcovHC with cluster argument in R's plm, the , cluster() option in Stata).

Two-way clustering (by unit and by time period) is sometimes used when shocks at a given time are correlated across units. In DiD-style analyses, the modern recommendation is to cluster at the level at which treatment is assigned.

Without cluster-robust standard errors, FE results that look statistically significant frequently are not.

Here the true effect is exactly zero, so an honest test should reject only 5% of the time. Turn up within-unit correlation and watch the naive confidence interval stay stubbornly narrow while its false-rejection rate climbs far past 5% — observations inside a unit aren't independent, so the naive formula thinks it has many more data points than it really does. The cluster-robust interval widens to match and stays calibrated at 5%. This is exactly why a "significant" FE coefficient without clustered standard errors so often evaporates.

Unit, time, and two-way FE

The setup so far has used unit fixed effects — one intercept per unit. There are two other common variants worth naming:

Time fixed effects (γ_t) — one intercept per time period, absorbing shocks common to all units at the same time: seasonality, macro shocks, holidays, platform-wide rollouts. With time FE alone, β is identified from cross-unit variation at each time after taking out the time-mean.
Two-way FE — both unit and time FE in the same regression. Absorbs both stable per-unit differences and shared time-varying shocks. β is then identified from variation in X that is left after subtracting both the unit mean and the time mean — the so-called "double-demeaned" or "within-within" variation.

Two-way FE is the standard panel-data specification when both kinds of confounders are plausible, and it is the natural setup for the DiD estimator below. The cost is that it eats more degrees of freedom and requires variation along both dimensions; the benefit is that it controls for everything that is either unit-specific or time-specific without requiring that those confounders be measured.

Difference-in-differences and two-way FE

Difference-in-Differences (DiD) is a panel-data identification strategy closely related to FE. The classical setup has two groups (treated and control) observed at two time points (before and after); the DiD estimate is the difference between the change in the treated group's outcome and the change in the control group's. In the simplest two-group, two-period setup, this is exactly the coefficient on the treatment indicator in a regression with unit fixed effects and time fixed effects:

y_it = α_i + γ_t + β × Treated_it + ε_it

where Treated_it = 1 if unit i is in the treatment group and time t is after the policy. The unit FE α_i controls for level differences between groups; the time FE γ_t controls for shocks that affect everyone equally; the remaining variation that loads on Treated_it is the DiD estimate.

In its simplest two-period two-group form, DiD is essentially "two-way FE applied to a treatment indicator", and the identification rests on the parallel-trends assumption: in the absence of treatment, the treated group's outcome would have followed the same trend as the control group's. This is a strong assumption that should be argued and partly checked with pre-treatment data (an "event study" plot of the treated-vs-control gap before treatment is the standard pre-trends check).

DiD reads the effect as the treated group's jump relative to the control group's change; the counterfactual is the control's path pinned to the treated group's starting level. With parallel trends (violation = 0) the groups move together before treatment and DiD nails the true effect. Add a pre-trend violation and the lines fan apart before treatment ever happens — the assumption is false, and DiD counts the diverging trend as effect. Watching the pre-treatment gap is precisely the event-study check that has to come before any DiD claim.

Two-way FE with staggered treatment — modern critique

The setup above is clean when the treatment turns on at the same time for everyone in the treated group. The picture gets much more complicated — and much more relevant in practice — when treatment is staggered: different units get treated at different times.

A standard staggered-DiD analysis writes a two-way FE regression with a 0/1 treatment indicator that switches on at the unit's treatment date:

y_it = α_i + γ_t + β × Treated_it + ε_it

This is what most applied work used as a default for years. Starting around 2020, a series of papers showed that the OLS coefficient β̂ from this regression is in general not the average treatment effect, even under parallel trends. The estimator is a weighted average of unit-by-time treatment effects with weights that can be negative when later-treated units are compared against already-treated units acting as controls. With heterogeneous treatment effects across cohorts or over time, β̂ can have the wrong sign.

The key references:

Goodman-Bacon (2021) — decomposed the two-way-FE estimator into pairwise comparisons and showed where the negative weights come from.
de Chaisemartin & D'Haultfœuille (2020) — gave conditions under which two-way FE is biased and proposed a robust estimator.
Callaway & Sant'Anna (2021) — group-time average treatment effects, estimated cohort by cohort.
Sun & Abraham (2021) — interaction-weighted event-study estimators.
Borusyak, Jaravel & Spiess (2024) — imputation-based estimator robust under heterogeneous effects.

Modern recommended practice for staggered DiD is to use one of these estimators rather than vanilla two-way FE. Implementations: did (Callaway–Sant'Anna, R/Python), csdid (Stata), did_imputation (Borusyak–Jaravel–Spiess), eventstudyinteract (Sun–Abraham). Reporting an event-study plot — the dynamic version of DiD that traces effects relative to treatment time — is now standard, and the modern estimators produce these natively.

For non-staggered treatment (single treatment date, common across all treated units), classical two-way FE / DiD is still fine. The fragility appears specifically when treatment timing varies.

Two cohorts are treated at different times and the effect grows the longer a unit has been treated, so the true average effect on the treated stays firmly positive. But in the shaded window the late cohort is compared against the already-treated early cohort — which is still rising from its own treatment. TWFE subtracts that rise as if it were a control trend, so β̂ drops below the ATT. Push effect dynamics up and β̂ goes negative while every real effect is positive — the staggered-DiD failure that Goodman-Bacon and Callaway–Sant'Anna replaced with cohort-by-cohort estimators.

Limitations of FE

FE is one tool in the panel-data arsenal, not a universal solution. The main limits are:

Time-invariant predictors are absorbed. Any variable that does not vary within a unit (store type, location, gender, brand) gets killed by the within-transformation along with α_i. FE controls for these but cannot estimate their effects. If the question is "what is the effect of location on sales?", FE is the wrong tool — use a model that keeps location as a covariate.
Time-varying confounders are not handled. FE removes only the time-invariant component of unobserved heterogeneity. A time-varying factor not in the model — a local event affecting both pricing and demand at the same time, for instance — is left in the error and biases β. For time-varying confounders the right tools are IV, control function, DiD with a credible control group, or synthetic-control methods.
Sufficient within-unit variation is required. β is identified from how X moves within a unit's own history. A unit whose X never changes contributes nothing to β̂. If most units have stable X, the FE estimate is driven by a small subset that did vary, and is correspondingly noisy.
Post-treatment controls leak the effect. Time-varying controls in an FE regression need to be chosen carefully: a control variable that is itself affected by X is a post-treatment variable, and including it absorbs part of the treatment effect rather than reducing confounding. This is a generic causal-inference issue, but it shows up especially often in panel work because there are usually many candidate time-varying controls available, and the temptation is to add them all. Controls should be ones that affect Y and are not affected by X within the panel.

High-cardinality FE in practice

For panels with millions of units (every user of a product, every transaction, every device), the LSDV approach is infeasible — adding millions of dummies to a regression is computationally and numerically prohibitive. The within-transformation handles single-dimension unit FE efficiently, but two-way and higher-order FE require specialised algorithms.

The standard tools in 2026:

fixest (R) — handles arbitrary high-dimensional FE through partialling-out tricks, fast and reliable.
pyhdfe (Python) — high-dimensional FE absorption used internally by linearmodels and econml.
FixedEffectModels.jl (Julia) — similar capabilities, very fast on large data.
Stata's reghdfe — same family, popular in applied economics.

These libraries make FE on tens of millions of observations with multiple high-cardinality fixed-effect groupings routine, which is what makes panel methods practical at internet-scale data.

Random effects

Random effects (RE) is the alternative panel-data approach. RE assumes α_i is uncorrelated with X_it and is drawn from a fixed distribution; under that assumption, RE uses both within- and between-unit variation and is more efficient than FE.

In modern applied causal-inference work RE is rarely the right default. The "α_i uncorrelated with X" assumption is exactly what FE was invented to avoid relying on, and it is almost always violated in causal applications. The Hausman test formally compares FE vs RE — significant difference rejects the RE assumption — but most applied work skips the test and defaults to FE, because the RE assumption is implausible on its face for most causal questions.

RE remains useful in mixed-effects modelling for prediction or in random-slopes settings where partial pooling is the point. For causal panel work, FE is the safer default.

When to use FE

FE is the right reach for panel data when:

The same units are observed across multiple time periods.
Time-invariant unobserved characteristics correlated with X are the suspected confounders.
X varies within units (otherwise nothing to estimate from).
The effect of time-invariant variables is not the primary interest.

FE is the wrong tool when:

There is no panel structure (only cross-section, or only time series).
The time-invariant variables are themselves the treatment of interest.
The dominant confounding is time-varying — use IV, RD, DiD with a credible parallel-trends argument, or synthetic controls.
Within-unit variation is very small for most units, leaving the estimate driven by a few outliers.

In applied econometrics FE remains one of the most widely used tools and a sensible default for panel structures. The two practical things to keep in mind in 2026 are cluster-robust standard errors (almost always needed) and the staggered-DiD literature (when treatment timing varies across units), both covered above. For ML pipelines that combine FE structure with flexible nuisance estimation, see causal ML and the DML extensions of control function.