lenatriestounderstand

Chapter 7 of 25

Interpolating a time series

Created May 27, 2026 Updated May 27, 2026

Interpolation fills in missing or unevenly-sampled values in a time series. The interesting part is that every method makes a different assumption about what the signal was doing between known measurements. Picking the wrong method silently changes the data downstream models see.

A note on terminology: in practice people use "interpolation" loosely. Strictly, interpolation estimates between known time points using the points themselves (forward-fill, linear, spline, PCHIP). Imputation is broader — model-based and KNN methods that draw on additional structure (other series, learned generators) belong here. Both fill in missing values; the assumptions and risks are different.

The common methods and what each assumes:

  • Forward-fill (and backward-fill). Assumes the value didn't change since the last observation. Right for step-like signals — sensor readings of a setpoint, a configuration value, a category that holds until a new event. Wrong for smoothly-varying signals where the gap likely had real movement.
  • Linear interpolation. Assumes a constant rate of change between the two known points. Right for signals that drift roughly linearly at the time scale of the gap. Wrong for cyclic or volatile signals where a straight line oversmooths.
  • Spline / cubic interpolation. Assumes the underlying signal is smooth and curved. Useful for natural physical signals (temperature, pressure) sampled densely enough that curvature is real. Dangerous when gaps are long enough that the spline starts inventing structure.
  • PCHIP / shape-preserving interpolation. Assumes smooth-ish movement but avoids the overshoot that cubic splines can produce. Useful for bounded or monotonic signals (probabilities, cumulative counters, physical quantities with hard floors and ceilings) where a cubic spline might invent impossible values.
  • Seasonal / model-based imputation. Assumes a specific generative process — daily seasonality, a fitted ARIMA, a learned imputation model. Right when the gap is short and the model is good; wrong when the model is misspecified, because errors get baked into the data instead of left visible.
  • KNN-based imputation. Looks at nearby observations across multiple correlated series and borrows their pattern. Useful for panel-like data where related entities are observed during the gap.

The thing to watch for: the choice of interpolation is a modeling choice, not a preprocessing step. Forward-fill on a smoothly-varying signal creates artificial flat plateaus. Linear interpolation on a step-like signal creates artificial ramps. Both will look "fine" downstream because everything is filled in — but the model is now learning from values you invented under wrong assumptions.

Watch for time leakage in forecasting pipelines. Linear, spline, and PCHIP interpolation all use the future endpoint of the gap to fill in the missing values. That is fine for offline reconstruction of historical covariates, but not safe for features that have to be available at prediction time — at inference the future endpoint doesn't exist yet. Forward-fill is leakage-safe by construction; the others need a different pipeline for online use (one-sided estimators, or last-known-value substitution at the serving boundary).

Treat target and feature missingness differently. If the missing value is the target y, consider masking it out of the loss rather than inventing a label — the model otherwise learns to predict a fabricated value. If it's an input feature, imputation plus an is_imputed flag is usually safer, because the model can learn to discount fabricated cells if it has the flag.

Three practical defaults that hold up:

  1. Categorize the signal first. Step-like / linear-drifting / smooth-cyclic / model-driven — pick the method that matches the category, not the method that comes first in the library.
  2. Track an is_imputed flag. Carry a parallel mask of which values were original vs filled. Downstream models can use it as a feature; debugging is easier when failures cluster on imputed regions.
  3. Set a maximum gap size. Interpolating 3 missing minutes is not the same as interpolating 3 missing weeks. Define the largest gap any method is allowed to bridge; beyond that, leave the value missing or fall back to a model-based estimate with its own uncertainty.

Full breakdown — interpolation, bucketing, time-zone handling, missing-interval detection, and the order these steps should happen in: see Time Series Preprocessing: Interpolation, Bucketing, Time Zones, and Missing Intervals.