Chapter 8 of 25
Time series data preprocessing — the order that matters
Created May 27, 2026 Updated May 27, 2026
Time series preprocessing is a pipeline, not a checklist. Each step changes what the next step sees, so the order matters more than which exact methods get used. A typical order that holds up across most domains:
1. Index integrity. Confirm the timestamp column parses, is monotonic, and has the expected frequency. Detect and document duplicates, gaps, out-of-order rows. Skip this and every later step is computed on a silently broken index.
2. Time-zone normalisation. Normalize timestamps to a canonical representation, usually UTC, before any cross-source merge. But choose bucket boundaries in the timezone where the process is defined — daily sales, store activity, school schedules, working-hour load curves all have their natural boundary in the local timezone, not UTC. Rule of thumb: store in UTC, aggregate in the business / event timezone. Daylight-saving transitions create phantom missing hours and double-counted hours that mimic real data issues either way.
3. Bucketing and resampling. Decide the target frequency and resample. Aggregations (mean, sum, last) are not equivalent — sum for counters, last for stocks, mean for rates. Resample before scaling because scale changes with bucket size.
4. Missing-value semantics. Distinguish three causes that look identical in the data:
- Missing by absence — the event didn't happen, the value is genuinely zero.
- Missing by failure — the sensor or pipeline dropped the observation, the true value is unknown.
- Missing by structure — the entity wasn't being observed yet, the row shouldn't exist.
Each calls for a different fill strategy. Conflating them is one of the most common silent data bugs.
5. Interpolation. Fill in the genuinely-missing values using a method that matches what the signal does between samples (forward-fill for step-like, linear for drift, spline for smooth, model-based for known generative process). Covered in Interpolating a time series.
6. Stationarity check / differencing. Test for trends and unit roots if a classical model (ARIMA, SARIMA, SARIMAX) is downstream. Differencing changes the scale and the variance, so this has to happen before any scaling. Skip or rethink this step for tree-based models, global neural models, and foundation models — they don't require a stationary input unless the specific recipe explicitly expects a differenced target.
7. Scaling / normalisation. Standardise or normalise last, because every earlier step changes the distribution. Per-series scaling for global neural models; per-window scaling for short-window patches; no scaling at all for tree-based models.
8. Feature engineering. Two sub-categories that don't share a slot:
- Calendar / Fourier features (day-of-week, month, holiday flags, harmonics) can be created once the correct local calendar is known — usually right after step 2.
- Lag and rolling features (lag-7, rolling mean over the last
kbuckets) need the target grid and the missing-value policy fixed first; build them last, on the cleaned signal.
Both inherit whatever's still wrong upstream.
The common traps:
- Scaling before missing-value handling. The scaler sees the raw distribution including the soon-to-be-filled gaps, so the scale gets pulled by whatever placeholder existed.
- Imputing before defining the target time grid. If bucket boundaries aren't fixed first, fills get smeared across intervals and aggregates are computed over invented values. (Resampling and interpolation are sometimes intertwined — bringing irregular sensor measurements to a 1-minute grid is one operation, not two — but in that case the grid is part of the resampling step, not something the imputer infers on its own.)
- Treating "missing by absence" as "missing by failure". Forward-fills a stable zero into something that should be zero, or imputes a non-event as a fake event.
Full breakdown — interpolation specifics, bucketing edge cases, time-zone wrinkles, missing-interval detection patterns: see Time Series Preprocessing: Interpolation, Bucketing, Time Zones, and Missing Intervals.