Chapter 2 of 8
Time Series Preprocessing: Interpolation, Bucketing, Time Zones, and Missing Intervals
Created Apr 28, 2026 Updated May 3, 2026
Before a time series can be visualized, compared, or passed into most forecasting models, its time axis usually has to be made explicit and regular. Real data rarely arrives in the clean form that textbook methods expect: timestamps can be irregular, duplicated, missing, recorded in different time zones, or produced as event streams rather than as periodic measurements. Between the raw data and a model that can use it there is almost always a preprocessing layer, and the choices made in that layer often matter more than the choice of model at the end of the pipeline.
Three practical questions tend to drive that layer. Is the time axis regular, sorted and free of duplicates? What does an absent value actually mean — that nothing happened, that something happened and was not recorded, or that a measurement was lost in transit? And in which time zone are the buckets supposed to be defined, given that "a day" or "an hour" is not a globally unambiguous unit? This note walks through the operations that answer those questions in practice — preliminary checks on the time axis, interpolation of continuous measurements, bucket aggregation of event streams, zero-filling and the more honest forms of missing-value handling, dropping incomplete final buckets, UTC normalization, and building several consistent series on one shared time grid.
The important point is that none of this is cosmetic cleanup. Every preprocessing decision encodes an assumption about what missing time means. The sections below try to make those assumptions explicit rather than hide them behind defaults.
Regularity, sorting and duplicates
Before any interpolation or aggregation, preprocessing usually starts with a few mechanical checks on the time axis itself. They are not glamorous, but skipping them is the source of most "the model is doing something weird" surprises later on.
The first check is sorting. Almost every interpolation, rolling-statistic or aggregation routine assumes that timestamps come in monotonically increasing order, and most do not verify this — they happily produce nonsense if the order is wrong. A single sort by timestamp at the very start of the pipeline is the cheapest defensive measure available.
The second check is uniqueness. Two rows with the same timestamp can mean very different things, and the right reaction depends on the semantics. In raw event data, several events naturally occur in the same millisecond and should be aggregated by counting or summing. In sensor measurements, two values for the same sensor at the same timestamp usually indicate retry logic, late arrival, or a data-quality bug, and silently keeping both will distort downstream averages. Interpolation routines in particular treat duplicate x values as ill-defined and will either fail loudly or produce arbitrary output, depending on the library.
The third check is regularity. A regular series has a fixed sampling interval (every second, every five minutes, every day); an irregular series has timestamps at arbitrary distances. Most classical methods (ARIMA, frequency-domain analysis, simple rolling statistics) assume regularity, so an irregular series typically has to be projected onto a uniform grid first — exactly what the interpolation and bucket-aggregation sections below are about. Even when a downstream model could in principle handle irregular spacing (Transformers with continuous time embeddings, neural ODEs, point-process models), making the regularity assumption explicit at the top of the pipeline still pays off, because everything else — visualization, comparison across series, anomaly detection — is much easier on a regular grid.
scipy.interpolate.interp1d
scipy.interpolate.interp1d is one-dimensional interpolation of a function from points. The classical task: given points (x₁, y₁), (x₂, y₂), ..., you need to estimate y at an arbitrary point x that does not necessarily land on a node.
Interpolation is fundamentally different from extrapolation. Interpolation is the estimation of values inside the range of known points; it usually works reliably if the points themselves cover the range densely enough. Extrapolation is the estimation of values outside the range; far more dangerous, because we have no data about the function's behavior beyond the boundaries and have to rely on the model's assumptions. Below we examine three main methods of interpolation in interp1d, each with its own niche.
Linear Interpolation
Linear interpolation is the simplest and most widespread. The idea is elementary: between two known points the function is modeled by a straight-line segment, and the value at any intermediate point is computed from the equation of this line.
Formula between neighboring points (x₁, y₁), (x₂, y₂):
y(x) = y₁ + (y₂ - y₁) × (x - x₁) / (x₂ - x₁)
Properties:
- Simple, fast, robust.
- Does not overshoot — the result always lies between the known values.
- Continuous in value but discontinuous in derivative — a piecewise-linear curve at the joints.
Linear interpolation has the important property of monotonicity preservation: if the known points are monotonically increasing, the interpolation is also increasing, and values never go outside the range of known y. This is called the absence of overshoot and is critical for physical quantities that have understandable bounds (for example, probabilities in [0, 1] or percentage indicators).
The price for this robustness is the discontinuous derivative at the joints: the curve comes out piecewise-linear, and the first derivative changes abruptly at every node. Visually this is seen as sharp corners on the graph, which for physical signals can look unnatural.
When to use:
- Noisy data, where the "smoothness" of a floating spline is inappropriate.
- Monotone signals where it is important not to go beyond the bounds.
- Always as a fallback when you are not sure of more sophisticated methods or there is too little data for splines.
Cubic Spline
Cubic spline is a much smoother interpolation. Between each pair of neighboring points the function is modeled by a cubic polynomial, and the coefficients of these polynomials are chosen so that the curve is twice continuously differentiable at all nodes.
Idea: on each interval [xᵢ, xᵢ₊₁] the function is a cubic polynomial of the form aᵢ + bᵢx + cᵢx² + dᵢx³.
Conditions at the joints
At each interior point three conditions are satisfied:
- C⁰ continuity — the value matches (
S_left(xᵢ) = S_right(xᵢ) = yᵢ). - C¹ continuity — the first derivative matches (smooth angle).
- C² continuity — the second derivative matches (smooth curvature).
The physical meaning of the conditions is as follows. C⁰ continuity guarantees that the curve passes through all known points without breaks. C¹ continuity makes it smooth in slope — no corners, the first derivative is continuous. C² continuity — the strongest requirement — makes the second derivative continuous as well, that is, the curvature. Visually this means the curve looks like a trajectory along which a physical particle could move without jolts.
Boundary conditions
The smoothness conditions at the interior nodes are not enough for a unique solution — two more boundary conditions at the endpoints of the interval are needed.
- Natural spline — second derivative equals zero at the endpoints.
- Clamped spline — the first derivative at the endpoints is specified.
- Not-a-knot — the third derivative is continuous at the second and second-to-last nodes (scipy default).
A natural spline assumes that the curve straightens out at the endpoints (the second derivative vanishes), which is convenient for physical signals. A clamped spline specifies the first derivative at the endpoints explicitly — useful when you know the derivative from additional data. Not-a-knot is a more cunning condition used by scipy by default: it requires that the third derivative be continuous at the second and second-to-last nodes, that is, these nodes "are not real nodes" of the spline.
Computational complexity
Mathematically, for N points we end up with N−1 polynomials, each with four coefficients, that is, 4(N−1) unknowns in total. The conditions at the joints (three per interior node), passing through known points, and the boundary conditions yield exactly 4(N−1) equations.
This system has a tridiagonal structure — each equation links only neighboring intervals. Thanks to this it can be solved in O(N) operations by the Thomas algorithm instead of the O(N³) of ordinary Gaussian elimination.
Properties
- Smooth (C²) — visually pleasing for physical signals.
- Can overshoot — between points the values can exceed the known range.
- Requires at least 4 data points: a cubic spline needs enough knots to define its third-degree pieces together with the chosen boundary conditions.
Cubic splines have a dangerous feature — overshoot. A cubic polynomial passing through several points can have local extrema between them whose values fall outside the range of known y. For physical quantities this leads to absurd results: interpolating mass can yield a negative number, interpolating a probability — a value greater than one.
The four-point requirement of interp1d(kind="cubic") is structural rather than purely a question of numerical stability: a cubic piece is defined by four coefficients, and once you also have to satisfy continuity and boundary conditions across the spline as a whole, three or two points simply do not carry enough information to pin down a third-degree spline in the usual way. In production code it is common to fall back to linear interpolation automatically when fewer than four points are available — both to satisfy this requirement and to avoid the wild swings that a barely-constrained cubic can produce when it is forced through too few anchors.
When to use: physical signals (temperature, pressure, flows) — where we know that the physical quantity changes smoothly.
Nearest
Nearest interpolation is the crudest approach: for an arbitrary point we take the value of the closest known point along the X axis.
Formula:
y(x) = yᵢ, where xᵢ = arg min |x - xⱼ|
Nearest does not "interpolate" in the strict sense — it produces a step function that coincides with the known value within each Voronoi interval around the corresponding point. Visually the result looks like a series of horizontal plateaus with vertical jumps between them.
This is the right choice for discrete statuses — when the interpolated quantity cannot change smoothly. For example, sensor status "on/off", categorical labels, equipment operating mode flags. Trying to linearly interpolate such data is meaningless: an "average" between on and off has no physical meaning.
When to use: discrete statuses (on/off), step functions, categorical data.
Twenty-five samples drawn from a smooth synthetic series, with a configurable gap in the middle. Toggle each method to overlay its reconstruction; the dashed grey line is the unobserved truth, and the MAE next to each row is computed against the hidden values. Push the gap wide enough and the cubic spline starts overshooting the visible range — exactly the failure mode that makes it a poor default for bounded quantities. PCHIP is the shape-preserving alternative that stays in range.
Parameters of interp1d
scipy.interpolate.interp1d supports several parameters that control the behavior outside the range of known points:
bounds_error=False— do not crash if target x is outside the known range. The default isTrue— it raises an exception.fill_value='extrapolate'— extrapolate beyond the boundaries. By defaultNaNis returned.fill_value=(a, b)— specify left and right constants.
By default, if you request a value outside the range, the function raises an exception — this is defensive behavior but often inconvenient. The parameter bounds_error=False disables it, and then fill_value takes effect: you can return NaN, extrapolate, or specify constants.
Modern tools related to 1-D interpolation
interp1d is still widely used and supported, but in current SciPy documentation it is explicitly marked as a legacy API: the class is no longer being developed, removal is not currently planned, and for new code SciPy recommends switching to more specialized tools depending on which kind of interpolation is actually needed. The reason is partly historical (interp1d is a single class hiding several quite different algorithms behind a kind argument) and partly practical — the more specialized classes give better control over boundary conditions, extrapolation behavior and numerical conditioning.
The direct replacements depend on which kind you were using:
kind="linear"→numpy.interpfor plain evaluation, orscipy.interpolate.make_interp_spline(x, y, k=1)if you want a spline object you can reuse.kind="quadratic"or"cubic"→scipy.interpolate.make_interp_splinewithk=2ork=3.kind="nearest","previous","next"→ there is no specialized class; the standard recipe is to find the right index withnumpy.searchsortedand pull the correspondingyvalue yourself.
Beyond the strict drop-in replacements, a few related tools are worth knowing about for situations interp1d does not cover well:
CubicSpline— a more modern cubic-spline API with explicit C² continuity and several boundary-condition options.BSpline— works in the B-spline basis directly, numerically more stable for large numbers of knots and the natural building block for smoothing splines.PchipInterpolator— Piecewise Cubic Hermite Interpolating Polynomial, a shape-preserving cubic interpolator. On monotone input data it preserves monotonicity and avoids the overshoot typical of ordinary cubic splines, which makes it a good default for physical quantities that have to stay non-negative or bounded (probabilities, masses, percentages).RegularGridInterpolator— for multidimensional interpolation on a regular grid in 2D/3D, whichinterp1ddoes not handle at all.
Risks of extrapolation
A separately important topic is extrapolation, that is, estimating values outside the range of known points. Linear extrapolation is mechanically simple — you just continue the last segment beyond the boundary — but that simplicity should not be mistaken for safety: if the underlying process changes outside the observed range (and most real processes do), the extrapolated values can be arbitrarily wrong, just along a straight line instead of a curve. Cubic extrapolation is typically even riskier, because the polynomial tails of a third-degree spline can diverge to ±∞ very quickly once you leave the data, but the right mental model is not "linear safe, cubic dangerous" — it is "all extrapolation is a guess, and some guesses fail more spectacularly than others".
Typical engineering practice is to deliberately disable extrapolation in uncertain cases. Better to return NaN and explicitly show "we don't know" than to produce a fictitious value that may look plausible but be utterly wrong. This is the general principle: silence is more honest than confident lies.
Eleven samples of a smooth function on x ∈ [0, 100]. The slider extends the x-axis past the data; linear and cubic extrapolations are drawn against the (unknown) truth. For tiny horizons both look plausible. Push the slider out and the cubic tail diverges quickly — a third-degree polynomial has nothing anchoring it past the boundary. Linear is tamer but still drifts off the truth: the right answer for “don’t know” is usually NaN, not a confident extrapolated number.
Bucket aggregation
Bucket aggregation is a classical technique for working with time series, in which a stream of events is converted into a regular series. Raw events occur at arbitrary moments in time, and to plot a graph or apply standard analysis methods you need to project them onto a uniform grid.
What it is: the projection of events onto a uniform grid — each event lands in a "bucket" of fixed length.
A bucket is a time interval of fixed length, for example five minutes. We assign each event to the bucket it belongs to, most often by rounding its timestamp down to the start of the bucket. If an event occurred at 14:37:12 with a bucket size of 5 minutes, it lands in the bucket [14:35:00, 14:40:00).
After such a projection we can compute aggregates over each bucket — how many events occurred, what the average latency was, how many unique users — and obtain a regular time series suitable for graphs and statistical analysis.
Typical set of intervals
In analytical dashboards a set of standard bucket sizes is usually supported: 5 minutes, 30 minutes, an hour, a day. The choice of a specific size depends on the length of the analyzed window — see the section on adaptive interval selection below.
Rounding the timestamp to the start of the bucket is done with simple arithmetic:
# for 5-minute buckets
bucket_start = dt.replace(
minute=dt.minute // 5 * 5,
second=0,
microsecond=0,
)
All events with the same bucket_start land in one bucket.
This replace-based version is fine for fixed sub-hourly buckets such as 5 or 30 minutes, where the bucket boundary is just a minute field rounded down. For hourly or daily buckets it needs a small extension (zeroing minute as well, plus paying attention to the timezone if the day boundary is meant to be local rather than UTC), and for arbitrary bucket sizes such as 7 minutes or 90 minutes the cleanest formulation is to compute the bucket start from epoch time — floor(timestamp_seconds / bucket_seconds) * bucket_seconds — or to use a library-level floor operation such as pandas' dt.floor("5min").
Adaptive interval selection
Bucket size is a critical visualization parameter. Too small over a long window gives noisy graphs in which the main trends get lost in random fluctuations; too large over a short window hides the details and makes the graph boringly flat. The right size depends on the length of the displayed window.
Typical heuristic:
- Window less than an hour → 5 minutes.
- Less than a day → 30 minutes.
- Less than a week → an hour.
- More than a week → a day.
Such a heuristic gives a stable range of points on the graph, usually 20–200, which is well perceived by the eye and performant for the frontend. The specific thresholds are chosen empirically and provide a good balance between detail and readability across different scales.
The logic is simple: if the data fits within an hour, 5-minute buckets give about 12 points — enough detail without overload. Up to a day — 30-minute buckets (48 points). Up to a week — hourly (168 points). More than a week — daily, and the graph is not overloaded.
A 60-minute event stream with two bursts and a quiet stretch, projected onto a uniform grid. The slider picks the bucket size; bars show the count per bucket and the orange rug at the bottom is the raw event stream. Drop the size to 30s and the burst at minute 21 fragments into noise; push it to 30m and the whole hour collapses into two bars. The 5–15m range hits the heuristic sweet spot.
Missing does not always mean zero
Before deciding how to fill empty intervals, it is worth being precise about what an empty interval actually means in the data at hand. A missing bucket can correspond to several quite different situations: no events occurred, a sensor or service was down, an upstream pipeline failed, a value exists but arrived late, or a value was filtered out by permissions or retention rules. Only the first of these should be treated as a true zero; the others usually want to stay as NaN, be marked with a data-quality flag, or be handled by a separate incident or anomaly mechanism.
This distinction is one of the most important preprocessing decisions in time series work, because the choice silently propagates everywhere. A bucket filled with zero and a bucket marked as missing produce identical-looking graphs but behave differently in every aggregation downstream — averages, rolling statistics, training labels. Treating "we did not record anything" as "nothing happened" is the kind of error that does not crash the pipeline and does not show up in unit tests; it just quietly biases every model trained on the data thereafter.
The next two sections cover the two most common safe defaults: zero-filling for genuine event-count gaps, and dropping the incomplete final bucket. The point is not that these are universal answers — they are the correct answers when the semantics of the missing interval match them, and the wrong answers otherwise.
Same source data, three fill policies. Two regions are missing in the source: one because a sensor was offline (truth unknown), one because there were genuinely no events. The 30-day mean drifts in opposite directions depending on which policy is applied universally — zero-fill biases it down by inventing zeros for the outage; interpolation biases it up by inventing activity for the quiet stretch. Only the third panel — labelling each gap before filling it — stays close to the truth.
Zero-filling (constant interpolation)
After bucket aggregation a question arises: what to do with periods when there were no events? They can either be left absent in the series (sparse representation) or explicitly set to zero (dense representation with zero-filling). For event-count metrics on a regular grid, the latter is usually more correct.
What we do: all buckets in the chosen time range are generated, and empty ones receive the value 0. Mathematically this is constant imputation — filling gaps with a constant.
For count metrics (number of events, number of requests, unique users), zero-filling is the right approach when the missing bucket means that no events occurred. If the bucket is missing because the data was not collected — the logger was down, the export job failed, the time window is outside the data-retention horizon — then the truthful value is not zero but unknown, and forcing zero into it would invent a fact rather than represent one. The check is semantic, not syntactic: "no events" and "no data" can produce identical-looking gaps and need to be told apart from outside the series.
Imagine 5-minute buckets where the bucket at 14:00 has 100 requests and the bucket at 14:10 has 200 requests, but the bucket at 14:05 is missing entirely because no events landed in it. Linear interpolation between the two existing buckets would invent a value of around 150 for the 14:05 slot — a smooth transition that looks plausible on the graph but corresponds to nothing in the underlying data. For an event-count metric where the gap really does mean "no traffic", zero-filling is the only way to express that honestly.
Zero-filling makes the statement "no events happened in this bucket" explicit, instead of letting the gap silently disappear into a smooth curve. It also removes a particular class of fake signals from downstream analysis: rolling averages over empty buckets stay near zero instead of drifting toward neighboring values, and dashboards do not invent activity in windows where none was recorded. (It does not, and should not, prevent anomaly detectors from raising alarms on a sudden drop to zero — sometimes that drop is itself the anomaly. The point is that the alarm is then triggered by real, explicit zeros rather than by the absence of data.)
Dropping the incomplete final bucket
One of the non-obvious bugs in bucket aggregation is the last bucket, which covers the period "up to now" but has not yet ended. If you do not handle it correctly, a false "dip" appears at the end of the graph.
A specific example. Data is available up to 14:37, bucket size — 5 minutes. The last bucket is [14:35:00, 14:40:00), but we have collected events only up to 14:37, that is, just 2 minutes out of 5. As a result, this bucket will have 2.5 times fewer events than it would over a full 5-minute interval — on the graph this looks like a sudden drop in activity, although in reality there is no drop.
The solution is to think in terms of completed boundaries rather than in terms of which comparison operator to use. A bucket [start, end) is complete when its end is at or before the current cut-off time, so the rule is: generate buckets only up to the most recent complete boundary. The naive formulation while bucket_end < end_time is close to right and works almost always, but it has a small edge case — if end_time happens to land exactly on a bucket boundary, the previous bucket is already finished and should still be included, yet the strict < comparison would drop it. The cleaner formulation is to compute the last complete boundary first (round end_time down to the nearest multiple of the bucket size) and stop bucket generation there. The last partial bucket is then naturally excluded, exact-boundary timestamps are handled correctly, and the rule is one a future reader of the code can derive from the variable names without having to remember which way the inequality leans.
The same pattern applies wherever a sliding window passes over a growing data stream: in real-time visualization, in computing rolling statistics, in comparing the current period with the past one.
Twelve consecutive 5-minute buckets ending at “now”. The slider controls how far past the boundary “now” sits inside the current bucket. With the toggle off, the unfinished bucket is shown as-is and looks like a dip in activity — the artefact the section warns about. Turn it on and the series ends on the last completed boundary instead, regardless of where the cursor is.
UTC normalization
Working with time zones is a source of countless bugs in time series. Events from different systems can come in different time zones, and without normalization this creates chaos in aggregates.
Imagine the situation: server A in Moscow (UTC+3) writes an event at 00:30 local time, server B in London (UTC+0) at 21:30 local time. Locally these are "different days", and if you aggregate by daily buckets in local time the events end up on different calendar dates, even though technically they happened in the same hour of universal time. Multiply this by hundreds of services and a long-running pipeline, and the resulting daily totals can drift by entire percentage points relative to reality.
The safe internal convention is to normalize every event timestamp to UTC as early as possible — ideally at the ingestion boundary, before the value reaches anything that aggregates or compares — and to keep that convention explicit throughout the rest of the pipeline. In modern pandas this usually means working with timezone-aware datetime64[ns, UTC]:
dt = dt.astimezone(timezone.utc)
Some systems store UTC timestamps as naive datetime values for compatibility with libraries that handle timezone-aware values poorly — typically by following an astimezone(timezone.utc).replace(tzinfo=None) convention. That works in practice, but it is a step that quietly throws information away: a downstream reader has no machine-readable guarantee that the stored value is UTC, and may very well interpret it as local time. If you go that route, the convention has to live somewhere a reader cannot miss — in the column name (event_time_utc, not just event_time), in the schema documentation, or both.
One more subtlety: UTC is an excellent storage and ordering convention, but it is not always the right bucketing convention. The choice depends on what the buckets mean to whoever reads them. For low-level technical metrics — request rates, error rates, queue lengths — hourly UTC buckets are usually fine; nobody reads them with a calendar in mind, and consistency across services matters more than alignment with any single local clock. For daily business metrics — sales, active users, local-operations dashboards — the natural notion of "day" lives in the user's or business's local timezone, and computing daily buckets directly in UTC silently shifts every day boundary by a few hours and mis-attributes the activity that happens around local midnight. The fix is to convert from UTC into the relevant local timezone first, do the daily aggregation there, and then store and display the result with the timezone made explicit.
A week of UTC events with activity bumps near UTC midnight every day. Pick a timezone and compare the daily totals when buckets are defined by UTC boundaries vs by the selected zone’s local boundaries. Totals stay equal — the same events are being counted — but per-day attribution drifts, and the further the offset from UTC the larger the redistribution at the day boundary. This is the silent error behind “daily” business metrics computed naively in UTC.
Multidimensional series on one time grid
When you need to display not one indicator but several related ones (total volume, breakdown by first-level category, further breakdown by second-level category), the task arises of building several consistent series on one time grid.
Approach: generate all buckets for the same set of timestamps for all levels. Sum per bucket over all labels gives the total volume, sum per bucket over first-level categories — the level-1 aggregate, and so on.
The key property is that all series use the same time grid. This gives two important advantages:
- The series can be overlaid on each other on a graph without additional recomputation.
- It is guaranteed that the sum of the lower-level breakdown is exactly equal to the value at the higher level. This is an invariant easily violated if you build each series separately with a different grid (for example, one series happens to have an empty bucket and skips it, while another does not).
A typical mistake in this area is to use groupby over time separately for each level and end up with series with different sets of buckets. The right path is to first fix the full set of buckets, then left-join the data of each level to it, and for missing buckets substitute zeros via the zero-filling already described.