lenatriestounderstand

Chapter 9 of 10

Practical Training Recipes for Deep Learning Time Series

Created Apr 28, 2026 Updated May 6, 2026

This note collects the cross-cutting practical recipes for training deep learning time-series models — the things that show up regardless of whether the underlying network is an LSTM, a TCN, or one of the many custom hybrid assemblies people put together from these blocks (such as the one walked through separately). Initialization, regularization, optimizer choice, sliding windows, loss functions and the compile + fit workflow all live here, so that the architecture-specific notes can stay focused on architecture. The companion to this one is the interpretability and production-maintenance note, which covers what happens after training (SHAP, recalibration, drift monitoring).


Weight Initialization — Glorot Uniform (also known as Xavier)

Weight initialization is a critical aspect of training deep networks, often underestimated by beginners. Proper initialization can make the difference between a model that converges in an hour and one that doesn't train at all. LSTM layers usually combine two initialization choices: Glorot/Xavier for the input-to-hidden weights and orthogonal initialization for the recurrent hidden-to-hidden weights. Both are covered below.

The problem with naive initialization

If you initialize weights randomly from N(0, 1), activations in deep networks either explode or decay — exactly like vanishing gradients.

During the forward pass, each layer multiplies weights by the activations of the previous layer. If the weights are large, activations grow exponentially layer by layer, saturating the nonlinearities at extreme values (tanh → ±1, sigmoid → 0 or 1). If the weights are small, activations decay to zero, and the network stops distinguishing inputs. In both cases, gradients are lost, training fails.

This is the same problem as vanishing/exploding gradients, but arising already on the forward pass before training begins.

Glorot Uniform (Xavier)

Formula:

W ~ U(-√(6 / (fan_in + fan_out)), +√(6 / (fan_in + fan_out)))

Keeps the variance of activations constant across layers. Suitable for tanh, sigmoid.

Glorot Uniform, also known as Xavier initialization, was proposed by Xavier Glorot and Yoshua Bengio in 2010. The idea: choose the variance of weights so that the variance of activations remains constant across layers.

Mathematically this gives a formula with uniform distribution in the range [-√(6 / (fan_in + fan_out)), √(6 / (fan_in + fan_out))], where fan_in and fan_out are the number of input and output units, respectively. This specific range ensures invariant variance for networks with tanh or sigmoid nonlinearities.

Glorot/Xavier is the default choice for LSTM and other sigmoid/tanh-based architectures.

He initialization — analog for ReLU

W ~ N(0, 2 / fan_in)

For networks with ReLU nonlinearity, Glorot is not optimal — ReLU zeroes out negative values, which asymmetrically affects variance. He initialization by Kaiming He et al. (2015) adapts the formula for ReLU: a normal distribution with zero mean and variance 2 / fan_in.

This is 2x larger than Xavier, which compensates for "zeroing out" half the values by ReLU. For many ReLU-based feed-forward and convolutional networks, He initialization is a common default. (Transformer architectures use a mix — Xavier-style initialization with extra layer-wise scaling and careful interaction with LayerNorm and residual connections — so "use He for transformers" is a less safe blanket statement.) LSTM uses Glorot because LSTM gates are sigmoid/tanh.

A real forward pass through a stack of dense layers, plotted layer by layer (no training, just the init). The y-axis is the standard deviation of activations on a log scale; the green band is the “healthy” zone where signals neither saturate nor collapse. With tanh at depth 10, naive N(0,1) often drives pre-activations into saturation — the post-activation std stays bounded near 1 (tanh is in [−1, 1] so it physically cannot grow further), but most outputs are pinned at ±1 where tanh' is almost zero, so gradients have little signal to use. The green Glorot line stays in the healthy band across depths — exactly what the initialization is designed to approximate under its assumptions: keeping activation and gradient variance from immediately exploding or collapsing. Switch the nonlinearity to ReLU — Glorot is no longer the matched scheme because ReLU zeroes out half the input, and the blue He curve takes over. The bottom histograms show where the activations actually land at the last layer: spikes at ±1 (saturation), a single column at 0 (collapse), or a smooth bell that the next training step can actually learn from.


Orthogonal Initialization — for recurrent weights

Recurrent weights (W_h, which are applied to the hidden state at each step) require even more careful initialization than ordinary weights. There's a special method for them, critically important for stable training of recurrent networks.

Why orthogonal

W_h is multiplied by the hidden state at every step. If repeated multiplication by the recurrent matrix amplifies or shrinks vector norms, signals and gradients can explode or vanish through time. Orthogonal matrices preserve vector norms because their singular values are all equal to 1 — this is the property that lets gradient norms ride through many time steps without exponential blow-up or decay.

The key to understanding lies in the fact that W_h is applied sequentially T times during the forward pass of one sequence of length T. During the backward pass, the gradient over time is multiplied by the corresponding Jacobians at each step.

More generally, what matters for gradient norms is how repeated multiplication changes vector lengths. Singular values describe this amplification or shrinkage directly, which is why orthogonal initialization is a natural starting point: by construction it has all singular values equal to 1, so at step 0 the iterated products W_h^t preserve gradient norms and training starts stably.

Practical implementation

Specifically: QR decomposition of a random matrix — Q turns out orthogonal. In Keras this is the Orthogonal(seed=42) initializer.

Building an orthogonal matrix numerically is a standard practice via QR decomposition: take a random matrix, decompose into Q × R, where Q is orthogonal, R is upper triangular. Keras does this internally via the Orthogonal initializer. The seed fixes randomness for reproducibility of training runs.

This initializer applies specifically to the recurrent_initializer of the LSTM layer, which is crucially important for training stability.

Why this matters for forecasting. In time-series models, unstable training often shows up not as a NaN but as a model that quietly predicts a flat mean across the whole horizon, or that explodes on peaks, or that takes many epochs to start learning anything beyond the trivial baseline. Good initialization is one of the cheapest defenses: it does not guarantee a good model, but it dramatically lowers the chance that the model fails before it has had a chance to learn the temporal structure at all.

The y-axis is the norm of the iterated vector on a log scale; the dashed line at 1 marks “norm preserved”. The green orthogonal trajectories sit dead flat on that line for every t — that's the guarantee QR construction buys you, and it's why the article calls orthogonal init “the natural starting point” for recurrent weights. The orange naive Gaussian trajectories grow exponentially (the spectral radius of an unscaled n×n Gaussian is around √n, well above 1). A scaled Gaussian matrix can preserve variance in expectation for a single step, but repeated multiplication by the same sampled matrix still has no pointwise norm-preservation guarantee, so the blue trajectories drift up or down at random. In a vanilla RNN this is essentially the vanishing/exploding-gradient story plotted directly. In an LSTM the picture is softer — long-term gradients mostly travel along the cell-state highway, modulated by the forget gate, rather than through repeated multiplication by W_h — but the recurrent weights still appear inside the gate computations, so a stable starting point for them still matters.


Embedding Layers for categorical features

Embedding layers are a standard way of representing categorical features in neural networks. They became a fundamental technology, invented in NLP for words (word2vec), but since then applied very broadly: for user IDs in recommender systems, for product categories in forecasting, for entity types in knowledge graphs.

For time series, they solve the important problem of representing calendar variables.

The problem with one-hot encoding

One-hot encoding of Day_Of_Week (0–6):

  • 7 binary features.
  • Sparse representation.
  • Doesn't account for similarity — Monday and Tuesday are "equally close" to Thursday.

The classical way to turn a categorical variable into numeric for neural networks is one-hot encoding: Day_Of_Week 0–6 becomes 7 binary features (Mon: 1,0,0,0,0,0,0; Tue: 0,1,0,0,0,0,0; ...). It works, but has two serious drawbacks:

  • Sparse representation — 6 out of 7 values are always zero, a waste of memory and compute.
  • Doesn't account for similarity — in one-hot space, Monday and Tuesday are "equally far" from each other as Monday and Sunday. In many series Mon–Tue behaves similarly and Mon–Sat behaves differently, but the model does not assume which days are close; it can learn the actual structure from the data (and in some businesses that structure is quite different — Sunday may behave like a weekday for some operations, or like a weekend for others). The neural network has to learn whatever similarity the data actually has, which under one-hot encoding requires more data and training time than under a learned embedding.

Embedding layer solves this

day_emb = layers.Embedding(
    input_dim=7,       # number of categories (days of week)
    output_dim=4,      # dimensionality of learned embedding
)(day_of_week_int)

The embedding layer solves the problem elegantly. Instead of a sparse one-hot, it uses a trainable lookup table: for each category (each value of input) there's a learnable dense vector of dimension output_dim.

When the input arrives as a category ID (e.g., 3 for Thursday), the layer simply extracts the corresponding vector from the table. The vector is fully trained together with the rest of the model via backpropagation — the network learns which values in the embedding space make predictions better.

The input here is an integer category ID, the output is a dense vector of fixed dimension.

Emergent semantic structure

The main thing is that the network spontaneously learns semantic structure. If the behavior of Monday and Tuesday is similar (both are workdays with growing traffic), their embeddings will converge to similar values. If Saturday and Sunday behave like a weekend, their embeddings will also be close.

This is an emergent property of training — no one explicitly tells the model "Mon and Tue similar", it learns it from the data. As a result, each category is characterized by a learned vector in a 4-dimensional space, where distance reflects behavioral similarity.

The seven dots are the actual learnable embedding vectors for Mon..Sun, plotted directly in 2D — because emb_dim = 2, the lookup table is the picture, no PCA / t-SNE projection needed. They start near the origin from small random init. Press Train with the weekend split pattern: Mon..Fri tend to drift toward each other and Sat/Sun toward each other, because the model needs to encode “workday vs weekend” to predict the target. Switch to progressive — the model often learns a direction in 2D space that separates earlier days from later ones (the absolute orientation can rotate or flip between runs; what's stable is that some direction in embedding space encodes the progression). Switch to random noise (no day signal) — the target is independent of day_of_week, so there's nothing for the embeddings to encode and they wander without finding meaningful structure. The pedagogical message: an embedding's geometry is not magic semantics from thin air; it is whatever structure the loss function rewards.

Choice of embedding dimensionality

Rule-of-thumb: output_dim ≈ min(50, num_categories // 2).

A popular rule of thumb from Jeremy Howard and the fast.ai community: half the number of categories, but no more than 50. For Day_Of_Year (366 categories), the formula gives 183, which is overkill — a reasonable compromise is 16. For Day_Of_Week (7), the formula gives 3–4. For Month (12) — 6.

A typical choice of embedding dimensions for calendar variables:

  • Day_Of_Week → 4 dims. Just 7 values, 4 dim is enough to express 7 patterns.
  • Month → 6 dims. 12 values, slightly more expressive power.
  • Day_Of_Year → 16 dims. 366 values, potentially rich patterns (seasons, recurring annual patterns, event clusters), higher dimensionality allows capturing all of this. Holidays are a separate matter: many of them move between calendar dates from year to year (Easter, lunar-calendar holidays, Thanksgiving), so an embedding indexed by day-of-year cannot reliably represent them. Movable holidays are usually better encoded as explicit known-future flags, fed into a separate covariate path rather than left for the day-of-year embedding to discover.

Rule of thumb is not a magical formula, just a starting point. In practice, it's worth trying several values and choosing the best by validation performance.


Known-future covariates vs observed covariates

Calendar embeddings are usually safe inputs because calendar values — day-of-week, month, day-of-year — are known for any future timestamp. But not every feature can be used the same way, and treating them as if they could is one of the most common quiet sources of leakage in time-series training. Before any feature is fed into a forecasting model, it should be classified into one of three categories:

  • Known in advance (known-future). Calendar variables, scheduled holidays and promotions on a published plan, school terms, contracted prices, sensor identifiers — anything whose value at a future timestamp is genuinely available at the moment the forecast is produced. These are safe to feed into both the historical input window and into any future-decoder branch (the event branch in a hybrid architecture, the known-future inputs in TFT, etc.).
  • Observed only up to the forecast origin. Past target values, past weather observations, past sales of related items, anything sensor-measured. These are fine in the historical input window but cannot be supplied for the future horizon — at prediction time we simply don't have them yet.
  • Target-derived. Lag features, rolling means, rolling standard deviations, lookback aggregates of the target. These are safe as long as the rolling window only ever looks backward from the timestamp at which the feature is computed. A rolling statistic that quietly includes future values is a textbook leakage source, and it usually does not crash anything — the model just learns to use information it will not have at prediction time, and the validation score becomes dishonestly good.

When building the sliding-window dataset, this classification has to be decided up front. The same column name can be a known-future feature in one project (a planned promo flag scheduled by marketing) and an observed-only feature in another (an actual promotion uptake signal that is only measured after the fact). The discipline is to label each feature explicitly and to feed it only into the parts of the architecture that are entitled to see it.

Four rows split by the vertical now line at the forecast origin. The target is observed on the left, unknown on the right (that's the forecast). Past covariates exist only on the left. Future-known covariates — calendar, holidays, scheduled promotions, planned prices — exist on both sides, which is why they're safe to feed into the future-decoder branches. Future-unknown covariates (realised future weather, dynamically-set future prices, exogenous shocks) cannot be filled with their realised future values at prediction time — that would leak future information. Note the subtle case: a weather forecast available at time t is itself future-known and is fine to use; what's forbidden is the realised weather that only becomes observable later, or revised forecasts that arrive after the forecast origin.

For the target-derived category specifically — lags and rolling statistics — the leakage trap is a one-window-too-far error. The widget below makes that visible.

The same series with a configurable lag and a rolling-mean window. Drag the cursor through time and watch how each feature value at timestamp t is computed only from values at timestamps ≤ t — never from the future. A rolling window that quietly extends past t is the textbook leakage source: the model learns to use information it will not have at prediction time, and the validation score becomes dishonestly good.


Dropout

Dropout — one of the most important regularization techniques in modern neural networks. It's simple, powerful, and practically used in every deep learning model. Introduced by Hinton and collaborators and popularized by the Srivastava et al. dropout paper, it changed the game for training deep networks without overfitting.

How Dropout works

Regularization: during training, each activation unit is zeroed out with probability p. Prevents co-adaptation of features and overfitting.

The mechanics of Dropout are very simple:

  • During training, on each forward pass, for each activation unit a decision is made: with probability p (usually 0.2–0.5) its output is zeroed out, with probability 1-p it remains as is.
  • The remaining nonzero activations are scaled up by 1/(1-p) to preserve expected magnitude.
  • During inference, dropout is disabled — all activation units are used.

This gives two important properties:

  • Prevents co-adaptation — neurons can't rely on specific other neurons (which may have been dropout'd), and have to learn more robust features.
  • Approximates ensembling — each training mini-batch effectively trains a different subnetwork; the final model is an ensemble of exponentially many sub-models, averaging their predictions.

Two dropouts in LSTM

Two dropouts in LSTM:

  • dropout=0.2 — on input-to-hidden weights (W_x).
  • recurrent_dropout=0.2 — on hidden-to-hidden (W_h). Rarely used, slow.

LSTM has two types of dropout, each with different effect:

  • dropout is applied to input-to-hidden connections (W_x), that is, to new inputs at each timestep. This is safe and fast, applicable everywhere.
  • recurrent_dropout is applied to hidden-to-hidden connections (W_h), that is, to recurrent information. This is theoretically more correct (prevents co-adaptation between recurrent units), but practically rarely used: first, the implementation is slightly slower (requires special handling for recurrent structure), second, often doesn't give noticeable improvements compared to regular dropout. Most practitioners use only the dropout parameter.

Usage example

layers.LSTM(lstm_units, return_sequences=True, dropout=dropout_rate)
layers.Dropout(dropout_rate)(x)   # standard dropout after

A typical pattern: dropout applied in several places in the network, providing multi-level regularization. First — inside the LSTM layer via the dropout parameter. Second — a standalone Dropout layer after the LSTM output.

A typical value dropout_rate = 0.1–0.3 — slightly lower than the default 0.5 from classic papers, because there are usually other regularization mechanisms in the architecture (residual connections, layer normalization, early stopping).

A small fully-connected network. Press Step in Training mode and roughly p of the 12 hidden neurons go dark on each forward pass — a different random subset every time. Each of those is a different sub-network the model is briefly training on, which is why dropout is often described as “an ensemble of exponentially many sub-models averaging their predictions”. The thumbnail strip below the diagram shows the most recent few; the counter tracks how many distinct masks you've already seen. Switch the mode pill to Inference — every neuron lights up, no dropping: that's the “dropout disabled at prediction time” behaviour the article warns is easy to get wrong. The line under the diagram in training mode also shows the rescaling factor 1/(1−p) applied to active outputs — it's what makes the same weights produce calibrated activations in both modes.


Early Stopping

Early Stopping — a classical regularization technique, working at the level of the training loop, not individual layers. It's so simple in idea and powerful in effect that it's used in practically every deep learning project.

Idea

Stop training when validation loss doesn't improve for N epochs. Prevents overfitting.

The typical training dynamics is this:

  • Training loss decreases monotonically — the model fits training data better and better.
  • Validation loss first decreases, then reaches a minimum, and starts to grow — the model starts to overfit (fitting to training noise instead of generalizable patterns).

The optimal stopping point is at the moment of minimum validation loss. But it's not known in advance when this moment will come — it depends on the model, data, hyperparameters.

Early Stopping gives the automatic answer: monitor validation loss at each epoch, and when it doesn't improve for N epochs in a row (patience parameter), training stops. This is an elegant form of regularization — doesn't require additional components in the model, just manages training duration.

Usage example

EarlyStopping(
    monitor="val_loss",
    patience=5,                     # how many epochs to wait for improvement
    restore_best_weights=True,      # roll back to the best version
)

Keras provides a built-in callback EarlyStopping:

  • monitor="val_loss" — metric to track (usually validation loss, but can be validation MAE, accuracy, any other metric).
  • patience=5 — number of epochs in which the metric doesn't improve before training stops. A small patience (1–3) can stop training prematurely due to random fluctuations; a large one (10+) allows training to continue too long after the real minimum. 5–10 is the typical sweet spot.
  • restore_best_weights=True — a critically important parameter, more on it below.

Why restore_best_weights=True is mandatory

Without restore_best_weights=True, after stopping, the model remains with weights from the last epoch — which are already worse than they were at the moment of best validation loss (otherwise training wouldn't have stopped). That is, you lose all the benefit of early stopping.

With restore_best_weights=True, after stopping, Keras automatically rolls weights back to the checkpoint of best validation performance. This is a subtle detail that beginners often miss, and they get a surprising negative result when "early stopping didn't help".

A small polynomial regressor is actually trained in your browser by gradient descent on a noisy synthetic dataset, and both losses are plotted epoch by epoch. Crank capacity up to 12 and watch the train loss keep falling while the val loss bottoms out and climbs — that's overfitting. The forest dot marks the best val loss the model ever achieved; the orange line marks where Early Stopping fires after waiting patience epochs without improvement. Toggle restore_best_weights off — the “val loss kept” stat jumps to a worse number, exactly the “early stopping didn't help” failure mode. Drop capacity to 4 and the model can't overfit, so Early Stopping never fires.


Cyclical Learning Rate (CLR)

Cyclical Learning Rate — an advanced learning rate scheduling technique proposed by Leslie Smith in 2017. It challenges the conventional wisdom that learning rate should only decrease over training. CLR shows that oscillating LR can be more effective.

Idea

Instead of a monotonically decreasing LR — oscillate between base_lr and max_lr in a triangular form. Helps:

  • Escape local minima (large LR at the peak).
  • Converge accurately (small LR in the valley).
  • Find optimal LR without grid search.

The traditional approach is learning rate decay: start with a large LR, decrease it over time. This is standard, but has drawbacks. If you're stuck in a local minimum or saddle point, a low LR can't escape.

CLR proposes an alternative: periodically "speed up" to large LR to escape poor regions, then "slow down" to small ones for careful convergence. The shape is usually triangular (linear rise and fall) or cosine.

Advantages

  • Escape local minima thanks to periodic high LR phases.
  • Careful convergence thanks to low LR phases.
  • No need for grid search for optimal LR — if the max/min range is chosen correctly, the network automatically explores.

Smith also proposed the LR range test — a way to quickly find a good max_lr via short training with exponentially growing LR.

Implementation example

A simplified callback sketch (omitting __init__, the iteration counter update self.iter += 1, and the base_lr / max_lr / step_size configuration for brevity):

class CyclicalLearningRate(tf.keras.callbacks.Callback):
    def on_train_batch_begin(self, batch, logs=None):
        cycle = np.floor(1 + self.iter / (2 * self.step_size))
        x = abs(self.iter / self.step_size - 2 * cycle + 1)
        lr = self.base_lr + (self.max_lr - self.base_lr) * max(0, (1 - x))
        self.model.optimizer.learning_rate.assign(lr)

The shape of the implementation is a Keras callback that updates the learning rate at every training batch. The formula is a triangle-wave generator. The parameter step_size is half the length of one full cycle (ramp up + ramp down). self.iter is a global iteration counter (accumulated across all epochs); a real implementation would increment it at the end of each batch.

The math in the formula translates iteration count into position in the current cycle and computes the appropriate LR value. The assignment self.model.optimizer.learning_rate.assign(lr) actually applies the new LR to the optimizer.

Triangular pattern

The formula gives a triangular pattern:

  • LR grows for step_size iterations from base_lr to max_lr.
  • Then falls back over the next step_size iterations.

Over the first step_size iterations, LR linearly increases — the model "speeds up". Over the next step_size, LR linearly decreases back to base_lr — the model "focuses". This triangle repeats throughout training.

Typical choices:

  • step_size = 2–8 × iterations_per_epoch — half-cycles in several epochs.
  • base_lr and max_lr — multiplier of 3–10x.

CLR can work well combined with early stopping — you can stop training at the moment of convergence, without thinking about an optimal LR schedule up front.

Learning rate plotted against iteration for four schedules at once. Drag step_size down to 80 and the triangular cycles bunch up — the model spends most of training swinging between extremes. Push it to 400 and you get a few long ramps. Switch on exponential decay for the contrast: decay only ever shrinks, while CLR keeps revisiting the high-LR regime, which can help the optimizer move out of shallow basins, saddle regions, or overly conservative low-LR updates. The vertical dotted lines mark the end of each full cycle (2 × step_size iterations). The cosine variant is the same cycle structure but smoothly rounded.


Adam Optimizer

Adam — the optimizer that has effectively become the default choice for training deep networks. Its popularity is such that in any deep learning tutorial, the first optimizer will be Adam. Proposed by Diederik Kingma and Jimmy Ba in 2014, and has dominated the industry for a decade.

Adaptive Moment Estimation

Kingma & Ba, 2014 — combines:

  • Momentum — exponential moving average of gradients.
  • RMSProp — exponential moving average of squared gradients.

The name Adam is short for Adaptive Moment Estimation. It combines two previously popular optimizers:

  • Momentum — accumulates exponential moving average of gradients, smoothing the trajectory and accelerating progress in consistent directions.
  • RMSProp — accumulates exponential moving average of squared gradients, adaptively adjusting learning rate per-parameter depending on the magnitude of gradients.

Each of them solved some problems of classical SGD, but had their own drawbacks. Adam took the best of both: momentum term for acceleration, RMSProp's adaptive LR for per-parameter scaling. The result is an optimizer that is robust to hyperparameter choices and converges fast.

Formula

m_t = β₁ × m_{t-1} + (1 - β₁) × g_t           # 1st moment (mean)
v_t = β₂ × v_{t-1} + (1 - β₂) × g_t²          # 2nd moment (variance)
m̂_t = m_t / (1 - β₁^t)                        # bias correction
v̂_t = v_t / (1 - β₂^t)
θ_t = θ_{t-1} - lr × m̂_t / (√v̂_t + ε)

The formulas are slightly intimidating, but the logic is transparent:

  • m_t — first moment, exponential moving average of gradients. This is momentum — "inertia" of direction.
  • v_t — second moment, EMA of squared gradients. This is an estimate of variance of gradients — characterizes "how unstable" gradients are for this parameter.
  • Bias correction (m̂_t, v̂_t) — technical adjustment: initial m_0, v_0 initialized as zeros, the first few iterations systematically underestimate true values; correction compensates.
  • Update rule: parameter is updated by -lr × m̂_t / (√v̂_t + ε). Numerator — bias-corrected momentum (where to move). Denominator — square root of variance (how confidently to move).

Parameters with stable gradients (small variance) get a full step; parameters with noisy gradients — a smaller step. ε — a small number for numerical stability (avoids division by zero).

Why Adam is the default

The reason why Adam became the default optimizer is that it's robust to hyperparameter choices.

  • SGD requires careful tuning of learning rate — wrong LR and the model doesn't learn.
  • Adam works reasonably with default hyperparameters (β₁=0.9, β₂=0.999, ε=1e-8) on a very wide range of deep learning tasks.
  • LR is also less sensitive thanks to per-parameter adaptive scaling.

For practitioners this means "when in doubt, use Adam with default parameters" — and usually it works.

There's existing criticism: Adam sometimes generalizes worse than SGD with momentum in image classification, which motivated variants like AdamW with better weight decay. But for most projects Adam is a safe default.


Sliding Window / make_X_y

Sliding Window — a fundamental data preparation technique for time series neural networks. Formally simple, but crucial for properly training deep models on sequential data. Without it, time series cannot be fed into the standard supervised learning framework.

How it works

Data preparation for time series NN:

# From a 1D series we make X (samples, window_size, features) and y (samples, horizon)
for i in range(len(series) - window_size - horizon + 1):
    X[i] = series[i : i+window_size]                      # input window
    y[i] = series[i+window_size : i+window_size+horizon]  # target horizon

The task is to convert a time series (one long sequence) into a training dataset formatted as (X, y) pairs expected by neural networks.

The sliding window does this as follows:

  • Take a window of fixed size (e.g., 90 days) as input (X).
  • The next horizon days (e.g., 30) as target (y).
  • The window moves forward by one step, forming the next training example.
  • Continues until the end of the series.

As a result, one time series with, say, 3 years (1095 days) gives ~1000 training examples (each overlapping with neighbors). This is data augmentation "for free" — one sequence gives thousands of training samples.

A blue input window of size W and an orange target horizon of size H slide across a 180-day synthetic series. Press Animate and the pointer steps forward one position at a time; each step writes a new (X, y) row into the dataset on the right. The counter N − W − H + 1 shows how many training examples are extractable from a fixed-length series. Push horizon up — the task gets harder and the example count drops. Push window_size down — less context per example, but more examples. That trade-off is the windowing decision in one picture.

Window_size vs horizon

Two key hyperparameters:

  • window_size — how many days of history we look at (context).
  • horizon — how many days we forecast.

window_size — how many past days the model sees for each prediction. More — more context, but also: more parameters, more compute, fewer training examples (from a fixed series).

horizon — how many days ahead we forecast. The greater the horizon, the harder the task, the less accurate, but the more useful for decision-making.

The choice depends on practical needs: for inventory planning a horizon in weeks/months is needed, for real-time dashboards — hours/days.

Typical choice

A popular choice for daily forecasting is window_size = 90, horizon = 30: 3 months we look at, 1 month we forecast.

3 months of history gives enough context for weekly seasonality × 12 weeks, monthly patterns, recent trend. 1 month ahead is a practical business horizon for many planning tasks. If the task shifts forward to a larger horizon, it's usually necessary to increase window_size to maintain forecast accuracy.

Important caveat for validation

Training examples are not independent — overlapping windows mean that neighboring examples contain shared information. This must be taken into account during validation splits: you cannot use random shuffle as in i.i.d. tabular data; a time-based split is needed (training examples from the past, validation — from the future).

This is a basic rule of working with time series, but because of sliding window it's easy to forget: even if the dataset itself "looks like i.i.d." (X and y are ordinary numpy arrays), internally they are linked by temporal structure.

Why this matters for forecasting. The windowing strategy quietly defines what the model is even allowed to learn — what window of past it gets to see, what horizon it has to commit to, and how strict the train/validation boundary is. A model with a perfectly chosen architecture can still produce a meaningless validation score if windows are constructed poorly: a too-short lookback hides relevant seasonality, a too-long horizon makes the task harder than it needs to be, and a sloppy split lets validation windows share rows with training windows. The architecture deserves the credit when forecasting works; the windowing deserves the blame when it does not.

Loss Functions for time series

The loss function is the heart of training for any ML model. It determines what the model learns to optimize, and directly influences the outcome. For time series forecasting there's a set of classical losses with different properties, and choosing the right one is critically important.

Main variants

MSE (Mean Squared Error): mean((y - ŷ)²) — penalizes large errors more heavily, the standard.

MSE — the de facto standard for regression. Squared error heavily penalizes large mistakes: an error of 10 contributes 100 to the loss, an error of 1 contributes just 1. This property makes MSE sensitive to outliers — the model will prioritize reducing extreme errors even if it hurts average performance. Mathematically differentiable and smooth, which makes MSE friendly to gradient-based optimization.

MAE (Mean Absolute Error): mean(|y - ŷ|) — robust to outliers.

MAE — an alternative to MSE with linear penalty. An error of 10 contributes 10, an error of 1 contributes 1 — proportional. This makes MAE robust to outliers: single extreme values don't dominate the loss. Good for business metrics, where "average error" in original units is more important than squared-error magnitude. Disadvantage: mathematically less smooth (non-differentiable at 0), although modern optimizers handle this fine.

MAPE (Mean Absolute Percentage Error): mean(|y - ŷ| / |y|) × 100% — scale-invariant, but blows up when y ≈ 0.

MAPE — percentage error, expresses error as a percentage of true value. Scale-invariant: same MAPE for predictions of sales $1K vs $1M value. This is intuitive for stakeholders ("10% average error"). But MAPE has a serious issue: explodes when y ≈ 0. If true value is zero, percentage error is division by zero. If close to zero, percentage inflates dramatically. For bookings-like data with occasional zeros (weekends, holidays), MAPE can be unusable.

sMAPE (Symmetric MAPE): mean(2|y - ŷ| / (|y| + |ŷ|)) × 100% — bounded in [0, 200%], but with asymmetric bias.

sMAPE — modification of MAPE, symmetrical for over- vs under-prediction. The denominator is the average of |y| and |ŷ|, which prevents the division-by-zero problem. Always bounded in [0%, 200%]. sMAPE is popular in forecasting literature and was used in competitions such as M4. (M5 used a different family of metrics — RMSSE / WRMSSE — so do not assume sMAPE is the universal "competition metric".) But sMAPE has its own asymmetric bias: over-predictions and under-predictions of the same magnitude contribute different amounts to sMAPE, which makes it less "fair" than it appears.

RMSE: √MSE — in target units, human-readable.

RMSE is just the square root of MSE. It has the same optimum as MSE — minimizing one minimizes the other — but the gradients are not identical: the square root rescales the MSE gradient by a factor that depends on the current root-mean-squared error, which can change optimization dynamics. In practice this is why models are usually trained with MSE and only evaluated with RMSE: training is cleaner, and reporting in original units (RMSE = $100) is more readable for stakeholders than MSE = $10000.

The x-axis is the prediction error itself, so the minimum is always at r = 0 (zero error → zero loss) — the shape of each curve is what differs. MSE's parabola, MAE's V and Huber's kinked curve don't depend on y at all; drag the y slider and they don't move. Then turn on MAPE and slide y down toward zero — the orange curve balloons in front of you, because MAPE divides by |y|. That's the divisor problem the article warns about, made visible. sMAPE stays bounded but reveals its asymmetry between positive and negative residuals.

Specifics for non-negative data

For non-negative counts (bookings, events, user actions) with occasional zeros — two advanced approaches work well:

  • MSE on log-transformed target. Apply log(y+1) transformation, train on log scale, exp(prediction) - 1 back. This changes geometric interpretation — the model optimizes ratio errors, not absolute.
  • Tweedie loss. A specialized loss for data with mixed continuous/discrete nature (zero-inflated with continuous values): when sometimes 0 (discrete point mass), sometimes continuous values. Tweedie distribution parametrizes this formally, gradient-friendly, gives meaningful optimization signal.

Why this matters for forecasting. The loss function quietly defines what the model is told to care about: absolute errors, percentage errors, peak errors, the behavior near zero, scale-normalized accuracy across many series. Two models with identical architecture and identical data can converge to very different predictions depending on whether they were trained on MAE, MSE, MAPE, sMAPE, or quantile losses. The forecast that comes out is, in part, a consequence of the loss it was optimized against — picking the loss that matches what stakeholders actually care about is one of the highest-leverage choices in the pipeline.


Model.compile + fit workflow

Model.compile + fit — the standard Keras workflow for training. It's simple, but has several important nuances, especially for time series.

Basic example

model.compile(
    optimizer=Adam(learning_rate=1e-3),
    loss=custom_loss_fn,
    metrics=["mae", "mape"],
)

model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=100,
    batch_size=64,
    callbacks=[EarlyStopping(...), CyclicalLearningRate(...)],
    shuffle=False,    # see the note below — defaults are not always right
)

The workflow consists of two steps:

  • compile defines the training configuration: which optimizer to apply (Adam, SGD), which loss to minimize (MSE, custom function), which metrics to additionally monitor (MAE, MAPE — informational, not trained directly). Loss is used for gradient computation; metrics — only for logging.
  • fit runs training: X_train, y_train — training data; validation_data — separate dataset for monitoring generalization; epochs — max number of passes through the data; batch_size — size of mini-batches. Callbacks — extensible hooks that run during training (EarlyStopping, CyclicalLearningRate, Model Checkpoint, TensorBoard).

shuffle=False and what it really means for time series

The Keras default is to shuffle training data before each epoch, which is great for i.i.d. tabular data. For time series the picture is more subtle than the slogan "always set shuffle=False" might suggest, and it is worth being precise about what must not be shuffled and when.

The unambiguous rule is: do not shuffle before the time-based split. Random shuffling of raw timestamps and then splitting train/val/test lets future information into the training set, which is the classic time-series leakage and is always wrong.

After a proper time-based split, however, the picture depends on the model:

  • Stateless models trained on precomputed sliding windows (the typical stateless LSTM, TCN, gradient-boosted lag-feature model). Each window is a self-contained training example, the temporal information is already baked into the lag and rolling features, and the order in which mini-batches are presented does not have to follow chronology. Shuffling training windows is not leakage in this case, and it sometimes helps optimization just as it does in i.i.d. tabular learning. shuffle=True here is a defensible choice.
  • Stateful RNNs (where the hidden state is explicitly carried across batches). Shuffling here breaks the assumption that consecutive batches contain consecutive time steps — shuffle=False is essential.
  • Walk-forward evaluation logic that relies on processing time chronologically — shuffle=False again, by construction.
  • Reproducibility and easier debugging. Even when shuffling is technically safe, keeping shuffle=False makes training behavior more predictable and easier to inspect, which is why many time-series codebases default to it as a habit.

So shuffle=False is the right default for time-series training code, mostly for safety and reproducibility — but the categorical claim "shuffle=True means leakage" is too strong. The real test is whether shuffling can leak future information into the training set, and after a correct time-based split on a stateless model it usually cannot.