Chapter 8 of 8

Playing with a Hybrid Architecture for Forecasting

Created Apr 28, 2026 Updated May 5, 2026

This note is a walkthrough of one possible hybrid forecasting architecture — me playing with how the building blocks from the rest of the series fit together when you actually try to assemble a custom forecaster. It is a particular composition of common neural forecasting blocks (TCN, LSTM, attention, decomposition heads, embeddings, future-covariate branches) that captures the shape of architectures that were strong in roughly the 2020–2022 period and that still show up in practice when teams want fine-grained control over what the model can and cannot represent.

This architecture is intentionally over-engineered as a learning and design exercise. Do not start here in a real project. Start with simpler baselines (statistical models, gradient boosting with lag and calendar features, a foundation model). Reach for a hybrid like this only when you have a clear reason for each branch.

What stays useful regardless of which model class is currently in fashion is the ability to read and assemble these blocks — to know, looking at someone else's hybrid forecaster, which components are doing what and why. That is what this note tries to show, by walking through one such assembly end to end.

The architecture builds on top of the LSTM/RNN recurrent core and the TCN convolutional block — those two are described separately and used here as known building blocks. Initialization, dropout, optimizer choice, sliding windows, loss functions and other cross-cutting concerns live in the practical training recipes note. The VSN block below is a simplified, TFT-inspired feature-selection layer (it skips the optional context-conditioning input from the original); the GRN block keeps the gated form from the published TFT design.

Overall model scheme

The full pipeline of a hybrid LSTM model can look like this:

Input (window_size × num_features)
    ↓
Optional: Embedding layers for categorical / calendar features
    ↓
Optional: Variable Selection Network (VSN) — TFT-inspired
    ↓
Optional: Feature attention — softmax weights over features
    ↓
Optional: TCN (Temporal Convolutional Network) — dilated causal conv + residual (here with GLU activation)
    ↓
LSTM (return_sequences=True) — recurrent extraction
    ↓
Dropout
    ↓
Optional: Multi-Head Self-Attention
    ↓
Optional: GLU Feed-Forward
    ↓
Split into 3 heads (N-BEATS-style decomposition):
    ├── Trend head    → Dense → Dense(horizon)
    ├── Season head   → Dense → Dense(horizon)
    └── Residual head → Dense → Dense(horizon)
    ↓
Add → main_output (horizon-step forecast)

+ Optional: Seasonal head (calendar embedding + same-position lag + historical average for that calendar slot)
+ Optional: TCN direct path (skip connection)
+ Optional: Event branch (future-known features such as holidays or scheduled events)

All outputs are summed → final forecast

Each component solves its own specific task. Below is a detailed breakdown of each.

The full pipeline as a real architecture diagram, with every optional block toggleable. The dark green LSTM and the three N-BEATS heads are the always-on backbone; un-check VSN, TCN, multi-head attention, the side branches (Seasonal head / Event branch) or the TCN skip path and watch the wires rejoin around the disabled box. The point: most blocks of this hybrid architecture are optional, and what the model actually computes depends on which ones are turned on.

Layer Normalization (Ba et al., 2016)

Layer Normalization is a critically important component of modern neural networks, especially recurrent and transformer architectures. Normalization techniques are needed to stabilize the training of deep networks, but classical Batch Normalization has serious limitations for sequence models, and LayerNorm avoids many of them.

Formula

LN(x) = γ * (x - μ) / σ + β

where μ, σ are the mean and std over features within one example (in contrast to BatchNorm, where it's over the batch).

LayerNorm computes mean and standard deviation per individual example, over the feature axis. The formula is simple: subtract mean, divide by std, get normalized representation with zero mean and unit variance. Then scale and shift via learnable parameters γ (gamma) and β (beta).

In Keras specifically, LayerNormalization(axis=-1) is the default and is what these architectures usually want: it normalizes over the feature dimension independently for each example and each time step, rather than mixing across the time axis. So for a tensor of shape (batch, time, features), the statistics are computed inside each (batch, time) slice over its feature vector, never across time.

The key difference from BatchNorm:

BatchNorm computes statistics over the entire batch for each feature separately — how all examples in the batch look for feature 1, for feature 2, and so on.
LayerNorm — the opposite: for each example it computes how its features are distributed.

Why LayerNorm in time series

Why is LayerNorm better than BatchNorm for time series and recurrent models?

BatchNorm works poorly with sequence models for several reasons:

Different sequences in a batch may have different lengths — BatchNorm needs padding or masking, which is complicated.
Recurrence makes statistics time-dependent — the same activation at different time steps has different meaning.
For training with very small batches, BatchNorm statistics become noisy and often unstable; at inference time it can still use moving averages, but in sequence models LayerNorm is usually simpler and more robust because it does not depend on batch statistics at all.

LayerNorm normalizes element-wise, for each example and each time step separately — no dependence on the batch. It is stable for recurrent models and became the de facto normalization choice in Transformers.

Pluses in general: speeds up training (normalized activations have better gradient flow, learning rate can be larger, convergence faster).

γ and β — learnable parameters

After normalization to mean=0, variance=1, the layer applies a learnable affine transformation: γ * normalized + β. Parameters γ (scale) and β (shift) are trained together with the rest of the network.

They are important because normalization is a strong constraint that can remove useful information. γ and β do not literally restore the original per-example statistics; they give the model learnable scale and shift parameters so that normalization does not force every representation to stay at zero mean and unit variance. In practice the model learns whatever scale and shift are useful for the next layer — sometimes close to the unnormalized distribution, sometimes very different.

VSN — Variable Selection Network (TFT-inspired)

VSN — Variable Selection Network — one of the most useful components from Temporal Fusion Transformer (TFT). It's a mechanism that explicitly teaches the model to select relevant features for each prediction, instead of blanket-processing all features the same.

In time series tasks we often have dozens of features (lags, calendar, exogenous variables), and not all of them are equally important — VSN dynamically manages this. The implementation below is a simplified, VSN-inspired feature-selection block, not a line-by-line reproduction of the TFT Variable Selection Network — the spirit (per-variable transformation + softmax weighting) is preserved, but the original paper includes additional context-conditioning machinery that this simplified version skips.

Idea

Explicitly learn which features are important at each moment in time. Soft feature selection via learned feature weights.

The main problem with many features — "noise" from irrelevant features can suppress the signal from relevant ones. Classical ML handles this via feature selection preprocessing (remove irrelevant features before training). But for time series, relevance can change: on a normal day lag_1 is important, on a holiday — other features. Static selection is insufficient.

VSN does soft feature selection via a learned weighting mechanism: for each input the model computes importance weights for each feature, and features with low weight are effectively ignored. This is learned selection, adaptive to context. (It is attention-like in the technical sense — softmax weights summing to 1 — but as the dedicated section below explains, it is operating over input variables rather than over time steps, which is a different question from what temporal attention answers.)

Mechanics — five steps

The mechanics of VSN's operation is a five-step process:

Pass each feature through a GRN (Gated Residual Network) — a mini-network for feature-specific transformation. Output shape per feature: (batch, time, hidden_units).
GlobalAveragePooling1D + Dense(1) → one scalar logit per feature (summary), of shape (batch, 1).
Concatenate the scalar logits + Softmax → importance weights of shape (batch, num_features), summing to 1 across features.
Multiply each transformed feature by its scalar weight (broadcast across time and channels).
Add → weighted sum across features, of shape (batch, time, hidden_units).

At step 1, each feature is processed via a separate GRN (more on it below). At step 2, GlobalAveragePooling1D reduces the temporal dimension and a Dense(1) collapses the per-feature pooled vector to a single scalar logit per feature — this is the bit that makes the resulting weights actually per-feature, rather than per-channel-of-each-feature.

At step 3, the scalar logits are concatenated into a (batch, num_features) tensor and Softmax is applied along the feature axis — so the weights sum to 1 across features, exactly one weight per input variable. At step 4, the transformed features (from step 1) are multiplied by their scalar weights, broadcast over time and channels. At step 5, the weighted features are summed (Add) into a single output.

Code example

# 1. Split input into per-feature slices: each (batch, time, 1)
feature_slices = [
    Lambda(lambda t, i=i: t[:, :, i:i+1])(x) for i in range(num_features)
]

# 2. GRN for each feature: each (batch, time, hidden_units)
grn_transformed = [
    grn(feat, units=hidden_units, dropout_rate=dropout_rate)
    for feat in feature_slices
]

# 3. One scalar logit per feature, then softmax across features.
#    Each Dense(1)(GlobalAveragePooling1D(...)) is shape (batch, 1).
feature_logits = [
    Dense(1)(GlobalAveragePooling1D()(f)) for f in grn_transformed
]
importance_logits = Concatenate(axis=1)(feature_logits)        # (batch, num_features)
importance_weights = Softmax(axis=1)(importance_logits)        # sums to 1 per example

# 4. Weighted sum across features. Each weight is broadcast over (time, channels).
weighted_features = []
for i, f in enumerate(grn_transformed):
    w = Lambda(lambda t, i=i: t[:, i:i+1])(importance_weights)  # (batch, 1)
    w = Reshape((1, 1))(w)                                      # (batch, 1, 1)
    weighted_features.append(Multiply()([f, w]))

x = Add()(weighted_features)                                    # (batch, time, hidden_units)

Two things to flag about this code. First, the Lambda(lambda t, i=i: ...) trick in step 1 (and again in step 4) is the standard workaround for the Python closure bug — see the Lambda section below for the full explanation. Second, the Dense(1) after pooling in step 3 is what guarantees that you get one weight per feature, not one weight per hidden channel of each feature. A naive Softmax(Concatenate([GlobalAveragePooling1D(f) for f in ...])) would distribute weight across num_features × hidden_units numbers, which is not what VSN is supposed to do.

The complexity of VSN is justified for projects with many features, where feature importance varies over time. For simpler cases, ordinary concatenation of all features may be sufficient.

Five features each get a learned scalar logit (drag the sliders). The widget shows the softmax across logits → per-feature importance weights → each feature's mini time-series scaled by its weight → the final weighted sum that downstream layers see. Push noisy feat. down and watch it disappear from the output; push lag-1 and lag-7 together and the output collapses onto their seasonal pattern. The weights always sum to 1 — giving more to one means taking from the others.

VSN vs temporal attention

How does VSN differ from a standard attention layer in a sequence model? The important difference is not simply "softmax" — most attention layers also use softmax weights that sum to 1. The difference is in what the weights are over.

Temporal attention (the kind used inside Transformers and LSTM+attention hybrids) weighs time positions: at each query step, which past steps matter more?
VSN weighs input variables (after first transforming each one through its own small per-variable network): for the current context, which features matter more?

The two answer different questions. They are complementary rather than competing — a full TFT uses both — and a VSN-style block is what you reach for when you specifically want the model to softly choose which input variables to lean on, rather than which time steps to attend to.

GRN — Gated Residual Network

In TFT, the Gated Residual Network (GRN) is the workhorse block used inside VSN and elsewhere. The full TFT GRN has four pieces: two dense layers with an ELU activation between them, optional context conditioning, a gating layer (GLU) on the output, and a residual connection followed by LayerNorm. The gate is what makes it a gated residual network: it lets the model learn to softly turn the entire block off when its contribution is not needed. The dense path could in principle learn to output near-zero on its own (especially with the residual shortcut available), but the explicit gate gives the network a much cleaner and more directly trainable way to suppress its transformed contribution.

A minimal GRN with the gate, in Keras-style pseudocode:

def grn(x, units, dropout_rate):
    x_in = x
    # Skip projection if the input has a different number of channels
    # than the block output (e.g. a single feature slice, channels=1):
    if x.shape[-1] != units:
        x_in = Dense(units)(x_in)

    x = Dense(units)(x)
    x = ELU()(x)                       # smooth alternative to ReLU
    x = Dense(units * 2)(x)            # 2× units for the GLU split
    x = Dropout(dropout_rate)(x)
    a, b = tf.split(x, 2, axis=-1)
    x = a * tf.sigmoid(b)              # GLU gate: learnable on/off

    x = Add()([x, x_in])               # residual connection
    x = LayerNormalization()(x)
    return x

This is the version of GRN that the rest of this note assumes. It omits the optional context-conditioning input from the original TFT paper (which lets a static covariate vector modulate the GRN), but otherwise matches the published design — the GLU gate is the part that makes this a gated residual block rather than a plain residual MLP. The Dense(units)(x_in) projection on the skip path is essential whenever the input has a different last-axis size than units (which is exactly what happens inside the VSN block below, where each per-feature slice has only 1 channel).

Components

Dense → ELU → Dense — two linear transformations with a nonlinearity between. This gives modeling capacity — the ability to learn nonlinear relationships. The second Dense produces 2 × units channels because GLU will split them in half.
Dropout — standard regularization on the inner activations.
GLU (a * σ(b)) — the second half of the channels gates the first half through a sigmoid, giving the network a learnable on/off control over the residual contribution. If the gate learns to output close to zero, the block effectively passes the input through almost unchanged via the residual.
Residual connection (Add()([x, x_in])) — direct gradient path; combined with the GLU gate, it makes the block "do nothing harmful" in the worst case, which is what makes deep stacks of GRNs trainable.
LayerNormalization — stabilizes the output statistics.

ELU (Exponential Linear Unit) — an activation function, an alternative to ReLU. Formula: f(x) = x if x > 0, else α(e^x - 1), where α is a small positive constant (usually 1.0). For positive inputs ReLU and ELU are identical. For negative inputs ReLU zeroes out while ELU gives smooth negative values close to −α. The practical advantages are a continuous derivative and mean activations closer to zero, which together help optimization in deeper stacks; ELU is not strictly necessary here, ReLU also works.

Multi-Head Self-Attention (from Transformer)

Multi-Head Self-Attention — the revolutionary mechanism that became the foundation of the Transformer architecture (Vaswani et al., 2017, "Attention Is All You Need"). It completely overturned ideas about sequence modeling and displaced RNN in most NLP tasks. In the hybrid architecture, attention is used on top of LSTM to provide long-range interactions.

Self-attention formula

Attention(Q, K, V) = softmax(Q K^T / √d_k) × V

where Q, K, V — Query, Key, Value — linear projections of the input.

The attention formula is one of the most important in modern neural processing. Q (Query), K (Key), V (Value) — three different linear projections of the same input (in self-attention) or different inputs (in cross-attention).

For each position in the sequence, Q is multiplied by all K (matrix product Q K^T), which gives a matrix of similarities between positions. Scaling /√d_k prevents the explosion of dot products for high dimensions. Softmax converts similarities into probability weights. These weights are then applied to V to obtain the weighted sum.

Intuition

For each position t, attention weights all positions by their similarity to t (via Q K^T), then takes the weighted sum of values.

Intuitively, attention is "looking up" information relevant to the current position. For each position t, the network creates a Query vector — "what am I looking for?". Each other position has a Key vector — "what do I have?". Similarity between Query and each Key determines how much the current position "pays attention to" each other. Positions with high similarity contribute a lot, with low similarity — little.

This is content-based interaction: related positions find each other regardless of distance in time.

Multi-head

We do num_heads parallel attentions with different projections, concatenate.

Multi-head attention — parallel execution of multiple attention operations. Each "head" has its own Q, K, V projections, which allows different heads to learn to attend to different aspects:

One head may learn seasonal-looking alignments.
Another may focus on short-term dependencies.
A third may react to event-like positions.

The clean separation suggested by these labels is not guaranteed by the architecture — heads are not enforced to specialize, and in practice the division of labor between them is messy and often hard to interpret. What multi-head does guarantee is access to several independent attention subspaces, and that already gives the model richer representations than single-head attention. Outputs are concatenated and transformed via a final linear projection.

Code example

attn_output = layers.MultiHeadAttention(
    num_heads=4,
    key_dim=32,              # dimensionality of Q/K/V inside each head
    dropout=dropout_rate,
)(x, x)                      # self-attention: Q=K=V=x

The Keras API makes multi-head attention trivial. num_heads=4 — 4 parallel heads (typical values 4–16). key_dim=32 — dimensionality of Q/K/V in each head. dropout for regularization. (x, x) — self-attention, where source and target are the same sequence (Q=K=V=x). For cross-attention, different tensors would be passed: (query_seq, value_seq).

Why attention in a hybrid architecture

LSTM processes sequentially, but attention can directly link t=1 and t=30. For example, "what was last Monday" — attention will find this signal faster than LSTM will carry it through 7 steps.

In the hybrid architecture, LSTM and attention are complementary:

LSTM captures local dependencies (recent past) well, but information from the distant past is "diluted" as it passes through time steps.
Attention — direct connection: any two positions can "talk" to each other, regardless of distance.

For time series this is especially useful: "what was last Monday" — attention can do a direct match without waiting 7 steps through LSTM. Hybrid LSTM+attention combines the sequential inductive bias (good for local patterns) with direct long-range access (good for weekly/monthly seasonality).

A synthetic past window with a strong weekly pattern, plus four attention heads each hand-crafted to specialize in a different aspect: local momentum (head 0), weekly periodicity at t−7, t−14, t−21 (head 1), event-spike magnitude (head 2), and a near-uniform global summary (head 3). Drag the query position and watch each head highlight different past keys. The architecture only permits this kind of specialization — in a real Transformer it isn't enforced, and what each head ends up doing is shaped by data and loss.

A small but important caveat about leakage: in this architecture self-attention is applied only over the encoded historical window (the LSTM output for past time steps), so no causal mask is needed — every position the model attends to is already in the past relative to the forecast origin. If the same attention block were instead applied over a sequence that included future decoder steps (for example, in a seq2seq decoder that attends over its own future-known covariates as it emits the horizon), masking or careful query/key/value separation would be required to prevent each future step from peeking at later future steps.

N-BEATS Decomposition Heads (Trend / Seasonality / Residual)

N-BEATS decomposition — an idea from the N-BEATS paper, often adapted in hybrid architectures. Instead of the model predicting target as one continuous quantity, it splits the forecast into several components with different nature, each computed by its own head. This is the classical statistical approach (STL decomposition) carried over to a deep learning context.

Three components

N-BEATS (Oreshkin et al., 2020) — deep learning for time series with interpretable decomposition:

Trend — long-term movement (trend line).
Seasonality — periodic patterns (day of week, month).
Residual — everything else.

The decomposition idea came from classical time series statistics: any series can be represented as a sum (or product) of three components.

Trend — long-term direction (linear growth, slowing, monotonic change).
Seasonality — repeating patterns with fixed period (weekly, monthly, yearly).
Residual — everything not explained by trend and seasonality (noise, random events, short-term fluctuations).

The original N-BEATS had more complex basis functions for trend and seasonality (polynomials, Fourier series). In the simplified version, each head is just an MLP, learning its component implicitly.

Implementation example

Three parallel heads from the last LSTM hidden state:

x_last = Lambda(lambda t: t[:, -1, :])(x)   # take the last step

trend_head = Dense(32, activation="relu")(x_last)
trend_out = Dense(horizon)(trend_head)

season_head = Dense(32, activation="relu")(x_last)
season_out = Dense(horizon)(season_head)

residual_head = Dense(32, activation="relu")(x_last)
residual_out = Dense(horizon)(residual_head)

main_output = Add()([trend_out, season_out, residual_out])

The implementation is very simple. Take the last hidden state of the LSTM (after all processing layers). Three parallel MLP heads — each Dense → Dense, each outputs in horizon dimensions (the number of future time steps that we forecast). One head for trend, one for seasonality, one for residual. The results are simply summed into the final main_output.

The three heads share the underlying representation (x_last), but learn different functions over it.

How heads "specialize" — and an important caveat

How exactly do the three heads "specialize" in different components? Not in any explicit way — we don't force the trend head to output only linear patterns. There is at most a soft inductive bias: depending on initialization, data patterns, and other architectural choices, one head may start to capture one aspect more than another.

The caveat is critical and easy to miss: without explicit constraints, these heads are not identifiable. If three free MLP heads are simply summed, the model is mathematically free to spread any signal across any combination of heads. There is nothing in the loss that forces the "trend head" to contain only trend, or the "seasonality head" to contain only seasonality. Calling them by those names is convenient labeling, not a guarantee about their content. Real interpretable decomposition needs something more — explicit basis functions (polynomials for trend, Fourier series for seasonality, as in the original interpretable N-BEATS variant), monotonicity constraints, auxiliary losses on the per-head outputs, or some other restriction that breaks the symmetry between heads.

What this design does still buy you, even without identifiability, is a useful architectural prior: three parallel paths often help the model generalize better than one monolithic output layer, and the modular structure makes the resulting code easier to reason about. It is a practical trick that frequently improves accuracy on time-series forecasting; it just is not a true interpretable decomposition unless you take the extra step.

Three heads (trend / season / residual) over a 28-step horizon, each with its own amplitude slider, summed into the final forecast. The first three sliders rebalance the heads in the obvious way. The fourth slider — trend → season — moves the slope out of the trend head and into the season head; the per-head curves change, but the final forecast doesn't move at all. That's the non-identifiability the article warns about: without basis-function constraints or per-head losses, three free MLP heads can spread the same signal across each other in infinitely many ways.

Seasonal Head — separate branch

Seasonal head — an additional specialized output branch that complements the main N-BEATS decomposition heads. Its task is to explicitly handle a strong, known seasonality (weekly is the example used here, but the same shape works for hourly or yearly cycles) via direct access to the relevant calendar features. This is an architectural commitment: if we know in advance that a particular periodic pattern is critical for forecasting, we give the network a direct pathway for it instead of hoping it will rediscover that pattern from the raw input.

Structure

An additional head with explicit calendar features — to capture DoW seasonality directly. The inputs here are future-known (we know the day-of-week for every step in the horizon), so the per-future-step features have shape (batch, horizon, ...) and the head emits one seasonal contribution per future step:

# day_of_week_input has shape (batch, horizon)
day_embedding = Embedding(input_dim=7, output_dim=4)(day_of_week_input)
# lag_7_input, dow_avg_input have shape (batch, horizon, 1)
seasonal_features = Concatenate()([day_embedding, lag_7_input, dow_avg_input])
# Apply the MLP per future step, then collapse the trailing 1 dim:
seasonal_hidden = TimeDistributed(Dense(16, activation="relu"))(seasonal_features)
seasonal_output = TimeDistributed(Dense(1))(seasonal_hidden)
seasonal_output = Reshape((horizon,))(seasonal_output)

The pipeline is simple but targeted. Take the day_of_week input, pass through a trained embedding (4-dimensional representation of each day). Concatenate with two other relevant features: lag_7 (the value a week ago, for weekly recurrence) and dow_avg_input (historical average by day-of-week). Then apply a small MLP per future step — the TimeDistributed(Dense(...)) form makes that explicit, with Dense(1) collapsing each step to a single seasonal contribution. The final Reshape((horizon,)) gives (batch, horizon), ready to be added to the main forecast.

A subtle but important detail: a plain Dense(horizon) on a 3D tensor of shape (batch, horizon, features) would broadcast over the last axis and produce (batch, horizon, horizon) — not what we want here, and easy to miss. Either use TimeDistributed(Dense(1)) per step (as above), or first flatten the time dimension and then Dense(horizon) once on the resulting 2D tensor; both give a clean (batch, horizon) output.

Three input types

It uses:

Day_Of_Week embedding (4-dimensional learned).
lag_7 — value exactly a week ago.
dow_avg — historical average by DoW.

The three input types here are carefully chosen for weekly seasonality capture:

Day_Of_Week embedding gives a learned representation of each day — the network will automatically understand the similarity structure (weekdays vs weekends) through training.
lag_7 — value exactly a week ago, a strong baseline for weekly recurrence.
dow_avg — historical average, long-term stability by day-of-week, smoother signal than a single lag_7.

The combination of all three gives the network robust information for weekly pattern modeling.

Why inductive bias

This head is added to main_output — as inductive bias for weekly seasonality.

Why is a separate head needed if the main LSTM can learn weekly patterns itself? It's about inductive bias. The architecture explicitly "hints" to the model that weekly seasonality is important, and gives a direct pathway for its capture. This is prior knowledge encoded in the architecture.

Without this head, LSTM could learn the weekly pattern, but less efficiently — information would have to pass through many layers, get rescaled. An explicit seasonal head gives a shortcut and gives the network a direct path for important calendar features (whether the model actually leans on it is up to training; the architecture only makes the path available).

A typical pattern in modern architectures: combine general-purpose components (LSTM, attention) with domain-specific branches.

Event Branch

Event branch — a specialized part of the architecture, designed for incorporating known-future information into the forecast. In time series forecasting there are often features that are known in advance: holidays, scheduled events, price changes, promotions. The event branch specifically processes this information, separately from historical patterns.

Past features vs future-known features

It's important to understand the distinction:

Past features — these are what we observed in the past (historical values, past weather, past sales). These features are available only for past time steps, not for future ones — we don't know future sales in advance.
Future-known features — features that we know for future time steps (school vacations next month are already in the calendar; promotion next week is already planned by the marketing team).

A plain encoder-only LSTM processes only the historical window. To use future-known features the architecture needs an explicit path for them: a decoder input (in seq2seq), a future-covariate branch (as here), or a TFT-style known-future pathway. The event branch is one such explicit path.

Where each kind of information lives in time, drawn as four rows split by the now line. The target is observed on the left and unknown on the right (that's the forecast). Past covariates exist only on the left. Future-known covariates — calendar, holidays, scheduled promotions — exist on both sides, which is exactly why the hybrid architecture grows side branches that take horizon-shaped inputs. Future-unknown covariates (forecast weather, future prices) cannot be peeked at without leaking the label.

Implementation

event_input = Input(shape=(horizon, num_event_features))
event_branch = TimeDistributed(Dense(32, activation="relu"))(event_input)
event_branch = TimeDistributed(Dense(1))(event_branch)
event_branch = Reshape((horizon,))(event_branch)

The implementation makes direct use of future event features. Input shape (horizon, num_event_features) — a tensor with features for each future step. TimeDistributed Dense layers process each step independently, producing a single value per step. The final Reshape makes the output a 1D vector of length horizon, which is then added to the main forecast. The network learns the relationship between scheduled events and expected impact on the target.

TimeDistributed

TimeDistributed — applies the wrapped layer to each time step independently, preserving the time dimension. Conceptually: same weights, applied per step.

There is one important nuance for Dense specifically. In modern Keras, Dense applied to a 3D tensor of shape (batch, time, features) already broadcasts over the last axis and produces (batch, time, units) — so for Dense alone, TimeDistributed(Dense(32)) is essentially equivalent to Dense(32). The wrapper is still useful for clarity (it makes the per-step intent explicit), for compatibility with older code patterns, and for wrapping layers that genuinely expect one sample at a time and would otherwise need manual reshaping. It is necessary less often than tutorials sometimes suggest.

The Keras analog for a 2D-conv network would be Conv1D(kernel_size=1) — parameter-sharing across spatial positions.

Event branch integration

The event branch output is combined with the trend/season/residual heads — the model knows future holidays and adjusts the forecast.

In this architecture the integration is done with an elementwise sum: final forecast = trend + seasonality + residual + event_impact. This is additive decomposition: each component contributes its own additive part on top of the others.

This works well when event effects are themselves roughly additive on the chosen target scale. For example, "next Friday is a holiday → forecast goes down by 20 units" is the kind of pattern an additive event branch can learn directly. But many real event effects are multiplicative — "holidays have +20% sales", or "promotions multiply demand by 1.5×". An additive head can only learn those well if the target itself is on a scale where the effect becomes additive: training on log(target) and exponentiating back, or predicting a relative uplift on top of a baseline forecast, or adding interaction terms between the event branch and the baseline output. Without one of those, an additive event branch will systematically misestimate multiplicative effects, especially at unusual baseline levels.

TCN Direct Path — skip connection

TCN Direct Path — an optional component of the architecture, creating a skip connection from TCN output directly to the final forecast, bypassing LSTM and attention. It is a deep-supervision-like pattern (closer to a skip forecast path than to strict deep supervision, which would also add a separate auxiliary loss on the intermediate output).

What it does

tcn_last = Lambda(lambda t: t[:, -1, :])(tcn_output)
skip_output = Dense(horizon)(tcn_last)

We take the last timestep output of TCN (its representation after all dilated convolutions and GLU activations), pass it through one Dense layer for projection into horizon size, and add the result to main_output. This is an alternative forecast path — not through LSTM → attention → heads, but directly from TCN.

Deep supervision

The TCN output is added to main_output. This is related to deep supervision but is not the strict version with an auxiliary loss; here the TCN direct path acts as a skip forecast path: it contributes directly to the final output and gives the TCN layers a shorter gradient route from the main loss back to themselves.

The concept of strict deep supervision came from computer vision (Lee et al., 2014, "Deeply-Supervised Nets"). The original idea: in a very deep network the gradient signal reaching early layers is weak, so add auxiliary losses on intermediate outputs — each loss training its corresponding sub-network directly. The skip-forecast pattern used here is in the same family of ideas (give intermediate layers their own path to the loss), but only via a contribution to the shared final output, not a separate auxiliary loss term.

The TCN Direct Path serves a dual purpose:

First, TCN itself can produce a reasonable forecast independently of LSTM, which gives an ensembling effect.
Second, the backprop gradient from the final loss goes directly to TCN layers via the skip path, bypassing LSTM + attention — which means TCN gets a strong training signal even if gradients through the main path vanish.

If the LSTM path trains poorly for some reason, the TCN path can still contribute a useful shorter-path forecast signal.

`tf.keras.layers.Lambda` — inline operations

tf.keras.layers.Lambda — a utility wrapper in Keras that turns an arbitrary Python function into a full-fledged layer. This is a practical tool for quickly adding custom logic without creating a separate Layer subclass.

What it's used for

For custom logic without a separate layer class:

layers.Lambda(lambda t: t[:, -1, :])(x)       # select last timestep
layers.Lambda(lambda t, idx=idx: tf.cast(t[:, :, idx], tf.int32))(inputs)  # cast

Typical use cases:

The first — selection of the last timestep from a sequence (common for getting the final hidden state).
The second — type casting one slice of input.

Lambda integrates with Keras model-building like a regular layer: you can chain it with other layers, it's visible in the model summary, participates in the training's computation graph. Inside, ordinary TensorFlow tensor operations work — you can use tf.reduce_mean, tf.concat, slicing, broadcasting.

When to use and when not to

Use Lambda: simple slicing, type casting, elementwise operations, broadcasting. Quick, readable, doesn't require subclass boilerplate.

Don't use Lambda: if you need a stateful layer (with internal state that persists between calls), with trainable weights (learnable parameters), or complex logic (conditional branches, loops). In those cases, write a proper Layer subclass — overhead is more, but it comes out cleaner, more maintainable, and supports everything Keras can do (serialization, weights tracking).

Pitfall — Python closure bug

Variable closure: lambda t, i=i: ... — the default argument i=i is important, otherwise all lambdas capture the last value of i (classic Python closure bug).

A tricky Python pitfall that often catches beginners. Typical bug: in a loop creating many Lambda layers, each using loop variable i. Without careful closure handling, all lambdas end up referencing the same i variable — after the loop it equals the last value. As a result, all lambdas behave identically.

Solution — default argument trick:

lambda t, i=i: ...

Creates a new local variable i in each lambda's scope, capturing value at creation time, not the current value of outer i. This is a Python feature (not Keras-specific), but constantly arises with Lambda layer usage. The workaround saves the value explicitly.

Concatenate vs Add

Concatenate and Add are two main ways to combine tensors in neural networks. The choice between them is fundamentally meaningful and often not immediately obvious for beginners. Understanding the differences is important for designing network architectures.

Add

output = x + y — requires the same shape, preserves dimensionality. Residual connections.

Add — elementwise sum. Requires tensors with identical shape, and preserves shape.

The key use case is residual connections: output = F(x) + x for skip connections in ResNet, Transformer, etc. Semantically this is "additional signal": base representation x plus delta from transformation F(x). Doesn't increase dimensionality — output is the same size as inputs.

Good for combining same-type information: two feature vectors representing the same thing from different paths.

Concatenate

output = [x, y] — joins along an axis, increases dimensionality. Feature fusion (combining different types of features).

Concatenate — appending one tensor to another along a specified axis. Shapes can differ along the concat axis (summed), but must match along others. Results in increased dimensionality: if you concat two 64-dim vectors, you get a 128-dim result.

The key use case is feature fusion: combining different types of features (calendar embeddings + numerical lags + categorical one-hots), where each is important in itself, not redundant with others. Downstream layers learn how to use the combined information.

Also used for multi-task learning and multi-modal fusion (images + text).

Closing

The point of this architecture is not that every forecasting problem needs all of these branches. Most do not. The point is to show how the blocks compose: VSN for feature routing, TCN for bounded convolutional context, LSTM for recurrent dynamics, attention for direct long-range interactions, decomposition heads for output structure, and future-covariate branches for information known at prediction time.

In practice, this kind of hybrid model is worth considering only after simpler baselines have been tried — classical statistical models, gradient boosting with lag and calendar features, and possibly a time-series foundation model. But understanding the hybrid design is still useful, because it teaches what each neural forecasting block contributes and what kind of problem it is meant to solve. Reading a forecaster like this is half of being able to debug or extend one when the time comes.

Overall model scheme

Layer Normalization (Ba et al., 2016)

Formula

Why LayerNorm in time series

γ and β — learnable parameters

VSN — Variable Selection Network (TFT-inspired)

Idea

Mechanics — five steps

Code example

VSN vs temporal attention

GRN — Gated Residual Network

Components

Multi-Head Self-Attention (from Transformer)

Self-attention formula

Intuition

Multi-head

Code example

Why attention in a hybrid architecture

N-BEATS Decomposition Heads (Trend / Seasonality / Residual)

Three components

Implementation example

How heads "specialize" — and an important caveat

Seasonal Head — separate branch

Structure

Three input types

Why inductive bias

Event Branch

Past features vs future-known features

Implementation

TimeDistributed

Event branch integration

TCN Direct Path — skip connection

What it does

Deep supervision

tf.keras.layers.Lambda — inline operations

What it's used for

When to use and when not to

Pitfall — Python closure bug

Concatenate vs Add

Add

Concatenate

Closing

`tf.keras.layers.Lambda` — inline operations