One idea at a time

What is Grouped-Query Attention (GQA)?

GQA sits between full multi-head attention and MQA: query heads are partitioned into a small number of groups, and each group shares one K/V set. Most of the KV-cache savings of MQA, most of the head diversity of MHA — and a cheap conversion path from existing checkpoints.

llm
attention
gqa
kv-cache

The KV Cache in One Formula

Why long context costs memory, not flops. The KV-cache size formula has six factors — once you can read it, every attention-variant design choice falls into place.

llm
kv-cache
inference
long-context

Why "strawberry" Has Three R's (and the Model Can't Count Them)

The most famous LLM bug — miscounting letters in a word — is not a reasoning failure. It is a representation failure. The model isn't fed letters, it's fed tokens. Once you know what a token actually is, the bug is the only outcome that could have happened.

llm
tokenization
bpe

Temperature, top_k, top_p: How LLMs Pick the Next Token

Three sampling knobs, three different operations on the same next-token distribution. Temperature reshapes; top_k truncates by count; top_p truncates by cumulative probability. They are not interchangeable.

llm
sampling
inference

HNSW: How Vector Search Stays Logarithmic

The graph that turned vector search from O(N) into O(log N). A hierarchy of small-world graphs, three tunable parameters, and the recall/latency trade-off underneath every production vector database.

embeddings
hnsw
ann
vector-databases

BM25 vs Dense Embeddings: Why Hybrid Retrieval Wins

Two retrieval philosophies — lexical evidence vs learned geometry — fail in exactly opposite ways. That's why production retrieval is rarely pure dense, and why Reciprocal Rank Fusion ended up everywhere.

embeddings
retrieval
bm25
hybrid

Bi-encoder vs Cross-encoder: Why Retrieval Is Two-Stage

A bi-encoder encodes query and document independently; a cross-encoder reads them together. One is cheap and slightly dumb, the other is expensive and much sharper. Modern retrieval uses both — and the two-stage cascade is the whole point.

embeddings
retrieval
reranking

Parquet vs CSV: Why Columns Beat Rows for Analytics

The same data, two layouts on disk. For analytical queries that touch a few columns out of many, Parquet reads tens of times less from disk than CSV — and that gap is structural, not a benchmark artifact.

storage
parquet
csv
columnar

Kafka Partitions and Consumer Groups: Where Parallelism Lives

Partition count is the ceiling on parallel consumption. Ordering is a per-partition guarantee, not a per-topic one. Two facts that decide most Kafka architecture questions.

kafka
streaming
distributed-systems

Walk-Forward Validation: Why k-Fold Leaks the Future

Random k-fold CV reshuffles time. On time series that means training on the future to predict the past — and quietly inflating every metric. The correct alternatives are expanding-window and rolling-window CV.

time-series
cross-validation
evaluation

STL Decomposition: Trend + Seasonal + Residual

Three components, one additive identity. STL splits a series into trend, seasonality, and residual so you can see what's actually in it before modeling — and the same decomposition idea propagates into modern deep architectures like N-BEATS.

time-series
stl
decomposition
eda

Why MLE Becomes Cross-Entropy

Everyone trains classifiers with cross-entropy, far fewer can say why that loss and not another. It isn't a design choice — it falls out of one principle: maximize the probability the model assigns to the data. Cross-entropy, log-loss, MSE, and KL are the same idea wearing four different hats.

probability
loss-functions
classification

Why KL Divergence Is Everywhere in ML

KL divergence turns up in cross-entropy training, VAEs and variational inference, knowledge distillation, and the RLHF leash. That isn't a coincidence — a huge fraction of ML is secretly the sentence 'make distribution q close to distribution p,' and KL is what that sentence reduces to.

probability
information-theory
kl-divergence

Attention Is a Kernel Method in Disguise

Everyone knows attention is softmax(QKᵀ/√d)V. Fewer notice it's structurally the same thing as kernel regression — Nadaraya-Watson smoothing from 1964, with a learned similarity. Seeing the kernel underneath tells you immediately how to make attention linear.

attention
transformers
kernel-methods

Why the ELBO Isn't a Random Formula

The evidence lower bound looks like a formula handed down from on high. It isn't invented — it's forced. When the evidence integral is intractable but you still want maximum likelihood, there is essentially one move available, and the ELBO is what it produces.

probability
variational-inference
generative-models

Cosine, Dot Product, Euclidean: When Are They the Same Thing?

Three different similarity measures that, on normalized embeddings, all rank the same. Why cosine is the convention, why dot is the implementation, and when the choice actually matters.

embeddings
similarity
retrieval

May 28, 2026

What is a TCN?

TCN stands for Temporal Convolutional Network — a causal CNN with dilated convolutions that gives sequence models exponential receptive field for the cost of linear depth.

time-series
tcn
deep-learning

Univariate vs multivariate time series

Univariate is one signal over time. Multivariate is many signals over time. The decision between them is about whether the extra signals actually carry information about the target.

time-series
fundamentals

ARIMA and SARIMA, on one page

ARIMA in three letters: AutoRegression, Integration, Moving Average. SARIMA adds seasonal versions of each. Together they handle the linear, stationary case — which covers a surprising amount of forecasting.

time-series
arima
sarima
classical-forecasting

What is Multi-Query Attention?

Why production LLM decoders share K and V across attention heads — and the 8–32× memory cut that follows.

llm
attention
mqa
kv-cache

What is an LLM agent architecture?

An LLM agent is an LLM wrapped in a loop that gives it tools, state, and a stopping rule. The architecture is the shape of that loop.

llm
agents
architecture

What is Multi-head Latent Attention (MLA)?

MLA stores a low-rank latent representation for each token and uses learned projections to produce the K/V information needed by attention at decode time. A different way to attack the KV-cache problem than MQA or GQA — trading a sharing bottleneck for a rank one.

llm
attention
mla
kv-cache
+1

Interpolating a time series

Each interpolation method makes a different assumption about what was happening between known measurements. The right method depends on what the signal actually does between samples, not on what's the default.

time-series
interpolation
preprocessing

Time series data preprocessing — the order that matters

TS preprocessing is a pipeline, not a checklist. The order of steps changes the result — scaling before handling missing values is the common trap.

time-series
preprocessing
data-cleaning

Time series vs panel data — what stacking series actually changes

Time series is one entity over time. Panel data is many entities over time. The new dimension — unit-level heterogeneity — is what panel methods exist to handle.

time-series
panel-data
econometrics