lenatriestounderstand

Shorts

One idea at a time

Single-idea notes drawn from the longer pieces. 1-2 minutes each. Use the search to narrow by topic, tag, or track — or browse the full feed →.

25 shorts

Shorts

What is Grouped-Query Attention (GQA)?

GQA sits between full multi-head attention and MQA: query heads are partitioned into a small number of groups, and each group shares one K/V set. Most of the KV-cache savings of MQA, most of the head diversity of MHA — and a cheap conversion path from existing checkpoints.

  • llm
  • attention
  • gqa
  • kv-cache
Read
Jun 7, 2026
Shorts

The KV Cache in One Formula

Why long context costs memory, not flops. The KV-cache size formula has six factors — once you can read it, every attention-variant design choice falls into place.

  • llm
  • kv-cache
  • inference
  • long-context
Read
Jun 7, 2026
Shorts

Why "strawberry" Has Three R's (and the Model Can't Count Them)

The most famous LLM bug — miscounting letters in a word — is not a reasoning failure. It is a representation failure. The model isn't fed letters, it's fed tokens. Once you know what a token actually is, the bug is the only outcome that could have happened.

  • llm
  • tokenization
  • bpe
Read
Jun 7, 2026
Shorts

Temperature, top_k, top_p: How LLMs Pick the Next Token

Three sampling knobs, three different operations on the same next-token distribution. Temperature reshapes; top_k truncates by count; top_p truncates by cumulative probability. They are not interchangeable.

  • llm
  • sampling
  • inference
Read
Jun 7, 2026
Shorts

HNSW: How Vector Search Stays Logarithmic

The graph that turned vector search from O(N) into O(log N). A hierarchy of small-world graphs, three tunable parameters, and the recall/latency trade-off underneath every production vector database.

  • embeddings
  • hnsw
  • ann
  • vector-databases
Read
Jun 7, 2026
Shorts

BM25 vs Dense Embeddings: Why Hybrid Retrieval Wins

Two retrieval philosophies — lexical evidence vs learned geometry — fail in exactly opposite ways. That's why production retrieval is rarely pure dense, and why Reciprocal Rank Fusion ended up everywhere.

  • embeddings
  • retrieval
  • bm25
  • hybrid
Read
Jun 7, 2026
Shorts

Bi-encoder vs Cross-encoder: Why Retrieval Is Two-Stage

A bi-encoder encodes query and document independently; a cross-encoder reads them together. One is cheap and slightly dumb, the other is expensive and much sharper. Modern retrieval uses both — and the two-stage cascade is the whole point.

  • embeddings
  • retrieval
  • reranking
Read
Jun 7, 2026
Shorts

Parquet vs CSV: Why Columns Beat Rows for Analytics

The same data, two layouts on disk. For analytical queries that touch a few columns out of many, Parquet reads tens of times less from disk than CSV — and that gap is structural, not a benchmark artifact.

  • storage
  • parquet
  • csv
  • columnar
Read
Jun 7, 2026
Shorts

Kafka Partitions and Consumer Groups: Where Parallelism Lives

Partition count is the ceiling on parallel consumption. Ordering is a per-partition guarantee, not a per-topic one. Two facts that decide most Kafka architecture questions.

  • kafka
  • streaming
  • distributed-systems
Read
Jun 7, 2026
Shorts

Walk-Forward Validation: Why k-Fold Leaks the Future

Random k-fold CV reshuffles time. On time series that means training on the future to predict the past — and quietly inflating every metric. The correct alternatives are expanding-window and rolling-window CV.

  • time-series
  • cross-validation
  • evaluation
Read
Jun 7, 2026
Shorts

STL Decomposition: Trend + Seasonal + Residual

Three components, one additive identity. STL splits a series into trend, seasonality, and residual so you can see what's actually in it before modeling — and the same decomposition idea propagates into modern deep architectures like N-BEATS.

  • time-series
  • stl
  • decomposition
  • eda
Read
Jun 7, 2026
Shorts

Why MLE Becomes Cross-Entropy

Everyone trains classifiers with cross-entropy, far fewer can say why that loss and not another. It isn't a design choice — it falls out of one principle: maximize the probability the model assigns to the data. Cross-entropy, log-loss, MSE, and KL are the same idea wearing four different hats.

  • probability
  • loss-functions
  • classification
Read
Jun 7, 2026
Shorts

Why KL Divergence Is Everywhere in ML

KL divergence turns up in cross-entropy training, VAEs and variational inference, knowledge distillation, and the RLHF leash. That isn't a coincidence — a huge fraction of ML is secretly the sentence 'make distribution q close to distribution p,' and KL is what that sentence reduces to.

  • probability
  • information-theory
  • kl-divergence
Read
Jun 7, 2026
Shorts

Attention Is a Kernel Method in Disguise

Everyone knows attention is softmax(QKᵀ/√d)V. Fewer notice it's structurally the same thing as kernel regression — Nadaraya-Watson smoothing from 1964, with a learned similarity. Seeing the kernel underneath tells you immediately how to make attention linear.

  • attention
  • transformers
  • kernel-methods
Read
Jun 7, 2026
Shorts

Why the ELBO Isn't a Random Formula

The evidence lower bound looks like a formula handed down from on high. It isn't invented — it's forced. When the evidence integral is intractable but you still want maximum likelihood, there is essentially one move available, and the ELBO is what it produces.

  • probability
  • variational-inference
  • generative-models
Read
Jun 7, 2026
Shorts

Cosine, Dot Product, Euclidean: When Are They the Same Thing?

Three different similarity measures that, on normalized embeddings, all rank the same. Why cosine is the convention, why dot is the implementation, and when the choice actually matters.

  • embeddings
  • similarity
  • retrieval
Read
May 28, 2026
Shorts

What is a TCN?

TCN stands for Temporal Convolutional Network — a causal CNN with dilated convolutions that gives sequence models exponential receptive field for the cost of linear depth.

  • time-series
  • tcn
  • deep-learning
Read
May 27, 2026
Shorts

Univariate vs multivariate time series

Univariate is one signal over time. Multivariate is many signals over time. The decision between them is about whether the extra signals actually carry information about the target.

  • time-series
  • fundamentals
Read
May 27, 2026
Shorts

ARIMA and SARIMA, on one page

ARIMA in three letters: AutoRegression, Integration, Moving Average. SARIMA adds seasonal versions of each. Together they handle the linear, stationary case — which covers a surprising amount of forecasting.

  • time-series
  • arima
  • sarima
  • classical-forecasting
Read
May 27, 2026
Shorts

What is Multi-Query Attention?

Why production LLM decoders share K and V across attention heads — and the 8–32× memory cut that follows.

  • llm
  • attention
  • mqa
  • kv-cache
Read
May 27, 2026
Shorts

What is an LLM agent architecture?

An LLM agent is an LLM wrapped in a loop that gives it tools, state, and a stopping rule. The architecture is the shape of that loop.

  • llm
  • agents
  • architecture
Read
May 27, 2026
Shorts

What is Multi-head Latent Attention (MLA)?

MLA stores a low-rank latent representation for each token and uses learned projections to produce the K/V information needed by attention at decode time. A different way to attack the KV-cache problem than MQA or GQA — trading a sharing bottleneck for a rank one.

  • llm
  • attention
  • mla
  • kv-cache
  • +1
Read
May 27, 2026
Shorts

Interpolating a time series

Each interpolation method makes a different assumption about what was happening between known measurements. The right method depends on what the signal actually does between samples, not on what's the default.

  • time-series
  • interpolation
  • preprocessing
Read
May 27, 2026
Shorts

Time series data preprocessing — the order that matters

TS preprocessing is a pipeline, not a checklist. The order of steps changes the result — scaling before handling missing values is the common trap.

  • time-series
  • preprocessing
  • data-cleaning
Read
May 27, 2026
Shorts

Time series vs panel data — what stacking series actually changes

Time series is one entity over time. Panel data is many entities over time. The new dimension — unit-level heterogeneity — is what panel methods exist to handle.

  • time-series
  • panel-data
  • econometrics
Read
May 27, 2026