Section 3 of 5

LLM

Transformer architecture, the generation loop, prompt engineering, structured outputs, agent patterns.

10 chapters in this section.

How LLM Generation Works: Transformer, Sampling, Tokens, Batching, and Validation

What happens inside a transformer when you send a prompt, and how the practical knobs — temperature, max_tokens, structured outputs, batching strategy, retry-with-catch-up — fall out of that picture.

llm
transformers
attention
tokenization

Read

Updated May 27, 2026

Attention Is All You Need — But Not All Attention Is the Same

Why modern LLMs are no longer just decoder-only transformers with standard multi-head attention. Attention has become a design space — MHA, MQA, GQA, MLA, sliding-window, sparse, linear, recurrent, hybrid — plus position encoding, attention sinks, and KV-cache compression. Each variant solves a different bottleneck.

attention
kv-cache
long-context

Read

Prompt Engineering

What separates a working LLM prompt from a flaky one in 2026 — instruction hierarchy, in-context learning, chain-of-thought, structured outputs, reasoning-model specifics, and the prompt-injection trust boundary.

prompt-engineering

Read

Updated May 8, 2026

The Physics of Hallucination

What hallucination looks like at the level of the transformer's internal computation — distributed representations, signal competition in the residual stream, the softmax bottleneck, the activation-output gap, and the architectural reasons there is no first-class epistemic channel.

hallucinations

Read

Updated May 8, 2026

The Hindsight Corpus: Time in LLM Pretraining Data

Saying a model was 'trained on text written before T' invites a picture of human knowledge as of T. The actual corpus is volumetrically skewed toward recent years, dominated by retroactively-edited sources like Wikipedia, missing reliable per-document timestamps, and survivor-biased for older periods. The mechanisms, the failure modes that fall out, what's silently absent from datasheets, and what time-aware pretraining would have to do differently.

pretraining
training-data
temporal

Read

Updated May 13, 2026

LLM Agent Architectures

Agent architecture is where LLM engineering stops being mostly about prompts and starts looking like distributed systems. Covers workflows vs agents, the classical loop, five paradigms (ReAct, Function Calling, Plan-and-Execute, Reflection, CodeAct), MCP as the protocol layer above per-vendor function calling, multi-agent patterns, computer use, memory and resumability, production failure modes including indirect prompt injection, tool security, cost levers, and observability.

agent-architectures
tool-use
mcp
multi-agent

Read

Updated May 27, 2026

The Missing Now: Temporal Grounding in LLM Agents

A chat transcript preserves order but not elapsed time, world state, or whether earlier hypotheses have expired. For long-running agents, temporal grounding is a runtime problem, not a model problem — what 'now' actually is, the failure modes that fall out when context gets treated as state, the primitives (clocks, event logs, state reducers, expectations, monitors) that close the gap, and how to measure whether it works.

agents
temporal-grounding
state-management

Read

Updated May 13, 2026

Fine-Tuning LLMs: When the Weight Delta Is Worth It

Fine-tuning is not prompt repair. It is a decision to write a reusable parameter delta into an existing checkpoint. That delta changes future logits, defaults, and trade-offs. This note is about when that is worth doing: what fine-tuning actually buys, how to tell whether a gap belongs in the weights, and why data, evals, forgetting, and probability shape matter more than the slogan 'just fine-tune it'.

fine-tuning
residual-stream

Read

Fine-Tuning LLMs: Post-Training Is a Pipeline, Not a Step

Post-training is not one fine-tuning method. It is a sequence of objective signals. Continued pretraining teaches substrate, SFT teaches examples and defaults, preference optimization teaches comparisons, RLVR teaches verifiable trajectories, and distillation transfers the resulting behavior. The important design question is not which acronym is fashionable, but which stage matches the behavior you are trying to install.

fine-tuning
post-training
sft
dpo
+3

Read

Fine-Tuning LLMs: Modern Post-Training Deep Dive

A reference-style deep dive into the modern knobs around post-training: preference optimization variants, LoRA and PEFT methods, memory-efficient full fine-tuning, model merging, distillation, tooling, and serving. Read this after the pipeline note, once you know which stage you actually need.

fine-tuning
dpo
lora
qlora
+4

Read