Chapter 7 of 10
The Missing Now: Temporal Grounding in LLM Agents
Created May 13, 2026 Updated May 13, 2026
A user changes an nginx configuration, restarts the service, and the site still doesn't open. They ask the assistant for help, follow the suggested steps, and check again an hour later. Still broken. They write:
Still not working.
The assistant replies:
You just changed the configuration. Give DNS some time to propagate.
The model isn't wrong about nginx. It's wrong about when. Between "I changed the config" and "still not working" sit one hour of wall-clock time and zero tokens of context. To the model, those two messages are adjacent. To the world, they aren't.
I keep coming back to this example because it compresses the whole thing into one moment. A chat transcript preserves order. It doesn't preserve elapsed time, world state, or whether earlier hypotheses have since expired. A stateless LLM invocation receives a serialized context — not a process unfolding in time. When the task is short, the difference is invisible. When the task is long-running, that gap is where most of the failures I see actually live.
The angle I'm working from in this note: temporal grounding is a runtime problem, not a model problem. I don't think it gets fixed reliably at the model layer alone — not by longer context, better prompting, or scaling. What fixes it is giving the runtime the primitives the model doesn't have on its own: clocks, event logs, state reducers, expectations, monitors. The rest is me working through what "now" actually is, where I see it fail, what to build, what it costs, and how I'd measure whether it's working.
Stateless, more precisely
When people say LLMs are stateless they usually mean: the model doesn't remember between conversations. That's true but it's not precise enough for the design question.
The more precise version: a base LLM invocation maps input tokens to a next-token distribution, and after the invocation returns, no part of the model continues to exist as a process. It doesn't track running deployments. It doesn't notice that a user has been away. It doesn't carry expectations forward. The next invocation reconstructs everything from whatever text the runtime puts in the context window.
The runtime around the model can be stateful — chat apps store history, agent frameworks store tool outputs, workflow engines persist tasks. But these are properties of the surrounding system, not the model. The most common conflation, mine included, is treating a long context window as if it were persistent state. It really isn't, and most of the failures below come from blurring that distinction.
Putting it in a table:
| What it actually is | |
|---|---|
| LLM weights | Learned prior knowledge |
| Context window | Serialized input for one invocation |
| Chat history | Product-level storage of prior messages |
| Vector store / "memory" | Retrieval into context |
| Runtime state | Explicit representation of current task and world |
| Monitor / scheduler | Process that observes change between invocations |
Context describes the past, state represents the present. A transcript is evidence about what happened. State is a claim about what's true now. They're related but they're not the same object.
"Now" isn't a timestamp
The obvious first fix is to put timestamps in the conversation. It helps a little. It doesn't solve the problem, because "now" isn't a single piece of information.
When I decompose "now" operationally, I get at least four components:
- Wall-clock time and elapsed durations. What time is it; how long since each prior event.
- Recency of events. Not just an ordered history, but which events are fresh, which are stale, and how that's shifted since the last invocation.
- Current state of the world. A maintained representation of the task and environment — not a transcript of observations about it.
- Outstanding expectations. Deadlines, pending checks, hypotheses that haven't yet been verified or falsified.
For a human these blur into one thing — the sense of being inside a situation. An engineer just knows the config changed an hour ago, that the reload should have taken seconds, that the failure persists, that the wait-for-propagation hypothesis is now weak. None of that reasoning is explicit. It's a property of being a process embedded in time.
An LLM invocation doesn't have any of this by default. It can reconstruct parts of it from text if the relevant information is present, salient, and used during inference. Each of those conditions fails for me regularly, which is why timestamps alone are only a partial fix — they address one component out of four, and only when the model actually attends to them.
Mapping each component to the runtime primitive that has to own it:
| Component of "now" | Runtime primitive | How hard it seems |
|---|---|---|
| Wall-clock time, elapsed durations | Clock access, timestamp injection | Trivial |
| Recency of events | Append-only event log with freshness metadata | Straightforward |
| Current world state | State reducer over events | Moderate — needs a task model |
| Outstanding expectations | Expectation registry + scheduler | Harder — needs an async observation loop |
This is the table that ended up mattering most. Each row is something concrete to build. The rest of the note works through how I'd build each one and how I'd know if it worked.
Five failure modes I keep running into
Once context and state are separated, the failure modes I see in long-running LLM systems get nameable. None are exotic — if I'm running an agent over hours-to-days horizons, I've seen all five. For each one I write down what the signature looks like in logs, because once I have a name and a detection heuristic, the failure mode stops being mysterious.
Temporal adjacency error
The model treats adjacent messages as if the underlying events happened close together.
10:00 — I deployed the change.
12:00 — Still failing.
Assistant: You just deployed it; give it a moment.
What I look for: time-sensitive verbs (just, recently, currently) in assistant responses where the cited event is more than ~10× the typical settling time of the relevant process. If my agent says you just deployed an hour later, this is what's happening.
Stale snapshot error
The model reasons from a state that was true when last observed but may no longer be.
10:00 — Deployment in progress.
10:45 — What should I do next?
Assistant: Wait for the deployment to finish.
A deployment that's in progress in the transcript may have completed, failed, or timed out. A snapshot without a freshness boundary is dangerous in a way I keep underestimating.
What I look for: assistant responses that reference world state by name (the deployment, the running job, your branch) without an accompanying observation step. If the model talks about state without checking state, it's assuming staleness doesn't exist.
Missing expectation violation
This is the one I find most interesting. For a human engineer, still broken after an hour isn't just another message — it's the violation of an earlier expectation, and it should shift the system from waiting to diagnosing. A stateless invocation doesn't see violation, only continuation.
What I look for: every recommendation that involves the word wait (or give it time, propagate, settle). For each one I ask: was a deadline created somewhere? Is there any mechanism by which the system would behave differently if the deadline passed? When the answer is no for both, every wait recommendation is a silent open loop.
State reconstruction drift
In long conversations, the model reconstructs task state differently on different turns. At one point it assumes nginx was restarted; later it suggests restarting again; then assumes the right config was deployed; then forgets which failure class is in play.
What I look for: I instrument multi-turn runs to extract the model's implicit belief about completed steps at each turn. A simple way is to ask the model to summarize what we've established so far every Nth turn and diff the summaries. Drift shows up immediately. The first time I watched an agent litigate the same step twice across a long conversation, this was what I was watching, though I didn't have a name for it yet.
Unverified transition error
The model treats user-reported actions as completed state transitions.
Assistant: Restart nginx.
User: Done. Still not working.
Done is ambiguous. The restart may have succeeded, failed silently, hit the wrong server, restarted a container while the host service kept running old workers. Everything downstream is now built on an unverified claim.
What I look for: how often the agent proceeds from a state transition to a downstream recommendation without an intervening tool observation. When my verification rate drops below something like 70% on critical transitions, accumulated state errors are getting shipped into production.
Why scaling doesn't fix this
The objection I'd want to hear out: maybe this is a current-generation problem. Bigger models follow instructions better, longer contexts hold more history, reasoning models can think through elapsed time explicitly.
All true, and reasoning models in particular do help — given enough CoT tokens, they can write out an hour has passed, the waiting hypothesis is now weak and act on it. But none of this changes the underlying architecture.
A larger model still receives a serialized input. A longer context is still not maintained state. A prompt saying consider elapsed time still relies on the model extracting, computing, and acting on temporal features inside one forward pass — every time, from scratch. Reasoning models do this self-prompting more reliably, but they're still doing it on a fresh context, not because the runtime knows.
The bottleneck is representation, not intelligence. If elapsed time isn't in the input, the model can't use it. If staleness isn't marked, the model can't avoid it. If expectations aren't tracked, the system can't notice them being violated. If nothing runs between user messages, nothing observes between user messages.
This is why current LLM agents feel locally smart but temporally unreliable. They handle isolated tasks brilliantly and lose track of unfolding processes. I don't see how that gap closes by scaling — what closes it, in the systems I've watched work, is giving the runtime the primitives the model is missing.
What I'd build
The minimum architecture I keep coming back to has five components. None are exotic — the work is in composing them.
1. Timestamp every event. Not only user messages — tool calls, tool results, file writes, service restarts, deployments, observations, retries. Every event carries when it happened and how it was sourced (user-claimed, tool-observed, verified).
2. An append-only event log, separate from the transcript. The transcript is optimized for the model to read. The event log is optimized for state reconstruction. They're different artifacts. This is event sourcing applied to agent state: store the sequence of events, derive state from them.
3. A state reducer. Events go in, current task state comes out. The reducer is task-specific — for an nginx-debugging agent, it tracks config-changed, syntax-validated, service-reloaded, reload-verified, external-availability. The model receives the reduced state, not the raw events. This is what eliminates reconstruction drift — the model doesn't reconstruct anything, it reads the state object.
4. An expectation registry. Any recommendation involving waiting creates an expectation with a deadline and a violation action.
expectation:
created_at: 10:04
condition: site_responds_after_nginx_reload
expected_by: 10:09
on_violation: classify_failure_mode
status: pending
When the deadline passes the registry either fires the violation action automatically or, if the system is interactive, surfaces the violation to the next invocation. This is the only mechanism that turns still not working into something operationally meaningful.
5. Scheduled observations. When the agent says check again in ten minutes, the runtime schedules an actual check. Without this, check in ten minutes is a string. With it, there's a closed loop.
A weak architecture:
conversation history → LLM → next message
A stronger one:
event sources → event log → state reducer → expectation manager
↓
LLM ← (state, fresh events, violated expectations, actions)
↓
response / tool call / scheduled action
The LLM still does what it's good at — interpreting ambiguous input, proposing hypotheses, writing commands, communicating with the user. It just stops being asked to simulate a clock, a memory, and a state machine inside prose.
What this costs
None of this is free. The costs worth being explicit about:
Engineering surface area. An event-sourced system with a state reducer and an expectation registry is a stateful service alongside the LLM. Schema design, migration, debugging, replay tooling — it ends up looking like a small workflow engine. For agents handling one narrow task type that's the right investment. For an agent expected to handle arbitrary tasks, the reducer becomes either generic-and-shallow or task-specific-and-many, and I haven't figured out how to escape that tradeoff. Pretending the model will infer state from prose just relocates the cost into hidden state errors.
Latency and token budget. Injecting state, recent events, and violated expectations into every invocation adds tokens. For long-running tasks the marginal cost is dominated by tool calls and reasoning anyway, but for high-frequency interactive agents it matters. The mitigation I've been considering: state summaries with explicit freshness boundaries, event log truncation with explicit and N earlier events not shown markers.
New failure modes. A stateful runtime can be wrong too — the reducer can mis-classify, the expectation registry can fire spurious violations, the scheduler can drift. These are debuggable in ways that hidden state errors aren't, but they exist. The trade is an opaque model-internal failure mode for an observable system-level one. Usually that's worth it, but it is a trade.
The alternative — just put it all in the transcript — is cheaper to build and roughly free to operate, until it isn't. The accumulated cost of stale-snapshot recommendations and unverified transitions tends to show up as incidents, not as benchmark scores.
Where current frameworks land
As of May 2026 the ecosystem has moved closer to the runtime shape I'm describing than it was even a year ago. The right claim isn't that frameworks have no state or scheduling — several do. The gap I still see is narrower: temporal grounding is usually assembled from primitives, not exposed as one explicit expectation-driven abstraction.
LangGraph / LangSmith is closest to what I've been sketching. There's explicit graph state, durable execution, persistence, human-in-the-loop interrupts, memory, and scheduled graph runs via LangSmith cron jobs. That covers a lot of the substrate: state can persist, workflows can pause and resume, scheduled execution is available. What I'd still build myself is the semantic layer on top — an expectation registry that records we expect X by T, marks it violated when T passes, and shifts the agent's policy accordingly. A cron job is a primitive for when to run; an expectation registry is the semantics of what we were waiting for and what changed when it didn't happen.
OpenAI has moved well beyond the old Assistants/Threads model. Assistants API is deprecated (sunset August 2026); the current stack is Responses + Conversations APIs storing messages, tool calls, and tool outputs, with the Agents SDK giving application-owned orchestration, tools, approvals, sessions, and state. Much better substrate than a raw transcript. Temporal semantics still aren't automatic though — a Conversation can persist a tool output, but it doesn't on its own decide that a two-hour-old deployment observation is stale or that a previous wait for propagation recommendation has expired.
Claude has similarly shifted toward managed agent infrastructure. Claude Managed Agents provide stateful sessions with persistent event history, environments, tools, and session events; the Agent SDK exposes a programmable agent loop with built-in context management. So Claude tool use is just a stateless chat loop is an outdated read. What I still don't see as the central abstraction is expectation tracking — deadlines, freshness boundaries, hypotheses that get invalidated by elapsed time.
AutoGen is harder to position now. The original project is in maintenance mode and Microsoft is directing new work toward Microsoft Agent Framework (1.0 shipped April 2026). AutoGen itself has memory/RAG, save/load state, an event-driven core; for current Microsoft agent work it's been superseded. I'd treat AutoGen as part of the lineage rather than the main current platform.
Vector-store memory is a different layer. It helps with recall — retrieving facts, preferences, prior observations. But recall isn't state tracking. A retrieved memory saying deployment started is still a fact about a past moment; it doesn't tell me whether that deployment is still running, succeeded, failed, or became irrelevant.
One thing worth noting: outside the LLM-agent world, durable workflow engines like Temporal and Inngest have had expectation-like semantics for years — timers, signals, conditional waits, deadline handlers are core primitives, not application-level patterns. Some of what I'm describing isn't unknown territory; it's a known pattern from workflow orchestration that hasn't fully migrated into LLM-agent frameworks as a first-class concept.
So the ecosystem isn't missing all the pieces — durable execution, persisted state, events, tools, sessions, scheduled runs, traces, memory are all available now. What I don't see consolidated is the composition: clock, event log, state reducer, freshness metadata, expectations, and scheduled observations treated as one temporal grounding layer.
That layer still looks like something I'd build explicitly for a long-running agent.
How I'd measure whether this works
The evaluation I see most often — can the model talk about time — doesn't capture the actual capability. The version that does:
Hold the conversation text constant. Vary only the elapsed time between messages. Does the recommendation change appropriately?
Counterfactual elapsed-time evaluation. The dataset would be conversations where:
- Surface text is identical or near-identical across variants
- The temporal gap between key messages varies (30 seconds / 10 minutes / 2 hours / 48 hours)
- The correct action is qualitatively different across variants (wait → diagnose → escalate)
Then compare configurations across this dataset:
- No timestamps (baseline)
- Timestamps in messages
- Explicit elapsed-time summary in system prompt
- Structured state object injected
- Structured state + expectation registry
- Full runtime with scheduled observations and verified transitions
Metrics, each one targeting one failure mode from the catalog above:
- False-wait rate. How often the model recommends wait / give it time when elapsed time has already exceeded the relevant settling window. Targets adjacency and missing-expectation.
- Temporal sensitivity. Variance of recommendation across temporal variants of the same text. High sensitivity means the system is using elapsed time as a decision variable. Low means it isn't.
- Stale-state recommendation rate. How often recommendations reference a state variable without an accompanying observation step.
- State consistency. Across a multi-turn run, does the model's belief about completed steps stay consistent? (Diff what we've established so far summaries across turns.)
- Unverified-transition rate. How often state transitions are accepted from user reports without tool verification.
For a starting benchmark I'd build maybe 100 hand-built conversations across 4–5 task categories (deployments, troubleshooting, monitoring, scheduling, long-form research), each with 3–4 temporal variants. Small enough to build in a week, large enough to discriminate between configurations. Running it across a few model + scaffold combinations would give me an actual measurement of where each architecture lands — which I currently don't have.
The nginx case, with a runtime
A weak assistant sees:
User changed config. User restarted service. User says still not working.
A temporally grounded runtime supplies:
current_time: 11:10
events:
- 10:03 config_changed (user_report)
- 10:04 service_restarted (user_report, unverified)
- 11:09 external_check_failed (user_report)
derived_state:
elapsed_since_restart_minutes: 65
external_availability: failing
waiting_hypothesis_valid: false
reload_verified: false
violated_expectations:
- site_should_recover_after_nginx_reload (expected by 10:09)
Given that, the response writes itself:
An hour after the reload, this isn't a propagation-delay scenario. Two things to check before going further: (1) whether nginx actually loaded the intended config —
nginx -T | grep server_nameandjournalctl -u nginx --since "1 hour ago"— and (2) whether the failure is at DNS, TCP, TLS, nginx routing, or the upstream.curl -v http://localhostfrom the server andcurl -v http://your-domain.comfrom outside; the difference between them localizes the failure.
The model didn't become smarter. The system gave it an operational now.
Where I've landed
Short-horizon LLM use is mostly: given this snapshot, what's the best response? LLMs are excellent at that, and the temporal grounding problem doesn't really bite.
Long-running tasks ask a different question: given the current state of a changing world, what happened since the last observation, which expectations are now violated, which assumptions are stale, what action should happen next? That second question isn't a harder prompt. It's a different system design problem. Treating it as the first one — by trusting longer context, more memory, better instructions — gives me agents that are locally smart and temporally unreliable. They handle the next message well and the next hour poorly.
The framing I've landed on: context describes the past, state represents the present, expectations make the future operational. A stateless LLM gives me the first. The agents that hold up over hours are the ones where someone built the other two.
Whether that decomposition keeps holding as I see more systems — I don't know yet. It's been useful so far. The nginx case is small. The class of failures it points to doesn't seem to be.