OpenClaw’s sessions.reset Race Shows Why Agent Session Identity Has to Be More Than a Key String

OpenClaw’s sessions.reset Race Shows Why Agent Session Identity Has to Be More Than a Key String

The most dangerous session bugs are the ones that look like a healthy system. Slack is connected. The gateway is alive. The UI says there is no active run. Then the fresh OpenClaw session row is marked running, or worse, failed, by an error from a dead run that should no longer own anything.

That is the bug fixed in PR #88625. After sessions.reset rotated a channel session to a fresh sessionId, stale lifecycle start, end, or error events from the old in-flight run could still mutate the new row because the persistence path was keyed by sessionKey. The result is not just cosmetic state drift. It is a trust failure in the thing operators use to decide whether an agent lane is safe to continue.

The linked issue, #88538, was filed on May 31 against OpenClaw 2026.5.28 on macOS arm64, using Slack socket mode, OpenAI Responses, gpt-5.5, and thinking default xhigh. The reported sequence is exactly the sort of runtime edge case that hides in production: a Slack channel session hit context overflow, sessions.reset returned a new session id, the fresh row showed status="running" with hasActiveRun=false, and then an old run emitted a terminal error that overwrote the new row to status="failed".

The state snapshots included totalTokens: 19644, a stale runtimeMs: 297399, and an old context-overflow diagnostic referencing the previous session file. That is the smoking gun: the failure belonged to the old identity, but the platform projected it onto the new one.

A session key is a route, not an identity

The architectural lesson is simple enough to fit in a code review comment and important enough to deserve a post: sessionKey and sessionId are not interchangeable. A session key is a stable routing concept — “the Slack channel conversation,” “the Telegram DM,” “this logical lane.” A session id is identity — “this specific run-bound session row.” When reset, compaction, replay, restore, or migration rotates the identity under a stable route, late events that only know the route become ambiguous.

In a normal web app, this might produce a stale UI update. In an agent runtime, it can poison the recovery path. Operators do not experience “a key/id mismatch.” They experience an assistant that says the lane failed even though the live gateway has no active run, or a reset that did not reset, or a Slack thread that looks wedged for reasons nobody can reproduce. State bugs are bad. Plausible state bugs are worse.

PR #88625 applies the correct invariant: lifecycle events now carry the owning run sessionId, and stale persistence is rejected when the event identity no longer matches the current row. Crucially, the fix also covers sessions.changed snapshot projection, not only durable writes. That distinction matters. Users and operators do not care whether the bad status came from the database or a broadcast if the UI still shows the wrong status.

The review trail is useful because it shows why first-order fixes are rarely enough in agent runtimes. PR #88583 carried the initial fix shape, but maintainer review found a missing sibling path: sessions.changed could still broadcast stale state even when persistence was skipped. A second finding caught preflight compaction: registering run context before compaction could stamp the old id and cause the new stale guard to drop legitimate events. The replacement PR refreshes follow-up run-context registration after preflight compaction so real rotations still work.

Reset and compaction need provenance, not hope

This is where agent-session governance gets concrete. If a platform supports reset, compaction, branch/restore, replay, migration, and follow-up runs, it needs provenance fields that survive all of those transitions. At minimum, logs should include sessionKey, sessionId, run id, lifecycle source, compaction id or generation where applicable, current-row identity, and whether a write or projection was accepted or rejected as stale.

Without that, debugging turns into archaeology. You inspect JSON files, compare timestamps, search for old overflow diagnostics, and try to infer which run emitted which event. That is not observability. That is a séance with better indentation.

For practitioners running OpenClaw, the immediate action is to upgrade once this fix lands in your release channel and watch for mismatches between displayed session status and active-run state. If hasActiveRun=false but a lane shows running or failed immediately after reset, capture the session id before and after reset, the run id, and the terminal error text. Those fields are the difference between a useful bug report and another “Slack got stuck” anecdote.

For platform builders, this is a reminder to test late events against rotated state. Unit tests that only check the happy path will miss the real bug. You need fixtures where an old run emits start, progress, error, and end after a reset; where compaction rotates the id during preflight; where snapshot broadcasts are attempted after persistence is correctly skipped; and where a legitimate follow-up run should still be accepted after rotation.

The deeper point is not OpenClaw-specific. Every agent platform that maps long-lived human channels onto mutable agent sessions will hit this class of bug. Humans think in conversations. Runtimes need identities. If the system cannot prove which run owns an event, it cannot safely reset, compact, recover, or tell the truth about what happened.

LGTM take: session governance is identity governance. A stable channel key is convenient, but if it becomes the authority for lifecycle writes after reset, the runtime has already lost the plot.

Sources: OpenClaw PR #88625, issue #88538, PR #88583, OpenClaw v2026.5.30-beta.1 release