openclaw

OpenClaw's Session-Migration Bug Is the Kind of Data-Loss Edge Case Agents Cannot Afford

Anatoliy Kolodkin

10 Jun 2026 • 4 min read

There is a specific kind of bug that lives in the gap between "startup should recover gracefully" and "don't destroy evidence." It shows up in migration paths, recovery routines, and the code that runs exactly once, under pressure, when something has already gone wrong. OpenClaw's issue #88017 is one of those bugs — a P0 data-loss defect in the session migration path where a corrupt target file does not trigger a warning or a preserve-both strategy. It triggers an overwrite.

The mechanism is deterministic and surgical. OpenClaw maintains session state in JSON-backed files. When the system detects legacy session format, it runs a migration: read the legacy store, read the target store, merge, write back. The bug lives in what happens when the target file exists but cannot be parsed — because it is corrupt, truncated, or contains trailing garbage from a previous write that did not complete cleanly. In that case, the migration code swallows the parse error, treats the unreadable target as an empty store, merges in the legacy sessions, and atomically overwrites the corrupt target. The result: both the salvageable corrupt bytes and any target-only session records are destroyed simultaneously. Whatever was in that target file before the migration ran is gone.

The recovery path is worse than the bug. Because the target file is overwritten with a valid merged store, the corruption is silent — no subsequent startup will retry the migration. The system reports success. The session history is simply shorter than it was five minutes ago.

PR #88018 fixed this with a small, principled change: when the target exists but cannot be parsed, skip saving, warn, and preserve both the corrupt target and the legacy source. The patch is 56 lines across two files. The policy it encodes is not complicated. But it represents the difference between a system that treats unreadable state as junk and a system that treats it as evidence.

The real problem is not the JSON parser

It would be easy to read this as a bug about JSON5 parsing — an edge case in a file format library. That reading would be wrong. The real problem is the assumption baked into the migration path: that a file which cannot be read is equivalent to a file that does not exist. That assumption is understandable in the context where most migrations run — cold start, doctor checks, first run after an upgrade. In those contexts, you want to keep things moving. Treating an unreadable file as empty is a reasonable heuristic for a fresh install.

But session migration runs in a different context. It runs when there is already legacy state to migrate, which means there is already a history of sessions. That history may include state that exists nowhere else — compaction lineages, delivery receipts, replay metadata, tool-call records, and the breadcrumbs a human needs to understand what the agent was doing and why. For an agent platform, session files are not cache fluff. They are the operational record of a workload that may have taken hours to produce.

The heuristic that treats parse failure as emptiness works fine until it destroys the only copy of something you actually needed to keep.

Why the fix is architectural, not just a patch

The fix in #88018 encodes a simple invariant: if the target exists, it gets a chance to be read before it gets overwritten. That sounds trivial, but it runs counter to a lot of migration code that prioritizes forward progress over preservation. The change is not just "add a warning." It is "change the default so that a parse failure blocks the overwrite path and leaves both files intact." That is a different contract with the operator — one that says "I will not discard your data even if I cannot read it" rather than "I will try my best to keep things moving."

The ClawSweeper review on the PR caught an important nuance in the first pass. The initial patch preserved the target against overwrite but still reported "Migrated latest direct-chat session" as a change entry even when the skip path had not actually saved anything. That meant the operator might see a migration success message while the actual session file remained untouched. Commit fe6093be6a moved the change entry into the save-committed block, so the success message only appears when something was actually written. That distinction matters for post-incident review. If you are debugging why a session disappeared, the first thing you should not see is a misleading success confirmation.

What operators should actually do

The practical guidance is straightforward and not specific to OpenClaw. First, back up your session state directory before any major upgrade or doctor run. Session files live under agents/{id}/sessions/ and the migration path touches both legacy and target stores. If you run openclaw gateway doctor or upgrade across a version boundary, a quick cp -r of that directory is cheap insurance against a silent overwrite.

Second, treat parse failures in session stores as recoverable evidence, not junk. If OpenClaw's doctor path or startup logs report a session store as unreadable, stop and inspect before restarting repeatedly. Each restart may overwrite the same parseable excerpt with an empty merge if the bug described here is present in your version. The issue was fixed in the June beta train, but operators on earlier releases should verify their current version.

Third, watch for the symptom pattern: sessions that were present before an upgrade are absent after, with no error message. Unlike a crash or a clear failure, this presents as normal operation — the system started, the agent is responsive, everything looks fine. Only when you go looking for a specific prior conversation do you discover it is gone.

The broader lesson for agent platforms

Agent runtimes create long-lived distributed state out of things that used to be one-shot operations. A conversation with a language model in 2022 was ephemeral — you got a response and that was the transaction. A conversation with an agent in 2026 is a workload: it compacts, it forks, it spawns subagents, it records tool calls, it persists memory across resets, and it produces session state that may be the only record of what actually happened during a multi-hour autonomous run. When that state lives in files that get migrated, compacted, and rotated as part of normal operation, every migration path becomes a data-loss risk that did not exist in the simpler era.

The fix for #88017 is small. The lesson is large: agent systems need to preserve damaged state until a human explicitly discards it, because damaged state may still be the only evidence left.

Sources: GitHub issue #88017, PR #88018, OpenClaw v2026.5.28-beta.2 release notes

The real problem is not the JSON parser

Why the fix is architectural, not just a patch

What operators should actually do

The broader lesson for agent platforms

Sign up for more like this.