openclaw

OpenClaw's Session-Migration Bug Is the Kind of Data-Loss Edge Case Agents Cannot Afford

Anatoliy Kolodkin

29 May 2026 • 4 min read

Data-loss bugs in agent platforms deserve less forgiveness than ordinary migration bugs. A broken migration in a CRUD app can lose preferences or force a restore from backup. A broken migration in an agent runtime can erase the only useful record of what the agent did, which channel it was acting in, which tools it used, and what state a human might need to reconstruct a failed run. That is why OpenClaw issue #88017 is more important than its small patch surface suggests.

The report describes a P0 edge case in OpenClaw’s legacy session migration path. If the target agents/{id}/sessions/sessions.json file already exists but is corrupt, the migration reads it as an empty store, merges in readable legacy sessions, and atomically overwrites the corrupt target. That destroys two things at once: the corrupt-but-potentially-salvageable bytes, and any target-only session records that existed before the migration. In other words, startup recovery can become data destruction.

Parsing failure is not emptiness

The root cause is a classic optimistic-recovery mistake. OpenClaw’s readSessionStoreJson5 path can swallow a parse error and return a shape equivalent to { store: {}, ok: false }. The migration’s save gate was satisfied by the legacy source being readable, not by the target being readable. That meant a valid legacy store could authorize writing over an unreadable target. The code did not set out to delete data; it treated “I cannot parse this” too much like “there is nothing here.”

That distinction matters. A corrupt state file is not empty state. It is evidence. It may contain valid records before the bad trailing bytes. It may contain target-only sessions created after the legacy path diverged. It may contain the last visible trace of a session that failed during a crash, restart, interrupted write, or manual operator mistake. The first rule of recovery code should be: preserve evidence until a human or a clearly validated repair path discards it.

The deterministic reproduction in the issue is pure filesystem work: create a valid legacy sessions/sessions.json, create a corrupt target agents/{id}/sessions/sessions.json with target-only data plus trailing garbage, run the legacy state migration detection and migration functions, and inspect the target bytes. Before the fix, the target is overwritten. That is exactly the kind of bug automated tests should catch because there is no flaky external service involved. It is policy encoded as file behavior.

The patch is small because the policy is simple

PR #88018 adds the right guard: when the target exists but cannot be parsed, skip saving, warn, and preserve both the corrupt target and the legacy source. The initial patch touched src/infra/state-migrations.ts and src/infra/state-migrations.test.ts, adding a targetReadable gate and a regression test. ClawSweeper’s review reproduced the source failure and confirmed the fix preserved happy-path behavior, then requested a refinement because a “Migrated latest direct-chat session” change entry could still be reported even when the corrupt-target skip path did not actually save.

That review nit is not cosmetic. Recovery tools must not lie. If a migration warning says “skipped to preserve corrupt target,” but a change entry says “migrated latest session,” the operator receives contradictory evidence. The author addressed that in commit fe6093be6a by moving the change entry into the save-committed block. That is the kind of detail that separates a real recovery fix from a patch that merely avoids the worst overwrite.

Issue labels tell the story plainly: P0, impact:data-loss, impact:session-state, clawsweeper:source-repro, clawsweeper:linked-pr-open, and the project’s own high-signal “diamond lobster” rating. It was filed at 2026-05-29T12:41:16Z, and the linked PR appeared almost immediately. That speed is good. The more important lesson is why the severity is justified. Sessions are not disposable chat transcripts anymore; they are the audit log, control surface, and recovery substrate for agent work.

Operators should treat session files like operational state

For teams running OpenClaw, the practical guidance is boring and non-negotiable. Back up state before major upgrades. If startup or doctor output reports session migration warnings, stop and inspect the relevant files before repeatedly restarting. Do not assume a JSON parse failure means the file is worthless. Keep the corrupt bytes somewhere, because a later parser, manual repair, or partial extraction may recover the records you care about.

For platform builders, this is a test-design lesson. Every migration should be tested with each input corrupt independently: legacy unreadable, target unreadable, both unreadable, target absent, target valid with target-only keys, and target valid with overlapping keys. The dangerous case is usually not total failure. It is asymmetric success: one side parses, another side does not, and the code decides that the successful side is enough authority to write.

This bug also sits next to another same-morning recovery report, issue #88021, where recovered final replies are logged but not delivered to the original channel because deliver: false is hard-coded. Different failure, same product lesson. Recovery semantics are user-visible. If the runtime says it recovered but the user never receives the answer, the recovery path failed. If the runtime says it migrated but destroyed target-only state, the migration failed.

Agent systems need a more conservative default than ordinary apps because their state has more meanings. A session file may be memory, audit trail, pending work queue, replay input, delivery tracker, or legal evidence depending on the deployment. Treating damaged state as disposable cache is the wrong mental model. Damaged state is still state.

LGTM take: this is not “one JSON5 parser edge case.” It is a recovery-policy story. Agent systems need to preserve damaged state until a human explicitly discards it, because damaged state may be the only evidence left.

Sources: OpenClaw issue #88017, OpenClaw PR #88018, OpenClaw v2026.5.28-beta.2, OpenClaw issue #88021

Parsing failure is not emptiness

The patch is small because the policy is simple

Operators should treat session files like operational state

Sign up for more like this.