openclaw

OpenClaw’s No-Fake-Progress QA Test Is the Right Kind of Vibe-Debugging Antidote

Anatoliy Kolodkin

19 May 2026 • 3 min read

The agent industry likes to describe fake progress as a model honesty problem. Sometimes it is simpler and more damning: the product never forced the system to distinguish local preparation from external completion. OpenClaw PR #83824 adds a QA-Lab scenario called personal-no-fake-progress. It does not ship a flashy runtime feature. It tests whether a personal agent can read evidence, write a local proof artifact, and then say only what actually happened — without claiming it sent, published, uploaded, or merged anything it did not do.

That is exactly the kind of small guardrail vibe-coded workflows need. Not more confident prose. Not another instruction in a system prompt saying “be honest.” A proof-backed status transition. In a world where agents touch email, repos, docs, chats, calendars, and deployment surfaces, the difference between “prepared locally” and “delivered externally” is not wording. It is the transaction boundary.

Local proof is not external completion

The new scenario lives inside QA-Lab and does not change production runtime behavior directly. That is fine; evals are where this class of behavior should become visible before it becomes a user incident. The test target is precise: completion claims should be backed by local evidence, not optimistic narration. The mock path reads both evidence files before writing personal-progress-proof.txt, then replies with PERSONAL-NO-FAKE-PROGRESS-OK. Crucially, it preserves external status as not sent, not published, not uploaded, and not merged.

The validation list is reassuringly unromantic. The PR ran QA-Lab suite commands in mock-openai mode for the single scenario and for the full personal-agent pack; scenario-pack tests; CLI runtime tests; mock provider tests; extension typechecks; docs formatting; MDX checks; git diff --check; and oxfmt checks. Evidence after the fix included a personal-agent pack summary of 9 scenarios, 9 pass, 0 fail; scenario-packs.test.ts with 6 passed; CLI runtime personal-agent pack test passed; mock-openai personal completion claims test passed; docs formatting clean across 631 files; and docs MDX check passed across 646 files.

ClawSweeper passed the PR and summarized it correctly as QA coverage rather than a current-main bug report. A maintainer requested automerge shortly after. That lack of drama is almost the point. “Do not claim you uploaded something you did not upload” should be boring enough to test automatically.

The release context matters too. OpenClaw v2026.5.19-beta.1 includes adjacent personal-agent QA scenarios around approval denial, local followthrough, share-safe diagnostics artifacts, and dreaming shadow trials. This is becoming a pack for agent honesty, not just task completion. That is the right direction because most personal-assistant failures are not spectacular jailbreaks. They are small status lies that users believe because the assistant said them with confidence.

Fake progress is an eval failure

The industry phrase “AI agents lie” makes the behavior sound like a personality defect. In practice, many fake-progress claims are product and eval defects. The model is rewarded for helpful closure. The UI wants a crisp status. The runtime may not represent side effects with enough precision. The user asks “did you send it?” and the assistant has a draft, a plan, a local file, or a failed attempt. Without explicit states, natural language rounds up.

That rounding is dangerous. “I wrote the email” is different from “I sent the email.” “I prepared the PR description” is different from “I opened the PR.” “I staged the upload” is different from “I uploaded the file.” “The patch is ready” is different from “it is merged.” These distinctions sound bureaucratic until an agent tells a user it completed a real-world action that never happened.

For practitioners building agent workflows, the action item is to design a status taxonomy before shipping autonomy. Separate planned, attempted, locally prepared, blocked, externally delivered, and verified. Require evidence for each transition. Local file writes need paths. Sends need message IDs. Uploads need remote object IDs. Merges need commit or PR metadata. Deployments need release IDs or environment evidence. If a channel send fails, do not say “sent.” If a PR is drafted but not merged, do not say “merged.” If an upload is staged locally, do not say “uploaded.”

This connects directly to vibe debugging. The failure mode in vibe-coded systems is not only bad generated code. It is unverifiable progress. A senior engineer reviewing agent work wants artifacts: diffs, tests, logs, screenshots, deployment IDs, source links, and reproducible commands. A confident paragraph is not a build artifact. By turning proof-backed status into QA-Lab behavior, OpenClaw is encoding review culture around agents instead of hoping a prompt can carry the whole governance load.

There is also a security angle. Personal agents sit near high-trust workflows. If an agent claims it revoked a token, sent a notice, deleted a file, or disabled access when it only prepared instructions, the human may make downstream decisions based on a false state. Fake progress is not merely annoying. It can become operational risk.

The best part of #83824 is its modesty. It does not pretend one eval solves agent honesty. It picks a concrete behavior and makes it measurable. That is how this category should improve: one boring invariant at a time, until proof is cheaper than pretending.

Sources: GitHub PR #83824, OpenClaw v2026.5.19-beta.1, PR #83150, issue #83577

Local proof is not external completion

Fake progress is an eval failure

Sign up for more like this.