OpenClaw’s Pi/Codex Runtime RFC Says Agent Platforms Are Finally Treating Policy Drift Like a Real Systems Problem

OpenClaw’s Pi/Codex Runtime RFC Says Agent Platforms Are Finally Treating Policy Drift Like a Real Systems Problem

The easiest way for an agent platform to lose credibility is not to crash. It is to behave differently depending on which runtime happened to answer the call. Same user prompt, same surface, same platform logo, subtly different tool behavior, auth forwarding, fallback handling, or delivery semantics. That is how “one platform” quietly becomes a bundle of adjacent products.

OpenClaw’s new Pi/Codex runtime RFC and accompanying contract suite are interesting because they treat that drift as a first-order systems problem instead of a cleanup chore. The headline artifact is PR #71096, opened April 24 with 34 files changed, nearly 4,000 added lines, and 44 commits. The linked RFC, #71004, says the quiet part plainly: the missing boundary in OpenClaw is not the harness registry itself, but ownership of runtime policy.

That distinction matters. The harness abstraction is not the villain here. The problem is that policy is scattered across too many seams: tools, auth and profile resolution, prompt overlays, schema normalization, transcript repair, delivery routing, fallback classification, transport parameters, and observability all live in different combinations of Pi runner code, Codex app-server glue, transport layers, plugins, and auth helpers. As Codex takes over more execution paths, it becomes easier for the system to preserve capability while drifting on behavior. That is exactly the sort of failure that only shows up after users start trusting the platform.

The RFC’s answer is not “rewrite everything.” In fact, one of the better parts of the document is its restraint. It explicitly says this is not a mandate to redesign the harness SPI or force a giant runtime rewrite. The first move is contract-first: lock expected behavior in tests, then decide whether to migrate shared policy into a prepared runtime plan later. That is disciplined engineering. It is also a refreshing contrast to a category that often tries to fix architectural ambiguity with another abstraction layer before behavior is even pinned down.

The scope of the contract suite tells you where the maintainers think the real risk is. The Phase 1 PRs cover eight domains: dynamic tools, auth and profile forwarding, outcome and fallback classification, delivery and NO_REPLY semantics, transcript repair, prompt overlays, schema normalization, and transport params. That list is basically a map of all the ways an agent platform can lie to its operator while still technically “working.”

Take auth as one example. If one runtime preserves openai-codex/* profile forwarding on startup and resume while another leaks or drops it, the user experiences random provider breakage even though the platform claims a shared auth model. Or take transcript repair. If one path preserves structured and media context while another path repairs only text turns, the model can appear inconsistent when the real inconsistency lives in platform preprocessing. Or take fallback classification. A planning-only or reasoning-only terminal state may trigger recovery logic on one runtime and surface as a final answer on another. That is not just polish. That is product truth diverging by backend.

The RFC cites recent evidence for exactly this kind of drift. Dynamic tool hook preservation had to be patched in Codex mode. GPT-5.4-related fixes kept landing across outcome handling, tool params, schema normalization, auth aliases, orphan turn repair, and follow-up delivery. Harness observability improved, but observability alone does not stop divergence. Those are not isolated bugs. They are symptoms of policy that was never fully owned in one place.

That is why the proposed AgentRuntimePlan idea is more important than the name suggests. The architecture sketch is basically a declaration that OpenClaw should decide policy once per user turn, then let Pi or Codex implement model-loop specifics without reinventing platform behavior. In the RFC’s framing, OpenClaw should own tool catalog behavior, auth and profile resolution, overlays, repair, delivery, fallback, schema normalization, transport defaults, and observability. Pi should own Pi session mechanics. Codex should own app-server startup, thread lifecycle, and model-loop details. Adapter selection should not mutate platform truth.

This is one of those ideas that sounds obvious once written down, which is exactly why it matters. Mature systems work is often the process of writing down what everyone thought was already true and then discovering it was only approximately true in five different places.

There is a bigger industry lesson here too. Multi-runtime agent platforms are approaching the same inflection point that browsers, cloud SDKs, and infrastructure control planes hit years ago. Once you have multiple execution backends, parity stops being a docs promise and becomes a contract management problem. If you do not formalize the boundary, runtime-specific patches accumulate until maintainers are effectively supporting separate behavioral dialects under one brand.

For practitioners building on top of agent frameworks, the action item is straightforward. Start testing runtime parity as its own category. Do not just compare model quality. Compare tool hook ordering, auth alias handling, prompt overlays, transcript repair, delivery suppression, fallback semantics, and transport defaults across each supported execution path. If those do not match, you are not running one platform. You are running a routing layer over several incompatible ones.

For maintainers, the lesson is harsher. Contract tests are not glamorous, but they are cheaper than reputation drift. The moment users start building automations that depend on background delivery, specific auth profiles, or tool middleware semantics, backend inconsistency becomes an outage with better wording.

My take: this RFC is one of the more adult pieces of agent-platform engineering to land this month. Not because it ships a new capability, but because it reframes a messy cluster of bugs as a boundary problem with a testable solution. The agent market does not need more vibes about autonomous coworkers. It needs more boring, explicit contracts between the platform and the runtimes it wraps. OpenClaw seems to know that now. That is progress.

Sources: OpenClaw PR #71096, OpenClaw RFC #71004, OpenClaw PR #71048, OpenClaw PR #71042