Subagent Thinking Regression Turns Model Governance Into a Multi-Agent Reliability Bug
Multi-agent systems do not only fail when the model gives a bad answer. They fail when the parent agent asks a child to do work and the runtime rejects the child before the task starts. That is the shape of OpenClaw issue #84880: subagent spawns in OpenClaw v2026.5.19 still reject non-off thinking levels for canonical OpenAI/Codex GPT-5 models, even though the main session can use high-reasoning modes through the normal /think path.
This is not a philosophical debate about whether agents should “think.” It is a governance plumbing bug. The parent session and the child-spawn path appear to resolve model reasoning capabilities differently. Once that happens, the same model can be valid in one lane and invalid in another. The user sees a failed delegation. The operator sees a policy system that cannot explain itself.
The repro is small, which is why it matters
The issue reports OpenClaw v2026.5.19, commit a185ca2, with the main session model set to openai-codex/gpt-5.5. In the main session, high reasoning works. But a minimal child spawn like sessions_spawn({ runtime: "subagent", mode: "run", model: "openai/gpt-5.4", thinking: "high", task: "Reply exactly OK" }) is rejected because the child path says openai/gpt-5.4 only supports off.
A second repro with openai-codex/gpt-5.5 and thinking: "high" fails with the same shape. Earlier attempts with thinking: "xhigh" produced the error: Thinking level xhigh is not supported for openai/gpt-5.5. Use one of: off. That is a strong hint that the child path is not using the same effective capability source as the main session picker.
The issue’s acceptance tests are exactly what you would want: verify openai/gpt-5.4 high, verify openai-codex/gpt-5.5 high, handle xhigh correctly, and make /think plus sessions_spawn share one reasoning-capability source. In other words, stop validating the same runtime fact twice through different maps.
This is model governance showing up as reliability
Enterprises increasingly want reasoning levels governed like any other runtime cost and risk knob. Some workflows should use cheap fast models. Some should use high-reasoning models because code review, security triage, or architecture analysis benefits from the extra budget. Some providers have no reasoning parameter and should never receive one. Some organizations may disable higher reasoning for cost or latency control. That is all reasonable — if the policy is resolved once and carried faithfully through the system.
Issue #84880 shows what happens when it is not. A parent session can be configured for high reasoning, but a child spawn canonicalizes or validates the model through a static capability map that only permits off. The parent agent plans a delegated workflow assuming a child can perform a high-reasoning review. The runtime refuses to create the child. The failure is not in the model. It is in the orchestration layer forgetting what the model can do at the boundary where delegation becomes real work.
The comments strengthen the diagnosis. Practitioner xieetudousi confirms the GPT-5/Codex path and points to a likely catalog-vs-static-capability mismatch: the main session uses the runtime catalog, while subagent spawn canonicalizes openai-codex/gpt-5.5 to openai/gpt-5.5 and checks a static map that only allows off. Another practitioner, SplyzerRB, adds Ollama Cloud evidence: explicit thinking: "off" and config defaults can still result in child crashes complaining that low or medium is unsupported for models with reasoning: false.
That second report is the canary. This is not merely “GPT-5’s capability entry is stale.” It is a policy propagation bug. If unsupported thinking params leak into providers that opted out, and supported thinking params are rejected for providers that support them, the runtime is losing the effective reasoning policy somewhere between configuration, catalog resolution, canonical model refs, and transport invocation.
The previous closure did not prove the path
The history matters. Earlier issue #84706 was closed as already fixed via PR #84626. But #84626 was a doctor --fix migration for stale compat.thinkingFormat config values. Useful, but not proof that sessions_spawn(...thinking) works. That distinction is a classic runtime-maintenance trap: a related config migration passes, so a different execution path is assumed fixed.
Good acceptance tests should now exercise the actual public contract. Spawn a child with a GPT-5 model and thinking: "high". Spawn a child with the Codex-prefixed model ref. Spawn a child for a provider with reasoning disabled and prove no unsupported reasoning parameter is sent. Run the same capability object through /think, sessions_spawn, provider adapters, and transport payload construction. If any layer reinterprets the policy, the test should fail before users rediscover it in production.
For OpenClaw operators, the workaround is not elegant: pin subagent thinking to off where necessary, avoid delegating high-reasoning work to child runs until a release proves the acceptance cases, or route that work through the main session. If you rely on subagents for code review, docs analysis, or multi-file engineering tasks, treat this as a reliability issue, not an optional model-setting nicety.
For platform builders, the lesson is broader. Model refs are not just strings, and “canonicalization” is not harmless. openai-codex/gpt-5.5 may map to an OpenAI model family, but it may also carry provider-specific runtime behavior, auth routing, compatibility settings, and reasoning capability. Normalizing a name while dropping the associated capability context is how good abstractions become production bugs.
This is why “best coding agent for enterprise” comparisons need runtime semantics in the table. A model can support high reasoning and still be unusable in a multi-agent workflow if the orchestration layer forgets that at the child boundary. A local provider can reject reasoning parameters correctly and still crash if defaults are injected downstream. The model benchmark is not the system benchmark.
OpenClaw’s issue labels are appropriate: P1, impact:session-state, impact:auth-provider, clawsweeper:fix-shape-clear, and issue-rating: 🐚 platinum hermit. The fix shape is clear because the product contract is clear. Parent and child runs should agree on what a model supports. If reasoning is policy, it has to be policy all the way down.
Sources: OpenClaw issue #84880, OpenClaw issue #84706, OpenClaw issue #84646, OpenClaw PR #84626