OpenAI-Compatible Reasoning Streams Need Progress Signals, Not Visible Text

OpenAI-Compatible Reasoning Streams Need Progress Signals, Not Visible Text

Reasoning models are forcing agent runtimes to admit something product UIs have been hiding: progress and output are not the same thing. A model can be working, spending reasoning tokens, and still produce no visible assistant text for several seconds. If the watchdog only understands text, it will kill a healthy stream and call that reliability. OpenClaw PR #89440 fixes exactly that failure in the OpenAI-compatible transport, and the lesson is bigger than one Vertex AI sidecar.

The bug was filed as issue #84384 against OpenClaw 2026.5.6. A user was running Gemini 2.5 Flash through a Vertex AI sidecar on 127.0.0.1:8787, exposing the model through an OpenAI-compatible streaming endpoint. The sidecar itself returned HTTP 200 in two to three seconds, but OpenClaw consistently hit its LLM idle timeout at roughly 28 seconds. The model was not necessarily idle. The runtime was blind to the kind of activity the model was emitting.

OpenAI-compatible is not stream-compatible

The root cause is subtle but increasingly common. Gemini 2.5 Flash emitted chunks containing completion_tokens_details.reasoning_tokens, even when thinking was not explicitly requested. Those chunks represented model activity, but they did not include visible assistant text or a tool event. OpenClaw’s OpenAI-compatible stream reader consumed them without yielding a stream event. The idle watchdog, seeing no user-visible delta, treated the stream as silent and terminated the turn.

PR #89440, created June 2 at 12:35 UTC and merged less than five minutes later, ports a fix shape OpenClaw had already applied to its native Google transport. PR #76080 previously taught the native Google path that thoughtSignature-only SSE parts should refresh the watchdog. #89440 applies the same concept to OpenAI-compatible streaming: when a usage chunk reports positive reasoning tokens but no assistant event, OpenClaw emits a zero-length thinking_delta activity marker. That marker refreshes liveness without turning hidden reasoning into chat content.

That last clause matters. The fix does not leak thought text, invent empty user-facing messages, or add trailing blank thinking blocks after visible text or tool output. It creates an internal progress signal for the control plane. GitHub API metadata shows the patch at +162/-9 across two files, with verification covering openai-transport-stream.test.ts and llm-idle-timeout.test.ts. Maintainers reported 314 tests passing, oxfmt passing, pnpm check:test-types passing, a clean local autoreview, and a green check rollup at head 31a3e181d2a05894bac03bbc97f7bf61dfb2ff84.

The adapter shape is only half the contract

This is a useful example of why “OpenAI-compatible” should be treated as a starting point, not a guarantee. The request and response envelope can look OpenAI-shaped while the stream semantics still differ in ways that matter operationally. Local proxies, BYOK gateways, Vertex sidecars, llama.cpp adapters, vLLM deployments, and hosted compatibility layers all sit in this zone. They may expose the same API surface while making different choices about usage chunks, reasoning metadata, tool-call deltas, finish reasons, and keepalive behavior.

Reasoning tokens make the mismatch more visible. Older chat models mostly produced visible text or tool calls; watchdog logic could get away with treating those as the only signs of life. Reasoning models do meaningful work in spans that are intentionally not shown to the user. The runtime still needs to observe those spans for timeout, cancellation, telemetry, and user feedback. Otherwise the platform punishes the model for being quiet in exactly the way the provider designed it to be quiet.

The practitioner takeaway is straightforward: test reasoning-only intervals. Do not stop at “does the model eventually return an answer?” A production transport test should simulate chunks with positive reasoning-token usage and no visible assistant delta, long gaps before final text, tool calls after reasoning-only spans, and true hangs where no activity arrives at all. The watchdog must distinguish all four. If it cannot, your agent will either kill healthy reasoning or wait forever on dead streams.

There is also a transcript-hygiene angle. Internal progress markers must not pollute the conversation history. A zero-length thinking_delta is a compromise because it tells the runtime something happened without creating a fake assistant message. That is the right boundary: operational telemetry belongs to the control plane; user-visible content belongs to the transcript. Mixing them is how debugging artifacts become memory, replay state, or confusing UI debris.

For local and self-hosted coding-agent users, this fix is especially relevant. A lot of teams are experimenting with OpenAI-shaped adapters because they want privacy, lower cost, regional control, or model optionality without rewriting the entire agent stack. That only works if the runtime respects more than endpoint syntax. It has to preserve provider semantics well enough that the model can think, stream, call tools, and recover across turns without being misclassified as idle.

The editorial take: OpenAI-compatible APIs are the USB-C port of agent infrastructure — convenient, widespread, and full of devices that are not quite the same behind the connector. PR #89440 is a small but correct adaptation to that reality. Reasoning streams need progress signals, not visible text. If an agent watchdog cannot tell the difference, it is not guarding liveness; it is deleting work.

Sources: OpenClaw PR #89440, OpenClaw issue #84384, OpenClaw PR #76080, Google Cloud Vertex AI inference docs