OpenClaw's Streaming Usage Bug Is a Cost-Governance Problem, Not a Metrics Nit

OpenClaw's Streaming Usage Bug Is a Cost-Governance Problem, Not a Metrics Nit

Token accounting is one of those features that feels like dashboard garnish until the bill arrives, a rate limit trips, or a production agent quietly burns through a budget while looking innocent in the UI. That is why OpenClaw issue #90495 is more than a compatibility nit. It reports that the OpenClaw gateway does not honor stream_options: { include_usage: true } for streaming /v1/chat/completions requests. In plain English: the agent can stream the answer, but the gateway drops the receipt.

The report was opened June 5 at 00:57 UTC and passed the freshness gate as minutes old. The reproduction uses POST http://localhost:18789/v1/chat/completions with stream: true, stream_options: {"include_usage": true}, and model: "openclaw/main" against an OpenRouter-backed agent on OpenClaw gateway v2026.3.24. The expected behavior matches OpenAI’s documented streaming contract: before data: [DONE], the stream should include a final chunk with usage: { prompt_tokens, completion_tokens, total_tokens } and choices: []. The actual behavior is simpler and worse: no SSE chunk contains usage.

The issue author describes the downstream impact directly. Consumers relying on streaming token usage for billing, rate limiting, or cost tracking get no data. Their openclaw-agent stats pipeline is already implemented, but it produces zero token metrics because usage is always nil. That is not a broken chart. That is an observability contract failing at the exact point where agent workloads are hardest to estimate.

The expensive turns are usually the streaming turns

Non-streaming chat calls are easier to account for. The provider computes, returns a response, and attaches usage at the end. Streaming complicates that path. The application wants partial tokens immediately. The provider may only know final accounting after completion. The gateway has to preserve the live content stream and still forward a terminal accounting event without confusing clients that parse Server-Sent Events frame by frame.

That complexity is precisely why stream_options.include_usage exists. It gives clients a standard way to ask for the receipt in the streaming path. If OpenClaw exposes an OpenAI-compatible chat-completions surface, this option is not decorative. Compatibility is not just accepting the JSON key without throwing an error. Compatibility means preserving the semantics the caller depends on.

For coding agents, the stakes are higher than ordinary chat. A coding turn can trigger tool calls, subagents, retries, compaction, repository search, patch generation, review loops, test runs, and provider fallbacks. The turn that streams for several minutes is often the turn operators most need to measure. If that path loses usage, the dashboard undercounts precisely the work that dominates spend.

Budget enforcement becomes fragile too. Many teams start with provider-side hard limits, then add gateway-level accounting, then add project or agent budgets. That middle layer only works if the gateway receives real usage events. If streaming requests return content but no usage, a downstream budget system has three bad options: pretend usage was zero, estimate from text length, or reconcile later from provider invoices. The first is false. The second is imprecise. The third is too late to stop a runaway agent.

OpenClaw is seeing cost bugs from both directions

The related context makes this issue more important, not less. Issue #89709 reported the opposite failure mode after v2026.5.28: bounded daily usage showed historical cumulative data, including qwen3.5-flash at 12.6M dashboard tokens versus 2.38M cloud-console tokens — roughly 5x inflation. PR #90485 attempted to exclude untimestamped cached transcript rows from bounded dashboard ranges while keeping all-time summaries intact, but it was closed at capture time and had been marked as needing proof.

Put those together and the pattern is obvious. One path undercounts because streaming usage disappears. Another path overcounts because bounded ranges accidentally pull in historical cached rows. Both failures produce dashboards that look precise and are wrong. That is the most dangerous kind of observability bug because it trains operators to stop trusting the product. A cost chart that is visibly absent is annoying. A cost chart that is confidently wrong is worse.

The practitioner response should be concrete. Test your own gateway directly. Send a streaming request with include_usage, capture the raw SSE frames, and verify whether the terminal usage chunk arrives before [DONE]. Do this for each provider route you depend on — OpenAI, OpenRouter, Anthropic-compatible adapters, local OpenAI-compatible servers — because gateways often normalize the request but not the provider’s exact terminal accounting behavior.

If usage is absent, do not treat OpenClaw’s downstream budget dashboards as authoritative for streaming workloads. Enforce hard spend limits at the provider account level. Add request counters and rough token estimates as warning signals, not billing truth. If you operate agents for teams, expose the uncertainty: “streaming usage unavailable for this route” is better than showing a neat zero.

For OpenClaw itself, the fix should be framed as a contract test, not a one-off adapter patch. A gateway advertising OpenAI-compatible streaming should have tests that assert stream_options.include_usage survives request normalization, provider routing, SSE transformation, and final chunk delivery. It should also define behavior when a provider cannot supply usage: omit with an explicit capability marker, synthesize only if labeled estimated, or return a clear compatibility warning. Silent absence is the wrong default because silent absence looks like free work.

The bigger industry lesson is that agent cost governance is not a billing page. It is runtime evidence. You need prompt tokens, completion tokens, cached tokens, model route, provider, timestamp, session identity, retry count, and whether the numbers are measured or estimated. Lose any of those in the hot path and the control plane starts guessing.

OpenClaw’s streaming usage gap is small in code shape and large in product meaning. Token accounting is infrastructure, not dashboard garnish. If the streaming path drops usage, the agent platform loses the evidence needed for budgets, rate limits, and honest postmortems.

Sources: OpenClaw issue #90495, OpenAI API documentation, OpenClaw issue #89709, OpenClaw PR #90485