OpenAI’s WebSocket Push Is Really a Bid to Remove the Agent Tax

OpenAI’s WebSocket Push Is Really a Bid to Remove the Agent Tax

Most complaints about coding agents eventually collapse into one banal sentence: it takes too long. Not too long to think, necessarily, and not always too long to generate tokens. Too long to do the whole dance. Read files, call tools, wait for the API, rebuild context, validate the same state again, wait some more, then maybe continue. That accumulated dead time is the real “agent tax,” and OpenAI’s April 24 WebSockets post is notable because it finally treats the problem as infrastructure, not magic.

The headline number is clean. OpenAI says it made Responses API agent loops about 40% faster end to end, after earlier work already delivered nearly 45% better time to first token. The more interesting detail is how. The company did not just point to a faster model and call it a day. It reworked the transport layer so clients can keep a persistent WebSocket connection open, reuse prior response state in memory, and continue a conversation with previous_response_id instead of rebuilding the entire context on every follow-up request.

That sounds like plumbing because it is plumbing. It is also exactly the kind of plumbing that separates a flashy demo from a useful coding workflow. In OpenAI’s description of a typical Codex bug-fix task, the system scans a codebase, reads files, edits code, runs tests, and loops through dozens of back-and-forth API requests. The old architecture treated each of those turns as more independent than they really were. Even when most of the conversation history and tool setup had not changed, the system kept paying to reprocess them. As inference got faster, those repeated CPU-side costs became harder to hide.

OpenAI’s own numbers help explain why the transport work matters now. The post says earlier flagship models like GPT-5 and GPT-5.2 ran at roughly 65 tokens per second in the Responses API. For GPT-5.3-Codex-Spark, built on specialized Cerebras hardware, the goal was over 1,000 tokens per second, and the company says it later saw bursts up to 4,000 TPS in production. Once inference jumps that sharply, any leftover orchestration overhead becomes painfully visible to users. You stop waiting on the model and start waiting on everything wrapped around it.

The architectural fix is refreshingly unromantic. OpenAI considered different approaches, including gRPC bidirectional streaming, but settled on WebSockets because the protocol could preserve familiar request and response shapes. Instead of forcing developers to rewrite their integrations around a completely different interaction model, OpenAI kept the core response.create pattern and used a connection-scoped in-memory cache to store previous response objects, prior input and output items, tool definitions and namespaces, and reusable sampling artifacts such as rendered tokens. Follow-up requests only need to send the new information. The system handles the rest by reusing cached state.

That design choice matters more than the protocol brand name. OpenAI is effectively admitting that agent performance is now a systems problem. Better models help, but once a coding tool becomes multi-step and tool-heavy, the user experience depends just as much on state management, validation strategy, and transport efficiency as on raw model intelligence. If you keep rebuilding the world between turns, you can have the fastest model in the market and still deliver a product that feels like it is dragging a trailer uphill.

The external validation OpenAI cited is directionally useful even if it deserves the usual caution. The company says Vercel’s AI SDK saw up to 40% lower latency, Cline reported 39% faster multi-file workflows, and Cursor saw OpenAI models become up to 30% faster. Those claims were linked through social posts that were not independently extractable in this run, so they should be treated as vendor-presented evidence rather than settled truth. Still, the pattern is plausible, and it lines up with what developers have been saying for months: the model is often not the only or even the main bottleneck anymore.

There is a broader market implication here for Codex and for everyone building agent loops on OpenAI’s stack. OpenAI launched GPT-5.5 one day earlier with a pitch centered on workflow quality, tool use, and lower token consumption. The WebSockets post is the missing half of that story. A model can be more capable on paper and still disappoint if the surrounding loop is clumsy. Conversely, transport and cache improvements can make an unchanged model feel dramatically more usable because fewer of the user’s seconds are being burned on repeated overhead. In other words, the difference between “smarter” and “feels faster” is starting to matter less than it used to. Mature products need both.

This is especially relevant for builders evaluating coding agents in practice. Teams often attribute sluggishness to the wrong layer. They blame the model for what is really a loop problem. They conclude that one vendor’s model is more productive when the underlying win may partly come from better state reuse, less chatty transport, or more aggressive caching around validation and routing. As agentic coding becomes a competitive category, vendors will market the brain because it is easier to sell. Buyers should inspect the pipes too.

The practical takeaway is not “everyone should use WebSockets because OpenAI says so.” It is that teams building serious agent workflows need to profile latency honestly. Measure how much time is spent in inference, how much in request validation, how much in client-side tool execution, and how much in rebuilding context that did not materially change. If your stack still treats every turn as a fresh conversation, you are probably paying an unnecessary tax in both latency and compute. That tax compounds worst in exactly the workflows people care most about: long refactors, multi-file debugging, test-fix loops, and background agent sessions.

There is also a useful product lesson here for the broader AI tooling market. Early coding-agent competition was dominated by benchmark screenshots and personality tests disguised as technical evaluations. The next phase will be uglier and more consequential: transport choices, state models, retry semantics, approval flows, and runtime overhead. Those are not sexy launch-day headlines, but they are the reason one tool feels dependable and another feels like it is burning your afternoon one round trip at a time.

OpenAI’s WebSockets push is therefore best read not as a protocol announcement, but as a maturity signal. The company is starting to optimize the stack where frustration actually lives. That is what happens when a product moves from demo theater to daily use. The smartest model in the room still matters. But once agents are expected to carry real work, shaving the “agent tax” off the loop can matter just as much.

Sources: OpenAI, OpenAI Developer Docs, OpenAI GPT-5.5 announcement