OpenAI's WebSocket Push Is Really a Bid to Remove the Agent Tax

There is a specific kind of frustration that sophisticated users of AI coding tools feel but rarely articulate cleanly: the model is fast, but the agent loop is slow. Not because the inference is sluggish, but because the overhead between turns compounds. API handshakes, context reconstruction, state validation, repeated authentication — the tax of being an agent instead of a chatbot adds up, and it shows up as dead time that makes even a fast model feel sluggish.

OpenAI published a post this week about WebSocket support in the Responses API that is really about attacking exactly that problem. The headline number — roughly 40% end-to-end latency reduction for agent loops — is compelling on its own. But the mechanism is more interesting than the headline, because it describes a structural change to how conversation state persists across turns. Instead of rebuilding full context from scratch on every request, OpenAI is now caching previous response state in memory, keeping a persistent WebSocket connection open, and reusing prior tool definitions, namespaces, and sampling artifacts across turns via a previous_response_id. That is not an inference optimization. It is a protocol optimization.

The production numbers OpenAI cites are instructive. Earlier flagship models like GPT-5 and GPT-5.2 ran at roughly 65 tokens per second, which is respectable for single-turn work. GPT-5.3-Codex-Spark targeted 1,000+ tokens per second and saw bursts up to 4,000 TPS in production. That gap — from 65 to 4,000 — is not a model improvement story. It is a transport and state-reuse story. The model improvements matter, but the protocol changes are what make the difference visible to users in agentic workflows.

The external adoption claims are harder to evaluate independently — OpenAI cited them and they link to external sources that were not independently fetchable during this run — but the internal framing is consistent with what the market has been asking for. Vercel AI SDK, Cline, and Cursor all publicly work with OpenAI APIs, and all three would benefit from lower per-turn overhead in multi-file workflows. If the latency claims hold in real workloads rather than controlled benchmarks, they represent a real improvement in how coding agents feel during sustained work.

The broader implication is that OpenAI understands something important about where the next performance frontier lives. Base model improvements are still coming — the token throughput numbers bear that out — but they do not fully compound if the surrounding protocol reintroduces latency at every turn. Once a coding tool crosses from single-turn completion to multi-step workflow, transport and state reuse stop being back-end details. They become user-facing product quality. OpenAI is effectively acknowledging that the "agent tax" is real and that addressing it is worth significant engineering investment.

For Codex specifically, this matters in two ways. First, Codex CLI and any product built on the Responses API can now inherit the latency improvements if they adopt the WebSocket transport and state-reuse mechanism. Second, the latency improvement changes the economics of agentic workflows. Faster loops mean fewer idle seconds between tool calls, which means more work gets done per billing window, which means the cost-per-useful-output improves even without a model upgrade. That is the kind of compounding gain that matters for teams running agents at scale.

The HN Algolia entry for this post had very low engagement, which is the expected pattern for transport-layer work — it is important engineering plumbing, not a broad-discussion headline. But the people who care about this are exactly the ones building the agent loops that the protocol improvements are designed for. For them, this is not a minor optimization. It is a structural change in how OpenAI thinks about agent performance versus model performance.

The practical takeaway is to audit where latency actually lives in your own agent stacks. If your application rebuilds too much state on each turn, treats each interaction as isolated, or has multi-step workflows that feel sluggish despite using a capable model, the bottleneck is likely in your state management and transport layer — not in the inference. OpenAI's WebSocket work is a reminder that the next round of agent improvements will come as much from protocol design as from model improvements, and the winners in this category will be the ones who optimize the full loop, not just the brain.

Sources: OpenAI, OpenAI Developers