codex

GPT-5.5 on Databricks Is a Reminder That Enterprise Agents Fail on PDFs Before They Fail on Philosophy

Anatoliy Kolodkin

16 May 2026 • 4 min read

Enterprise agents do not usually fail first on philosophy. They fail on PDFs.

They fail on scanned pages, legacy files, tables that lost their structure in 2014, invoices with one smudged digit, contracts with duplicated sections, permissions that hide the important appendix, and document sets where the answer depends on finding the right clause before reasoning about it. That is why OpenAI’s Databricks story around GPT-5.5 is more interesting than the average vendor benchmark victory lap.

OpenAI says GPT-5.5 set a new state of the art on Databricks’ OfficeQA Pro benchmark, a test focused on complex enterprise document-agent workflows: scanned PDFs, long-context files, retrieval, parsing, and grounded reasoning. Databricks reports GPT-5.5 reached 50% accuracy and reduced errors 46% versus GPT-5.4. Arnav Singhvi, a Databricks research engineer, is quoted saying, “Codex with 5.5 is now state-of-the-art amongst all the agents and models out there.”

That is a useful result. It is not a victory parade. The interesting number is both numbers: 46% fewer errors is meaningful; 50% accuracy is still an escalation queue.

The ugly-document problem is the enterprise-agent problem

Most AI demos choose friendly terrain. Clean prompts. Clean files. Clean APIs. Clean examples where the model can look smart without touching the sludge that makes enterprise software expensive.

Real enterprise document workflows are not like that. They involve old PDFs, scanned images, inconsistent formatting, deeply nested appendices, spreadsheet exports masquerading as reports, contracts assembled across templates, and internal taxonomies that only make sense if you have lived inside the company for five years. The hard part is not merely “answer a question about a document.” The hard part is extracting the right evidence, preserving the details, retrieving the right context, and not hallucinating confidence when one bad parse changes the entire trajectory.

Databricks’ OfficeQA Pro benchmark is pointed at that mess. OpenAI says the largest GPT-5.5 gains showed up in parsing-heavy workflows, where small extraction errors cascade into downstream failures. Singhvi notes that older models such as GPT-5.4 struggled to parse all digits correctly, while GPT-5.5 showed a step-function lift on older documents and scanned PDFs. He also says GPT-5.4 sometimes took “unnecessary search detours,” producing inefficient multi-step trajectories; GPT-5.5 improved orchestration across those tasks.

That phrase — search detours — is doing real work. Enterprise agents do not just answer. They plan, retrieve, parse, call tools, compare sources, and synthesize. A bad parse leads to a bad query. A bad query leads to the wrong document. The wrong document produces a plausible answer. A plausible answer gets copied into a workflow, and now the mistake is not a model error; it is an operational error with a ticket number.

Fifty percent is progress, not permission

The right way to read the 50% accuracy claim is sober optimism. If OfficeQA Pro reflects real workflows, a 46% error reduction over GPT-5.4 is significant. It means the frontier moved in a place that matters: messy enterprise inputs, not just synthetic chat tasks. But 50% accuracy is not “turn it loose on finance, legal, healthcare, or support operations.” It is “this may meaningfully reduce human workload if the workflow has review gates, citations, confidence signals, and fallback paths.”

That distinction matters because enterprise AI adoption keeps getting distorted by two bad instincts. Vendors overstate autonomy because it sells. Skeptics dismiss partial progress because it is not autonomous enough. Builders need the middle path: use the model where it reduces toil, measure failure modes, and keep humans in the loop where the cost of being wrong is high.

For document agents, the evaluation unit should be the workflow, not the model response. If your system handles invoices, contracts, claims, compliance evidence, board materials, support histories, or procurement documents, test the whole chain: OCR or extraction, table parsing, retrieval, citation grounding, reasoning, tool calls, and final answer generation. Measure where errors originate. Separate extraction errors from retrieval errors, reasoning errors, and orchestration errors. “The model was wrong” is not a postmortem. It is a shrug in a hoodie.

Databricks is selling the platform layer, not just the model

The distribution path is also telling. Databricks is making GPT-5.5 available through AI Unity Gateway for workflows built with AgentBricks and the Agent Supervisor API. Its docs describe a stack that includes AI Playground prototyping, Knowledge Assistant for managed Q&A over documents, custom agents built with Agent Framework and MLflow, and support for third-party authoring libraries such as LangGraph, LangChain, OpenAI, LlamaIndex, and custom Python implementations.

That is the shape enterprise agents are taking: not one model in a chat box, but a governed platform where model choice, data access, evaluation, deployment, observability, and workflow supervision live close to enterprise data. The boring pieces are the point. If an agent is reading sensitive documents, the organization needs to know which data it accessed, what it cited, what tools it called, how it was evaluated, who approved the output, and whether the same workflow can be replayed after an incident.

The OpenAI page labeling this under Codex is interesting too. This is not coding in the narrow “edit my repository” sense. It is agentic knowledge work over enterprise documents. That suggests Codex is continuing to stretch from a coding product into a broader agent execution and reasoning brand in partner workflows. Developers should expect the boundaries to blur: coding agents, document agents, business-operations agents, and tool-using enterprise agents will share models, runtime patterns, approval flows, telemetry, and governance controls.

There are caveats. The public benchmark details are thin. We do not get enough methodology, dataset composition, cost, latency, failure categories, or examples of hard misses to independently judge the claim. That does not invalidate the result, but it should keep teams from treating the headline number as procurement-grade truth. Benchmarks point at where the market is moving. Your eval determines whether your customers are safe.

The practical move is to build your own OfficeQA-style eval set from real, permission-safe documents. Include ugly scans, old formats, long PDFs, tables, signatures, and cases where one digit changes the business answer. Require citations. Track extraction correctness separately from final-answer correctness. Add human review gates for low-confidence parses and high-impact recommendations. Route expensive frontier models to the hardest cases and use cheaper deterministic or smaller-model paths where the task is well-bounded.

GPT-5.5’s Databricks result matters because it points at a less flashy, more consequential frontier: agents that can survive enterprise document sludge. That is where a lot of real money is. It is also where the margin for hand-wavy demos goes to die.

Sources: OpenAI, Databricks AgentBricks docs, OpenAI Codex changelog

The ugly-document problem is the enterprise-agent problem

Fifty percent is progress, not permission

Databricks is selling the platform layer, not just the model

Sign up for more like this.