GPT-5.4 Thinking Reaches Human-Level Performance on OS Tasks

OpenAI's GPT-5.4 achieves 75.0% on OSWorld benchmark, surpassing human performance. First model to handle desktop navigation, browser control with autonomous AI agents.

GPT-5.4 Thinking Reaches Human-Level Performance on OS Tasks

OpenAI today announced GPT-5.4, the first AI model to officially surpass human performance on OSWorld-Verified, scoring 75.0% against the human expert baseline of 72.4%. This 27.7 percentage point increase over GPT-5.2 represents a significant leap in the model's ability to navigate desktop environments through screenshots and keyboard/mouse actions.

The "Thinking" variant integrates test-time compute that allows the model to ponder complex problems before outputting responses, enabling native OS-level file navigation, browser control, and terminal interface operation with minimal human intervention. Developers can now access these capabilities through the updated computer tool in the API.

On WebArena-Verified, the model reaches 67.3% success rate, indicating strong browser automation capabilities. The integration of GPT-5.3-Codex means this is the first mainline model to include frontier coding capabilities while maintaining computer use abilities.

What makes this breakthrough significant is the practical implication: AI can now handle complex, multi-step workflows that previously required human oversight. The model's improved steerability combined with its 1M token context window creates the foundation for more sophisticated agentic applications that can truly replace human labor in knowledge work.

However, the real test will be whether OpenAI can translate this impressive benchmark performance into reliable product experiences. The gap between technical capabilities and user-level reliability has plagued AI automation efforts. The 75% success rate means the system still fails nearly 25% of the time, which would be unacceptable in many enterprise scenarios.

The competitive landscape is heating up rapidly. Anthropic's Claude Code already showed strong performance on related benchmarks, while Google's DeepMind has been making strides in robotics and multimodal systems. GPT-5.4's human-level performance raises the bar for what users should expect from AI agents.

As AI moves closer to autonomous computer control, the industry faces critical questions about reliability, safety, and the economic impact of automating knowledge work at scale. This is more than just technical progress—it's the beginning of fundamental changes in how humans interact with computers and automation.

Read the full article at OpenAI →