vibe-coding

Vision2Web: The First Benchmark That Measures Whether Coding Agents Can Actually Build a Real Website End-to-End

Anatoliy Kolodkin

30 Mar 2026 • 1 min read

Vibe coding's most common real-world use case — "build me a website" — has had no rigorous benchmark until now. Vision2Web, accepted at ICML and built by Zehai He et al. at Zhipu AI, fills that gap with 193 tasks, 918 prototype images, and 1,255 test cases drawn from real-world websites across 16 categories. For the first time, engineering teams and researchers have a standardized way to measure whether a coding agent can actually deliver on the promise of front-end and full-stack generation from visual specs.

The benchmark is structured as three escalating levels of complexity. Level 1 covers static UI-to-code generation: given a screenshot or wireframe, render it faithfully as HTML and CSS. Level 2 requires interactive multi-page frontend reproduction — client-side logic, routing, and state management. Level 3 pushes into long-horizon full-stack development with backend, database, API integration, and deployment. This hierarchy isn't just for difficulty progression; it's diagnostic. A model that passes Level 1 but fails Level 2 has a specific weakness in dynamic interaction handling, not in general code generation — which points directly at where training data or fine-tuning effort is needed.

Vision2Web's most novel contribution is its evaluation methodology. Rather than a single pass/fail metric, it combines a GUI agent verifier (does the site actually work?) with a VLM-based judge (does it look right?). This dual-verification approach reflects the dual nature of web development quality: functional correctness and visual fidelity are both necessary, and neither one is a proxy for the other. The results are sobering — all evaluated state-of-the-art models show substantial performance gaps at every level, with full-stack development remaining well beyond current agent capabilities.

The workflow-based agent verification paradigm Vision2Web introduces — a GUI agent plus VLM judge evaluating together — is also extractable as an evaluation pattern for any team currently running coding agents on front-end tasks without a principled way to measure what they're getting.

Read the full paper on arXiv →

Sign up for more like this.