Harness Engineering: Leveraging Codex in an Agent-First World
OpenAI's Harness team set out to answer a pointed question: what does it actually look like to ship a real production product without writing a single line of code manually? Over five months, three engineers merged roughly 1,500 pull requests — averaging 3.5 PRs per engineer per day — and produced a million-line codebase now in internal beta with daily users. The tool they used was Codex. The product is real, the users are real, and the lessons are worth studying carefully.
What had to change wasn't just the tooling — it was the mental model of what engineering work is. The Harness engineers stopped writing code and started designing environments. Their highest-leverage output became AGENTS.md files: documents that describe the repo, the conventions, the test expectations, and the intent behind the work. From there, they found that deterministic test gates were non-negotiable; without hard pass/fail criteria, errors compounded across PRs until the codebase drifted. Human review time, not model inference, became the true throughput bottleneck — and protecting it became a first-class engineering concern.
The most striking finding: throughput actually increased as the team grew, suggesting the workflow scales in ways traditional development does not. The 10× speed claim is eye-catching, but the more durable insight is the concrete blueprint it offers — what humans keep doing, what they hand off, and how to design the feedback loops that make agent-first development reliable enough to trust in production.