Your Agent's Harness Is a First-Class Optimization Problem — Machine-Discovered Harnesses Beat Hand-Engineered Baselines on TerminalBench-2

Your Agent's Harness Is a First-Class Optimization Problem — Machine-Discovered Harnesses Beat Hand-Engineered Baselines on TerminalBench-2

Your coding agent's performance ceiling might not be the model — it might be the harness. A new paper from Meta-Harness authors (Yoon Ho Lee et al.) demonstrates that the scaffolding wrapping an LLM — what information gets stored, retrieved, and presented — is itself a learnable optimization target, not a fixed engineering artifact. The system runs an outer loop that searches over harness code using an agentic proposer with access to full execution traces of all prior candidates, providing far richer feedback than any prompt-only optimizer. On agentic coding tasks, machine-discovered harnesses surpass the best hand-engineered baselines on TerminalBench-2. On retrieval-augmented math reasoning, a single discovered harness improves accuracy by 4.7 points on average across five held-out models. And in online text classification, it beats a state-of-the-art context management system by 7.7 points while using four times fewer tokens.

The practical implication is significant for every team currently hand-tuning their AGENTS.md, context injection logic, or tool permission architecture: those design choices are not optimal by default, and the "more context equals better results" assumption doesn't hold. Meta-Harness reframes harness design as an optimization problem with a concrete methodology — the agentic proposer finds improvements humans miss because it sees full failure traces, not just final outputs. That 4× token reduction with a quality gain is the clearest evidence yet that the harness, not the model, is often the real bottleneck in agentic coding systems.

Read the full article at arXiv →