Passing the Verifier Is Not Correctness: Agentic Spec Synthesis Has a False Confidence Problem
Automated formal specification synthesis has been advancing fast, with recent work reporting high verifier pass rates for LLM-generated JML specs. A new paper asks the uncomfortable follow-up question: does passing the verifier actually mean the specification is correct and complete? The answer, backed by a new evaluation framework called Spec-Harness, is no — and the gap between verifier pass rate and genuine correctness is large enough to matter for any team using LLM-synthesized specs as part of their testing pipeline.
Spec-Harness goes beyond verifier pass rates to measure specifications through symbolic verification, checking whether they actually catch real bugs (correctness) and whether they cover the full range of important properties (completeness). A substantial fraction of verifier-accepted specifications — including those produced by optimized prompts — turn out to be incorrect or incomplete in ways the verifier cannot detect. The systematic failure mode is consistent: over-specification of trivial or obvious cases, combined with under-specification of the behaviors that actually matter for correctness.
Prompt optimization makes the problem more visible, not less. Pushing verifier pass rates higher reveals a clear performance ceiling, and at that ceiling, Spec-Harness shows that the remaining failures are systematic rather than random. Teams optimizing for the wrong metric are not just missing correctness — they're building confidence in specifications that will miss the bugs they were designed to catch.
The implication for agentic coding pipelines that use LLM-generated specifications as test oracles, formal contracts, or property-based test inputs: the verifier is a necessary check, not a sufficient one. Spec-Harness provides a direct template for the additional evaluation layer needed to catch what verifier pass rates hide.