Why Your Multi-Agent Review Pipeline's Gains Aren't What You Think — Decomposing What Actually Happens in the Second Pass
Multi-LLM revision pipelines — where a second model reviews and improves output from a first — are a standard pattern in agentic coding systems. The assumption baked into almost every implementation is that the second pass corrects errors: the reviewer catches what the generator missed. New research runs a controlled decomposition experiment to find out what actually drives the performance improvement, and the answer is more complicated than the assumption suggests.
The gains from two-pass pipelines break into three components: re-solving (the stronger second model would have produced the right answer even without the draft), scaffold (the structure of the draft helps even if its content is wrong), and content (the draft's actual information contributes to the revision). On code generation tasks, the structural scaffolding effect is surprisingly powerful — even semantically null drafts, where the first model's reasoning is incorrect, provide enough shape to help the reviewer. But when the first model's reasoning is actively wrong rather than just incomplete, having the reviewer process that incorrect reasoning can degrade outcomes below what direct re-solving would have achieved.
The practical consequence: routing logic matters. When first-pass confidence is low or the draft quality is poor, bypassing the review step and routing directly to the stronger model can outperform the two-pass pipeline. Always running two passes, regardless of first-pass quality, is not the optimal architecture — it's just the simpler one to implement.
For teams building multi-agent code review and revision workflows, this decomposition provides both a diagnostic framework and a concrete design direction: detect low-confidence first-pass output and route it differently, rather than assuming the reviewer will fix whatever the generator gets wrong.