Your Web-Augmented Coding Agent Is Being Misled by Bad Search Results — Sherlock Detects and Repairs It Automatically
Most teams running web-augmented coding agents — those that issue live search queries before generating code — have no systematic defense against bad search results quietly degrading output quality. A new paper from Guoqing Wang et al. documents exactly what happens when that defense is absent, and introduces Sherlock, an automated pipeline designed to fix it.
The researchers studied what they call Search-Induced Issues, or SII: failures where the external pages returned by a search API mislead the model into generating incorrect code. Testing across three commercial search APIs and six advanced LLMs, they found that all evaluated web-augmented models are vulnerable. The failures cluster into two types: misaligned specifications, where a retrieved page describes an API differently than its actual contract, and flawed code implementations, where a retrieved page contains buggy example code that the model faithfully reproduces. Both failure modes are difficult to detect through standard testing because the errors are plausible and the root cause is upstream, not in the model itself.
Sherlock operates as a continuous three-stage pipeline rather than a one-time audit. First, it detects potential SII instances by probing the live search surface. Second, it debugs those instances to identify which specific pages are error-inducing and why. Third, it repairs by annotating misaligned content or replacing erroneous snippets with verified solutions from trusted sources. The results are strong: Sherlock identifies error-inducing pages with F1 up to 95% and repairs 71–100% of affected generations across tested models. The continuous design reflects the central insight — the search surface is a living, adversarial input that needs ongoing monitoring, not a fixed corpus you can validate once.
For any team running MCP servers or tool integrations that fetch external content before code generation, the SII taxonomy and the Sherlock architecture represent a directly applicable quality layer that most current deployments are missing entirely.