ai-frameworks

ClickHouse’s Coding-Agent Lessons Are the Best Anti-Hype Argument for Using Agents

Anatoliy Kolodkin

25 May 2026 • 5 min read

The most useful coding-agent story right now is not the one where an AI writes an app from a napkin sketch. It is the one where a serious database company uses agents to grind down flaky tests, merge conflicts, boilerplate, review chores, and CI failures — then still insists the human has to think. ClickHouse’s account of a year with AI coding agents is valuable because it refuses both fashionable lies: agents are not useless, and they are not magic.

The New Stack’s piece is built around a simple adoption curve. Level 1 is chat copy/paste: ask the model a question, paste code, paste an answer back. Level 2 is where agents become operationally interesting: CLI or IDE tools that can read code, run commands, edit files, build, test, and commit. Level 3 is the autonomous loop: isolated environments, specs, feedback, and agents opening pull requests or finding edge cases with less direct supervision. That taxonomy is not revolutionary. It is useful because it maps to risk. Every level gives the model more agency, and every level needs more validation underneath it.

ClickHouse is a good test case because the codebase is not a toy TypeScript demo. It is a large C++ database project with enough complexity to punish shallow confidence. The author says early Claude Code was useful for JavaScript boilerplate and one-off Python scripts but got lost in ClickHouse’s C++ codebase. The named turning point was Claude Opus 4.5 in November 2025, after which agents became useful for daily work on the large C++ codebase. That is a more credible claim than “AI changed everything.” The boundary moved. That is how tools actually get adopted.

The agent works because the CI system has teeth

The number that matters most is not the model version. It is ClickHouse’s validation surface: 20 to 80 million tests across roughly 600 commits and 300 pull requests per day. In January and February 2026, with agent assistance, the author submitted roughly 700 pull requests fixing tests and CI infrastructure, reducing findings from about 200 per day to 3 to 5 per 10 million test runs. That is not prompt magic. That is an enormous feedback machine turning agent output into reviewable work.

This is the part many teams will miss. Coding agents are more valuable when your engineering practice is already disciplined. Good tests, fuzzing, randomized checks, logs, review norms, and fast CI all increase the safe surface area for agent work. Without that, the same agent becomes a confident patch generator with no reliable judge. The productivity delta is not “we use AI.” It is “we route agent output into systems that can reject bad ideas quickly.”

That makes ClickHouse’s story an anti-hype argument for using agents. The agent did not replace engineering judgment. It compressed repetitive investigation and patch production in categories where correctness could be checked. Flaky-test repair, merge-conflict resolution, log-driven bug investigation, boilerplate, and localized refactors are exactly the kinds of tasks agents can attack because the work is constrained and the result is inspectable. A human still owns the hypothesis, the review, and the blast radius.

The headline lesson for managers is uncomfortable but important: if your CI is weak, agents will expose that weakness faster. They will generate more plausible changes than your process can evaluate. If your tests are slow, flaky, or absent, you have not bought productivity; you have bought a louder queue. The headroom in agent-assisted work is in the validation system, not in the prompt file.

“Agent does, human reviews” is a workflow. “Agent approves” is a smell.

ClickHouse’s examples draw a useful boundary between assistance and abdication. Merge conflicts are a strong agent use case because the task is local and the reviewer can compare intent. Review assistance can also work when the bot catches resource leaks, races, missing edge cases, or inconsistencies across similar files. But the reviewer of record still has to be human, especially when the code under review was generated or heavily modified by another model.

There is a bad workflow hiding one step away: agent writes, agent reviews, agent opens PR, agent comments “LGTM.” That is not automation. That is a closed loop with a confident narrator. The ClickHouse account wisely frames Level 3 autonomous agents as this year’s work, not solved infrastructure. The project has two autonomous agents opening PRs and finding edge cases, but the hard part is not making a bot submit code. The hard part is defining the sandbox, feedback loop, ownership model, and stop conditions that keep autonomous work from becoming autonomous cleanup debt.

The advice to keep short guidance in CLAUDE.md or AGENTS.md is also more nuanced than it sounds. Repo-specific instructions help, but only when they are short, current, and tested against actual agent behavior. Long instruction files become a second codebase nobody compiles. Overusing negative instructions can backfire because models often perform better with precise positive constraints: use this build command, inspect these directories first, prefer this test target, never edit generated files unless regenerating from source, ask before touching migration scripts. Treat agent guidance like operational documentation, not a shrine.

The operator still matters

The most politically awkward point in the piece is also one of the most useful: agents amplify the operator. A senior engineer can use a coding agent to fan out hypotheses, summarize logs, draft a patch, and challenge assumptions. A less experienced developer can follow a plausible false lead because the model sounds certain. The tool does not erase skill gaps; it often makes them more visible.

That does not mean junior developers should be kept away from agents. It means teams need explicit review patterns. Ask for tests before trusting a fix. Require the agent to explain why a change addresses the failing condition, but do not confuse explanation with proof. Compare generated patches against adjacent code manually. Track reverts, review time, acceptance rate, CI failures, and categories where the agent wastes time. If a team cannot say where agents help and where they hurt, it is not managing adoption. It is vibes with a subscription.

Provider optionality is another practical lesson. ClickHouse recommends keeping at least two model providers available. That is not vendor-neutral theater; it is operational realism. Models regress, rate limits happen, CLIs break, safety filters change, and pricing moves. If agent workflows become part of daily engineering, the agent provider becomes production infrastructure. Treat it that way: fallbacks, budget alerts, version pinning where possible, and a rollback story when a new model gets worse at your codebase.

For builders, the playbook is straightforward. Start with boring categories: boilerplate, generated config, obvious refactors, merge conflicts, failing-test triage, log summarization, documentation updates, and narrow bug fixes with reproducible failures. Keep humans in charge of architecture, security-sensitive code, migrations, concurrency primitives, and public API contracts until your validation system proves otherwise. Instrument the workflow: label agent-authored PRs, measure review outcomes, record accepted task types, and maintain a list of “do not route to agent” categories. Your goal is not to maximize agent usage. Your goal is to maximize reviewed, correct, maintainable output per engineer-hour.

ClickHouse’s year-one lesson is refreshingly unsentimental. Agents got better. Some work moved from “not worth trying” to “useful every day.” The winners will not be prompt magicians. They will be test-obsessed maintainers with ruthless task selection, strong CI, short instructions, multiple providers, and enough humility to remember that a model can be helpful without being in charge.

Sources: The New Stack, ClickHouse GitHub repository, Anthropic Claude Code docs, GitHub Copilot CLI changelog

The agent works because the CI system has teeth

“Agent does, human reviews” is a workflow. “Agent approves” is a smell.

The operator still matters

Sign up for more like this.