ai-frameworks

AI Commits Make Git Hygiene a Production Control, Not a Style Preference

Anatoliy Kolodkin

31 May 2026 • 4 min read

AI did not invent bad commits. It made them cheap. That is why the HackerNoon piece on AI agents committing to repositories lands harder than its modest premise suggests. The visible artifact of agentic engineering maturity may not be your prompt library, model router, or eval dashboard. It may be the first page of git log.

Matias Denda’s argument is that AI is a great amplifier. Junior habits become faster junior habits; senior habits become faster senior habits. In Git history, that difference is immediately legible. One pattern is the agent-assisted swamp: fix stuff, more changes, update, fix tests, wip. The other is structured work: feat(search): add cursor-based pagination for users endpoint, perf(search): replace LIKE '%q%' with full-text index, test(search): add edge cases for empty query and unicode input. Same category of tools — Claude Code, Cursor, GitHub Copilot, Windsurf — but radically different repo aftertaste.

The surface lesson is “write better commit messages.” The deeper lesson is that commit hygiene has become a production control.

The bottleneck moved from generation to comprehension

Before coding agents, a messy 1,000-line “fix stuff” commit was at least constrained by human typing, copying, and fatigue. A developer could still make a mess, but the mess had friction. Now an agent can produce a broad diff, revise it, add tests, and pass a happy path before anyone has seriously read the change. The failure mode is not that the code is obviously broken. The failure mode is that the code is plausible enough to merge and too poorly structured to understand later.

That matters because Git is not just a backup system. It is the operating manual for future debugging. git log -p lets reviewers inspect the patch introduced by each commit. git log --stat shows the files and line counts touched. git bisect, per Git’s own documentation, is a binary-search algorithm for finding the commit that introduced a bug. All of those tools depend on commits being meaningful test points. If every AI-assisted change lands as one blob mixing refactor, behavior change, dependency upgrade, generated tests, and formatting, bisect becomes theater and revert becomes surgery.

The repo does not care whether the mess came from a rushed junior, a senior having a bad day, or an agent following a vague prompt. Six weeks later, the question is the same: can we tell what changed, why it changed, and whether we can safely undo it?

Commit structure should happen before the prompt

The strongest practical idea in the original piece is that senior AI-assisted developers do not ask for “the feature” and then clean up whatever comes back. They decompose the work before the model starts writing. Refactor the query builder first. Commit it. Add cursor pagination next. Commit it. Add edge-case tests. Commit them. Optimize the search path. Commit it. Each step gives the agent a narrower target and gives reviewers a smaller unit of intent.

This is not bureaucratic purity. It is how teams preserve review quality under higher generation speed. A reviewer can reason about a mechanical extraction differently from a behavior change. They can review tests for coverage without also mentally parsing a new API contract. They can revert a performance experiment without losing unrelated cleanup. The commit boundary becomes an engineering affordance, not a stylistic preference.

Teams should encode this into agent prompts and workflows. “Propose a commit plan before editing.” “Do not mix refactor and behavior change.” “Keep commits bisectable.” “Run tests after each logical unit.” “Use Conventional Commit subjects with scope.” “If the implementation requires touching more than N files, pause and ask for plan approval.” These instructions are not magic, but they push the agent toward reviewable work rather than impressive blobs.

Agent frameworks should help here instead of optimizing for demo velocity. A coding agent that opens a PR should expose a plan, show scope boundaries, and optionally stage commits separately. It should attach provenance: the original task, prompts or plan steps, files touched, tests run, tool calls, generated assumptions, and unresolved questions. If the tool can generate code but cannot explain the unit of change, it is exporting cognitive debt to reviewers.

Measure the repo, not the vibes

Denda suggests concrete checks: count commits per merged PR with gh pr list --state merged --limit 100 --json commits --jq '.[] | .commits | length', measure average commit-message length with git log --since="3 months ago" --pretty=%s | awk ..., and inspect commit-size distribution with git log --since="3 months ago" --numstat. Those are useful because AI adoption should have operational fingerprints.

A healthy AI-assisted team might show more total commits, but not necessarily larger undifferentiated commits. PRs in the 2–6 commit range are often easier to review than one giant generated patch. Vague subject frequency should fall, not rise. Test-only commits should be visible. Revert rate should not spike because nobody understood the original diff. Bisect should still work. If the repository gets harder to operate after agents arrive, the productivity gain is probably being borrowed from future incident response.

Add lightweight repo-health checks to your engineering dashboard. Track median lines changed per commit, files touched per commit, vague subject tokens like “fix,” “update,” “changes,” and “wip,” commits per PR, review cycle time, revert frequency, and whether AI-authored PRs include a clear test record. This is not about policing developers for using AI. It is about detecting whether agents are making the codebase more legible or merely increasing merge throughput.

There is also a governance boundary worth keeping. GitHub’s Copilot code review leaves comments; it does not approve or request changes, and its feedback does not satisfy required human approvals. That is the right shape. AI can assist review, point out likely issues, and summarize diffs. Humans still own architecture, risk acceptance, rollout judgment, and merge authority. If one agent writes a blob and another agent rubber-stamps it, the team has automated the appearance of review rather than the substance.

The forward-looking take: Git history is going to become one of the simplest ways to spot whether agentic engineering is working. Mature teams will use agents to produce smaller, cleaner, more explainable units of change. Immature teams will use agents to produce bigger piles faster and call the pile velocity. The repo will not be fooled. It never is.

Sources: HackerNoon, Dev.to mirror, Git documentation, git bisect docs, GitHub Copilot code review docs

The bottleneck moved from generation to comprehension

Commit structure should happen before the prompt

Measure the repo, not the vibes

Sign up for more like this.