codex

Copilot Code Review Adds Severity and Grouping Because AI Review Noise Is Now the Bottleneck

Anatoliy Kolodkin

13 May 2026 • 6 min read

GitHub’s latest Copilot code review change is not flashy. That is why it is worth paying attention to.

Copilot review comments now carry High, Medium, and Low severity labels. Similar suggestions can be grouped together instead of repeated across a large pull request. Users opted into GitHub’s new pull request experience also get an updated suggested-changes UI. On paper, this is a modest comment-experience release. In practice, it is GitHub acknowledging the next bottleneck in AI-assisted software development: not whether an agent can leave review comments, but whether humans can survive the volume.

Automated review has crossed the novelty line. GitHub says Copilot code review has processed more than 60 million reviews, grew 10x in less than a year, and that more than one in five code reviews on GitHub now involve an agent. That is no longer a side feature. It is review infrastructure. And once AI review becomes infrastructure, noise is not an annoyance. Noise is a throughput tax.

The industry spent the first phase of coding agents asking, “Can this write code?” The second phase is messier: “Can this produce work that humans can responsibly review?” GitHub’s severity and grouping changes land squarely in that second phase.

A queue beats a confetti cannon

Severity labels are a primitive. A blunt one, but a necessary one.

A review tool that leaves ten comments with equal visual weight has not finished its job. It has merely moved prioritization back to the human. If one comment identifies a possible auth bypass, another recommends a clearer variable name, and a third suggests a slightly more idiomatic loop, the UI should not treat them as peers. Humans need a queue. Automated reviewers that cannot rank their own findings become confetti cannons: technically productive, operationally irritating.

High, Medium, and Low will not be perfect. They do not need to be. The first job of severity is not to replace human risk judgment; it is to make attention allocation less random. The senior engineer reviewing at 5:45 p.m. before a deploy should know where to look first. The junior engineer learning from review should see that not all comments carry the same architectural or security weight. The team lead trying to understand review load should have something better than a raw comment count.

Grouping duplicates matters for the same reason. Repetition is the fastest way to train engineers to ignore a tool. If Copilot finds the same naming issue in twelve places, twelve comments may be “accurate” and still be bad review behavior. Grouping turns “AI is noisy” into “AI found a pattern.” That is a much better object for discussion and remediation. Pattern-level feedback can often be fixed mechanically, tracked over time, and encoded into lint rules or repo instructions later.

This is the direction automated review has to go: fewer comments, better ranked, more pattern-aware, and easier to measure. The valuable reviewer is not the one that speaks most. It is the one that points humans toward the risk they would otherwise miss.

Severity is a hint, not a merge policy

The obvious failure mode is false authority. A High label does not make a finding correct. A Low label does not make it harmless. Automated severity is triage metadata, not governance.

Every serious team will need to calibrate Copilot’s labels against its own risk model. A missing permission check in a low-traffic admin route may be high severity in a company with strict customer-data controls. A naming inconsistency may be low priority in most services but meaningful in a codebase where identifiers map to billing semantics or security scopes. A suggested refactor may look medium-risk to an agent and be unacceptable during an incident freeze.

The right mental model is the same one teams use for static analysis and vulnerability scanners. The tool proposes severity. The team owns severity. Copilot can label the queue, but humans still own merge policy, release timing, and incident history.

GitHub’s docs reinforce that boundary in a subtle but important way. Copilot reviews usually take less than 30 seconds and always leave a “Comment” review, not “Approve” or “Request changes.” They do not count toward required approvals and do not block merges. That is the correct posture. AI review is a first pass, not a substitute for accountability. The moment an automated reviewer can satisfy a required approval gate, the organization needs a much sharper policy than “the bot seemed confident.”

There is another sharp edge: Copilot comments behave like human review comments in the UI — users can react, reply, resolve, and hide them — but replies are visible to humans, not to Copilot, and Copilot does not reply in the review thread. That means the agent is not participating in a persistent debate. If a comment is wrong, your resolution may not teach the reviewer in the local thread. GitHub also warns that re-reviews may repeat the same comments even if they were resolved or downvoted. Severity and grouping reduce the pain, but they do not magically create institutional memory.

Repo instructions become review policy

The most practical lever teams have is custom instructions. GitHub supports repository-level review guidance through .github/copilot-instructions.md and path-specific files under .github/instructions/**/*.instructions.md. Code review reads instructions from the base branch and only the first 4,000 characters of any custom instruction file.

That limit is a feature disguised as a constraint. Do not dump a team manifesto into Copilot instructions. Write short rules that map to actual failures. Flag CI weakening. Require tests for bug fixes. Scrutinize permission branches. Watch for duplicate utilities. Pay special attention to workflow changes that interpolate untrusted pull request, issue, or branch content into prompts or shell commands. Call out service-specific invariants that a generic model would not know.

Instructions should be treated like code. They should be reviewed, versioned, and kept tight. If the file becomes a thousand-line folk constitution, the first 4,000 characters will become a lottery ticket and the reviewer’s behavior will be hard to reason about. The strongest instruction files look more like checklists than essays.

GitHub’s May 8 changelog also added Copilot code review comment types to the usage metrics API. That matters because teams need feedback loops. If Copilot leaves many low-severity comments that humans rarely accept, tighten instructions or narrow where review runs. If high-severity comments are accepted often, consider adding tests, linters, or security checks that prevent those issues earlier. If grouped duplicates keep appearing, turn the pattern into a rule. Automated review should feed the engineering system, not just decorate pull requests.

The agent PR queue is coming

The biggest strategic implication is not the label in the top-right corner of a comment. It is the loop around the comment.

GitHub lets users accept suggested changes individually or as a group. It also lets users invoke Copilot cloud agent to implement suggestions, creating a new pull request against the branch. That is useful. It is also a volume multiplier. An automated reviewer finds an issue. An automated fixer opens another change. A human now reviews the fix. If the fix creates another comment cycle, the queue grows. Severity and grouping are not just UX polish; they are backpressure controls for a future where review inboxes contain human PRs, agent PRs, agent comments on human PRs, and agent fixes to agent comments.

This is where GitHub’s own “agent pull requests are everywhere” framing becomes important. The company cites a January 2026 study, “More Code, Less Reuse,” arguing that agent-generated code introduces more redundancy and technical debt per change than human-written code, while reviewers can feel better about approving it. That is the uncomfortable part. Agents can make a diff look coherent enough to pass a tired review while quietly adding duplication, weaker abstractions, or missing context. Automated review can help catch that, but only if it is tuned toward architectural risk, reuse, tests, and boundaries — not just surface-level style.

For teams, the playbook is straightforward. Treat Copilot review as the first pass. Let it catch mechanical issues before a human spends attention. Use severity labels to order work, not to skip judgment. Make human reviewers trace the critical path, inspect security boundaries, and look for reuse blindness. Track metrics: comments per PR, high-severity acceptance rate, repeated false positives, review latency, grouped-pattern frequency, and whether Copilot-suggested fixes reduce or increase follow-up review load.

Also decide where AI review should not run, or where its output should be advisory only. Security-sensitive changes, auth flows, migrations, release automation, billing logic, and production infrastructure deserve more than a generic agent pass. Copilot may still be useful there, but the human review standard should be explicit.

This is what mature AI dev tooling looks like: less magic, more queues, labels, metrics, and policy. Good. The problem was never that humans lacked enough comments. The problem was finding the few comments worth acting on before the merge button got bored.

Sources: GitHub Changelog, GitHub Docs, GitHub Blog, GitHub Changelog

A queue beats a confetti cannon

Severity is a hint, not a merge policy

Repo instructions become review policy

The agent PR queue is coming

Sign up for more like this.