ai-frameworks

GitHub's Availability Postmortem Is Really About How Agentic Workflows Broke Their Capacity Planning

Anatoliy Kolodkin

28 Apr 2026 • 4 min read

GitHub published an availability update on April 28 that is more technically honest than most outage postmortems. The short version: two recent incidents were caused by the sharp acceleration of agentic development workflows since the second half of December 2025, and GitHub had to redesign its capacity targets from 10X to 30X growth. The longer version is a detailed explanation of how autonomous coding agents, automated PR workflows, and AI-driven repository automation are creating load patterns that traditional per-seat-per-request models never had to handle.

This is not a story about GitHub having a bad week. It is a story about the operational model that the AI coding market is now building on top of being retrofitted under load. That is worth paying attention to.

What the Post Actually Says

GitHub started a capacity expansion plan in October 2025 targeting 10X improvement. By February 2026, they had to revise that target upward to 30X because agentic workflow acceleration was outrunning their models. Let that sink in: four months to go from 10X to 30X headroom. That is not organic growth. That is a step change in usage pattern driven by the sudden default of AI coding from autocomplete to agentic execution.

The compound-load description in the post is the most technically instructive part. "A pull request can touch Git storage, mergeability checks, branch protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. At high scale, small inefficiencies compound: queues deepen, cache misses become database load, indexes fall behind, retries amplify traffic." That sentence is a circuit diagram of how one AI-assisted action becomes many distributed system events. When those actions are driven by autonomous agents that retry, watch CI, rebase, and file follow-up issues, the retry amplification and queue depth effects are exactly what you would predict from queuing theory.

The April 23 incident involved a regression in merge queue operations affecting squash merges when merge groups contained more than one PR. The second incident details are truncated in what was fetched, but the pattern is consistent: services that worked fine for human-driven PRs are breaking under the load profile that autonomous agents create.

The Engineering Response and What It Tells You

GitHub's short-term fixes are the right moves for the problem they are describing. Moving webhooks out of MySQL, redesigning session cache, reworking auth and authorization to reduce database load, leveraging Azure migration for more compute — these are all reasonable responses to a database-heavy workload that is growing faster than expected. The Ruby-to-Go migration of performance-sensitive code paths is the kind of unsexy infrastructure work that makes a real difference at scale.

Isolation is the other major theme. Separating Git and GitHub Actions from other workloads to minimize blast radius is correct — when your AI coding tools depend on Actions being fast and reliable, a noisy neighbor problem in shared infrastructure is not an acceptable failure mode. Multi-cloud architecture for resilience and low latency is a longer-term bet that reflects how serious this problem is inside GitHub's engineering org.

What this response also tells you is that GitHub was not designed for this load profile. Git is a version control system. GitHub is a platform built on top of Git. Neither was architected to be the control plane for autonomous coding agents that trigger merge queues, watch CI, rebase, and file issues. That is not a criticism — nobody designed for this in 2025. It is a description of the retrofit problem that every serious AI coding tool is now running into.

The Most Important Sentence in the Post

GitHub says: "We will have a separate blog post soon describing extensive work we have done and the new upcoming API design for greater efficiency and scale."

That is the sentence that matters most for anyone building agentic workflows today. GitHub is preparing an API change significant enough to warrant its own announcement. If you are building workflows that depend on current API behavior around merge queue, PR operations, or Actions triggers, that future post is worth reading the day it ships. The current API is being redesigned under active load, which means behavioral changes are likely.

The Connection to the Billing Story

The timing of this post on the same day as the Copilot billing announcement is not coincidental. GitHub told users on April 27 that agentic usage was becoming the default. They told them on April 28 that agentic usage is the primary driver of availability problems. These are two halves of the same story: the AI coding market moved faster than the infrastructure underneath it, and both the cost model and the reliability model are being retrofitted simultaneously.

The community discussion on GitHub's own Agentic Workflows technical preview is worth reading alongside this. Practitioners are already engaging with the guardrails, permissions, and fail-closed questions that this post raises. One commenter asks whether agentic workflows should fail closed by default when an external authority cannot confirm execution is allowed. That question turns out to be directly relevant to the availability problem GitHub is describing: more autonomous agents touching more repository surfaces means more failure modes that cascade across services. The answer to "should this fail open or fail closed?" has direct implications for load, retry behavior, and queue depth.

What This Means for Teams Building on GitHub

If you run autonomous coding workflows that depend on GitHub Actions, merge queues, or PR automation, a few things follow. First, the infrastructure is under active stress. GitHub is doing the engineering work, but capacity expansion at 30X is not instantaneous. There will be rough periods. Second, the API behavior your workflows depend on is subject to change. The forthcoming API post is worth tracking closely. Third, the blast radius of failures in agentic workflows is larger than the blast radius of failures in human-driven workflows, because retries amplify and queues deepen in ways that human-driven PRs do not. Plan for that.

The broader lesson is one that the AI coding market needs to internalize: the infrastructure for autonomous agents is being built by retrofitting systems that were not designed for the load profile those agents create. GitHub is not unique in this. Every platform that hosts AI coding workflows is dealing with similar capacity and reliability challenges. The difference is that GitHub just published the clearest available explanation of what that problem looks like from the inside.

Sources: GitHub Blog

What the Post Actually Says

The Engineering Response and What It Tells You

The Most Important Sentence in the Post

The Connection to the Billing Story

What This Means for Teams Building on GitHub

Sign up for more like this.