agentic-coding

The Four Layers of AI Agent Development That Actually Work

Anatoliy Kolodkin

02 May 2026 • 6 min read

Most developers who picked up AI coding tools in the past eighteen months have had the same experience. It works beautifully — until it doesn't. You start a project, you prompt, you iterate, you ship. Then the codebase grows. More services, more state, more API surface. The agent starts making decisions that contradict choices from three sessions ago. You spend more time correcting than building. The tool that was supposed to make you faster starts generating work you didn't ask for.

A developer named Jeffrey Reese recently published a framework on dev.to that articulates why this happens and what to do about it. The short version: most teams have the wrong architecture for how they work with AI agents. They have behavioral guardrails but no actual specification. And without a spec, the agent is improvising — which works until the project is complex enough that improvisation produces contradictions.

The four layers that separate systems that scale from chaos

Reese's framework — inspired by and building on Andrej Karpathy's "From Vibe Coding to Agentic Engineering" talk at Sequoia AI Ascent 2026 — identifies four distinct layers in a well-functioning AI agent setup. Most teams have only the top two.

Layer 1 is the spec: the architecture document, the API contracts, the data models, the user flows. This is what the agent builds from. If the spec is vague, the output is vague. Simple. Obvious. And rarely done well by developers who are in a hurry to start building.

Layer 2 is the workflows around the spec: the skills and processes that generate the spec, that reference it when building, that check for drift when architecture changes. The spec is the artifact; the workflows are the muscle that keeps it current and relevant.

Layer 3 is behavioral config — what most people mean when they say "CLAUDE.md" or equivalent. This is code conventions, testing requirements, commit message format,脾气. Guardrails. Important. But fundamentally different from the spec: behavioral config tells the agent how to act, not what to build.

Layer 4 is mechanical enforcement: hooks that fire automatically on events — before commit, after file edit, when a session starts. The difference between a suggestion and a gate.

The gap most teams have is layers 1 and 2. They write a CLAUDE.md, add some rules, maybe set up a pre-commit hook. They skip the spec entirely. And then they wonder why the agent makes assumptions that conflict with decisions made three sessions ago — because there's no document capturing those decisions, and the agent doesn't have a persistent memory of the project's architecture across sessions.

The prompting breakdown is a spec problem, not a prompting problem

Reese describes his own prompting breakdown clearly. He started jumping straight into prompting for a project. It worked for the first few features. Then the codebase grew and the agent started generating output that contradicted itself. More time correcting than building. The pattern is familiar enough that it's become a meme in certain corners of the developer community: the AI coding tool that makes you faster until it makes you slower.

The diagnosis most people reach for is "the prompting needs to be better." But Reese's insight — and Karpathy's too — is that prompting has a ceiling. Beyond a certain codebase complexity, no prompt is sufficient to prevent contradictions, because prompts don't persist across sessions. What you need is a document that persists: a spec that captures architectural decisions and gets referenced on every interaction.

Reese built a tool called Forge to generate these specs retroactively from an existing codebase. The workflow: analyze the existing codebase to produce a structured specification — not a summary, but a full inventory of modules, dependencies, API surface, and data models. Then give the agent that document as context before asking it to build anything. Once the spec was in place with guardrails and standards, Reese reports: "the agent stopped guessing. The output aligned with the real architecture. The correction loop that was eating my time nearly disappeared."

The claim is specific and falsifiable. The correction loop disappearing is a real outcome you can verify in your own workflow. Whether it happens depends on how well the generated spec captures the actual architecture versus the idealized architecture — the gap between "how the code actually works" and "how the README describes how the code works."

The token economics lesson nobody is internalizing yet

Reese makes a point about CLAUDE.md that's worth dwelling on in the current billing environment: "Every token in a rule pays rent on every API call." A rule that fires once per session but loads on every turn is bleeding context budget on every single interaction. The discipline he advocates — write rules when the need is demonstrated, not when you imagine the need might occur — is the difference between a config file that makes your agent smarter and one that just burns money.

This is newly relevant now that GitHub Copilot is moving to usage-based billing on June 1. Every token in your CLAUDE.md costs real money per API call, not just opportunity cost. A 2,000-token config file that loads on every Copilot Chat interaction costs twice as much per month as a 1,000-token file. If you're not auditing your config files for relevance and necessity, you're overpaying for every AI-assisted interaction you have.

The spec is not a document you write once

The mental model shift that matters most in Reese's framework is this: a spec is not a one-time artifact. It's a living document that iterates with the agent. Corrections become part of the spec's history. When the agent proposes something that contradicts the spec, you correct it — and that correction gets incorporated into the spec, so the next session has the benefit of the learning.

Karpathy's framing in his Sequoia talk is precise: "People have to be in charge of this spec, this plan. Work with your agent to design a spec that is very detailed." The "with" is doing real work in that sentence. The spec isn't handed down from above. It's built collaboratively. The agent reveals ambiguities in the spec by proposing things that don't work; the human resolves those ambiguities; the resolution updates the spec. Over time, the spec becomes smarter than any individual session because it captures the accumulated decisions and lessons from every interaction.

This is a different mental model for human-AI collaboration than most teams are operating with. The typical pattern is: prompt, get output, correct output, prompt again. The spec-driven pattern is: establish spec, build to spec, catch drift from spec, update spec. The difference is that in the second pattern, each correction compounds — future sessions benefit from lessons learned in past sessions. In the first pattern, each correction is isolated — the agent doesn't know it made the same mistake last week unless you told it last week, explicitly.

What to actually do on Monday morning

Start with the codebase you have. Generate a structured analysis — not a summary, but a full inventory: what are the modules, what are the dependencies, what are the API surfaces, what are the data models. There are tools that do this (Reese built Forge; others exist). The mechanism matters less than the outcome: a document that captures how the code actually works, not how the original README described how it was supposed to work.

Give that document to your agent as context before any new task. Watch what changes. If the agent stops asking clarifying questions it was asking before — that's the signal that the spec is working. If it's still generating output that contradicts the existing architecture, the spec isn't capturing something important.

Then add rules only when you've observed the problem they solve. Not before. The temptation to write rules preemptively is strong — the "what if the agent does X" instinct. Resist it. Every rule you write costs tokens on every API call. Write it when the agent actually does the thing you were worried about, not when you imagine it might.

The take

Karpathy's Sequoia talk described the shift from vibe coding to agentic engineering. Reese's framework shows what agentic engineering actually looks like in a developer's daily workflow. The four-layer model is useful because it names something many developers have been feeling but not articulating: the sense that AI agents work fine on small projects and badly on large ones, and that "better prompting" isn't the fix.

The fix is a spec. Not a better prompt — a shared understanding of what you're building, captured in a form the agent can reference, updated as the understanding evolves. That's not a new idea. Software teams have always had specs. What's new is the context: in the AI agent paradigm, the spec has to be good enough that the agent can use it without a human in the loop to explain the decisions. That's a higher bar than most human-readable specs clear.

If you're using AI coding tools and things are breaking down as your project scales, the question to ask isn't "how do I prompt better." It's "what does my agent know about my codebase that I haven't written down." That gap is where the work is.

Sources: dev.to — Jeffrey Reese, Karpathy Sequoia AI Ascent 2026, The AI Opportunities, Dealroom