Thread-Based Engineering: The Mental Model That Changes How You Use Every AI Coding Tool

There's a moment every developer who picks up Claude Code or Cursor hits about two weeks in: you've got a session that started with "add auth" and somehow evolved into fixing a CSS bug, then a migration question, then a half-explored refactor, and now the agent is confused about what you actually want. The context has drifted. The model is working off a muddy combination of everything you asked in the last hour. Output quality degrades and you can't pin down why.

Thread-Based Engineering (TBE) is the answer to this. It's a workflow methodology for AI-assisted coding that treats each task as an isolated unit of work — a thread — rather than a running conversation that accumulates context indefinitely. Once you start thinking in threads, the way you use every agentic coding tool changes.

The Mental Model: What a Thread Actually Is

A thread is the atomic unit of agentic engineering: one unit of work, driven by you and your agent, with two mandatory moments where you show up.

PROMPT → [TOOL CALLS] → REVIEW
  You       Agent          You

You appear at the beginning (prompt or plan) and the end (review or validate). Everything in between — file reads, code writes, test runs, command executions — is the agent. You've delegated the execution layer entirely.

The key metric this surfaces: tool calls roughly equal impact. Before agentic tools, you were the tool calls. You opened files. You typed code. You ran commands. Now the agent does that. The engineer running more useful tool calls per day is outperforming the engineer running fewer — not because they're working harder, but because they've structured their work to maximize the agent's execution time and minimize their own bottleneck.

Andrej Karpathy said in early 2026: "I've never felt this much behind as a programmer." That's the correct response to agentic engineering. The old metrics — lines of code, PRs merged — don't capture what matters anymore. TBE gives you something better to measure: how well you're structuring your threads.

Why Isolation Matters: The Context Contamination Problem

The reason TBE exists is context contamination. When you pile multiple tasks into a single long session, earlier context doesn't disappear — it competes with current instructions. The model's attention distributes across everything in its window. Relevance decays but never clears. Ask about a database migration in a session where you earlier discussed frontend styling, and the model carries traces of both.

The symptoms are familiar:

  • The agent suggests solutions inconsistent with what you asked for
  • It references earlier decisions that are no longer relevant
  • Output quality degrades over a long session even though you're asking coherent questions
  • You spend time correcting course rather than reviewing work

The fix is mechanical: one task, one thread, fresh context. In Claude Code, that's /clear between tasks. In Cursor, it's opening a new composer window. In Codex CLI, it's a new session. The discipline of starting fresh is the foundation everything else builds on.

The Six Thread Types

Once you have the base concept, TBE extends into a taxonomy of thread patterns. These aren't theoretical — they map to concrete workflows developers are running today.

1. Base Thread

One prompt → agent executes → one review. The foundation. Everything else is this, scaled or combined.

Use it for: simple tasks, quick fixes, single-file changes, anything that fits in one context window with a clear deliverable.

The discipline: resist adding "while you're at it" mid-thread. Each new task is a new thread. The friction of opening a fresh session is intentional — it forces you to think about whether the new task is actually the same task.

2. P-Threads (Parallel)

Multiple threads running simultaneously. Boris Cherny — creator of Claude Code — runs five instances in his terminal (numbered tabs 1–5) plus 5–10 additional instances in the Claude Code web interface. That's 10–15 parallel threads. While one agent writes tests, another refactors an API endpoint, another explores an alternative architecture.

The math is direct: more threads running in parallel means more potential output per hour. The bottleneck shifts from agent execution to your own ability to review and redirect. You're no longer waiting — you're managing.

The practical split Cherny uses maps to two modes:

  • In-loop (terminal): Active threads where you're steering — the agent runs a few tool calls, reports back, you redirect
  • Out-of-loop (web/cloud): Fire-and-forget threads running autonomously while you work on something else

Tool support: Claude Code's web interface, Claude.ai Projects, background agent sessions. The infrastructure for 10+ parallel threads now exists natively.

3. C-Threads (Chained)

Multi-phase work with explicit human checkpoints between phases. This isn't the agent getting confused and asking for help — it's you choosing to verify before proceeding.

The canonical use case: a production deployment that touches database, API, and frontend. You don't want an agent running all three in one go if a database migration error would force you to unwind frontend changes. You chain: run migration → you review → run API updates → you review → run frontend → you review.

Claude Code's AskUserQuestion tool supports this natively — the agent can stop mid-workflow and request explicit approval before continuing. Text-to-speech hooks let you get notified without watching the terminal. The agent taps you on the shoulder at each checkpoint.

The trade-off: your time. C-threads require more human attention than base threads. Use them when the risk of not reviewing outweighs the cost of pausing.

4. F-Threads (Fusion)

Same prompt to multiple agents simultaneously, then aggregate the best results. This is "best-of-N" applied to entire engineering tasks.

The logic: if you send one agent to solve a problem, you get one attempt. If you send four agents, you get four attempts — and the probability that at least one of them is excellent is dramatically higher than the probability that any single attempt is excellent. You review all four and pick the winner, or cherry-pick the best elements across multiple results.

Extended to multi-model fusion: 3 Claude Code instances + 3 Cursor/Gemini instances + 3 Codex instances all attempting the same problem. Nine parallel attempts. One winner selected by you. The quality ceiling rises with the number of attempts — you're sampling the distribution of possible solutions rather than accepting the first one.

F-threads are particularly valuable for architecture decisions, security reviews, complex algorithm implementations — anywhere you'd normally want a second opinion from a senior engineer.

5. B-Threads (Big/Meta)

One thread that contains other threads inside it. An orchestrator agent fires off multiple worker agents; each worker runs its own thread; the orchestrator synthesizes results. From your perspective you still prompt once at the beginning and review at the end — but N threads ran underneath.

This is the foundation of Claude Code's multi-agent architecture. When you tell Claude Code to "use subagents to handle the frontend, backend, and tests separately," it spawns three isolated threads internally. Each subagent has its own context window — no contamination between them. The orchestrator knows the high-level plan; each worker knows only its scope.

The pattern extends: orchestrator writes prompts for worker agents, each worker executes, orchestrator synthesizes. You've multiplied throughput without multiplying your effort. Boris Cherny's team uses this for /team-build workflows: a single plan file spawns a frontend agent, backend agent, and quality engineer agent, each running in isolated context.

B-threads are where TBE intersects with multi-agent orchestration architecture — the orchestrator is itself a thread whose tool calls are the spawning of other threads.

6. L-Threads (Long Duration)

A base thread stretched to its limit: instead of 10 tool calls over 5 minutes, it's 200+ tool calls over hours. Boris Cherny has run L-threads for over 26 hours on complex feature builds.

L-threads require three things the base thread doesn't:

  • Excellent prompts: A vague prompt produces acceptable output in a base thread; it produces hours of wrong direction in an L-thread. The specification has to be complete before the thread starts.
  • Robust verification: The agent must be able to verify its own work so it knows when it's done, rather than stopping prematurely or looping endlessly. Stop hooks — callbacks that run when the agent tries to stop and force re-verification — are the standard mechanism.
  • Checkpoint state: Long threads hit context limits. The work needs to be checkpointed in files (a PROGRESS.md, a state file, git commits) so a new context window can pick up where the old one left off.

The connection to autoresearch is direct: the Karpathy loop is an L-thread with a mechanical stop condition (metric stops improving) and automatic rollback on failure. It runs overnight because the verification is robust enough to not need human oversight.

Z-Threads (Zero-Touch)

The theoretical endpoint: no review node. The agent ships to production, observes real analytics, decides whether the change worked, iterates. The human's mandatory presence at the end disappears.

Most engineers aren't running Z-threads today. The guardrail requirements are significant: near-perfect verification, robust rollback, meaningful observability, high trust in the agent's judgment. But the direction is clear — every improvement to verification, every better stop hook, every more reliable test suite moves the system closer to the point where human review becomes optional rather than mandatory.

The Four Levers

Every thread optimization ultimately improves one of four variables:

LeverWhat it affectsHow to improve it
ContextWhat the agent knows going inCLAUDE.md, AGENTS.md, task-specific context files, MCP servers, fresh sessions
ModelWhich model is executingRoute hard problems to Opus, fast iteration to Sonnet, syntax tasks to smaller models
PromptWhat you're askingSpecification quality, explicit scope, clear success criteria
ToolsWhat the agent can doMCP servers, skills, hooks, filesystem access, CI integration

Better prompts extend threads — a higher-quality specification means the agent can run further before needing direction. Better context means more accurate work per tool call. Better tools expand what the agent can accomplish autonomously. Better model selection matches capability to task.

Implementing TBE: The Practical Rules

Rule 1: One task, one thread. The hardest discipline and the most important one. When a new question or task comes up mid-thread, write it down and open a new session. Don't append it to the current context.

Rule 2: Start threads with complete context. Before the agent starts executing, it needs: what to build, what files to look at, what not to touch, what done looks like. CLAUDE.md and task-specific instruction files loaded at thread start are worth more than mid-session corrections.

Rule 3: Default to P-threads. If two tasks are independent, run them in parallel rather than sequentially. There's no reason to run authentication work after the tests finish if they don't depend on each other. Open a new terminal tab, start both threads, review both when done.

Rule 4: Use C-threads for production work. Any thread touching infrastructure, data migrations, or shared services should have explicit review checkpoints. The cost of unwinding an L-thread that went wrong through all three layers is higher than the cost of reviewing three sequential C-thread phases.

Rule 5: Invest in verification before scaling. L-threads and B-threads only work reliably if the agent can verify its own output. Build your test suite, your stop hooks, your CI checks before scaling to long-running or orchestrated threads. The verification infrastructure is what makes unsupervised execution safe.

What Changes When You Think in Threads

The shift isn't just tactical — it changes how you think about your own role. The traditional model of "developer writes code" shifts to "developer writes prompts, reviews outputs, and designs verification systems." The skills that matter most become specification quality (can you write a prompt the agent can execute for three hours without needing clarification?), verification design (can you build a test suite robust enough to catch what the agent gets wrong?), and orchestration judgment (which tasks should run in parallel, which need checkpoints, which are safe for overnight).

The engineers who are pulling ahead in 2026 aren't the ones who've memorized more API signatures or can type faster. They're the ones who've learned to think in threads — to decompose work into isolated units, specify them precisely, run them in parallel, and verify the outputs systematically. That's the skill the methodology is building toward.