agentic-coding Featured

Karpathy's autoresearch Is a Skill: How a 42,000-Star Repo Became a Claude Code Loop for Any Codebase

Anatoliy Kolodkin

28 Mar 2026 • 5 min read

On March 7, 2026, Andrej Karpathy pushed a repo to GitHub with a README that opens like this:

One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of "group meeting". That era is long gone.

The repo picked up 42,000 GitHub stars and 8.6 million tweet impressions in days. But the more interesting story isn't the viral reception — it's what developers immediately recognized in the design and started building from it.

What autoresearch actually is

karpathy/autoresearch is a deliberately minimal system: give an AI coding agent a small but real LLM training setup and let it experiment autonomously overnight. The agent modifies the training code, runs a 5-minute training job, checks if validation loss improved, keeps or discards the change, and repeats. You wake up to a log of experiments and a better model.

The repo has exactly three files that matter:

prepare.py — fixed constants, data prep, evaluation harness. Read-only. The agent cannot touch this.
train.py — the single file the agent edits. Full GPT model, optimizer (Muon + AdamW), training loop. Everything is fair game: architecture, hyperparameters, batch size, model depth.
program.md — baseline instructions for one agent. The human iterates on this; the agent follows it.

The metric is val_bpb (validation bits per byte) — lower is better, vocab-size-independent so architectural changes are fairly compared. Every run is exactly 5 minutes of wall-clock training time regardless of what the agent changes. Every experiment is git-committed with a TSV log entry. Failed experiments auto-revert.

In two days running, the system ran 700 experiments and discovered 20 independent optimizations. At ~12 experiments per hour, you get roughly 100 experiments per night of sleep.

The key insight Karpathy buried in the README

About halfway through the setup instructions, Karpathy writes:

The program.md file is essentially a super lightweight "skill".

That's the sentence that sparked the wave of ports. Developers working with Claude Code and Codex already knew what skills were — SKILL.md files that give agents specialized playbooks for a category of task. Karpathy's program.md is exactly that: a structured instruction file describing a loop, a constraint, a metric, and a methodology. The agent reads it, sets up, and runs.

The autoresearch design has five core constraints that make it work, and they're all transferable:

One file to modify. The agent only touches train.py. Scope is inviolable. This keeps diffs reviewable and attribution unambiguous.
Fixed time budget. Every experiment runs for exactly 5 minutes. All results are directly comparable regardless of architecture changes — bigger model or smaller batch, same clock budget.
Mechanical metric. The goal is a single number: val_bpb. No subjective evaluation, no "does this look good?" — just grep the log file.
Automatic rollback. If the metric gets worse, git revert. The agent doesn't deliberate — it has a rule.
Git is memory. Every experiment — kept or discarded — is committed. The agent reads git log and git diff before each iteration to understand what's been tried.

These five constraints, together, are what makes an autonomous loop actually autonomous. Remove any one and you need human judgment to intervene.

How developers are turning it into a Claude Code skill

The community response was immediate. Within two weeks of the repo going viral, multiple developers had generalized the pattern into installable Claude Code skills.

The most complete port is uditgoenka/autoresearch, which describes itself as: "constraint + mechanical metric + autonomous iteration = compounding gains, generalized to ANY domain."

The core loop in the skill is a direct translation of Karpathy's program.md:

LOOP (FOREVER or N times):
  1. Review current state + git history + results log
  2. Pick the next change (based on what worked, what failed, what's untried)
  3. Make ONE focused change
  4. Git commit (before verification)
  5. Run mechanical verification (tests, benchmarks, scores)
  6. If improved → keep. If worse → git revert. If crashed → fix or skip.
  7. Log the result
  8. Repeat. Never stop until you interrupt.

But where Karpathy's original is hardcoded for ML training on a single GPU, the skill generalizes across domains. Any codebase. Any measurable target. Developers are running it against:

Test coverage percentages
JavaScript bundle size
Lighthouse performance scores
API response time benchmarks
SQL query execution time
TypeScript type error counts
ESLint violation counts

The install is a single slash command in Claude Code:

/plugin marketplace add uditgoenka/autoresearch

Then to kick off a loop:

/autoresearch

Claude reads your codebase, extracts or asks for a mechanical metric, defines scope (which files are modifiable, which are read-only), establishes a baseline, and begins iterating. The Iterations: N parameter caps it; without it, the loop runs until you interrupt — exactly like the original.

The command surface that emerged

What's notable about how developers extended the skill is that they kept the core loop invariant and added domain-specific entry points on top of it:

Command	What it does
`/autoresearch`	Core autonomous iteration loop, unlimited
`/autoresearch:plan`	Interactive wizard: goal → scope → metric → verify config
`/autoresearch:debug`	Autonomous bug-hunting loop — scientific method + iterative investigation
`/autoresearch:fix`	Iteratively repair all errors (tests, types, lint) until zero remain
`/autoresearch:security`	Autonomous STRIDE + OWASP + red-team audit loop
`/autoresearch:ship`	Universal shipping workflow — code, content, docs, DevOps
`/autoresearch:learn`	Autonomous documentation engine — scout, generate, validate, fix loop

Each of these is the same underlying loop with a different metric and a different set of in-scope files. The pattern is composable because the constraints are consistent.

There's also a Guard: parameter — an optional verification command that must pass before any change is kept. Guard: npm test means the agent will never commit a change that breaks the test suite, regardless of whether the primary metric improved.

The design principle that makes it portable

The reason the autoresearch pattern generalizes so well is that it solves a specific problem with autonomous agent loops: they tend to drift. Without constraints, agents explore indefinitely, make changes that are hard to attribute, and produce outputs that are difficult to roll back. The Karpathy loop eliminates drift by design.

One change per iteration. One metric. One file scope. Automatic rollback. Git as the record. These aren't implementation details — they're what makes the loop safe to run unattended.

The parallel to the TDAD research from earlier this month is striking: both arrive at the same conclusion from different angles. Give the agent structural constraints and a mechanical metric, and it performs better than when given procedural instructions. Don't tell it how to research; tell it what to measure and what it's not allowed to touch.

What to do with this today

If you have a codebase with a measurable target — test coverage below where you want it, bundle size too large, too many TypeScript errors, slow queries — this loop is directly usable.

The minimal setup:

Install the skill: /plugin marketplace add uditgoenka/autoresearch
Add a GOAL.md to your project root — Karpathy's equivalent of program.md. Define: what the agent can modify, what it cannot touch, what command produces the metric, what "improved" means.
Run /autoresearch Iterations: 20 as a bounded first run to see what it does.
Review the TSV log. Read the git history. Decide what to keep.

The more interesting application is what Karpathy gestures at in the README's opening — the idea that you're not programming the training code, you're programming the program.md. The human's job shifts from writing code to writing better instructions about how to write better code. For teams thinking about long-horizon agentic workflows, that's the real architecture question.

The repo is at github.com/karpathy/autoresearch. The skill port is at github.com/uditgoenka/autoresearch. The design is simple enough that you can read both in an hour and adapt them to your stack by end of day.