CodeBurn Turns AI Coding Spend Into an Engineering Metric

CodeBurn Turns AI Coding Spend Into an Engineering Metric

The least glamorous question in agentic coding is becoming the most important one: where did the tokens go?

Not “which model is smartest,” not “which CLI has the best demo,” and definitely not “how many vibes did we ship this sprint.” The useful question is whether an AI coding session consumed budget in a way that produced working code, useful analysis, or reusable context — or whether it just re-read the same files, dragged a bloated instruction stack through every prompt, and turned a half-planned debugging session into a small bonfire of input tokens.

CodeBurn’s v0.9.8 release is worth watching because it treats AI coding spend as an engineering metric instead of a finance surprise. The release adds a codeburn models command, Crush provider support, multiple Claude config directories, per-day one-shot JSON metrics, and a batch of Cursor/project attribution fixes. That sounds like normal CLI plumbing. It is more than that. It is the observability layer that agentic coding needs before teams can have an honest conversation about cost, productivity, and waste.

A model table is only useful when it knows what the model was doing

The new codeburn models command breaks usage down across providers and models, including input tokens, output tokens, cache writes, cache reads, total tokens, cost, and a dominant task cell such as Coding (42%). It supports filters for period, provider, task, top models, minimum cost, and output formats including table, Markdown, JSON, and CSV.

That task context is the difference between telemetry and trivia. “Sonnet cost more than GPT this week” is a spreadsheet fact, not an engineering conclusion. A model may be expensive because it handled the gnarly migration sessions, or because developers keep using it for shallow refactors that a cheaper model could handle. A cheap model may be cheap because it was efficient, or because it failed early and pushed the real work somewhere else. Cost by model without task type, retry behavior, project attribution, cache behavior, and delivery outcome is how teams optimize for the wrong thing with confidence.

CodeBurn is trying to connect those dimensions. The README says it tracks token usage, cost, and performance across 19 AI coding tools, including Claude Code, Claude Desktop, Codex, Cursor, cursor-agent, Gemini CLI, GitHub Copilot, IBM Bob, Kiro, OpenCode, OpenClaw, Pi, OMP, Droid, Roo Code, KiloCode, Qwen, Goose, Antigravity, and Crush. The project had more than 6,100 GitHub stars, 472 forks, and fresh May 12 commits during the research window. npm showed 3,022 downloads on May 11, 13,640 over the prior week, and 52,221 over the prior month. That is not enterprise-procurement scale, but it is real practitioner pull.

The release also adds CLAUDE_CONFIG_DIRS, letting users scan multiple Claude data directories in one run, with POSIX : and Windows ; delimiters, tilde expansion, deduplication, and graceful skipping of unreadable directories. This is the kind of feature that looks boring to people who do not use these tools. Real developers have work accounts, personal accounts, remote boxes, containers, and half-remembered experiments across multiple profiles. Observability tools that assume one happy-path directory usually die the first week they meet a real workstation.

Project attribution is where dashboards become management tools

The Cursor fixes are more important than they sound. CodeBurn now breaks Cursor sessions down by workspace instead of dumping everything into one generic cursor row. It also improves model alias pricing for non-Auto Cursor variants, stops misclassifying phrases like “add error handling” as debugging, and discards stale cache versions.

Project attribution is the moment a personal dashboard starts answering team questions. A lead does not only need to know that Cursor burned tokens. They need to know which project consumed them, what kind of work triggered the spend, and whether the work shipped. If one service is absorbing a disproportionate share of agent usage because its test suite is noisy, its docs are stale, or its dependency graph forces agents into huge context windows, that is an engineering problem. Without project attribution, it remains a vague bill.

CodeBurn’s architecture is intentionally local-first. The docs describe one Node.js CLI plus macOS and GNOME clients that shell out to it. It reads local session artifacts — JSONL, SQLite, protobuf, and related formats — then aggregates through provider parsers, a daily cache, and output formatters. Caches live under ~/.cache/codeburn/, use atomic writes, and write mode 0o600.

That local-read design is the right tradeoff for this category. A proxy can produce cleaner accounting, but it becomes another trust boundary and forces every agent call through a managed path. CodeBurn reads exhaust from tools that already write local traces. Adoption is easier, and privacy is saner, though not automatic. Local logs can still contain prompts, repo paths, code snippets, internal URLs, and occasionally secrets if an upstream tool logged something it should not have. Exports and reports should be treated as developer telemetry, not casual screenshots for the company chat.

The numbers will be approximate. The patterns are the point.

There is an accuracy caveat, and it matters. CodeBurn is not instrumenting every model call at the provider boundary. Some tools expose real token counts. Others require inference from transcript content, local databases, or model aliases. Cursor Auto may be costed with an estimate. Kiro may be labeled as kiro-auto. Copilot VS Code transcripts may not expose explicit token counts. That does not make the tool useless; it defines the correct use case.

Treat these numbers as operational telemetry, not accounting truth to the cent. Use them for trends, outliers, ratios, and before/after comparisons. If a dashboard says MCP schema overhead doubled after someone added five unused servers, that is useful even if the final dollar amount is slightly estimated. If cache hit rate stays near zero, repeated file reads dominate sessions, or expensive agent runs produce no commits, you do not need perfect billing precision to know the workflow is sick.

The optimize detectors point in exactly that direction. CodeBurn currently includes 14 detectors covering junk reads, duplicate reads, MCP tool coverage, unused MCP servers, bloated CLAUDE.md, low read/edit ratio, cache bloat, ghost agents, ghost skills, ghost commands, bash bloat, low-worth sessions, context bloat, and session outliers. That list is a diagnosis of modern agent waste. The industry keeps talking about model prices, but plenty of waste is self-inflicted configuration drag.

Repeated file reads are usually a planning problem. Bloated instruction files are a governance problem. Unused MCP servers are an architecture problem. Uncapped bash output is a tooling problem. Low-worth expensive sessions are a workflow problem. Blaming the model for all of it is convenient, but not very engineering-like.

The “yield” feature is directionally smart for the same reason. It correlates AI sessions with git commits by timestamp and buckets spend as productive, reverted, or abandoned. That is crude. It will miss work that informed a human edit later, and it may over-credit commits that happened near a session. Still, the instinct is correct: cost needs delivery context. A session that cost $3 and produced a merged migration is different from a session that cost $3, looped for an hour, and left no diff behind.

How to use this without turning it into a blame machine

The practical rollout is simple: pick one project, run local token observability for a week, and review the outliers like slow queries. Look at cache hit rate, read/edit ratio, retry rate, MCP overhead, abandoned sessions, model/task mismatch, and cost per committed change. Then fix the boring causes. Trim instruction files. Remove unused MCP servers. Split giant tasks. Route simple edits to cheaper models. Add better validation commands so agents stop thrashing. Improve session openers so the agent reads the right files once instead of the wrong files repeatedly.

For managers, the warning is equally simple: do not turn per-user token spend into a leaderboard of shame. That will train engineers to hide experimentation, avoid hard problems, or use unmanaged tools. Aggregate by project, workflow, model, and task category first. The goal is not to find the person who burned tokens. The goal is to find the defaults and habits that make everyone burn tokens.

CodeBurn’s v0.9.8 release is not flashy. Good. Agentic coding has had enough flashy. The next productivity win is knowing which sessions actually shipped code, which ones just heated the token furnace, and what to change Monday morning. Cost observability is not the end of agent operations, but it is where the adult conversation starts.

Sources: GitHub — getagentseal/codeburn v0.9.8 release, CodeBurn README, CodeBurn changelog, npm package codeburn, CodeBurn architecture docs