CodeBurn Turns AI Coding Bills Into an Engineering Signal Instead of a Finance Surprise

CodeBurn Turns AI Coding Bills Into an Engineering Signal Instead of a Finance Surprise

AI coding agents have reached the stage every developer tool eventually reaches: the demo is over, the bill has arrived, and now someone needs to explain why Tuesday’s refactor cost more than Friday’s release.

That is the useful frame for CodeBurn, a local terminal dashboard from getagentseal that tracks token usage, cost, and performance across AI coding tools. The project is not another agent, not another wrapper, and not another gateway asking to sit between your editor and the model. It reads the session data your tools already leave on disk, prices the calls using LiteLLM’s model data, and turns the result into the kind of operational view engineers understand: cost by project, model, task type, tool, session, provider, retry pattern, and delivery signal.

That sounds small until you remember how most teams currently manage AI coding spend: vibes, plan limits, surprise invoices, and someone posting a screenshot of a rate-limit error in Slack. CodeBurn is trying to replace that with observability. Good. The industry badly needs fewer “which agent is best?” arguments and more “which workflows actually produced maintainable code for the money?” questions.

The missing top command for agent work

The repository’s pitch is direct: “See where your AI coding tokens go.” At research time, the GitHub project showed roughly 6,025 stars, 464 forks, 47 open issues, an MIT license, and fresh activity on May 10. The npm package lists codeburn as a CLI/TUI for token usage, with version 0.9.7 in the current package history. Those numbers do not make it production infrastructure by themselves, but they do say something useful: developers are actively looking for a meter.

CodeBurn supports a long list of coding-agent surfaces: Claude Code, Claude Desktop local agent mode, Codex, Cursor, cursor-agent, Gemini CLI, GitHub Copilot, OpenCode, OpenClaw, Qwen, Kiro, Roo Code, KiloCode, Goose, Antigravity, Crush, and others. The README says it covers 18 tools or surfaces, and each provider doc explains where data lives and what quirks apply. Claude Code sessions come from JSONL files under ~/.claude/projects/<sanitized-path>/<session-id>.jsonl. Codex sessions are nested under ~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl. Cursor is read from a local SQLite database. OpenClaw is read from local agent logs. That architecture is the product decision that matters.

A proxy-based system can produce cleaner accounting because every request passes through it. It also becomes a new trust boundary, a new failure mode, and another piece of infrastructure developers have to route through before they can get work done. CodeBurn takes the messier but more deployable path: inspect local artifacts, normalize what can be normalized, and clearly estimate where tools do not expose enough data. Cursor Auto mode hides the exact model, so CodeBurn labels it as an estimate. Copilot VS Code transcripts may lack explicit token counts, so it estimates from content length and tool-call prefixes. Kiro does not expose model identity, so sessions are labeled accordingly and costed with a fallback. That kind of uncertainty is not a bug. It is honest instrumentation in an ecosystem that has not standardized its telemetry yet.

Cost is now a developer-experience signal

The interesting part is not that CodeBurn calculates dollars. Finance teams can do that. The interesting part is that it translates spend into engineering language.

“Claude Code cost $400 this week” is an expense line. “Opus handled simple refactors that Sonnet could have done, cache hit rate stayed weak, three unused MCP servers were loaded into every session, and abandoned sessions burned more than merged work” is a diagnosis. Engineers can fix the second sentence. They can remove stale MCP servers, trim bloated CLAUDE.md files, cap noisy Bash output, stop loading ghost skills, route low-risk tasks to cheaper models, and teach teams which workflows have high retry rates.

CodeBurn’s optimizer points directly at those failure modes. It scans for repeated file reads, low Read:Edit ratios, uncapped shell output, unused MCP servers paying schema overhead, unused agents/skills/slash commands, bloated CLAUDE.md imports, cache creation overhead, junk directory reads, context-heavy sessions, and expensive low-worth sessions with no delivery signal. That list is quietly damning. It says the waste in agentic coding is not only model pricing. It is configuration entropy.

This will become a normal category of performance work. Teams already profile database queries, CI time, container image size, and frontend bundles. Agent stacks add a new bundle: instructions, memories, tool schemas, MCP descriptions, repo context, shell output, and previous-turn residue. Every token that enters the context window competes with the work you actually wanted the model to do. A sloppy agent configuration is now both slower and more expensive. Congratulations, we invented context bloat.

Yield beats cheapness

The feature to watch is yield analysis. CodeBurn tries to correlate AI sessions with git commits and classify outcomes as productive, reverted, or abandoned. That is crude, and it will miss plenty: useful review sessions that do not commit, architecture planning that prevents bad work, or debugging sessions that end with a human patch. But the direction is right.

Token accounting alone can optimize the wrong thing. The cheapest session is the one nobody runs. The cheapest model is not necessarily the one that gets the migration done. A team that saves 30 percent on tokens while doubling review churn has not optimized anything; it has moved the cost into human attention. The mature metric is cost per useful outcome: merged fix, passing migration, reduced incident time, reviewed PR, shipped feature, or avoided regression.

That is also where the Claude Code versus Codex versus Cursor debate gets more practical. The winner is not always the tool with the best benchmark, the largest context window, or the most impressive launch post. The winner for a given team is the tool that produces reviewed, maintainable changes with acceptable latency, predictable quota behavior, and a cost profile the team can explain. CodeBurn does not answer that question perfectly. It gives teams the raw material to stop guessing.

There is a governance angle too. Local logs can contain sensitive prompts, file paths, repository names, snippets of code, and occasionally secrets if upstream tools were careless. CodeBurn running locally is the right privacy default, but reports and exports should still be treated as developer telemetry. Do not dump them into shared dashboards without thinking. And do not turn per-user cost leaderboards into blame theater. The fastest way to ruin useful observability is to make engineers hide experimentation because they fear being labeled expensive.

The sane rollout is boring: run it privately for a week, inspect the top five expensive sessions, compare model usage by task, look for unused MCP servers, trim repo instructions, cap shell output, and review cache behavior before changing policy. If you manage a team, aggregate by project or workflow first, not by person. The goal is not to find the developer who asked too many questions. The goal is to find the defaults that made every question more expensive than it needed to be.

CodeBurn matters because it treats AI coding agents like infrastructure instead of magic. Once agents are writing code, opening PRs, reading repositories, and consuming real budget, they need the same boring instruments every serious system gets: meters, dashboards, waste reports, and outcome checks. The agent era does not need more mysticism. It needs a good top command.

Sources: getagentseal/codeburn GitHub repository, codeburn npm package, Claude provider docs, Codex provider docs, LiteLLM repository