openclaw

Tokscale v3.1.2 Turns Agent Spend Into a First-Class Engineering Metric

Anatoliy Kolodkin

10 Jun 2026 • 4 min read

Tokscale v3.1.2 is the kind of small release that points at a large bill. The feature list is modest: multi-device support with UUID device IDs and TUI aggregation, a musl detection fix under Bun and Alpine, and a CSRF origin correction for the frontend deployment. But the market signal is bigger than the diff. Coding-agent spend has escaped the single-tool era, and teams need instrumentation before the invoice becomes the observability system.

The release was published on June 10 at 2026-06-10T00:29:24Z. At research time, the repository had about 3,634 stars, 236 forks, and 81 open issues, with a same-day push later that morning. Tokscale’s promise is not glamorous: read usage data from a long list of AI coding agents and normalize it into token and cost reports. That is exactly why it matters. The agent market is converging functionally while fragmenting operationally.

Tokscale supports more than 25 clients and surfaces, including OpenCode, Claude Code, OpenClaw, Codex CLI, GitHub Copilot CLI, Hermes Agent, Gemini CLI, Cursor, Amp, Codebuff, Factory Droid, Pi, Kimi, Qwen, Roo Code, Kilo, Mux, Goose, Google Antigravity, Trae, Grok Build, Zed, Kiro, Cline, Gajae-Code, and Synthetic. For OpenClaw specifically, it reads usage from ~/.openclaw/agents/ and legacy paths such as .clawdbot, .moltbot, and .moldbot. Pricing is calculated using LiteLLM pricing data, including tiered pricing and cache token discounts.

That list looks excessive until you look at how developers actually work now. A single engineer may use Claude Code for a refactor, Codex for an OpenAI-native app-server path, Cursor for IDE-local edits, OpenClaw for channel-connected automation, Antigravity for Google’s stack, and Qwen or Kimi for a local or cheaper path. The workflow is becoming multi-agent by default, not because architects drew it that way, but because every tool has one mode where it feels better than the others.

The hard part is attribution, not arithmetic

Counting tokens is easy compared with answering the engineering question: what did we spend them on? Aggregate monthly spend is useful to finance, but nearly useless for improving agent workflows. Engineering needs attribution by client, workspace, model, session, and sometimes child session. Tokscale’s grouping options — model, client,model, client,provider,model, workspace,model, session,model, and client,session,model — are the important primitive.

Per-session JSON includes session ID, model, input tokens, output tokens, cache tokens, reasoning tokens, message count, and cost. That is the minimum viable audit trail for agent economics. Without it, teams optimize by superstition. Someone sees a scary monthly number and bans the expensive model, even if the real issue was a runaway background task, a bad retrieval loop, cache misses, or one workspace repeatedly asking an agent to re-read the same monorepo.

Reasoning tokens deserve special attention. Modern coding agents do not just produce visible output. They deliberate, plan, call tools, replay context, compact sessions, and sometimes spawn subagents. The cost center is not the final answer; it is the whole trajectory. If your reporting collapses all of that into “tokens used,” you cannot tell whether a model is expensive because it solved a hard problem or because it generated an elaborate trace and an expensive shrug.

Multi-device support makes this operational instead of hobby telemetry

The v3.1.2 multi-device support sounds mundane, but it tracks real developer behavior. Work no longer happens on one blessed laptop. There is the local machine, the remote Linux box, a Codespace, a CI worker, a Mac mini in the closet, a GPU workstation, and maybe an SSH session into something with enough RAM to hold the repo and the model. If your usage accounting only works on the machine where you remembered to run a CLI, it is not operational reporting. It is a diary.

UUID device IDs and multi-device TUI aggregation move Tokscale toward the shape teams actually need. Device-level accounting can separate local experimentation from remote automation, personal workflow from CI, and disposable test runs from production-like agent tasks. It also creates a path to answer uncomfortable but necessary questions: which machine generated the cost, under which workspace, with which client, using which model, for which session?

The musl detection fix under Bun and Alpine also matters more than it looks. Alpine-based containers are common in lightweight automation environments. Bun keeps showing up in modern TypeScript tooling. If cost reporting breaks in exactly the places teams run background agents or small service containers, the numbers will be biased toward interactive desktop usage and miss the automation layer where runaway spend often hides.

Agent convergence needs a meter

The broader story is that coding-agent convergence has created a boring but expensive problem: the tools can increasingly do similar work, but they do not agree on accounting. Each client stores sessions differently. Each provider prices cache reads, cache writes, reasoning, input, and output differently. Each agent decides how much context to carry, when to compact, when to call tools, and whether to spawn child runs. Finance wants a bill. Engineering wants causality. Managers want to know whether the spend bought throughput or just novelty.

LiteLLM pricing data is a reasonable normalization layer, but teams should still treat any token-cost dashboard as an approximation that needs calibration. Provider pricing changes. Discounts differ. Enterprise contracts hide real rates. Cache semantics vary. Local models have hardware and electricity costs rather than per-token invoices. The value of Tokscale is not that it will produce a perfect ledger. It is that it can make the shape of usage visible enough to argue about with evidence.

For OpenClaw users, the advice is simple: wire usage reporting before OpenClaw becomes invisible infrastructure. The dangerous moment is not day one, when everyone is watching the shiny new agent. It is week six, when cron jobs, Slack replies, background sessions, subagents, and coding tasks are running often enough that nobody remembers which automation was experimental. Establish a baseline. Group by client and session. Track cache tokens separately. Watch reasoning tokens. Compare cost against accepted work, not generated text.

Teams should also decide what “good spend” means. A $12 agent session that safely migrates a gnarly module may be a bargain. A $1 daily cron that produces no action 300 times a month may be waste. A local model that saves API spend but burns engineering time through bad tool calls may be expensive in a different column. Cost governance is not “use the cheapest model.” It is “understand which model, tool, and workflow produced value at an acceptable cost.”

Tokscale is not the whole answer. It is part of the missing instrumentation layer. And in this phase of the agent market, instrumentation is leverage. The teams that can see per-session cost, cache behavior, model mix, and workflow attribution will tune their agent stacks. The teams that cannot will discover their architecture in the bill.

Sources: Tokscale v3.1.2 release, Tokscale repository, Hermes Atlas Tokscale project page, LiteLLM

The hard part is attribution, not arithmetic

Multi-device support makes this operational instead of hobby telemetry

Agent convergence needs a meter

Sign up for more like this.