Agent Harnesses Need Observability Before They Get More Autonomy

Agent Harnesses Need Observability Before They Get More Autonomy

The next serious feature in coding agents is not “more autonomy.” It is the right to earn more autonomy by leaving a trail.

That is the useful signal in the latest work on Everything Claude Code, the very large agent-harness repository that now describes itself as a performance optimization system for Claude Code, Codex, OpenCode, Cursor, and whatever comes next. A fresh May 11 commit added an observability readiness gate for ECC 2.0: live status, session traces, harness-audit scorecards, tool-activity logs, and a local risk ledger. The line in the documentation is the whole thesis: “ECC 2.0 should be observable before it becomes more autonomous.”

That should be printed on the inside of every agent runtime’s metaphorical hard hat.

For the last year, the AI coding conversation has been stuck on patch quality. Which model writes cleaner React? Which one gets fewer tests wrong? Which one uses the right ORM without wandering off into dependency fan fiction? Those questions still matter, but they are no longer enough. Once an agent can run loops, spawn subagents, call MCP tools, load skills, mutate worktrees, and preserve state across sessions, the core question changes from “did it generate a good diff?” to “can we prove what happened while it was generating that diff?”

“Observable before autonomous” is the right ordering

The GitHub metadata tells you why this is not just a tiny repo hygiene patch. affaan-m/everything-claude-code was created in January 2026, updated again late on May 11, and now shows roughly 179,500 stars, 27,700 forks, 899 subscribers, no open issues, and an MIT license. Those numbers are absurd enough that you should treat them with the usual GitHub-popularity caveats, but the signal is still clear: this repo is part of the practitioner surface area for people trying to make agent harnesses usable across tools.

The fresh commit, 8aa8c32d2a86, adds docs/architecture/observability-readiness.md, scripts/observability-readiness.js, and tests for the readiness script. The gate checks five operator signals: live loop status via scripts/loop-status.js, session inspection via scripts/session-inspect.js, a harness baseline via scripts/harness-audit.js, local tool-usage.jsonl activity events, and a Rust-backed risk ledger in ecc2/src/observability/mod.rs for scored tool calls and paginated review.

That list is refreshingly unglamorous. No one is putting “file-backed tool-usage JSONL” on a keynote slide unless the keynote has already gone badly. But this is exactly the kind of boring surface that makes autonomy operationally survivable. Live status tells you whether a loop is active, idle, stuck, or thrashing. Session traces tell you what actually happened, not what the agent summarized after the fact. A harness audit creates a repeatable baseline for tool coverage, context efficiency, memory persistence, eval coverage, security guardrails, and cost efficiency. Tool logs and a risk ledger let a human reconstruct the difference between harmless context gathering and authority-bearing behavior.

In other words: this is not observability as dashboard garnish. It is observability as a release gate.

The cross-harness promise needs a cross-harness audit trail

ECC’s v2.0 release-candidate docs frame the project as a reusable substrate across Claude Code, Codex, OpenCode, Cursor, Gemini, and related harnesses. That is the direction the whole market is moving. Skills, rules, MCP conventions, hooks, and operator workflows are becoming the new portability layer. The problem is that portability is not the same thing as equivalence.

A SKILL.md file is relatively portable because most agent runtimes can load markdown instructions and metadata. Hooks are a different story. Claude Code, OpenCode, Cursor, Codex, and other tools do not expose identical event models, permission semantics, or lifecycle hooks. A workflow that is enforceable as a pre-tool gate in one runtime may degrade into advisory text in another. An MCP call that is logged as a first-class event in one harness may become opaque adapter noise somewhere else. A subagent handoff might preserve trace context in one system and flatten it into “the other agent did stuff” in another.

That is why observability is not a nice extra on top of portability. It is how you find out whether portability is real. If your internal agent workflow works across Claude Code and Codex but only one of those paths leaves an auditable trace of shell calls, file writes, MCP invocations, and permission boundaries, then the workflow is not equally safe across both tools. It merely installs in both places. That is a much lower bar.

This is also where the agent ecosystem starts to resemble CI/CD infrastructure more than editor tooling. Nobody serious deploys a CI runner based only on whether it can execute a build script. They ask where logs go, how secrets are isolated, how artifacts are retained, what happens on partial failure, how permissions are scoped, and how to reconstruct an incident. Coding agents deserve the same suspicion, preferably before they are allowed to run weekend-long refactors in repositories with production deploy credentials nearby.

Local-first is not anti-enterprise; it is the sane bootstrap

The ECC readiness doc explicitly keeps the default opt-in, repo-owned, deterministic, CI-safe, file-backed, and non-telemetry-sending. That matters. Hosted telemetry will eventually be useful for teams operating fleets of agents across many repositories. But shipping a hosted dashboard before the event model is trustworthy is how you get observability theater: pretty graphs over incomplete truth.

File-backed local evidence is less exciting and more useful. You can run it in CI. You can attach it to a pull request. You can inspect it without asking a vendor what their retention policy is. You can diff it as the harness evolves. Most importantly, you can block increased autonomy until the local evidence exists. “The agent loop can run unattended for four hours” should be downstream of “we can inspect what tools it used, where it wrote, what risks it scored, and why it stopped.”

For engineering teams, the practical checklist is straightforward. Before you run larger autonomous batches, require loop status, session inspection, a harness audit, tool-activity logging, and reviewable risky-call records. If a skill or plugin invokes scripts, make sure those script invocations land in the same audit stream as normal tool calls. If a workflow is converted across harnesses, test that traceability survives conversion. If a task output is going to a PR, attach the agent’s operational evidence, not just its confident final paragraph.

The caveat is obvious but worth saying: observability does not make an unsafe agent safe. It makes failure inspectable. A risk ledger will not undo a destructive write. A trace will not prevent every secret leak. A harness score can become theater if the team optimizes for the number instead of the failure mode. But without these surfaces, you are not running an agent runtime. You are letting a probabilistic contractor use your shell and hoping the transcript is honest.

ECC’s latest gate is a small implementation detail pointing at the correct industry direction. Autonomy should not be a slider teams drag upward because the model got better this week. It should be a privilege earned by runtime evidence: loops, sessions, tools, risks, and handoffs that humans can inspect after the model stops talking.

Sources: GitHub — affaan-m/everything-claude-code, observability readiness commit, ECC observability readiness docs, ECC cross-harness architecture, Claude Code hooks reference