agentic-coding

A Tiny HN Launch Accidentally Captures the Real Agentic-Coding Standard: Harness First, Model Second

Anatoliy Kolodkin

31 May 2026 • 5 min read

The most useful agentic-coding release this morning is not a model launch, benchmark graph, or IDE demo. It is a tiny GitHub repository with two stars at collection time saying the part teams keep learning the expensive way: production agents are mostly harness, not magic.

AlexDuchDev/agentic-product-standard was created on May 30, pushed again on May 31, and surfaced on Hacker News as “A standard for building production AI agents (+ installable Claude Code skills).” The HN thread was basically silent — 2 points and no comments when fetched — which is almost funny given how closely the repo matches where serious agentic coding is already going. The repo’s thesis is blunt: “An agent is not a prompt.” It is a bounded execution unit with contracts, scoped context, permissions, durable state, verification, and traceability.

That sounds less exciting than “10x engineer in your terminal.” Good. Excitement is not the missing primitive in agentic coding. The missing primitive is a reviewable operating model.

The standard is small, but the checklist is not

The repo packages a product-level STANDARD.md, a single-agent AGENT_STANDARD.md, templates, examples, and Claude Code skills under two tracks: agentic-product-architect for product and multi-agent design, and agent-builder for single production-grade agents. It recommends installing those skills into Claude Code so the guidance appears inside the actual build loop rather than living as a forgotten architecture doc.

The five principles are the useful compression: “Determinism by default, agency by necessity,” “Architecture beats framework,” “Harness > model,” “Context engineering is the core discipline,” and “Eval-driven development is non-negotiable.” None of these are novel in isolation. Anthropic’s “Building effective agents,” OpenAI’s agent guide, LangChain’s context-engineering work, HumanLayer’s 12-factor framing, and the last year of coding-agent runtime fixes all point in the same direction. The value here is that the repo turns those instincts into something a team can argue over in a design review.

The autonomy ladder is the sharpest part. It defines L0 as a single LLM call, L1 as an augmented LLM with retrieval/tools/memory, L2 as a deterministic workflow, L3 as orchestrator-worker, and L4 as a full autonomous agent loop. The escalation rule is refreshingly hostile to demo-driven design: do not climb to level L+1 until level L delivers at least a 90% pass rate on a curated eval set.

That one rule would prevent a large percentage of bad “agent” projects. Most teams do not need an autonomous loop. They need a workflow with one or two bounded LLM steps, schema validation, and a boring retry policy. But “workflow” does not demo as well as “agent,” so the model gets control flow it did not earn. Then the team discovers that every extra degree of autonomy adds cost, latency, audit burden, permission design, and failure recovery.

Claude Code skills are becoming project policy

The timing matters. Anthropic’s recent Claude Code releases made local .claude/skills more prominent, and teams are clearly beginning to treat skills as repo-local behavior packs. That is powerful because it moves agent practice from tribal prompt lore into versioned project assets. It is also dangerous because a skill can standardize bad assumptions just as easily as good ones.

This repo’s Claude Code skills should not be treated as a magic install. They should be treated like a proposed policy bundle. Review what the skills instruct the agent to do. Decide which teams may use them. Scope them per project when possible. Keep personal experiments out of shared repos. If your organization already reviews GitHub Actions because YAML can mutate production, it should review agent skills because prose can steer tools, permissions, and code generation.

The repo itself argues that permissions must be “enforced by code, never by prompt,” citing the 2025 Replit database-deletion incident where an agent ignored a “code freeze” instruction and wiped production data for more than 1,200 companies. That example has become the cautionary tale for a reason. Prompt-level restrictions are intent. Code-level restrictions are boundaries. A production agent needs the latter.

The harness is the product

The standard’s seven-layer harness is where the real work lives: agent loop; context and memory management; durable execution; guardrails; human-in-the-loop gates; evaluation; and observability/tracing. The repo says “98% of reliability” lives in the code around the LLM. Treat the number as a slogan, not a measurement, but the direction is correct. Model quality matters. Harness quality determines whether anyone can safely use the model after the demo.

For coding agents, the harness is where the engineering team encodes the things the model should not be trusted to remember: tool allowlists, side-effect classes, retry semantics, output schemas, trace IDs, approval gates, state storage, and stop conditions. It is also where cost governance lives. A full agent loop that repeatedly reads the wrong files is not just less reliable; it is more expensive in exactly the way finance will notice after the invoice lands.

The repo’s context advice is practical: write durable state outside the context window, select only relevant context, compress old or low-value material, and isolate independent subtasks into separate windows. It recommends keeping context-window usage below roughly 40%, arguing that degradation beyond that point is non-linear. That threshold should be validated per model and task, not worshipped as physics. But the underlying lesson is right: stuffing a model with raw transcripts, unused tool descriptions, and half the repository is not “giving it context.” It is making retrieval the model’s problem because the harness failed to do its job.

There is one place where the repo is a little too confident: “MCP-first” and “do not write custom integrations where an MCP server already exists.” MCP is clearly becoming the agent-tool standard, and reusable servers are good engineering. But an MCP server is also a supply-chain and permission surface. A narrow internal tool exposing exactly one safe operation can be better than a broad third-party server exposing twenty almost-safe ones. The right rule is not “MCP always.” The right rule is “smallest auditable capability surface, preferably reusable when reuse does not widen risk.”

What teams should actually do Monday

Do not start by installing this repo and declaring the agent platform solved. Start by stealing its review questions. Before any coding agent touches a real repository, require an agent contract: mission, owner, non-owner responsibilities, inputs, required context, optional context, tools, forbidden actions, output schema, acceptance criteria, failure modes, escalation rules, and logging requirements.

Then classify tools by side effect. Read-only file search is not the same as shell execution. Drafting a patch is not the same as applying it. Applying a patch is not the same as pushing to a remote branch. Any destructive action should require approval enforced outside the model. Any external write should leave a trace a human can reconstruct later.

Build evals from observed failures, not generic “helpfulness.” The repo recommends at least 50 examples per top-priority failure mode, calibrated binary LLM judges where appropriate, and CI gates that block regressions. Even before that infrastructure exists, read 20 to 50 real traces by hand. Teams want dashboards too early. The first observability tool is a senior engineer looking at what the agent actually did and asking, “Would I trust this pattern again?”

The bigger story is not that one small repo has become the standard. It has not. Two stars and a quiet HN thread are not consensus. The bigger story is that the agentic-coding conversation is converging on the same shape from multiple directions: less prompt mysticism, more contracts; fewer autonomous loops by default, more deterministic workflows; fewer screenshots of agents “working,” more trace review and permission design.

That is the right turn. The next serious coding-agent advantage will not come from who can write the longest prompt. It will come from who can make agent runs replayable, auditable, bounded, cheap enough to operate, and boring enough to trust. Harness first, model second. The repo gets that part exactly right.

Sources: GitHub — agentic-product-standard, STANDARD.md, AGENT_STANDARD.md, Anthropic — Building effective agents, OpenAI — A practical guide to building agents.

The standard is small, but the checklist is not

Claude Code skills are becoming project policy

The harness is the product

What teams should actually do Monday

Sign up for more like this.