ai-models

AgentDoG 1.5 Is a Small-Model Guardrail for the Part of Agents People Keep Pretending Is Safe: the Trajectory

Anatoliy Kolodkin

29 May 2026 • 3 min read

Most AI safety systems still inspect the part of an agent run that arrives after the damage is already done: the final answer. AgentDoG 1.5 is interesting because it moves the review point upstream, into the trajectory itself — the chain of observations, tool calls, approvals, command outputs, memory updates, and state changes where agent risk actually accumulates.

That distinction matters. A chatbot can say something unsafe. An agent can do something unsafe, then summarize it politely. If the guardrail only reads the summary, it is doing incident response with a lint rule.

The AgentDoG 1.5 paper extends the earlier ATBench safety framework into the operating surfaces that now matter for builders: Codex-style repository agents and OpenClaw-style multi-session tool agents. The taxonomy keeps three axes — Risk Source, Failure Mode, and Real-World Harm — then adds categories that sound painfully familiar to anyone shipping agent runtimes: session contamination, repository artifact injection, approval bypass, unsafe shell execution, destructive workspace mutation, and dependency or MCP supply-chain compromise.

The benchmark base contains 1,000 audited trajectories, almost perfectly split between 503 safe and 497 unsafe examples. Those traces include 2,084 available tools, 1,954 invoked tools, an average of 9.01 turns, and about 3.95k tokens per trajectory. That is still small by frontier-training standards, but it is large enough to make the core point: agent safety is not a prompt classification task. It is trace review.

The small-model result is useful, but the diagnosis result is the real story

The project page reports AgentDoG-1.5-Qwen3.5-4B at 92.2% R-Judge accuracy and 72.4% ATBench accuracy, while the unified 4B variant reaches 78.4% ATBench accuracy. GPT-5.4 is slightly better on R-Judge at 93.3%, but only 73.7% on ATBench. That comparison should not be oversold as “a 4B model beats GPT.” Benchmark distributions are narrow, and safety classifiers are notoriously sensitive to how the examples are written.

The more interesting result is fine-grained attribution. AgentDoG-1.5-Qwen3.5-4B reports 75.2% Risk Source accuracy, 27.5% Failure Mode accuracy, and 62.9% Real-World Harm accuracy on ATBench. GPT-5.4 reports 33.6% / 13.5% / 30.2% on the same dimensions. Translation: a frontier model may often tell you that a run looks unsafe, but a specialized guard model is better at saying which kind of unsafe you are looking at.

That matters operationally. “Unsafe” is not a remediation plan. A repo-artifact injection needs different controls than an egress leak. Approval bypass needs different instrumentation than a hallucinated shell command. Destructive workspace mutation needs rollback and filesystem policy. A useful guardrail should not just block; it should produce a diagnosis that maps to runtime controls and regression tests.

The authors also claim their lightweight finite-state simulation environment reduces memory overhead and startup latency to about 1/100 of Docker-level environments and can support 10,000+ concurrent agentic environments on an 8-core machine. If that holds under real workloads, it points to the economics of agent safety: you cannot put every trace through an expensive frontier model and call that governance. The safety layer has to be cheap enough to run often, close enough to the runtime to see state, and structured enough to feed CI.

The practitioner move is straightforward: start treating trajectories as audit artifacts. Log user intent, retrieved context, tool manifests, tool calls, approvals, command outputs, file mutations, memory writes, retries, and final responses in one reconstructable object. If your system cannot replay the run, a trajectory-level guard model cannot help you. More importantly, your human reviewer cannot help you either.

AgentDoG should not replace deterministic controls. The right stack still starts with least-privilege credentials, allowlisted tools, explicit approval gates, sandboxed repositories, egress limits, cost and time kill switches, and clean separation between sessions. A learned trajectory judge belongs above that layer as an auditor, regression signal, and postmortem assistant — not as the only thing standing between a model and production infrastructure.

There is also a benchmark caveat. AgentDoG 1.5 is unusually close to the current agent-runtime discourse, including Codex and OpenClaw-style categories. That makes it relevant, but it also means teams should test it against their own traces before trusting the numbers. The best Failure Mode accuracy cited here is still only 27.5%. That is useful enough for triage, not enough for fully automated incident classification.

The direction is right. Agent safety is finally shifting from “moderate the answer” to “review the run.” That is the same shift software teams made when they moved from checking release notes to reading logs, traces, diffs, and deploy metadata. Agents need the same discipline. The risky part is not always what the model says. It is what the model touched on the way there.

One more practical detail: run these evaluators on safe traces too, not only failures. A guardrail trained only around incident-shaped examples will teach teams very little about false positives, reviewer fatigue, and the normal tool-use patterns that should remain unblocked. Safety tooling has to protect the workflow without making every useful agent action feel like suspicious behavior.

Sources: arXiv, AgentDoG project page, Hugging Face Papers, ATBench-Claw dataset

The small-model result is useful, but the diagnosis result is the real story

Sign up for more like this.