ai-frameworks

Agent Security Harness 4.4.2 Says the Quiet Part Out Loud: Even Security Docs Now Need Threat Modeling

Anatoliy Kolodkin

24 May 2026 • 4 min read

Agent Security Harness 4.4.2 is a docs-only release, which is usually newsletter poison. No code changed. No tests changed. The test suite is still 470 tests across 32 modules. And yet this release is worth covering because it says something uncomfortable about the next phase of agent security: even defensive security documentation now needs threat modeling because automated scanners read it as part of the artifact.

The release adds citation metadata, cleans up skill bundle metadata, addresses scanner findings, hardens MCP example documentation, replaces stale badges, and rewrites offensive-looking descriptions into defensive-probe language. That sounds like housekeeping. It is actually a preview of how AI-security projects will have to operate when LLM-driven scanners, package registries, and agent skill stores classify repositories by strings, examples, and intent signals — not just executable behavior.

Defensive tools contain offensive vocabulary by design

The release note’s most useful sentence is blunt: “LLM-driven scanners read bundled docs as if they were code; string-density on offensive vocabulary defines the verdict more than test capability does.” That is the whole story. A repository that tests prompt injection, credential theft, capability escalation, MCP abuse, A2A downgrade paths, and adversarial protocol behavior will necessarily contain language that looks dangerous out of context. The problem is not that scanners are useless. The problem is that scanner interpretation becomes part of the release surface.

Agent Security Harness responds by rewriting a GTG-1002 capability table in docs/ADVANCED.md. The old framing used columns such as “Real GTG-1002 Activity” and “What We Test.” The new framing uses “Adversary behavior we probe for” and “Detection probes the harness sends.” That may look like wordsmithing, but it changes the machine-readable intent. The project is not trying to hide the risky behavior it tests. It is trying to make clear that the repository contains controlled defensive probes, not operational attack instructions.

This is not a new tension in security research. Good security docs have always needed to explain attacks without becoming a copy-paste abuse guide. What is new is that automated classifiers now sit between the project and adoption. A human reviewer can understand context, test boundaries, fixtures, and safety language. A scanner may see concentrated adversarial vocabulary and classify the repo as suspicious. If that scanner feeds a registry badge, install policy, or enterprise allowlist, phrasing becomes operational.

The scanner is part of the audience now

The 4.4.2 release adds CITATION.cff, letting GitHub render citation metadata and Zenodo ingest author information. The README citation section is anchored to ORCID 0009-0003-6736-1900 and Zenodo DOIs for methodology preprints. That matters more than it sounds because agent-security tooling is currently full of demos, one-off benchmarks, vendor claims, and “red team” scripts with unclear provenance. Citation metadata does not prove correctness, but it makes the work easier to inspect, cite, reproduce, and compare.

The release also adds SKILL.md with OpenClaw metadata and a Safety & Credentials section addressing prior ClawHub scan findings. It replaces a stale SafeSkill 85/100 badge with verified-clean badges: ClawScan Benign, Static Analysis Benign, and VirusTotal 0/92 Clean. Badge hygiene is not a substitute for review, but stale security badges are actively harmful. They tell users “someone checked this” without telling them when, against what, or whether the result still applies.

The MCP documentation fixes are another good signal. The release says telemetry is opt-in and disabled by default. It corrects the MCP server example to the real python -m mcp_server, defaults to stdio, and documents HTTP transport hardening. That is exactly the level of specificity security projects owe users. “Run the server” is not enough when agent tooling routinely bridges local processes, HTTP endpoints, credentials, and model-driven calls. The transport default is a security decision, even when it appears in a README.

Agent security is broader than prompt injection

The project describes itself as a harness with 470 executable security tests across 32 modules, covering MCP, A2A, L402, x402, decision governance, benchmark integrity, and skill supply chain. The README lists 18 MCP tests, 13 A2A tests, 85 L402/x402 tests, 25 cloud plus 20 enterprise platform tests, 17 GTG-1002 APT simulation tests, 50 jailbreak and over-refusal tests, and AIUC-1 mapping to 19 of 20 testable requirements. Those numbers should not be accepted uncritically, but they show the right mental model: agent security is not one eval category.

Prompt injection still matters. So does over-refusal. But production agent systems also fail through protocol confusion, replay, downgrade behavior, malformed identity assumptions, unsafe payment flows, credential exposure, tool-capability drift, benchmark contamination, and bad decision governance. The interesting part of a harness like this is not whether it can produce a scary demo. It is whether it can turn those failure modes into repeatable probes with clear expected behavior.

That is also why the docs-only nature of 4.4.2 is defensible. If the tests did not change, the release is not claiming new technical coverage. It is improving the way the project presents its intent, provenance, and safety posture to humans and scanners. In a world where security tools are installed through agent skill stores and evaluated by automated policy gates, that presentation layer is not decorative. It is part of distribution.

There is a practical lesson here for every team building or adopting agent security tools. When a scanner flags a defensive harness because its docs contain adversarial terms, do not blindly suppress it. Also do not blindly reject the harness. Inspect the project’s safety boundaries. Are credentials fixtures or real examples? Is telemetry opt-in? Are network transports hardened by default? Are offensive strings framed as controlled probes? Are test counts stable? Is there citation or methodology metadata? Are badges current? The answer should become a documented exception or a documented rejection, not tribal memory.

The same applies to internal security repositories. If your red-team prompts, MCP abuse tests, or agent-governance checks live in a repo that will be scanned by LLM-based tools, write for two audiences: the human engineer and the automated reviewer. Use defensive framing. Mark fixtures clearly. Avoid unnecessary operational detail. Link methodology. Separate runnable exploit code from explanatory text. Make scope and intent machine-readable where possible. This is not sanitizing research. It is reducing false positives without hiding risk.

Public community reaction is thin. HN searches for the project and release returned no meaningful discussion, and the GitHub repo had 17 stars and 5 forks at research time. But the release references real tooling feedback loops: ClawHub and ClawScan findings, VirusTotal Code Insight behavior, badge updates, and OpenClaw skill metadata. In this niche, the “community” is increasingly the scanners, registries, and automated gates deciding whether a tool can be installed.

My take: Agent Security Harness 4.4.2 is useful because it names a maintenance burden the industry is about to rediscover at scale. If AI systems are going to review AI-security tools, those tools need to make defensive intent legible to machines without pretending the adversarial behavior is harmless. Security documentation is now part of the threat model. Welcome to the least glamorous but most predictable future.

Sources: Agent Security Harness v4.4.2 release, red-team-blue-team-agent-fabric repository, agent-security-harness on PyPI

Defensive tools contain offensive vocabulary by design

The scanner is part of the audience now

Agent security is broader than prompt injection

Sign up for more like this.