Microsoft’s Agentic AI Taxonomy Update Says the Security Boundary Is Now the Agent Session
Agentic AI security has spent the last year stuck in an awkward phase: everyone agrees prompt injection is real, but too many defenses still look like chat-app bandages wrapped around systems that now browse, write code, invoke tools, remember state, delegate work, and ask humans to approve things they barely understand. Microsoft’s latest taxonomy update is useful because it stops pretending the model is the whole problem. The attack surface is the agent session.
Microsoft’s AI Red Team has updated its Taxonomy of Failure Modes in Agentic AI Systems after 12 months of red-team engagements against deployed agentic systems. The new v2.0 framework adds seven categories: agentic supply-chain compromise, goal hijacking, inter-agent trust escalation, computer-use-agent visual attack, session context contamination, MCP/plugin abuse, and capability/architecture disclosure. That list reads dry. It should. Dry is what happens when a toy becomes infrastructure.
The important shift is that Microsoft is no longer describing hypothetical jailbreaks against a chat box. It is describing failure modes that show up when agents consume plugin registries, MCP servers, prompt templates, tool descriptions, screenshots, web pages, persistent memory, and other agents’ messages. In other words: all the places software teams have been bolting capability onto models faster than they have been building controls around them.
The session is now a security boundary
The most interesting new category is “session context contamination.” Traditional application security thinks in fairly familiar units: requests, dependencies, files, services, users, tokens, networks. Agentic systems add something stranger: a long-running context that accumulates untrusted material over time and then uses that material to make later decisions. A malicious README, a poisoned MCP response, an adversarial web page, a misleading tool description, and a prior approval can all become part of the same reasoning soup.
That breaks a lot of comfortable security habits. A single input may not look malicious. A single action may not cross an approval threshold. But the sequence can still steer the agent into exfiltration, lateral movement, or destructive work. Microsoft says human-in-the-loop bypass was the most consistently exploited failure mode in its engagements, including zero-click chains that began with external input and ended in high-impact outcomes without meaningful human interaction beyond the initial invocation. That is the sentence security teams should print and tape next to every “we have approval prompts” slide.
Approval prompts are not automatically consent. If the agent constructs the description shown to the user, an attacker can launder a dangerous action through benign language. “Run tests” can hide a chained command. “Fetch documentation” can become credential exfiltration. Microsoft’s recommended mitigation — summarize approval prompts from the underlying tool calls, decompose compound actions, tier approvals by reversibility and blast radius, and monitor approval cadence — is not UX polish. It is the control surface.
Natural language is now supply-chain material
The taxonomy’s supply-chain section is where developers should feel the floor move a little. Microsoft explicitly includes plugins, MCP servers, prompt templates, and tool descriptions in the agentic supply chain. That is the right model, and it is uncomfortable because it means natural-language metadata is no longer harmless documentation.
A compromised npm package ships code. A compromised MCP tool description may ship instructions. A malicious plugin registry entry can tell an agent how to behave, when to ignore other tools, what to leak, or how to reinterpret a task. The binary scanner sees nothing. The dependency lockfile looks clean. The agent still ingests behavior-changing text and treats it as part of its operating environment.
Microsoft cites 99 CVEs in 2025 for MCP-related software and says tool poisoning has moved from theoretical risk to live attack surface. It also points to open-source agent framework incidents and marketplace abuse as evidence that the ecosystem matured faster than its guardrails. The specific numbers matter less than the direction: agent extensions are becoming a software supply chain, but many teams still manage them like browser bookmarks.
The practical answer is an agent SBOM. Not a normal SBOM with just packages and container layers. An agent SBOM should include MCP servers, plugins, prompt templates, tool schemas, natural-language tool descriptions, memory providers, browser/computer-use capabilities, approval policies, and version pins. If that sounds excessive, remember the agent may use all of that material to decide what command to run in your repository.
Multi-agent systems bring back the confused deputy
“Inter-agent trust escalation” is another useful label for an old bug wearing a new jacket. Multi-agent orchestration creates delegation chains: a coordinator asks a subagent to inspect code, a reviewer agent evaluates output, a fixer agent applies changes, a deployment agent handles environment work. If one agent can claim a role or permission level in natural language and the orchestrator believes it, you have a confused deputy problem with better prose.
Microsoft’s recommendation is cryptographic agent identity rather than positional trust. That may sound heavy for today’s coding assistants, but it is where serious deployments are headed. Once agents can spawn agents, relay instructions, and operate across different tools or accounts, “the message says it came from the security reviewer” is not enough. Identity, scope, and permissions need to be verifiable at handoff, not inferred from a chat transcript.
Computer-use-agent visual attacks extend the same principle to screens. If an agent can inspect images or operate a GUI, hidden text, off-viewport UI elements, adversarial screenshots, and instruction-bearing images become input channels. Humans learned to distrust phishing pages. Agents now need equivalent defenses for pixels they are asked to interpret.
What teams should do this quarter
The immediate move is not to ban agents. That would be theater, and probably ignored. The move is to threat-model them like systems that can take action.
First, inventory every agent extension surface: MCP servers, plugins, prompts, skills, memory stores, computer-use features, hosted web tools, and approval policies. Version them. Pin them. Diff them. Treat tool descriptions as reviewable artifacts, because they are now behavior-shaping inputs.
Second, test approval bypass with real workflows. Open an untrusted repository in a sandbox. Add hostile instructions to README files, comments, generated docs, hidden image text, and MCP tool descriptions. Try to get the agent to rephrase a dangerous command as a safe one. Verify the approval UI shows what will actually happen, not what the agent says will happen.
Third, shorten and segment high-risk sessions. Long contexts are useful, but they accumulate influence. Teams should track provenance inside the context: which content came from trusted system policy, which came from repository files, which came from external web pages, which came from tools, and which came from other agents. If your logs only show the final shell command, you are missing the part where the decision was shaped.
Finally, add the seven new categories to red-team coverage. Prompt injection alone is too vague now. Test goal hijacking, supply-chain poisoning, MCP/plugin abuse, visual attacks, session contamination, architecture disclosure, and inter-agent privilege claims as separate failure modes. The point of a taxonomy is not better vocabulary for the postmortem. It is better test design before the incident.
Microsoft’s update is valuable because it pushes the industry from “LLM safety vibes” toward endpoint and platform security. The model still matters. But the session, the tools, the registry, the memory, the approval UI, and the delegation graph are where the damage happens. Review them like code, or eventually explain them like breach evidence.
Sources: Microsoft Security Blog, Microsoft Agentic AI failure modes v2.0 whitepaper