Microsoft’s Agent Framework Build Slate Says the Quiet Part Out Loud: Agents Need Ops, Not Just Prompts

Microsoft’s Agent Framework Build Slate Says the Quiet Part Out Loud: Agents Need Ops, Not Just Prompts

Microsoft’s most useful Build 2026 agent announcement is not a shiny model demo. It is the boring admission that agents are now production systems, and production systems need ops.

The company’s Microsoft Agent Framework Build slate reads like a checklist written by someone who has already watched an agent demo turn into an incident ticket: multi-agent orchestration, Microsoft Foundry integration, observability, evaluations, MCP, OpenTelemetry, A2A, hosted agents, governance, and production scaling. That is a more important signal than another dropdown full of model names. The industry has enough agents that can call a tool once in a conference demo. The shortage is agents that can be traced, paused, governed, restarted, evaluated, and killed before they do expensive or unsafe things at scale.

The post itself is framed as a guide to Microsoft’s Build sessions on June 2 and 3. But the session lineup is the product strategy. “From prototype to production: build and run agents at scale” pairs Microsoft Agent Framework with Foundry Agent Service. “Observe and control agents across any framework with open source tools” puts governance in the title instead of burying it in a compliance appendix. Another session promises to visualize decisions inside complex Agent Framework apps using open standards and open-source instrumentation, then evaluate behavior with an LLM-as-judge. There is also an open-source integration path involving MCP, skills, Playwright CLI, OpenTelemetry, the Responses API, and A2A before operationalizing the stack in Microsoft Foundry.

That combination matters because agents are not chatbots with better vibes. A serious agent workflow can touch repositories, browsers, shell commands, cloud functions, MCP servers, search indexes, hosted code interpreters, identity systems, and third-party APIs. The failure mode is not merely “the answer was wrong.” It is “the wrong process called the wrong tool with the wrong authority, nobody can reconstruct why, and the bill/security blast radius is unclear.” Microsoft’s Build agenda is notable because it names those failure modes directly instead of pretending that a longer system prompt counts as governance.

The agent runtime is starting to look like a distributed system

The Agent Framework repository describes MAF as an open, multi-language framework for production-grade AI agents and multi-agent workflows in .NET and Python. Its stated fit criteria are revealing: orchestration beyond a stateless chat loop, graph-based sequential/concurrent/handoff/group collaboration patterns, durability, restartability, observability, governance, human-in-the-loop control, and provider flexibility.

Read that list as software architecture, not AI marketing. Sequential, concurrent, handoff, and group workflows are distributed-work primitives. Checkpointing and restartability mean the agent can survive process failure, pause during approval, or resume long-running work without pretending every task fits in one chat turn. Human-in-the-loop control means approval becomes part of the runtime contract, not a Slack message someone hopes the model respects. OpenTelemetry traces and time-travel debugging mean teams can inspect how a decision happened after the fact.

This is the right direction. If an agent writes code, opens a PR, calls an MCP server, queries a customer record, or changes cloud state, the organization needs evidence. Which model was used? Which prompt and tool definitions were active? Which identity executed the action? What data crossed the boundary? What approval was granted, and by whom? Which evals passed before this version shipped? If those answers are not available, the system is not production-ready. It is a clever intern with root access.

Foundry is Microsoft’s strategic glue here. The sample paths point toward local development, .NET/Python app code, GitHub Copilot SDK surfaces, Azure Container Apps, Durable Task-style hosting patterns, hosted Foundry agents, observability, and evaluations all fitting into one deployment story. That is useful if your company already lives in Microsoft’s platform. It also creates gravity. Once your agent definitions, traces, eval datasets, toolboxes, identity evidence, and approval logs live in Foundry-adjacent systems, switching providers is no longer just swapping a model endpoint. Portability moves up the stack: can you move the policy? The trace history? The skill package? The incident record?

Prompt safety is not a control plane

The strongest supporting source is Microsoft’s Agent Governance Toolkit. Its README makes the architectural point more bluntly than most launch posts: prompt-level safety is “a polite request to a stochastic system.” The toolkit’s promise is to intercept tool calls, message sends, and delegation in deterministic application code before model intent reaches the wire.

That distinction should be the new baseline for agent review. A prompt saying “do not delete files” is not the same thing as a policy engine that denies destructive file operations without an approval token. A model instruction saying “only use trusted tools” is not the same thing as an allowlist enforced at runtime. A chat transcript is not the same thing as tamper-evident audit evidence. Production agents need policy gates in code, per-agent and per-session identity, sandboxing, approvals, kill switches, SLO monitoring, chaos testing, and logs that security and SRE teams can actually query.

The AGT repo advertises policy enforcement, identity, sandboxing, audit logs, kill switches, MCP security gateway specs, audit/compliance specs, and 992 conformance tests across formal specifications. GitHub commit activity on June 2 included a signed change making verify_evidence strict by default for a CVSS 8.1 attestation issue. That is a useful reminder: governance tooling is not just slideware around the agent. It becomes security-sensitive software with its own patch stream, CVEs, defaults, and operational risk.

There is a practitioner trap here. Teams often treat governance as something to add after the demo proves value. Agents invert that sequence. The demo’s value frequently comes from broad tool access, large context, and autonomy — exactly the properties that expand blast radius. If the governance layer arrives after users have built workflows around permissive access, every new control feels like a regression. Better to start with least privilege, explicit approval scopes, and trace capture, then relax policies where evidence supports it.

MCP and skills need dependency discipline

Microsoft’s Build material repeatedly references MCP, skills, Playwright CLI, hosted tools, and open-source integration. That is where the next set of incidents will come from. Agent tools are dependencies with runtime authority. They should be reviewed more like packages plus service accounts than like prompt snippets.

For every MCP server or skill, engineering teams should be able to answer basic questions before it reaches production: who maintains it, what secrets can it access, what network destinations can it reach, what writes can it perform, what data does it return to the model, how are calls logged, how are versions pinned, and how can it be disabled during an incident? “It is open source” is not a security model. “The model decides when to use it” is not an authorization model.

The same applies to multi-agent handoffs. Delegation is a capability boundary. If one agent can ask another agent to act, the receiving agent’s permissions, tools, identity, and audit trail need to remain visible. Otherwise teams create a distributed permission bypass wearing a product-management-friendly name. A2A and multi-agent workflows are promising, but they need explicit contracts: what can be delegated, what requires approval, what context is shared, and what evidence is retained.

Microsoft’s own important notes for Agent Framework point in this direction. The README warns that third-party servers, agents, code, and non-Azure Direct models are third-party systems used at the builder’s risk. Teams are responsible for data flows outside Azure compliance and geographic boundaries, permissions, boundaries, approvals, and responsible-AI mitigations. That is not legal boilerplate to skip. It is the architecture review agenda.

What engineers should do this week

If your team is experimenting with Agent Framework, Foundry agents, Copilot SDK workflows, LangGraph deployments, MCP servers, or any comparable stack, the immediate task is not to pick the cleverest orchestration pattern. It is to write an agent readiness checklist before the first internal user treats the prototype as infrastructure.

Start with identity. Every agent session should have a scoped identity, not a generic service credential with whatever access was convenient during prototyping. Then define tool-call allowlists and deny-by-default policies for destructive or external actions. Add human approvals as runtime events with durable evidence, not out-of-band chat acknowledgments. Capture OpenTelemetry traces, tool-call logs, prompts, model versions, and policy decisions in a form that can be searched during an incident. Set token and cost budgets before long-running agents discover how expensive enthusiasm can be. Version prompts, skills, toolboxes, and eval datasets like release artifacts. Create a kill switch that someone outside the agent-building team knows how to use.

Then build evals from real traces. Static prompt suites are useful, but agent failures often emerge from tool timing, stale context, bad handoffs, unexpected permissions, and multi-step drift. Trace-based evaluation is a stronger fit because it tests the system the way it actually behaves. If a workflow failed because an agent chose a risky tool, shared too much context, or looped through an expensive model path, the eval should preserve that scenario and prevent regression.

The broader read is simple: Microsoft is trying to make agents boring enough to operate. That is a compliment. The agent market does not need more magic demos nearly as much as it needs runtimes with state, identity, traces, approval gates, policy enforcement, replay, and incident controls. Agent Framework’s Build slate is worth paying attention to because it treats agents as software systems. That is where this category has to go if it wants to survive contact with production.

Sources: Microsoft Agent Framework Blog, Microsoft Agent Framework on GitHub, Microsoft Agent Framework docs, Microsoft Agent Governance Toolkit, Microsoft Build session BRK250