Hosted Agents Are Becoming Azure’s Agent Runtime, and the Boring Parts Are the Product
Agent platforms usually fail in the gap between the demo and the incident review. Microsoft’s latest Hosted Agents update is interesting because it spends most of its calories in that gap: isolation, identity, filesystem state, traces, versioning, evals, guardrails, voice transport, rollback, and deployment plumbing. In other words, the stuff that never looks impressive on a keynote slide and absolutely decides whether an agent can be trusted with real work.
The Build 2026 update moves Hosted Agents in Microsoft Foundry Agent Service closer to a production runtime rather than a dressed-up container host. Each hosted-agent session runs in a hypervisor-isolated sandbox with its own persistent filesystem, gets an automatically provisioned Microsoft Entra ID agent identity, and emits OpenTelemetry traces. That is the correct center of gravity. A production agent is not just a model call with tools; it is a stateful, privileged workflow that needs a blast radius.
The most developer-friendly change is source-code deployment. Instead of forcing every team to build a container, push to a registry, wire versions, and babysit endpoint state before they can test the runtime, Microsoft now lets developers zip Python or .NET projects and deploy them with azd. The supported runtimes listed in Microsoft’s post are python_3_13, python_3_14, and dotnet_10. The azd ai agent init flow creates azure.yaml and agent.yaml; azd deploy handles zip packaging, SHA verification, upload, polling to an Active state, and RBAC configuration. Microsoft’s sample defaults to gpt-4.1-mini.
That may sound like boring developer experience work. Good. Boring is the product here. Containers remain the right artifact for many mature systems, especially when teams need OS-level control, custom dependencies, or strict supply-chain rules. But making containers the price of admission for every early hosted-agent experiment turns evaluation into platform homework. Source upload lowers the activation energy while preserving a path to heavier packaging later. That split is the kind of pragmatic platform design teams actually adopt.
The runtime is where agent promises meet blast radius
Hosted Agents also now has built-in Content Safety guardrails in public preview across all hosted-agent regions. Prompts are evaluated before agent code runs, and responses are checked before users see them. This should not be mistaken for complete safety, but it is an important placement decision. The runtime is a better control point than a prompt footnote because it can see the flow of execution and enforce policy outside the model’s next-token machinery.
The same logic applies to identity. If an agent uses a generic service credential, every incident starts with the same miserable question: who was acting, for whom, with what authorization, and under which session state? Entra-backed agent identity gives teams a place to attach policy, audit, and revocation. That does not magically solve least privilege, but it makes least privilege possible. Without a distinct identity boundary, the “agent” is just a script with borrowed authority and a much larger imagination.
OpenTelemetry tracing is similarly unglamorous and essential. Agent failures rarely look like a single bad answer. They look like a chain: a user request, a retrieved document, a tool description, a model plan, a tool call, a filesystem write, a second model call, and finally an output that seems plausible until somebody asks how it happened. Traces give engineering, security, and support teams a shared record. If your agent can spend money, change data, email people, create tickets, or touch production systems, traceability is not a nice-to-have. It is table stakes.
The preview also expands the transport surface. Hosted text agents can use Voice Live integration in public preview, while native speech-to-speech agents can use WebSocket or WebRTC through invocations_ws. Microsoft says the WebSocket protocol is currently limited to North Central US. That regional limitation matters architecturally, but the bigger point is behavioral: voice agents collapse the time between model behavior and user impact. A bad text answer can be re-read. A bad voice agent can interrupt, mislead, escalate, or create social pressure before the user has fully parsed what happened. Teams building real-time agents should pair this work with stricter evals, transcript logging, consent handling, latency budgets, and a much more conservative tool policy.
Agent Optimizer is useful, but treat it like code
The sharpest new piece is Agent Optimizer, currently in private preview. Microsoft describes a closed loop: evaluate the baseline, generate candidates, evaluate candidates, rank and recommend with token costs visible, then deploy the winning version. Targets include instructions, skills, model choice, and tool descriptions. That is exactly where agent quality work belongs: not in vibes, not in a Slack thread, and not in an unversioned prompt edit nobody can reconstruct.
But teams should treat optimizer output like a code change, not like magic dust. If an optimizer changes a tool description so the agent calls it more aggressively, that can change permissions behavior, cost, latency, and user trust. If it swaps model choice for a cheaper candidate, it may improve margins while degrading edge-case reasoning. If it rewrites instructions to pass an eval suite, it may overfit to the suite’s blind spots. The safe workflow is boring and strict: run optimizer against representative traces, inspect diffs, compare token cost and latency, require human approval for production promotion, and keep rollback one click away.
Microsoft is also adding eval tooling around this loop. azd ai agent eval init can generate an eval suite from existing instructions; Microsoft’s sample creates 15 tasks and six weighted dimensions including policy_compliance, resolution_accuracy, and safety_boundaries. That is a useful starting point, not a finish line. Fifteen tasks will not prove a production agent is safe. But auto-generating a first eval harness from the agent’s intended behavior can push teams past the blank-page problem, which is often why evals never happen at all.
Hosted Agents supports Responses, Invocations over HTTP, Invocations over WebSocket, Activity protocol bridging for Teams and Microsoft 365, and A2A delegation. That breadth is strategically obvious: Microsoft wants agents to move across apps, channels, and other agents while staying inside a governed runtime. The risk is equally obvious. The more surfaces an agent can operate through, the more important it becomes to have consistent policy, identity, observability, and versioning across those surfaces. Otherwise the “same” agent behaves like four different systems depending on the doorway.
Hosted Agents is in public preview across 20 Azure regions, and Microsoft says it is approaching general availability by the end of June 2026. Roadmap items include Agent Optimizer public preview, private ACR in bring-your-own VNet, Managed VNet support, broader Voice Live and WebSocket regions, and durable long-running agents. The durable-agent item is worth watching. Long-running work is where failures get expensive: retries, partial state, approvals, external side effects, and abandoned sessions all need first-class handling.
The practitioner move is clear: evaluate Hosted Agents less like a model feature and more like runtime infrastructure. Ask whether the isolation boundary fits your data, whether agent identities map to your permission model, whether traces reach your observability stack, whether evals can replay real failures, whether guardrails fail closed, and whether version rollback is operationally boring. Do not ship an agent because the demo deployed. Ship it because the runtime gives you enough evidence and control to survive the first incident.
Microsoft’s useful contribution here is not another agent container. It is the argument that the runtime should own more of the chores teams keep rebuilding badly: sandboxing, identity, state, tracing, eval loops, guardrails, deployment lineage, and rollback. That is not glamorous. It just looks suspiciously like production engineering.
Sources: Microsoft Foundry Dev Blog, Microsoft Learn, Microsoft Foundry Agent Service Build update, Microsoft Build 2026 live blog