azure-ai

Agent Framework’s CodeAct and Hosted Agents Are Microsoft’s Answer to the Two Things That Kill Agent Demos: Tool Latency and Production State

Anatoliy Kolodkin

05 Jun 2026 • 6 min read

Most agent demos die the same way: not because the model cannot reason, but because the system spends too many turns asking itself which tool to call next, then collapses when someone asks where the state lives in production. Microsoft’s latest Agent Framework announcement is useful because it goes after both failure modes instead of pretending another orchestration diagram will fix them.

The sharpest new piece is CodeAct with Hyperlight. Microsoft says the feature lets a model write a short Python program that composes registered tools through call_tool(...), then runs that glue code inside a fresh Hyperlight micro-VM. In the company’s benchmark, using the same model, same five tools, same prompt, and same structured output schema, the traditional tool-calling path took 27.81 seconds and 6,890 tokens. CodeAct took 13.23 seconds and 2,489 tokens — a 52.4% latency reduction and 63.9% token reduction.

That is the kind of number agent teams should care about more than leaderboard theater. Agentic systems do not just burn tokens on the final answer. They burn them on planning turns, tool schemas, intermediate summaries, retries, and the model repeatedly re-reading its own breadcrumbs. The longer the loop, the more expensive the answer gets and the more opportunities the system has to drift. CodeAct is interesting because it compresses a multi-turn “think, call, observe, think, call” loop into a small executable plan.

The model was already writing the program in its head

A conventional tool-calling agent often behaves like a developer forced to do a database migration through a walkie-talkie. It asks for one record, waits, asks for another, waits, computes a tiny subtotal in natural language, asks for a third thing, waits again, then tries to reconstruct the state from chat history. That pattern is tolerable in a demo with two tools. It becomes painful in production workflows that need chained lookups, filtering, aggregation, transformation, and structured output.

CodeAct’s premise is refreshingly old-fashioned: if the task is procedural, let the system express the procedure as code. A short Python script is often a better representation of “fetch these records, group by account, calculate totals, ignore inactive users, return JSON” than five rounds of conversational tool calls. Code is inspectable. It can be logged. It can be diffed. It makes loops, conditionals, and intermediate variables explicit instead of smearing them across model messages. This is not magic; it is software remembering that software is good at deterministic bookkeeping.

The Hyperlight boundary is what makes the idea credible. Microsoft says the generated code runs in a fresh guest per execute_code call, with no host filesystem access unless explicitly mounted and no network access unless domains are allow-listed. That matters because “let the model write and run code” is otherwise one of those phrases that should cause every security reviewer in the room to sit up straighter. The model-authored glue gets a sandbox. The reviewed tools remain in the host application runtime. That split is the architecture.

It is also the part teams should not misunderstand. Microsoft is explicit that registered tools still execute in the host application with whatever access that process has. Hyperlight contains the generated Python glue; it does not magically make a dangerous tool safe. If a tool can email customers, spend money, mutate production data, delete files, or change access policy, putting that tool behind call_tool(...) inside a generated script does not reduce its blast radius. It may make the blast radius harder to see.

The approval caveat is not fine print

The most important operational caveat in Microsoft’s post is approval granularity. Tools passed through the CodeAct provider are gated at the execute_code block level, not at each individual nested call_tool(...). In plain English: a human or policy layer may approve the generated-code execution as a unit, while several tool calls happen inside that unit.

That is perfectly reasonable for pure, read-only, low-cost tools: search an internal catalog, fetch non-sensitive metadata, normalize records, run calculations, generate a report. It is not acceptable for side-effecting tools where each call deserves its own approval event. Microsoft recommends keeping tools like email, spending, or production writes exposed directly with per-call approval. Builders should treat that recommendation as a design rule, not a blog-post footnote.

The practical implementation pattern is two lanes. Lane one is CodeAct-safe: read-only tools, deterministic transforms, narrow APIs, bounded data, strict timeouts, and outputs that can be validated. Lane two is approval-sensitive: anything external, destructive, financial, privileged, user-visible, or compliance-relevant. Those tools stay outside the generated-code composition layer and get explicit policy checks per invocation. If a tool would make you nervous in a loop, it does not belong behind a single block approval.

Hosted Agents are the other half of the story

The same announcement also ties Microsoft Agent Framework to Foundry Hosted Agents, and that matters because faster tool composition is useless if deployment remains a science project. Microsoft says Hosted Agents package developer-owned .NET or Python agent code as containers on Foundry-managed infrastructure, with managed identity, automatic scaling, versioning, persistent session state, observability, and per-session VM-isolated sandboxes. Idle compute deprovisions after 15 minutes, while unused sessions can persist for up to 30 days and resume with filesystem state intact.

Those numbers are boring in exactly the right way. Agent platforms need boring numbers. How long does idle capacity live? What happens to filesystem state? Can a session resume tomorrow? Where do traces go? How does rollback work? Which identity calls Azure services? The answers decide whether a team can operate an agent after the launch demo ends.

Each hosted agent getting a dedicated Microsoft Entra ID is especially important. An agent with its own workload identity can be granted least privilege, audited separately, disabled independently, and mapped to policy without borrowing a developer’s token. That is the difference between “the assistant did something with someone’s credentials” and “this named workload performed this action under this version and session.” The former is a shrug in an incident review. The latter is evidence.

Microsoft’s minimal local-to-hosted path is also a deliberate developer-experience move. In .NET, the sample shape is builder.Services.AddFoundryResponses(agent) and app.MapFoundryResponses(). In Python, it is ResponsesHostServer(agent).run(). The point is not that every production agent should be three lines. The point is that Microsoft wants the standard Agent Framework programming model to flow into hosted /responses services without every team inventing its own container contract, session store, identity handoff, and trace wiring.

What engineers should actually do with this

Do not start by moving your scariest workflow to CodeAct. Start by finding the workflows where your current agent spends most of its time chaining safe reads and doing light computation: report generation, catalog lookup, support triage, data enrichment, structured extraction, policy checks, and internal QA. Benchmark token use, latency, tool-call count, failure rate, and trace readability before and after. If CodeAct reduces cost but makes behavior harder to review, the win is not free.

Then classify tools before registering them with the CodeAct provider. Write down which tools are pure reads, which return sensitive data, which mutate state, which call external systems, which cost real money, and which require human approval. Put hard limits around execution time, output size, network access, mounted files, and allowed domains. Log the generated script, the tool calls it made, the inputs and outputs, the approval event, and the model/version that produced it. If your security team cannot reconstruct what happened, you have optimized the wrong thing.

For Hosted Agents, evaluate the runtime as infrastructure, not as a prettier deployment target. Verify session isolation, identity behavior, trace export to Application Insights or your OpenTelemetry pipeline, rollback mechanics, scale-to-zero latency, cost under idle and active traffic, and how state behaves across interruptions. Test failure paths: tool timeout, sandbox crash, model refusal, invalid structured output, expired session, revoked identity, and partial completion after approval. The first incident should not be the first time anyone learns how the runtime fails.

The broader take is simple: Microsoft is moving agents from prompt craft toward runtime engineering. CodeAct cuts token and latency waste when agents compose safe tools. Hyperlight gives generated code a boundary. Hosted Agents give long-running systems state, identity, versioning, and observability. None of this removes the need for evals, threat models, approval policy, or human review. It does make it harder to pretend that “agent” is a UI feature instead of a production system.

That is progress. Not flashy progress. Useful progress. The kind that looks good in a standup because someone can finally answer: what ran, where did it run, what did it call, what did it cost, and how do we roll it back?

Sources: Microsoft Agent Framework Blog, Microsoft CodeAct with Hyperlight, Microsoft Foundry Hosted Agents deployment guide, Microsoft Agent Framework on GitHub

The model was already writing the program in its head

The approval caveat is not fine print

Hosted Agents are the other half of the story

What engineers should actually do with this

Sign up for more like this.