azure-ai

Microsoft Foundry’s May Dump Is Really an Agent-Ops Checklist

Anatoliy Kolodkin

01 Jun 2026 • 5 min read

Microsoft’s May Foundry roundup looks like a product manager emptied the changelog drawer: new catalog models, trace-based evals, Managed VNET GA, GPT-5 reinforcement fine-tuning in gated GA, Foundry Local updates, Content Understanding improvements, and preview skills/toolboxes in azure-ai-projects. Read it as a model announcement and it is forgettable. Read it as an agent-ops checklist and the shape is much more interesting.

The useful story is not that Foundry now has more names in the model dropdown. Enterprise AI teams are not blocked by a shortage of model names. They are blocked by the boring failures that happen after the demo: nobody knows which project burned the budget, the agent needs outbound access nobody approved, evals are built from toy prompts instead of production traces, local execution is an afterthought, and tool packaging turns into a supply-chain junk drawer. Microsoft is quietly filling in that control plane.

The model catalog is not the control plane

Yes, the May update adds more model surface area. Microsoft lists xAI Grok 4.3, DeepSeek V4 variants, Fireworks-hosted DeepSeek V4 Pro, and Kimi 2.6 among the catalog changes. It also notes that Grok 4.3 carries additional responsible-AI considerations, including higher safety and jailbreak risk than some Azure Direct models. That warning is the tell: model choice is becoming a policy decision, not a vibes decision.

GPT-5 Reinforcement Fine-Tuning moving to gated GA fits the same pattern. “Gated GA” is not launch confetti; it is Microsoft saying reinforcement-tuned production agents are real enough for enterprise compliance and SLA coverage, but risky enough that casual self-service would be malpractice. That is exactly the right posture. Fine-tuning and reinforcement workflows are powerful when the task, evaluation loop, and failure modes are understood. They are expensive ways to make confident mistakes when they are not.

The stronger update is trace-based evaluation. Microsoft says Foundry can now grade production traces from Foundry, GCP, AWS, or any framework, reducing the need to hand-curate eval datasets. That matters because agent evaluation is moving away from static prompt exams and toward failure review on real tool calls, real retrieval paths, and real user sessions. If your eval set does not include the weird thing your agent did in production at 2:13 a.m., you are benchmarking theater.

Cost attribution is the feature finance will actually use

Project-level cost attribution is not glamorous. It is also one of the most important items in the release. Shared AI environments turn subscription-level billing into archaeology: one team is prototyping, another is running evals, a third is testing model routing, and production is quietly doing the thing everyone forgot was expensive. When the bill spikes, “Azure AI was costly this month” is not an answer. It is a support ticket with better formatting.

Project-level attribution gives platform teams a review unit they can govern: product, project, team, environment, experiment, owner. That is the unit where budgets, chargeback, anomaly review, and shutdown policy become possible. It also lands at the same moment GitHub is moving Copilot from premium request units to token-based GitHub AI Credits on June 1, counting input, output, and cached tokens. Different surface, same physics: agentic AI costs real money because agents produce long traces, call tools, retry, summarize, inspect files, and sometimes do all of that with frontier models.

The practitioner move is simple: stop treating AI spend as one shared bucket. Create project boundaries that match accountability. Set budgets per environment. Alert on unusual token growth. Track eval runs separately from production. Decide when expensive models are allowed, and when the system should route to cheaper models or local execution. The future incident nobody wants is a runaway agent hidden inside “innovation spend.”

Managed VNET is where demos become architecture

Managed VNET reaching GA for Foundry projects is another unsexy feature with real production weight. Agents do not just call a model anymore. They retrieve documents, hit search indexes, call APIs, write files, emit telemetry, invoke tools, and occasionally wander toward external services that nobody modeled in the threat diagram. Network isolation is no longer a checkbox after the demo; it is part of the agent’s behavior.

The gotchas matter. Microsoft’s architecture notes say Managed VNET isolation mode is a creation-time decision. You cannot disable it later, and you cannot convert a custom VNET deployment in place. FQDN outbound rules can also create a managed Azure Firewall, which may add firewall charges even though Managed VNET itself is free. Translation: decide the network posture before the project becomes a dependency magnet. Retrofitting private egress after teams have built around public outbound access is not hardening. It is archaeology with downtime.

For regulated teams, the default question should be boring and early: does this agent need broad internet outbound, approved outbound only, private endpoints, service tags, or explicit FQDN rules? If the answer is “we will figure that out later,” the real answer is “we are building migration work into the launch.”

Foundry Local makes hybrid execution practical, not ideological

Foundry Local deserves more attention than it will get. Version 1.1 added live audio transcription, text embeddings, Responses API support, Qwen 3.5 Vision, and WebGPU as a separate execution-provider plugin. Microsoft says it tested more than 50 ASR configurations, reduced the selected Nemotron streaming model from 2.47 GB to as little as 0.67 GB, kept word-error rate within 1% absolute of the PyTorch baseline, and hit 8.20% average streaming WER with 0.56 seconds of algorithmic latency on CPU.

Those are useful numbers because “local AI” usually arrives wrapped in privacy slogans and benchmark fog. The real architecture is not local versus cloud. It is tiered execution. Run cheap, latency-sensitive, privacy-sensitive primitives locally when they are good enough. Use hosted Foundry models for heavier reasoning. Route explicitly, log the decision, and make sure the system does not spend frontier-model money to classify a routine transcript or embed a document chunk.

Version 1.2 pushes that direction further with cancellable model and execution-provider downloads, multilingual ASR, Linux ARM64 support, WinML 2.0, ONNX Runtime 1.26.0, GenAI 0.14.0, region-based downloads, and removal of the previous five-minute timeout cap on large models. None of that is headline candy. All of it makes local/cloud routing less brittle.

Toolboxes are useful. Treat them like dependencies.

The preview azure-ai-projects 2.2.0 skills and toolboxes surface may be the highest-leverage part of the update. Microsoft’s sample registers reusable design guidance as a project skill, bundles it into a toolbox, exposes that toolbox through an MCP endpoint, and attaches it to a GPT-5.4 prompt agent with image input. That is a cleaner packaging model for agent capabilities.

It is also a new governance object. Who wrote the skill? Who can edit it? Which tools does the toolbox expose? Are secrets isolated? Are tool calls logged? Can an agent reach an MCP capability the app owner did not intend? Can you roll back a skill version after a bad deploy? This is where “MCP security checklist” stops being a search term and becomes an incident-prevention practice.

The Microsoft Agent Framework docs draw a helpful line between code-owned Responses Agents and server-managed, versioned Foundry Agents. That distinction matters. Prototypes can keep instructions and tools close to code. Production agents need release management: versioned definitions, promotion paths, review, reproducibility, and rollback. If a support agent changes behavior because somebody edited a shared toolbox at 4:57 p.m., the postmortem should not start with “we think the prompt changed.”

So the checklist coming out of this release is practical. Define project boundaries before cost grows. Pick the network isolation mode before building dependencies. Evaluate from production traces, not demo prompts. Use Foundry Local where latency, privacy, or unit economics justify it. Treat skills and toolboxes like software dependencies: owner, version, permission scope, approval policy, audit log, rollback. Build model routing as policy, not personal preference.

The market already has enough chat boxes and enough model dropdowns. What it needs is an operations layer for agents that can be observed, budgeted, isolated, evaluated, governed, and shut down when they go sideways. Microsoft’s May Foundry update is messy because production is messy. That is why it matters.

Sources: Microsoft Foundry Dev Blog, Foundry Local 1.1, Microsoft Tech Community, Microsoft Agent Framework docs, GitHub Copilot billing

The model catalog is not the control plane

Cost attribution is the feature finance will actually use

Managed VNET is where demos become architecture

Foundry Local makes hybrid execution practical, not ideological

Toolboxes are useful. Treat them like dependencies.

Sign up for more like this.