azure-ai

Foundry Model Router Is Turning Model Choice Into Policy

Anatoliy Kolodkin

22 May 2026 • 4 min read

The least glamorous feature in Microsoft Foundry may be one of the most important: a router that decides which model should answer a prompt. That sounds like product-manager convenience until you run enough AI workloads in production. Then it becomes obvious that “let every developer pick a model” is not a strategy. It is a distributed cost leak with a dropdown.

Microsoft’s updated Foundry Model Router documentation describes a broader and more serious routing surface than the usual “smart model picker” pitch. The router is exposed as a single Foundry deployment, but behind it sits an approved pool of models: GPT-5 family variants, DeepSeek V3.1 and V3.2, gpt-oss-120b, Llama 4 Maverick, xAI Grok variants, and Claude Haiku/Sonnet/Opus models including claude-opus-4-7. Claude support has an extra operational step: teams must deploy Claude models from the catalog first, then allow the router to call those deployments when selected.

The product story is simple. Instead of asking every application team to choose the right model for every prompt, Foundry lets platform teams deploy one endpoint, choose a routing mode, optionally constrain the eligible model set, and observe which underlying model was selected in the response’s model field. The engineering story is sharper: model choice is becoming policy, and policy belongs in infrastructure rather than buried in prompt wrappers.

The router is not the policy. The subset is.

Microsoft lists three routing modes: Balanced, Cost, and Quality. Balanced considers models within a small quality range — the docs give an example of 1% to 2% compared with the highest-quality model for that prompt — and then chooses the most cost-effective option. Cost mode accepts a wider quality range, such as 5% to 6%, to save more money. Quality mode ignores cost and selects the highest-quality model. Those settings are useful, but the more important feature is model subsets.

Subsets let a team specify which underlying models are eligible. Just as important, new models are excluded by default until explicitly added. That is the right default. The latest router version listed in the docs is 2025-11-18, and Microsoft says it is actively maintained with new underlying models added over time without changing the version identifier. Without subsets, “same deployment” can quietly become “different model mix.” With subsets, platform teams get an approval boundary: these models are allowed for this workload, these are not, and newly available models do not enter production because somebody updated a catalog.

That matters for regulated work, customer data, internal code, and anything where reproducibility is not optional. Auto-routing across the full supported pool may be fine for a prototype or a low-risk internal assistant. It is a poor default for workflows that summarize contracts, inspect source code, trigger actions, or generate customer-facing recommendations. The router should reduce operational toil, not launder model governance into a black box.

The other feature practitioners should underline is response visibility. Foundry exposes the selected underlying model in the response. That field is not trivia. It is how you debug regressions, attribute cost, build dashboards, correlate output quality with model choice, and explain why the same endpoint behaved differently across two classes of prompts. If your observability pipeline drops that field, you have adopted routing without adopting the evidence trail that makes routing safe.

Context windows are where naive routing breaks

Microsoft’s docs include a warning that should be printed on every internal platform guide: the effective context window is constrained by the smallest underlying model unless teams use subsets to raise the floor. That is the kind of detail teams discover the hard way. A router pool that mixes small-context and large-context models can work well for short chats, classification, and simple transformations. It can behave poorly for long-document RAG, codebase reasoning, legal workflows, or agentic tasks where the prompt includes system instructions, tool definitions, conversation history, retrieved documents, and user input.

The practical answer is segmentation. Do not create one mega-router and point everything at it. Create separate deployments by workload class: high-volume short prompts, internal support chat, long-context analysis, coding-agent planning, tool-heavy agent loops, and high-risk reasoning. Each deployment should have its own approved subset, routing mode, eval corpus, budget expectations, and monitoring threshold. If that sounds like more work than a single endpoint, welcome to production.

The agent-service caveat reinforces the point. Microsoft notes that if Agent Service tools are used in flows, only OpenAI models are used for routing. That is not a footnote for teams building agents; it is architecture. The set of models available to a generic chat workload may not match the set available inside a tool-using agent flow. If you are counting on Claude or Grok or DeepSeek inside a particular agent workflow, verify the actual routing path before you sell the architecture internally.

There is also a serious eval requirement here. Microsoft points to a ModelRouter-Distribution repository for testing routing distributions against workload corpora. Teams should treat that as a preflight checklist, not optional homework. Collect representative prompts. Run them through Cost, Balanced, and Quality modes. Inspect model distribution, latency, cost, refusal patterns, formatting drift, and failure modes. Then do the unglamorous work after launch: monitor distribution in Azure Monitor and alert when the model mix changes outside expectations.

The strongest use case may be coding agents, but only if teams are careful. Agentic coding is not one prompt. It is many task types stitched together: classify the issue, inspect files, plan, edit, run tests, read failures, patch again, summarize the diff. Some of those steps can route to cheaper models. Some should not. A model router can lower the average cost of agent work and improve availability, but the policy has to understand task risk. Use broad routing for low-risk classification and summarization. Pin or tightly subset models for security-sensitive edits, migration logic, auth code, concurrency, and public API changes.

The industry spent the last two years treating model choice like taste: this one feels better, that one is faster, this other one writes nicer tests. That phase is ending. At scale, model choice is cost control, compliance posture, reliability engineering, and auditability. Foundry Model Router is interesting because it pushes that decision into a governable layer. The feature is only good if platform teams use it that way. A router without subsets, evals, observability, and ownership is just randomness with an enterprise SKU.

Sources: Microsoft Learn, Microsoft Learn: Model Router concepts, Microsoft Learn: how routing works, ModelRouter-Distribution

The router is not the policy. The subset is.

Context windows are where naive routing breaks

Sign up for more like this.