Toolboxes and Routines Are Microsoft’s Answer to Agent Tool Sprawl Before It Becomes the New Microservices Mess
Microsoft’s Foundry Toolboxes update looks like a grab bag if you read it as a feature list: Skills, Work IQ, Fabric IQ, Browser Automation, managed MCP servers, Tool Search, guardrails, and Routines. Read it as an operations story and it becomes much cleaner. Microsoft is trying to stop agent tool sprawl before it turns into the new microservices mess, except this time the thing choosing between 200 vaguely named interfaces is a probabilistic model with a corporate credit card.
The problem is real and wonderfully unglamorous. A toolbox starts with five tools. Then every integration, workflow, report, SaaS action, internal API, and helper script wants to join. Soon the model receives 50 or 200 tool definitions every turn. That inflates token cost, crowds out useful context, and makes tool selection worse. Microsoft says the same thing directly: sending every definition on every turn increases cost, crowds context, and confuses selection. Good. Naming the failure mode is half the product.
Tool Search is retrieval for affordances
The cleanest idea in the batch is Tool Search. Instead of stuffing the full tool catalog into the model’s context, Foundry exposes two meta-tools: tool_search, which describes the needed capability and retrieves relevant tools, and call_tool, which invokes the discovered tool by name. Microsoft recommends it when a toolbox has more than 10–15 tools or when different tasks need different subsets of tools. Configuration uses the preview directive { "type": "toolbox_search_preview" }, and Tool Search does not appear in tools/list or count toward unnamed-tool-per-type limits.
This is not exotic. It is the obvious next layer once tool catalogs stop fitting inside a demo. We already learned this lesson with documents: do not paste the entire knowledge base into every prompt; retrieve the relevant chunks. Tool Search applies the same principle to affordances. The model does not need to see every possible action at all times. It needs to discover the right action, with enough context to call it safely.
But retrieval is not authorization. Hiding 200 tool schemas from the first context window may reduce confusion, but it does not make dangerous tools safe. The controls that matter are still tool-level permissions, approval requirements, input and output guardrails, sensitive-data policy, audit logs, and traceability from search query to actual execution. If a tool can send money, delete records, email external recipients, rotate credentials, or read sensitive data, it should not merely be hard to discover. It should be governed.
Microsoft’s examples deserve the usual sample-code skepticism. The REST and Python examples show an MCP tool plus ToolboxSearchPreviewTool() and require_approval="never". That is fine for a doc snippet. It is not a production policy. The first checklist item for any team adopting Tool Search should be to review approval defaults by tool class: read-only, low-risk write, high-risk write, external communication, sensitive-data access, and administrative operation. “Never require approval” is how a convenience layer becomes an incident report.
Skills are a package system for behavior
Skills may be the more strategic primitive. Microsoft’s distinction is useful: tools tell an agent what it can do; skills tell it how to do it. Skills are versioned, immutable, project-scoped reusable procedures loaded through MCP resources at startup. That gives teams a way to encode shared procedures without copy-pasting prompt blocks into every agent.
It also creates a supply chain for behavior. A stale refund-policy skill, a badly reviewed escalation skill, or an overbroad data-handling skill can shape decisions before any tool call happens. Versioned and immutable is the right base. It is not enough by itself. Teams need owners, review gates, provenance, deprecation rules, compatibility notes, and monitoring for which agents load which skills. If that sounds like package governance, yes. That is the point. Agent behavior is becoming deployable software, even when the artifact is a markdown-like procedure instead of a binary.
Work IQ and Fabric IQ are the context side of the same bet. Work IQ gives agents Microsoft 365 workplace context grounded in existing permissions. Fabric IQ provides business context through ontology, Fabric data agents, and Power BI semantic models. Those are useful because agents need organizational context, not just internet facts. They are risky for the same reason. A tool catalog attached to business data and workplace context needs least privilege, DLP, and traces. Otherwise you have built a very polite data exfiltration interface.
Browser Automation adds another sharp edge. Microsoft is bringing MCP-native web automation to hosted agents using Playwright workspaces, with live visibility and control for edge cases. Browser agents are powerful because the web remains the universal integration layer. They are dangerous because browser state is messy: sessions, cookies, hidden form fields, popups, permission prompts, and UI changes. Live visibility is not a nice-to-have. It is the difference between supervised automation and letting a model poke a production admin console through a glass window.
Routines are the boring glue agents needed
Routines solve a different operational problem: scheduling. They support timer and recurring triggers in preview, exactly one trigger and one action, and invoke one Foundry prompt or hosted agent through the existing endpoint. Run history stores inputs, outputs, status, and links to agent response and trace details. That limitation — one trigger, one action, no branching — is healthy. Use Routines for “run this agent every weekday morning.” Use workflows when there is state, approval, branching, or multi-agent coordination. Not every recurring task needs a distributed-systems cosplay outfit.
The practitioner move is to create a toolbox review process before the catalog gets large. Set a threshold for enabling Tool Search. Require useful descriptions and additional search text when internal tool names do not match user intent. Pin only truly universal tools. Review require_approval defaults. Test confusing pairs of tools. Log tool-search queries and the resulting tool calls. Track cost per turn as the catalog grows. If Tool Search is working, the cost curve should flatten and selection should improve. If not, the agent is searching too broadly or the catalog is poorly described.
Microsoft is not just shipping a bigger drawer of tools. It is admitting that agent platforms need catalog management, behavioral packaging, scheduled execution, data context, browser supervision, guardrails, and traces. That is less glamorous than a model benchmark. It is also much closer to what determines whether agents survive contact with production.
Sources: Microsoft Foundry Blog, Microsoft Learn: Tool Search, Microsoft Learn: Routines