azure-ai

Agent Optimizer Turns Prompt Tuning Into a Versioned Engineering Loop — With All the Usual Footguns

Anatoliy Kolodkin

04 Jun 2026 • 4 min read

Agent Optimizer is Microsoft’s attempt to professionalize the least professional part of agent development: the ritual of staring at failures, tweaking a system prompt, running three examples, and calling it a release. Every team says it is “iterating on behavior.” Too often, what it actually means is artisanal superstition with a YAML file.

The Foundry Agent Optimizer preview points in a better direction. Microsoft is packaging a closed-loop process for hosted agents: evaluate the baseline, generate candidate configurations, evaluate those candidates, rank them with score and token cost visible, then promote the winner as a new hosted-agent version. In Microsoft’s example, a candidate improves from a 0.60 score to 0.92, with pass rate moving from 71% to 100%, in 9 minutes over two iterations. No retraining. No code changes. Just changes to the behavioral configuration around the agent.

That number is a good demo. The real product is the release discipline it implies.

Prompt tuning is becoming CI/CD

Agent Optimizer targets instructions and system prompts, skills, model choice, and tool descriptions. It can use synthetic data or historical traces plus evaluator signals to rewrite those pieces and compare candidates. Required configuration lives under .agent_configs/baseline/, including metadata.yaml, instructions.md, optional skills/, and optional tools.json using OpenAI function-calling format. Runtime integration uses load_config() from azure.ai.agentserver.optimization, with optimized configurations injected during evaluation through an environment variable while production defaults apply when absent.

That layout is pleasantly mundane, which is a compliment. If agent behavior is going to change, it should have files, versions, diffs, traces, scores, and rollback. The phrase “no code changes” is technically true and operationally dangerous if teams hear it as “not a production change.” Behavior changes are production changes. A rewritten instruction that makes an agent more assertive, a skill change that alters procedure, or a tool-description edit that changes tool selection can affect cost, latency, safety, permissions, and customer outcomes.

Microsoft’s evaluation stack is doing the necessary plumbing. azd ai agent eval init can generate a dataset and evaluation criteria from existing instructions. The sample creates 15 tasks and six weighted dimensions: policy_compliance, resolution_accuracy, troubleshooting_structure, communication_clarity, safety_boundaries, and general_quality. Foundry evaluation includes built-in evaluators such as Task Adherence, Coherence, Violence, groundedness, and safety, while AI-assisted evaluators can use a judge model deployment such as gpt-4o or gpt-4o-mini. Microsoft’s own docs suggest setting acceptance thresholds — for example an 85% task adherence passing rate — before release.

This is the right direction, but it has the same old software truth hiding underneath: the optimizer can only optimize what the tests measure. If the eval suite is shallow, the optimizer will get very good at shallow. If the dataset overrepresents clean support tickets, the agent may improve on clean tickets while regressing on ambiguous inputs, adversarial prompts, tool outages, or messy real-world context. If general_quality outweighs safety_boundaries, the “better” candidate may simply be more charming while taking risks the product owner never intended.

The tool-description footgun deserves review

The most under-discussed target is tool descriptions. Better descriptions can improve tool selection and reduce hallucinated calls. They can also change the effective permissions behavior of the system. If an optimizer rewrites an order-lookup tool description so it sounds broadly useful for “customer context,” the agent may call it more often than necessary. If it tightens required parameters, it may reduce waste and improve safety. Both outcomes are real behavior changes.

That means candidate diffs need human review. Treat optimized prompts, skills, and tool descriptions like code. Review what changed. Run targeted regression suites. Compare tool-call rates. Inspect traces. Monitor token cost and latency after promotion. Keep rollback easy. The correct workflow is not “click optimize, ship winner.” It is “run optimize, inspect candidate diffs, compare scores and costs, stage the winner, replay production-like traces, then promote.”

The model-choice target is where Agent Optimizer becomes a FinOps tool instead of just a prompt toy. Agentic workloads consume tokens in ways classic chat apps do not: long sessions, tool schemas, tool results, retries, trace replay, evaluator runs, and multimodal context all accumulate. If an optimizer can compare smaller and larger models against the same task suite and expose score/cost trade-offs, the economically rational answer will often be “smaller model plus better instructions,” not “largest model everywhere because it feels safer.”

That is especially important for coding agents and internal operations agents, where usage can spike quietly. A developer may launch a long-running repo task. An optimizer may replay dozens of traces. A support agent may call tools repeatedly because its description nudges it that way. Token cost is not a billing footnote; it is a runtime property. If Microsoft is putting token costs into the optimizer loop, teams should put budget thresholds into release policy.

The practical advice is simple: do not start with the optimizer. Start with the eval contract. Define representative tasks, refusal cases, tool-failure scenarios, data-boundary tests, latency budgets, and cost ceilings. Pull in production traces, but do not blindly trust them; production logs encode existing product gaps and user workarounds. Add adversarial and edge cases manually. Decide what score is good enough, what dimensions are non-negotiable, and what kinds of regressions block release.

Then use Agent Optimizer as an engineering loop. It is promising precisely because it makes agent behavior more measurable. But measurable is not the same as safe. The tests, budgets, approvals, and rollback paths are what turn optimization from magic button to release process. The tool can generate candidates. Engineering still owns judgment.

Sources: Microsoft Foundry Blog, Microsoft Learn: evaluate agents, Microsoft Foundry Blog: Hosted Agents

Prompt tuning is becoming CI/CD

The tool-description footgun deserves review

Sign up for more like this.