azure-ai

AI Coding Agents Do Not Read Your Docs Like Humans

Anatoliy Kolodkin

27 May 2026 • 5 min read

The uncomfortable truth in Microsoft’s latest Agent Experience essay is that most developer platforms are still optimized for a species that is no longer the only consumer: humans with patience, taste, and enough context to know when an SDK example smells stale.

AI coding agents do not read your docs that way. They do not browse like a senior engineer skimming release notes, connecting Slack lore, tribal knowledge, and muscle memory. They operate through a harness that assembles context, exposes tools, decides what fits into the window, routes tool calls, feeds back errors, and lets a model generate code from whatever survived that pipeline. If your technology is hard for that pipeline to understand, the result is not a bad demo. It is wrong code that looks plausible enough to merge.

Microsoft’s Waldek Mastykarz lays out the lifecycle in seven steps: harness context assembly, model interpretation, tool selection, tool invocation, response processing, code generation, and iteration. That sounds academic until you map it to a real failure. The harness may include the system prompt, OS details, working-directory path, workspace files, MCP tool descriptions, conversation history, repository instruction files like .github/copilot-instructions.md or AGENTS.md, and the developer’s prompt. Then it may summarize, rank, truncate, or drop some of that context before the model ever sees it.

That is the new front door for your SDK, API, CLI, docs, language server, and examples. Not your homepage. Not the lovingly maintained tutorial page. The context window.

Your documentation now has a machine reader with terrible judgment

The article’s strongest point is that agent failures often start before the agent writes a single line of code. If a developer asks for “authentication” and your MCP tool is described as “configure identity provider settings,” the model has to infer the connection semantically. Sometimes it will. Sometimes it will choose a different tool whose description is more obvious. Sometimes it will skip tools entirely because stale pretraining makes it feel confident enough to code from memory.

That last case is the expensive one. A model with no knowledge of your technology may ask for help. A model with outdated knowledge may confidently generate the 2024 auth flow against a 2026 SDK. The developer sees familiar-looking code, the tests fail later, and everyone blames “AI hallucination.” That is too convenient. If your current agent-facing surfaces do not override stale training data at inference time, the hallucination is partly your platform’s documentation strategy leaking into production.

Microsoft’s earlier AX-stack piece frames the controllable layer as agent extensions: skills, MCP servers, instruction files, and custom agents. The model is largely fixed. The harness is largely someone else’s product — Copilot, Claude Code, Cursor, VS Code, Windsurf, whatever your developers use. Extensions are where technology owners still have leverage. But leverage is not the same as volume. Shoving an encyclopedia into an MCP response is not “better context.” It is context-window denial-of-service with a nicer badge.

Mastykarz gives the concrete warning: returning 3,000 tokens of documentation when 200 would do can push other relevant context out of the window. That is not just a cost problem. It is a quality bug. Every excessive paragraph competes with the repo instructions, a prior test failure, a security constraint, or the actual file the agent needed to inspect. Good AX is not “make all docs available.” Good AX is “return the smallest current answer that lets the agent do the right next thing.”

The harness is where model fandom goes to die

This is why the endless “which coding model is best?” argument keeps producing mediocre operating advice. The VS Code team’s harness deep dive makes the split explicit: language models do not edit files, execute commands, or run tests by themselves. The harness assembles context, exposes tools, validates tool calls, executes edits and commands, captures output, manages loop limits, summarizes history, and decides when the agent is done.

In other words, the model is not the product. The runtime is.

That matters because the same model can behave differently across Copilot, Claude Code, Cursor, and other harnesses. Tool descriptions may be included, summarized, reordered, or omitted differently. Some tools require confirmation. Some are enabled only for specific models. Some harnesses are more eager to search or call tools; others try to answer from memory first. VS Code’s post notes that providers differ in tool calling, structured outputs, reasoning controls, prompt caching, context limits, and error behavior — enough that integrating a new model is not merely adding a dropdown option.

For practitioners, the implication is uncomfortable but useful: you cannot certify your platform as “agent-ready” by testing one clean prompt in one clean environment. You need a harness matrix. Test Copilot in VS Code, Copilot CLI if your teams use it, Claude Code, Cursor, and any internal harness that has real adoption. Then test with the messy extension set your developers actually run. An MCP server that works beautifully in isolation may fail when another tool offers a partial answer sooner, or when a router-style server hides the correct subtool one step deeper than the model is willing to go.

This composition problem is the part most agent demos politely avoid. Real developer environments are not empty labs. They contain linters, language servers, cloud CLIs, database tools, repo instructions, internal MCP servers, half-maintained extensions, and a few “temporary” scripts that have been load-bearing since Q3. Your tool is not competing against nothing. It is competing for attention, tokens, and trust.

Error messages are agent UX now

The iteration loop is where old-school developer experience suddenly becomes agent infrastructure. Agents run builds, observe terminal output, read diagnostics, and try again. If your CLI says Error: operation failed, the agent does not infer the missing environment variable from years of support tickets. It guesses. If your language server emits a precise deprecation warning with the replacement API, the agent may repair the code before the developer sees the failure.

This changes the ROI calculation for boring DX work. Clear error messages, structured diagnostics, stable exit codes, concise examples, and version-specific repair hints are no longer merely nice for humans. They are control signals for automated code generation. A test runner that says exactly which field violated which schema is an agent steering mechanism. A CLI that prints the command to fix a missing auth scope is a self-repair path. A docs page that distinguishes SDK v4 from v5 in machine-readable, task-shaped chunks is cheaper than reviewing yet another AI-generated diff that imported the wrong package.

The security angle is just as important. If agents consume tool output as context, tool output is an input boundary. Issue comments, scraped webpages, database rows, and third-party API responses can carry instruction-shaped text back into the model. That makes MCP governance, response sanitization, audit logs, tool permissions, and repository instructions part of the same story. AX without governance is just a faster way to produce confident mistakes.

So what should engineering teams do Monday morning? Start by inventorying every surface an agent can see: official docs, README files, code samples, MCP tools, skills, repo instructions, CLIs, SDK errors, language-server diagnostics, linter output, test failures, and generated examples. Rewrite tool descriptions in the vocabulary developers actually use. Return concise, current, structured responses instead of reference-manual dumps. Add explicit versioning to examples. Create evaluation tasks for stale API traps, auth setup, migrations, framework upgrades, and security-sensitive defaults. Measure not just whether a tool was called, but whether it improved the accepted diff rate, test pass rate, reviewer rework, token cost, and repair success.

Most importantly, treat agent experience as a product surface, not a prompt-engineering chore. The teams that win with coding agents will not be the teams with the cleverest magic incantation in AGENTS.md. They will be the teams whose systems are legible to machines without becoming unsafe for humans: small tool schemas, current examples, precise diagnostics, reviewable policies, and enough observability to know when the agent routed around the official path and started coding from vibes.

Microsoft is naming the work before the market has fully priced it. That is useful. “Agent Experience” sounds like branding, because of course it does, but the underlying point is solid: if agents are now users of your technology, your developer experience has acquired a second customer. Ignore that customer and your SDK will still have docs. The agent just will not use them correctly.

Sources: Microsoft Developer Blog, Microsoft AX Stack, Visual Studio Code Blog

Your documentation now has a machine reader with terrible judgment

The harness is where model fandom goes to die

Error messages are agent UX now

Sign up for more like this.