google-ai

Gemini 3.1 Deep Think Is Google’s Agentic Coding Flex — But Bring Your Own Eval Harness

Anatoliy Kolodkin

14 May 2026 • 5 min read

Google’s Gemini 3.1 update is not subtle about the audience it wants: developers choosing which model gets to touch a terminal, a repo, a browser session, and eventually production-adjacent workflow state. The DeepMind model page now presents Gemini 3.1 Pro and Gemini 3.1 Deep Think less as chat upgrades and more as the reasoning layer for agentic software work — with benchmark rows for Terminal-Bench, SWE-Bench, MCP Atlas, BrowseComp, APEX-Agents, and long-context tasks doing the sales pitch.

That matters because the AI coding market has mostly outgrown the “which chatbot writes prettier snippets?” phase. The useful question in 2026 is whether a model can inspect a codebase, select tools without flailing, recover from bad intermediate steps, keep a long task coherent, and avoid turning the shell into a confetti cannon. Google is making the case that Gemini belongs in that conversation, not as an Android-adjacent assistant, but as a serious agent-platform candidate.

The benchmark table is aimed at coding-agent buyers

DeepMind describes Gemini 3.1 Deep Think as a significant upgrade to Gemini 3.1’s specialized reasoning mode for “complex technical problems,” available to Google AI Ultra subscribers. The broader Gemini 3 framing is familiar but important: Gemini 1 brought native multimodality and long context, Gemini 2 added thinking, reasoning, and tool use, and Gemini 3 is supposed to merge those pieces into something builders can actually ship against.

The numbers are the hook. Google claims Gemini 3.1 Pro Thinking reaches 68.5% on Terminal-Bench 2.0, 80.6% on SWE-Bench Verified, 54.2% on SWE-Bench Pro public, 69.2% on MCP Atlas, 85.9% on BrowseComp, 33.5% on APEX-Agents, and 2887 Elo on LiveCodeBench Pro. It also posts 44.4% on Humanity’s Last Exam without tools and 51.4% with Search plus Code, alongside 77.1% on ARC-AGI-2 and 94.3% on GPQA Diamond.

Those rows are not random trophy metrics. Terminal-Bench tests agentic terminal work. SWE-Bench asks whether the system can solve real software issues. MCP Atlas points toward multi-step workflows over the Model Context Protocol, which is quickly becoming the messy middle where tools, permissions, and integration quality live. BrowseComp measures grounded search and browsing. APEX-Agents gets closer to long-horizon professional tasks. Together, they say: do not evaluate Gemini as a writing assistant; evaluate it as a worker inside a harness.

Google’s own comparison shows meaningful gains over Gemini 3 Pro Thinking: Terminal-Bench rises from 56.9% to 68.5%, SWE-Bench Verified from 76.2% to 80.6%, SWE-Bench Pro from 43.3% to 54.2%, APEX-Agents from 18.4% to 33.5%, MCP Atlas from 54.1% to 69.2%, and BrowseComp from 59.2% to 85.9%. That is the kind of movement that deserves a real bake-off, not a shrug.

Long context is useful. Lazy context is still expensive.

The Cloud model reference lists gemini-3.1-pro-preview with a 1,048,576-token maximum input window and 65,536 output tokens. It accepts text, code, images, audio, video, and PDFs, and supports grounding with Google Search, code execution, system instructions, structured output, function calling, thinking, context caching, Vertex AI RAG Engine, and OpenAI-compatible chat completions. It is in public preview, globally available, with a January 2025 knowledge cutoff.

That 1M-token window is genuinely useful for code archaeology, large migration planning, incident retrospectives, technical due diligence, and multimodal workflows where the model needs the whole mess in view. But it should not become an excuse to stop engineering the context layer. If your agent architecture is “stuff the repo into the prompt and pray,” Gemini’s context window only makes the bill larger before the design becomes better.

The pricing table is the part teams should read before letting demos set architecture. Standard Gemini 3.1 Pro Preview pricing is $2 per 1M input tokens up to 200K input tokens and $4 per 1M input tokens above 200K. Text output, including reasoning, is $12 per 1M tokens up to 200K input and $18 above that threshold. Cached input is $0.20 or $0.40 depending on context size; Flex/Batch halves the core rates to $1/$2 input and $6/$9 output.

The practical read: retrieve first, summarize second, cache aggressively, and reserve maximum-context runs for tasks where missing context is more expensive than tokens. Long context is a capability. Treating every task like a deposition archive is a cost-control bug wearing a model-feature hat.

The custom-tools endpoint is the real tell

Google Cloud also lists gemini-3.1-pro-preview-customtools, an endpoint optimized for agentic workflows that use bash and custom tools such as view_file or search_code. Google says it is better at prioritizing custom tools, while warning that quality may fluctuate in use cases that do not benefit from bash or custom-tool workflows. Pricing matches Gemini 3.1 Pro, but Provisioned Throughput is not supported for the custom-tools endpoint.

That is more interesting than another “better reasoning” claim. It acknowledges what every serious agent builder already knows: model quality and harness quality are now entangled. The same base intelligence can behave very differently depending on whether the runtime offers structured file tools, safe shell execution, patch application, search, browser access, memory, MCP servers, and stable output contracts. A model tuned to choose search_code instead of dumping shell guesses into a terminal is not universally smarter; it is better aligned to a particular work loop.

So the engineering move is not “switch everything to the custom-tools endpoint.” The move is to test both endpoints against your own tasks. Run repo navigation, bug fixes, refactors, flaky-test diagnosis, dependency updates, spreadsheet-style analysis, document synthesis, and multi-tool workflows separately. Measure success rate, latency, token cost, number of tool calls, refusal/overreach behavior, and recovery after failed commands. If you care about safety, include adversarial tests: malicious repo instructions, prompt-injected documentation, suspicious MCP servers, destructive shell suggestions, and credential-looking files.

The competitive read should be equally sober. Google’s table puts Gemini 3.1 Pro in the serious tier, but it does not end the Claude-vs-Codex-vs-Gemini argument. SWE-Bench Pro still shows Codex-flavored results ahead in Google’s own table, and different models will continue to win different work shapes. Claude may remain the better long-running pair-programmer for some teams. Codex may fit background code-review and patch-generation loops. Gemini may win high-context multimodal and tool-heavy tasks. The correct answer is an eval harness, not a fandom.

For practitioners, the checklist is simple. Add Gemini 3.1 Pro to your coding-agent bake-off. Test the regular and custom-tools endpoints. Price long-context runs before building workflows around them. Verify structured output, tool-call reliability, repo trust boundaries, MCP behavior, shell safety, and failure recovery. Do not migrate because a benchmark table looks good; migrate because your own task suite says the model earns the chair.

My take: this is Google’s “prove it” moment for agentic coding. Gemini 3.1 is no longer being positioned as a clever multimodal model with a giant context window. It is being pitched as the reasoning substrate for agents that browse, code, call tools, and work through long tasks. That is a real shift. LGTM for evaluation. Request changes if anyone tries to turn leaderboard screenshots into procurement.

Sources: Google DeepMind, Google Cloud model documentation, Google Cloud pricing

The benchmark table is aimed at coding-agent buyers

Long context is useful. Lazy context is still expensive.

The custom-tools endpoint is the real tell

Sign up for more like this.