agentic-coding

Agentic Coding: Models

Anatoliy Kolodkin

29 Mar 2026 • 3 min read

Agentic Coding: Models - THE LGTM

Agentic Coding: Models

Not all models are equal for coding tasks. Some excel at reasoning, others at speed, others at following instructions. Here's how to pick the right model for the job.

Last Updated: April 5, 2026

The 2026 Model Landscape

Four major families dominate agentic coding:

Anthropic Claude: Claude Opus 4.5, Sonnet 4.5, Haiku 3.5
OpenAI GPT: GPT-5, GPT-5 mini, o3 (reasoning)
Google Gemini: Gemini 3 Pro, Ultra, Flash
Open Source: Llama 4, Qwen 3, DeepSeek V4

Model Comparison Matrix

Model	Strength	Best For	Context	Speed
Claude Opus 4.5	Reasoning depth	Architecture, complex refactors	1M tokens	Slow
Claude Sonnet 4.5	Balance	General coding, most tasks	200K tokens	Medium
GPT-5	Instruction following	Clear specs, API integration	128K tokens	Fast
o3 (reasoning)	Step-by-step thinking	Debugging, complex logic	200K tokens	Slow
Gemini 3 Pro	Multimodal, context	Large codebases, docs	1M tokens	Medium
Gemini 3 Flash	Speed, cost	Quick edits, autocomplete	1M tokens	Fast

Benchmarks: SWE-bench Verified

SWE-bench Verified is the gold standard for measuring real-world coding ability. It tests models on actual GitHub issues from popular repositories.

Leaderboard (April 2026)

Model/System	SWE-bench Verified	Notes
Claude Opus 4.5 (Max)	80.9%	Current leader
Claude Sonnet 4.5	78.4%	Best value
Antigravity (multi-agent)	76.2%	Free tier
GPT-5 + scaffolding	74.1%	Strong with tools
Gemini 3 Pro	72.8%	Rapid improvement
Cursor + GPT-5	71.2%	IDE integration helps

Important Reality Check: SWE-bench Pro

SWE-bench Pro tests on harder, production-grade issues. Scores are much lower:

Model	SWE-bench Pro
Claude Opus 4.5	23%
Best open source	12%

Translation: Even the best models fail 3 out of 4 times on genuinely hard problems. Human oversight remains essential.

Reasoning vs Speed

When to Use Reasoning Models (o3, Opus)

Architectural decisions with tradeoffs
Complex debugging (root cause analysis)
Refactoring across many files
Algorithm design
Security-sensitive code

Trade-off: 10-30 seconds of "thinking" before generating code. Worth it when direction matters more than velocity.

When to Use Fast Models (GPT-5, Flash)

Quick edits and small fixes
Autocomplete and inline suggestions
Well-defined, scoped tasks
Rapid prototyping
High-volume, low-complexity work

Trade-off: Less depth, more speed. Use when iteration rate matters.

Context Window Strategies

Small Windows (32K-128K)

Models: GPT-5, older Claude versions

Cheaper, faster
Need careful context management
Use @file references, not paste
Good for focused, single-file tasks

Medium Windows (200K)

Models: Claude Sonnet 4.5, o3

Sweet spot for most projects
Can hold entire modules
Still need selective context

Large Windows (1M tokens)

Models: Claude Opus 4.5, Gemini 3

Can process entire codebases
No chunking complexity
More expensive per request
Best for architecture-level work

Multi-Model Strategies

Tiered Approach

Many teams use different models for different phases:

Planning: Opus 4.5 or o3 (reasoning)
Implementation: Sonnet 4.5 or GPT-5 (balance)
Testing: GPT-5 mini or Flash (speed)
Review: Opus 4.5 (quality)

Model Routing

Some tools (Cursor, Antigravity) can automatically route to appropriate models:

if task.complexity > threshold:
    use_reasoning_model()
elif task.scope == "single_file":
    use_fast_model()
else:
    use_balanced_model()

BYOM: Bring Your Own Model

Several tools support connecting custom or third-party models:

Tool	BYOM Support	Notes
Cursor	✅ Full	Any OpenAI-compatible API
Continue.dev	✅ Full	Open source, flexible
Cline	✅ Full	API costs only
Aider	✅ Full	Local models supported
Claude Code	❌ No	Anthropic only
OpenAI Codex	❌ No	OpenAI only

Local Models

Running models locally (via Ollama, LM Studio, etc.) offers:

Privacy: Code never leaves your machine
No rate limits: Unlimited usage
Cost: Hardware cost only
Latency: No network round-trips

Reality check: Local models lag cloud models significantly on coding benchmarks. Best for:

Autocomplete and simple suggestions
Highly sensitive codebases
Offline environments
Experimentation

Cost Optimization

Token Management

Clear context between tasks (/clear)
Use @file references instead of paste
Keep project docs (CLAUDE.md) lean
Target sessions under 30K tokens

Tier Selection

Use Pro/standard tiers for routine work
Reserve Max/Ultra for complex problems
Monitor usage, right-size plans

Hybrid Approaches

Copilot Pro ($10) for daily completions
Cursor/Claude for complex work
Save 40-50% vs. all-premium approach

The Bottom Line

Start with: Claude Sonnet 4.5 or GPT-5. These hit the sweet spot of capability, speed, and cost for most tasks.

Upgrade to reasoning models when:

Task requires tradeoff analysis
Debugging complex failures
Architecture decisions
Quality > velocity

Use fast models when:

Well-defined, scoped tasks
Rapid iteration needed
Volume > complexity

The model you choose matters less than how you use it. Clear prompts, good context, and tight feedback loops beat raw model capability.

Agentic Coding: Models

Anatoliy Kolodkin

Agentic Coding: Models

The 2026 Model Landscape

Model Comparison Matrix

Benchmarks: SWE-bench Verified

Leaderboard (April 2026)

Important Reality Check: SWE-bench Pro

Reasoning vs Speed

When to Use Reasoning Models (o3, Opus)

When to Use Fast Models (GPT-5, Flash)

Context Window Strategies

Small Windows (32K-128K)

Medium Windows (200K)

Large Windows (1M tokens)

Multi-Model Strategies

Tiered Approach

Model Routing

BYOM: Bring Your Own Model

Local Models

Cost Optimization

Token Management

Tier Selection

Hybrid Approaches

The Bottom Line

Further Reading

Sign up for more like this.