Agentic Coding: Models
Agentic Coding: Models - THE LGTM
Agentic Coding: Models
Not all models are equal for coding tasks. Some excel at reasoning, others at speed, others at following instructions. Here's how to pick the right model for the job.
Last Updated: April 5, 2026
The 2026 Model Landscape
Four major families dominate agentic coding:
- Anthropic Claude: Claude Opus 4.5, Sonnet 4.5, Haiku 3.5
- OpenAI GPT: GPT-5, GPT-5 mini, o3 (reasoning)
- Google Gemini: Gemini 3 Pro, Ultra, Flash
- Open Source: Llama 4, Qwen 3, DeepSeek V4
Model Comparison Matrix
| Model | Strength | Best For | Context | Speed |
|---|---|---|---|---|
| Claude Opus 4.5 | Reasoning depth | Architecture, complex refactors | 1M tokens | Slow |
| Claude Sonnet 4.5 | Balance | General coding, most tasks | 200K tokens | Medium |
| GPT-5 | Instruction following | Clear specs, API integration | 128K tokens | Fast |
| o3 (reasoning) | Step-by-step thinking | Debugging, complex logic | 200K tokens | Slow |
| Gemini 3 Pro | Multimodal, context | Large codebases, docs | 1M tokens | Medium |
| Gemini 3 Flash | Speed, cost | Quick edits, autocomplete | 1M tokens | Fast |
Benchmarks: SWE-bench Verified
SWE-bench Verified is the gold standard for measuring real-world coding ability. It tests models on actual GitHub issues from popular repositories.
Leaderboard (April 2026)
| Model/System | SWE-bench Verified | Notes |
|---|---|---|
| Claude Opus 4.5 (Max) | 80.9% | Current leader |
| Claude Sonnet 4.5 | 78.4% | Best value |
| Antigravity (multi-agent) | 76.2% | Free tier |
| GPT-5 + scaffolding | 74.1% | Strong with tools |
| Gemini 3 Pro | 72.8% | Rapid improvement |
| Cursor + GPT-5 | 71.2% | IDE integration helps |
Important Reality Check: SWE-bench Pro
SWE-bench Pro tests on harder, production-grade issues. Scores are much lower:
| Model | SWE-bench Pro |
|---|---|
| Claude Opus 4.5 | 23% |
| Best open source | 12% |
Translation: Even the best models fail 3 out of 4 times on genuinely hard problems. Human oversight remains essential.
Reasoning vs Speed
When to Use Reasoning Models (o3, Opus)
- Architectural decisions with tradeoffs
- Complex debugging (root cause analysis)
- Refactoring across many files
- Algorithm design
- Security-sensitive code
Trade-off: 10-30 seconds of "thinking" before generating code. Worth it when direction matters more than velocity.
When to Use Fast Models (GPT-5, Flash)
- Quick edits and small fixes
- Autocomplete and inline suggestions
- Well-defined, scoped tasks
- Rapid prototyping
- High-volume, low-complexity work
Trade-off: Less depth, more speed. Use when iteration rate matters.
Context Window Strategies
Small Windows (32K-128K)
Models: GPT-5, older Claude versions
- Cheaper, faster
- Need careful context management
- Use @file references, not paste
- Good for focused, single-file tasks
Medium Windows (200K)
Models: Claude Sonnet 4.5, o3
- Sweet spot for most projects
- Can hold entire modules
- Still need selective context
Large Windows (1M tokens)
Models: Claude Opus 4.5, Gemini 3
- Can process entire codebases
- No chunking complexity
- More expensive per request
- Best for architecture-level work
Multi-Model Strategies
Tiered Approach
Many teams use different models for different phases:
- Planning: Opus 4.5 or o3 (reasoning)
- Implementation: Sonnet 4.5 or GPT-5 (balance)
- Testing: GPT-5 mini or Flash (speed)
- Review: Opus 4.5 (quality)
Model Routing
Some tools (Cursor, Antigravity) can automatically route to appropriate models:
if task.complexity > threshold:
use_reasoning_model()
elif task.scope == "single_file":
use_fast_model()
else:
use_balanced_model()
BYOM: Bring Your Own Model
Several tools support connecting custom or third-party models:
| Tool | BYOM Support | Notes |
|---|---|---|
| Cursor | ✅ Full | Any OpenAI-compatible API |
| Continue.dev | ✅ Full | Open source, flexible |
| Cline | ✅ Full | API costs only |
| Aider | ✅ Full | Local models supported |
| Claude Code | ❌ No | Anthropic only |
| OpenAI Codex | ❌ No | OpenAI only |
Local Models
Running models locally (via Ollama, LM Studio, etc.) offers:
- Privacy: Code never leaves your machine
- No rate limits: Unlimited usage
- Cost: Hardware cost only
- Latency: No network round-trips
Reality check: Local models lag cloud models significantly on coding benchmarks. Best for:
- Autocomplete and simple suggestions
- Highly sensitive codebases
- Offline environments
- Experimentation
Cost Optimization
Token Management
- Clear context between tasks (/clear)
- Use @file references instead of paste
- Keep project docs (CLAUDE.md) lean
- Target sessions under 30K tokens
Tier Selection
- Use Pro/standard tiers for routine work
- Reserve Max/Ultra for complex problems
- Monitor usage, right-size plans
Hybrid Approaches
- Copilot Pro ($10) for daily completions
- Cursor/Claude for complex work
- Save 40-50% vs. all-premium approach
The Bottom Line
Start with: Claude Sonnet 4.5 or GPT-5. These hit the sweet spot of capability, speed, and cost for most tasks.
Upgrade to reasoning models when:
- Task requires tradeoff analysis
- Debugging complex failures
- Architecture decisions
- Quality > velocity
Use fast models when:
- Well-defined, scoped tasks
- Rapid iteration needed
- Volume > complexity
The model you choose matters less than how you use it. Clear prompts, good context, and tight feedback loops beat raw model capability.
Further Reading
- Agentic Coding: Tools — Tool-specific model support
- Agentic Coding: Best Practices — Prompting strategies