Agentic Coding: Models

Agentic Coding: Models - THE LGTM

Agentic Coding: Models

Not all models are equal for coding tasks. Some excel at reasoning, others at speed, others at following instructions. Here's how to pick the right model for the job.

Last Updated: April 5, 2026

The 2026 Model Landscape

Four major families dominate agentic coding:

  • Anthropic Claude: Claude Opus 4.5, Sonnet 4.5, Haiku 3.5
  • OpenAI GPT: GPT-5, GPT-5 mini, o3 (reasoning)
  • Google Gemini: Gemini 3 Pro, Ultra, Flash
  • Open Source: Llama 4, Qwen 3, DeepSeek V4

Model Comparison Matrix

Model Strength Best For Context Speed
Claude Opus 4.5 Reasoning depth Architecture, complex refactors 1M tokens Slow
Claude Sonnet 4.5 Balance General coding, most tasks 200K tokens Medium
GPT-5 Instruction following Clear specs, API integration 128K tokens Fast
o3 (reasoning) Step-by-step thinking Debugging, complex logic 200K tokens Slow
Gemini 3 Pro Multimodal, context Large codebases, docs 1M tokens Medium
Gemini 3 Flash Speed, cost Quick edits, autocomplete 1M tokens Fast

Benchmarks: SWE-bench Verified

SWE-bench Verified is the gold standard for measuring real-world coding ability. It tests models on actual GitHub issues from popular repositories.

Leaderboard (April 2026)

Model/System SWE-bench Verified Notes
Claude Opus 4.5 (Max) 80.9% Current leader
Claude Sonnet 4.5 78.4% Best value
Antigravity (multi-agent) 76.2% Free tier
GPT-5 + scaffolding 74.1% Strong with tools
Gemini 3 Pro 72.8% Rapid improvement
Cursor + GPT-5 71.2% IDE integration helps

Important Reality Check: SWE-bench Pro

SWE-bench Pro tests on harder, production-grade issues. Scores are much lower:

Model SWE-bench Pro
Claude Opus 4.5 23%
Best open source 12%

Translation: Even the best models fail 3 out of 4 times on genuinely hard problems. Human oversight remains essential.

Reasoning vs Speed

When to Use Reasoning Models (o3, Opus)

  • Architectural decisions with tradeoffs
  • Complex debugging (root cause analysis)
  • Refactoring across many files
  • Algorithm design
  • Security-sensitive code

Trade-off: 10-30 seconds of "thinking" before generating code. Worth it when direction matters more than velocity.

When to Use Fast Models (GPT-5, Flash)

  • Quick edits and small fixes
  • Autocomplete and inline suggestions
  • Well-defined, scoped tasks
  • Rapid prototyping
  • High-volume, low-complexity work

Trade-off: Less depth, more speed. Use when iteration rate matters.

Context Window Strategies

Small Windows (32K-128K)

Models: GPT-5, older Claude versions

  • Cheaper, faster
  • Need careful context management
  • Use @file references, not paste
  • Good for focused, single-file tasks

Medium Windows (200K)

Models: Claude Sonnet 4.5, o3

  • Sweet spot for most projects
  • Can hold entire modules
  • Still need selective context

Large Windows (1M tokens)

Models: Claude Opus 4.5, Gemini 3

  • Can process entire codebases
  • No chunking complexity
  • More expensive per request
  • Best for architecture-level work

Multi-Model Strategies

Tiered Approach

Many teams use different models for different phases:

  1. Planning: Opus 4.5 or o3 (reasoning)
  2. Implementation: Sonnet 4.5 or GPT-5 (balance)
  3. Testing: GPT-5 mini or Flash (speed)
  4. Review: Opus 4.5 (quality)

Model Routing

Some tools (Cursor, Antigravity) can automatically route to appropriate models:

if task.complexity > threshold:
    use_reasoning_model()
elif task.scope == "single_file":
    use_fast_model()
else:
    use_balanced_model()

BYOM: Bring Your Own Model

Several tools support connecting custom or third-party models:

Tool BYOM Support Notes
Cursor ✅ Full Any OpenAI-compatible API
Continue.dev ✅ Full Open source, flexible
Cline ✅ Full API costs only
Aider ✅ Full Local models supported
Claude Code ❌ No Anthropic only
OpenAI Codex ❌ No OpenAI only

Local Models

Running models locally (via Ollama, LM Studio, etc.) offers:

  • Privacy: Code never leaves your machine
  • No rate limits: Unlimited usage
  • Cost: Hardware cost only
  • Latency: No network round-trips

Reality check: Local models lag cloud models significantly on coding benchmarks. Best for:

  • Autocomplete and simple suggestions
  • Highly sensitive codebases
  • Offline environments
  • Experimentation

Cost Optimization

Token Management

  • Clear context between tasks (/clear)
  • Use @file references instead of paste
  • Keep project docs (CLAUDE.md) lean
  • Target sessions under 30K tokens

Tier Selection

  • Use Pro/standard tiers for routine work
  • Reserve Max/Ultra for complex problems
  • Monitor usage, right-size plans

Hybrid Approaches

  • Copilot Pro ($10) for daily completions
  • Cursor/Claude for complex work
  • Save 40-50% vs. all-premium approach

The Bottom Line

Start with: Claude Sonnet 4.5 or GPT-5. These hit the sweet spot of capability, speed, and cost for most tasks.

Upgrade to reasoning models when:

  • Task requires tradeoff analysis
  • Debugging complex failures
  • Architecture decisions
  • Quality > velocity

Use fast models when:

  • Well-defined, scoped tasks
  • Rapid iteration needed
  • Volume > complexity

The model you choose matters less than how you use it. Clear prompts, good context, and tight feedback loops beat raw model capability.

Further Reading