Model Routing: Cut AI Coding Costs 20-50%
90% of your daily AI coding tasks don't need the most expensive model. Route file lookups to Haiku, daily coding to Sonnet, and save Opus for architecture decisions — all with five minutes of config.
What Is Model Routing
Model routing is the practice of directing different types of AI tasks to different models based on complexity, rather than using a single expensive model for everything. Think of it as triage: not every question needs a specialist, and not every task warrants the full power of Opus.
In a typical AI coding session, you issue dozens of requests. Roughly 40% are simple lookups ("where is this function defined?"), 40% are standard coding tasks ("add a parameter to this method"), and only 20% are genuinely complex ("design the auth system for this microservice"). Yet most developers fire all of them at the same top-tier model.
The economics are stark. Opus costs roughly 5x more per token than Sonnet and 15x more than Haiku for input. For output tokens, the gap is even wider — Opus output is about 5x Sonnet and 25x Haiku. If you route 80% of your requests to cheaper models, your blended cost per session drops dramatically with zero quality loss on simple tasks.
This isn't about downgrading your tools. It's about matching the tool to the job. You wouldn't rent a server farm to host a static landing page. You shouldn't use Opus to grep for a function name.
The Three-Tier Routing Model
The most effective routing strategy uses three tiers, each mapped to a specific class of task:
- Haiku (Tier 1) — Exploration and Lookups: File search, finding function definitions, reading documentation, checking git history, explaining simple code. Haiku is fast, cheap, and surprisingly capable at information retrieval. It processes context at roughly 15-25% the cost of Opus.
- Sonnet (Tier 2) — Daily Coding: Writing functions, refactoring methods, fixing bugs, adding features, writing tests, code review. Sonnet delivers near-Opus code quality on standard development tasks at about 20-25% of the cost. For most working developers, Sonnet should be the default model.
- Opus (Tier 3) — Architecture and Complex Reasoning: System design, multi-file architecture decisions, security audits, complex debugging sessions, performance optimization. Opus shines when the problem requires deep reasoning across many interconnected concerns. Reserve it for the 10-20% of tasks that genuinely need it.
This three-tier approach is the foundation of every other token-saving method. Before you optimize prompts or tweak compaction thresholds, get your routing right. It's the single highest-leverage change you can make — five minutes of config for 20-50% savings.
Model routing pairs especially well with strategic compaction: route cheap models to simple tasks, compact early to preserve context quality, and the savings compound. If manual configuration feels like too many settings, ECC automates routing + compaction + thinking token caps in one install.
How to Set It Up
Each AI coding tool has its own way of configuring model routing. Here are the exact configs for Claude Code, Cursor, and Codex — copy, paste, and adjust to your workflow.
Claude Code
Claude Code supports two independent model settings: a main model for your primary interactions and a subagent model for background tasks like file search and exploration. This is where the biggest savings live — subagents handle the majority of per-task token consumption.
{
"model": "sonnet",
"env": {
"CLAUDE_CODE_SUBAGENT_MODEL": "haiku"
}
}
With this config, every "find where X is defined" or "search for Y pattern" runs on Haiku, while your actual coding requests use Sonnet. Subagents typically account for 30-50% of session tokens, so routing them to Haiku is a massive win. When you hit a genuinely hard problem, switch Opus on the fly with /model opus — no config change needed.
Cursor
Cursor lets you set different models for different features. The key split is between the inline editor (Ctrl+K) and the chat panel (Ctrl+L).
# Chat / Composer:
Default model: claude-sonnet-4
# Inline Edit (Ctrl+K):
Default model: claude-haiku-4
# Terminal / Agent (Cmd+I):
Default model: claude-sonnet-4
Inline edits are typically single-function changes, variable renames, or small refactors — perfect Haiku territory. Keep Sonnet for the chat panel where you describe multi-step tasks. For architecture planning sessions, manually switch chat to Opus from the model dropdown.
OpenAI Codex CLI
Codex CLI uses a model flag for each session. Create shell aliases so you never have to think about which flag to use.
alias codex-quick="codex --model gpt-5-nano"
alias codex="codex --model gpt-5"
alias codex-deep="codex --model gpt-5.2"
# Usage:
codex-quick "find the auth middleware file"
codex "add rate limiting to the login endpoint"
codex-deep "design the caching strategy for our API layer"
The alias approach removes decision fatigue. codex-quick for lookups, codex for daily work, codex-deep for architecture. It takes thirty seconds to set up and saves you from typing model flags on every command.
When to Use Which Model
Not sure where a task falls? Use this decision table. When in doubt, start one tier lower than you think — you can always escalate if the output isn't good enough.
| Task | Best Model | Why |
|---|---|---|
| Find a function definition | Haiku | Pure search, no generation needed. Haiku handles grep-style lookups instantly at 1/15th the cost. |
| Explain what a code block does | Haiku | Summarization of existing code. Haiku reads and explains just as well as larger models for single files. |
| Check git history for a change | Haiku | Information retrieval from structured data. No reasoning required beyond pattern matching. |
| Write a CRUD endpoint | Sonnet | Standard coding pattern. Sonnet generates clean, idiomatic code indistinguishable from Opus on routine tasks. |
| Fix a null pointer bug | Sonnet | Single-file debugging with clear error context. Sonnet traces logic paths reliably without Opus overhead. |
| Refactor a 200-line function | Sonnet | Medium complexity, single concern. Sonnet handles restructuring well when the problem is well-scoped. |
| Write unit tests for a module | Sonnet | Systematic but formulaic. Sonnet produces thorough test coverage at a fraction of Opus cost. |
| Design a multi-service auth flow | Opus | Cross-cutting architecture with security implications. The extra reasoning depth matters here. |
| Debug a race condition | Opus | Concurrency bugs require reasoning about interleaved execution. Opus consistently catches subtle timing issues that smaller models miss. |
| Performance profiling and optimization | Opus | Requires understanding the full system to identify bottlenecks. Opus's broader context window and deeper reasoning pay off. |
| Security audit of auth and permissions | Opus | Security-sensitive work demands the strongest reasoning. The cost difference is negligible compared to the cost of a vulnerability. |
Before vs After: Real Numbers
Here's what a typical heavy-use day looks like before and after implementing model routing. These numbers come from a single developer working a full 8-hour day in Claude Code.
| Metric | Before (All Opus) | After (Routed) | Savings |
|---|---|---|---|
| Total tokens (input) | ~850,000 | ~290,000 | 66% |
| Total tokens (output) | ~95,000 | ~38,000 | 60% |
| Session cost | $14.20 | $4.35 | 69% |
| Monthly cost (22 workdays) | $312.40 | $95.70 | $216.70 |
| Tasks completed | 24 | 24 | Same |
| Avg. response time | ~4.2s | ~1.8s | 57% faster |
The critical detail: tasks completed stayed the same. Model routing didn't reduce capability — it reduced cost. In fact, response times improved because Haiku and Sonnet are faster than Opus for most token budgets. You get the same work done, cheaper and faster.
The monthly savings of $216.70 is for one developer. If you're a team lead with five developers, that's over $1,000 per month saved — enough to cover another SaaS tool, an extra CI/CD runner, or a team lunch every week. Multiply across a 50-person engineering org and model routing alone saves $130,000 per year.
Common Pitfalls
Model routing is simple in theory but easy to get wrong in practice. Here are the three most common mistakes and how to avoid them.
Pitfall 1: Routing Everything to the Cheapest Model
The goal is not to minimize cost at all costs — it's to match model capability to task complexity. If you route everything to Haiku, you'll save tokens but produce lower-quality code, spend more time fixing Haiku's mistakes, and ultimately waste more tokens on rework than you saved. The pendulum swings both ways. Keep Sonnet as your default, not Haiku. Haiku is for exploration only.
Pitfall 2: Not Setting a Subagent Model
Many developers set their main model to Sonnet but leave the subagent model unset. In Claude Code, subagents inherit the main model if no override is provided — meaning your background file searches are still burning Sonnet tokens. Setting CLAUDE_CODE_SUBAGENT_MODEL=haiku is a one-line change that often doubles your savings. Do not skip it.
Pitfall 3: Rigid Routing Rules
No routing heuristic is perfect. Sometimes a task that looks simple ("find the auth middleware") balloons into a complex investigation ("and also trace every caller, check permissions, and audit the token refresh flow"). Build the habit of escalating mid-session. If Haiku or Sonnet gives you an answer that feels shallow or incomplete, switch to the next tier and re-ask. Claude Code supports /model opus mid-session. Cursor has a model dropdown. Use them.
FAQ
Not if you set sensible defaults. With Haiku as your subagent model and Sonnet as your main model, the routing happens automatically in the background. You don't manually choose a model for every request — subagents pick up Haiku, your chat uses Sonnet, and you only intervene when you need Opus (roughly once or twice per session). The cognitive overhead is near zero after the first day.
Yes — and in some ways it's better. Haiku is faster than Sonnet and Opus for the kind of pattern-matching work that dominates code exploration. It reads files, traces imports, finds definitions, and summarizes code blocks with high accuracy. Where Haiku falls short is generating new code that requires understanding complex, multi-file interactions. For pure information retrieval, Haiku is the right tool.
Escalate when the Sonnet response feels shallow — it explains what the code does but misses why it was designed that way, or it gives a technically correct solution that creates problems elsewhere. Opus shines at cross-cutting concerns: it catches downstream effects, identifies architectural tradeoffs, and asks clarifying questions before jumping to code. If your first thought after reading Sonnet's response is "that works but I'm not sure it's right," escalate. The token cost of one Opus query is cheaper than an hour of debugging a subtle bug Sonnet introduced.