Method 01

Model Routing: Cut AI Coding Costs 20-50%

90% of your daily AI coding tasks don't need the most expensive model. Route file lookups to Haiku, daily coding to Sonnet, and save Opus for architecture decisions — all with five minutes of config.

Token Savings: 20-50%

What Is Model Routing

Model routing is the practice of directing different types of AI tasks to different models based on complexity, rather than using a single expensive model for everything. Think of it as triage: not every question needs a specialist, and not every task warrants the full power of Opus.

In a typical AI coding session, you issue dozens of requests. Roughly 40% are simple lookups ("where is this function defined?"), 40% are standard coding tasks ("add a parameter to this method"), and only 20% are genuinely complex ("design the auth system for this microservice"). Yet most developers fire all of them at the same top-tier model.

The economics are stark. Opus costs roughly 5x more per token than Sonnet and 15x more than Haiku for input. For output tokens, the gap is even wider — Opus output is about 5x Sonnet and 25x Haiku. If you route 80% of your requests to cheaper models, your blended cost per session drops dramatically with zero quality loss on simple tasks.

This isn't about downgrading your tools. It's about matching the tool to the job. You wouldn't rent a server farm to host a static landing page. You shouldn't use Opus to grep for a function name.

The Three-Tier Routing Model

The most effective routing strategy uses three tiers, each mapped to a specific class of task:

This three-tier approach is the foundation of every other token-saving method. Before you optimize prompts or tweak compaction thresholds, get your routing right. It's the single highest-leverage change you can make — five minutes of config for 20-50% savings.

Model routing pairs especially well with strategic compaction: route cheap models to simple tasks, compact early to preserve context quality, and the savings compound. If manual configuration feels like too many settings, ECC automates routing + compaction + thinking token caps in one install.

How to Set It Up

Each AI coding tool has its own way of configuring model routing. Here are the exact configs for Claude Code, Cursor, and Codex — copy, paste, and adjust to your workflow.

Claude Code

Claude Code supports two independent model settings: a main model for your primary interactions and a subagent model for background tasks like file search and exploration. This is where the biggest savings live — subagents handle the majority of per-task token consumption.

# ~/.claude/settings.json
{
  "model": "sonnet",
  "env": {
    "CLAUDE_CODE_SUBAGENT_MODEL": "haiku"
  }
}

With this config, every "find where X is defined" or "search for Y pattern" runs on Haiku, while your actual coding requests use Sonnet. Subagents typically account for 30-50% of session tokens, so routing them to Haiku is a massive win. When you hit a genuinely hard problem, switch Opus on the fly with /model opus — no config change needed.

Cursor

Cursor lets you set different models for different features. The key split is between the inline editor (Ctrl+K) and the chat panel (Ctrl+L).

# Cursor Settings (Settings > Models)
# Chat / Composer:
Default model: claude-sonnet-4

# Inline Edit (Ctrl+K):
Default model: claude-haiku-4

# Terminal / Agent (Cmd+I):
Default model: claude-sonnet-4

Inline edits are typically single-function changes, variable renames, or small refactors — perfect Haiku territory. Keep Sonnet for the chat panel where you describe multi-step tasks. For architecture planning sessions, manually switch chat to Opus from the model dropdown.

OpenAI Codex CLI

Codex CLI uses a model flag for each session. Create shell aliases so you never have to think about which flag to use.

# ~/.bashrc or ~/.zshrc
alias codex-quick="codex --model gpt-5-nano"
alias codex="codex --model gpt-5"
alias codex-deep="codex --model gpt-5.2"

# Usage:
codex-quick "find the auth middleware file"
codex "add rate limiting to the login endpoint"
codex-deep "design the caching strategy for our API layer"

The alias approach removes decision fatigue. codex-quick for lookups, codex for daily work, codex-deep for architecture. It takes thirty seconds to set up and saves you from typing model flags on every command.

When to Use Which Model

Not sure where a task falls? Use this decision table. When in doubt, start one tier lower than you think — you can always escalate if the output isn't good enough.

Task Best Model Why
Find a function definition Haiku Pure search, no generation needed. Haiku handles grep-style lookups instantly at 1/15th the cost.
Explain what a code block does Haiku Summarization of existing code. Haiku reads and explains just as well as larger models for single files.
Check git history for a change Haiku Information retrieval from structured data. No reasoning required beyond pattern matching.
Write a CRUD endpoint Sonnet Standard coding pattern. Sonnet generates clean, idiomatic code indistinguishable from Opus on routine tasks.
Fix a null pointer bug Sonnet Single-file debugging with clear error context. Sonnet traces logic paths reliably without Opus overhead.
Refactor a 200-line function Sonnet Medium complexity, single concern. Sonnet handles restructuring well when the problem is well-scoped.
Write unit tests for a module Sonnet Systematic but formulaic. Sonnet produces thorough test coverage at a fraction of Opus cost.
Design a multi-service auth flow Opus Cross-cutting architecture with security implications. The extra reasoning depth matters here.
Debug a race condition Opus Concurrency bugs require reasoning about interleaved execution. Opus consistently catches subtle timing issues that smaller models miss.
Performance profiling and optimization Opus Requires understanding the full system to identify bottlenecks. Opus's broader context window and deeper reasoning pay off.
Security audit of auth and permissions Opus Security-sensitive work demands the strongest reasoning. The cost difference is negligible compared to the cost of a vulnerability.
Pro tip: Start with Sonnet as your default. If the output looks good, you're done. If it misses nuance or depth, escalate to Opus. You'll find that 80% of tasks never need escalation — and the 20% that do are obvious within the first response.

Before vs After: Real Numbers

Here's what a typical heavy-use day looks like before and after implementing model routing. These numbers come from a single developer working a full 8-hour day in Claude Code.

Metric Before (All Opus) After (Routed) Savings
Total tokens (input) ~850,000 ~290,000 66%
Total tokens (output) ~95,000 ~38,000 60%
Session cost $14.20 $4.35 69%
Monthly cost (22 workdays) $312.40 $95.70 $216.70
Tasks completed 24 24 Same
Avg. response time ~4.2s ~1.8s 57% faster

The critical detail: tasks completed stayed the same. Model routing didn't reduce capability — it reduced cost. In fact, response times improved because Haiku and Sonnet are faster than Opus for most token budgets. You get the same work done, cheaper and faster.

The monthly savings of $216.70 is for one developer. If you're a team lead with five developers, that's over $1,000 per month saved — enough to cover another SaaS tool, an extra CI/CD runner, or a team lunch every week. Multiply across a 50-person engineering org and model routing alone saves $130,000 per year.

Common Pitfalls

Model routing is simple in theory but easy to get wrong in practice. Here are the three most common mistakes and how to avoid them.

Pitfall 1: Routing Everything to the Cheapest Model

The goal is not to minimize cost at all costs — it's to match model capability to task complexity. If you route everything to Haiku, you'll save tokens but produce lower-quality code, spend more time fixing Haiku's mistakes, and ultimately waste more tokens on rework than you saved. The pendulum swings both ways. Keep Sonnet as your default, not Haiku. Haiku is for exploration only.

Pitfall 2: Not Setting a Subagent Model

Many developers set their main model to Sonnet but leave the subagent model unset. In Claude Code, subagents inherit the main model if no override is provided — meaning your background file searches are still burning Sonnet tokens. Setting CLAUDE_CODE_SUBAGENT_MODEL=haiku is a one-line change that often doubles your savings. Do not skip it.

Pitfall 3: Rigid Routing Rules

No routing heuristic is perfect. Sometimes a task that looks simple ("find the auth middleware") balloons into a complex investigation ("and also trace every caller, check permissions, and audit the token refresh flow"). Build the habit of escalating mid-session. If Haiku or Sonnet gives you an answer that feels shallow or incomplete, switch to the next tier and re-ask. Claude Code supports /model opus mid-session. Cursor has a model dropdown. Use them.

FAQ

Not if you set sensible defaults. With Haiku as your subagent model and Sonnet as your main model, the routing happens automatically in the background. You don't manually choose a model for every request — subagents pick up Haiku, your chat uses Sonnet, and you only intervene when you need Opus (roughly once or twice per session). The cognitive overhead is near zero after the first day.

Yes — and in some ways it's better. Haiku is faster than Sonnet and Opus for the kind of pattern-matching work that dominates code exploration. It reads files, traces imports, finds definitions, and summarizes code blocks with high accuracy. Where Haiku falls short is generating new code that requires understanding complex, multi-file interactions. For pure information retrieval, Haiku is the right tool.

Escalate when the Sonnet response feels shallow — it explains what the code does but misses why it was designed that way, or it gives a technically correct solution that creates problems elsewhere. Opus shines at cross-cutting concerns: it catches downstream effects, identifies architectural tradeoffs, and asks clarifying questions before jumping to code. If your first thought after reading Sonnet's response is "that works but I'm not sure it's right," escalate. The token cost of one Opus query is cheaper than an hour of debugging a subtle bug Sonnet introduced.

← Back to all 5 methods