Model Routing
Model routing sends each request to the cheapest model that can handle it, escalating only hard cases to a stronger model — cutting cost and latency.
Model routing is sending each incoming request to the most appropriate model — usually the cheapest, fastest one that can handle it, escalating only the hard cases to a stronger model — to cut cost and latency without sacrificing output quality.
The routing decision rides on a signal: the task type, the input length, a lightweight classifier that predicts difficulty, or a confidence/validation check that escalates when a cheap first attempt looks wrong (a cascade). A common shape is a small language model as the default workhorse with a frontier model held in reserve — the same tiering logic, automated per request.
The economics work because most production traffic is easy. If 80% of requests are simple classifications, extractions, or short answers, routing them to a model that costs a fraction as much slashes inference spend and tail latency while the hard 20% still gets full firepower. Gateways make this practical: a single API in front of many providers, where the router lives. See Calling Any Model: Gateways.
The caveat is the whole game: route too aggressively and you silently downgrade the cases that needed the strong model, degrading quality precisely where it counts. Gate every rule with an eval set that includes hard inputs, and pair routing with a provider fallback wrapper so an outage escalates rather than fails. Design the policy with a model router designer.
Frequently asked questions
- How does a model router decide where to send a request?
- The routing signal can be cheap heuristics (task type, input length, keywords), a lightweight classifier trained to predict difficulty, or a confidence-based cascade: try the small model first, validate its answer, and escalate only when the check fails. Heuristics are predictable and free; classifiers and cascades adapt to actual difficulty but add their own latency and need their own evals.
- What's the main risk of model routing?
- Routing too aggressively to the weak model. The cases that genuinely needed the strong one get silently downgraded, so quality drops on exactly the hard inputs that matter most — and the failure is invisible unless you measure it. Gate every routing rule with an eval set that includes hard cases, and watch the escalation rate: if almost nothing escalates, your router is probably too greedy.
Related
- InferenceInference is running a trained model to produce output — for LLMs, generating tokens one at a time. Its cost and latency define the economics of AI products.
- SLM (Small Language Model)A small language model is a compact LLM — roughly 1–15B parameters — that runs cheaply or locally, trading peak capability for speed and deployability.
- Model Router DesignerDesign a model router that sends each LLM request to the cheapest model that can handle it and escalates only the hard cases to the strongest — cutting cost and latency without tanking quality, gated by an eval set so the savings don't come from silently worse answers. Use when one expensive model serves all traffic (most of it easy), when LLM cost or latency is too high, or when balancing quality against spend across a range of request difficulty.
- Provider Fallback WrapperWrap LLM calls so a provider outage, rate limit, or timeout degrades gracefully — with multi-provider fallback, bounded retries with backoff, and timeouts. Use when an app depends on a single model/provider and needs production resilience.
- Calling Any Model: Unified LLM Gateways & SDKs in 2026Why teams put a unified layer in front of LLM providers — and how LiteLLM, OpenRouter, and the Vercel AI SDK compare for fallback and cost control.