Model routing with OpenRouter and DeepSeek to cut costs without losing quality
How I routed models in production with OpenRouter, using DeepSeek for the bulk of traffic and premium models only where it matters, cutting LLM costs without degrading quality.
Model routing means sending each request to the cheapest model that still solves the task at the quality you need, instead of sending everything to a single expensive model. In production I implemented it with OpenRouter as a single access layer: DeepSeek for the bulk of traffic (classification, extraction, summaries) and a premium model like Claude only for the cases where quality is critical. It’s the most direct way to cut AI costs without users noticing a difference, because most real requests don’t need the most expensive model.
TL;DR
- Send the bulk of traffic (classify, extract, summarize) to a cheap model like DeepSeek and reserve premium (Claude) only for the hard cases.
- Validate the cheap model's output and escalate to premium only when it fails. Clean the JSON (fences, trailing commas) before giving up, so you don't escalate unnecessarily.
- Measure everything by model and feature. With volume and tasks of uneven difficulty, savings land around 80-90%.
In this article:
- Fundamentals — What it is and why it cuts costs · How much you can save
- Implementation — DeepSeek vs Claude by task · Set up OpenRouter · Decide the model · Validation and fallback
- Operation — Observability · Prompt caching · Limitations · When NOT to use routing
What model routing is and why it cuts costs
The underlying idea is simple: not all LLM requests have the same difficulty. Classifying a text into one of three categories doesn’t require the same model as drafting a nuanced legal reply. When you send everything to the top model, you pay a premium price for tasks a cheaper model solves just as well. That’s where the money goes: not on volume, but on using expensive capacity for cheap work.
Routing breaks that uniformity. You define rules or a complexity signal, and based on that you pick the destination model. The price gap between model tiers is several orders of magnitude per million tokens, so moving even a fraction of the traffic to the cheap tier changes the bill entirely. That’s why routing is, in practice, one of the most effective ways to save tokens and lower the cost of an AI project.
How much you can save
Let me start with what almost everyone wants to know before reading the rest: how much you save.
An example with public OpenRouter prices (June 2026): DeepSeek V4 Flash costs about $0.09 per million input tokens and $0.18 output, while a premium model like Claude Opus runs around $5 input and $25 output. For a workload of 50 million input tokens and 10 million output per month:
- Everything on premium: on the order of $500 a month.
- 90% on cheap, 10% on premium: around $55 a month.
Monthly cost (example: 50M tokens in / 10M out)
All premium ████████████████████████████ ~$500
With routing ███ ~$55
↓ ~89% less
That’s close to a 9x reduction just by separating the easy tasks from the hard ones. The exact numbers depend on your traffic mix, but the order of magnitude holds: when the expensive model goes from solving 100% to solving 10%, the bill drops almost proportionally.
In one of my projects with this traffic pattern, the monthly bill dropped from roughly $430 to $60 after introducing routing, with about 90% of requests handled by the cheap model. These are rounded figures and depend on the project, but the order of magnitude holds as long as you have volume and a mix of tasks of uneven difficulty.
DeepSeek vs Claude: which tasks the cheap model can handle
The “without losing quality” promise only holds if you know which tasks the cheap model can take on and which it can’t. This is the part that really decides whether routing works: picking the right LLM model for each type of task, not the most capable one for everything.
My rule, based on tests over real traffic, ended up like this:
| Task type | DeepSeek quality vs Claude | Needs premium? |
|---|---|---|
| Classification | Equivalent | No |
| Field extraction | Equivalent | No |
| Summaries | ~98% of premium | Rarely |
| User-facing writing | ~90% | Yes |
| Multi-step reasoning | ~70% | Yes |
These are approximate figures from my tests over real traffic, not a formal benchmark, but the pattern is clear: where the output follows a predictable, verifiable format (classify, extract, summarize), the cheap model performs very close to premium. Where you need open-ended reasoning, strict adherence to long instructions, or nuanced writing, premium does make a difference. That’s why DeepSeek isn’t “the model”, it’s the default model, with an escape route to something more capable.
How to set up OpenRouter
OpenRouter exposes dozens of models from different providers behind a single API compatible with the OpenAI SDK. Switching models is switching a string, which is exactly what routing needs. The setup is minimal: you change baseURL and the API key.
// lib/openrouter.ts
import OpenAI from "openai";
// OpenRouter speaks the same protocol as OpenAI, only the endpoint changes.
export const client = new OpenAI({
baseURL: "https://openrouter.ai/api/v1",
apiKey: process.env.OPENROUTER_API_KEY,
// HTTP-Referer and X-Title are optional; OpenRouter uses them for attribution.
defaultHeaders: {
"HTTP-Referer": "https://ramonchancay.me",
"X-Title": "Personal Site",
},
});# lib/openrouter.py
import os
from openai import OpenAI
# OpenRouter speaks the same protocol as OpenAI, only the endpoint changes.
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.environ["OPENROUTER_API_KEY"],
# HTTP-Referer and X-Title are optional; OpenRouter uses them for attribution.
default_headers={
"HTTP-Referer": "https://ramonchancay.me",
"X-Title": "Personal Site",
},
)<?php
// lib/openrouter.php — openai-php/client package
use OpenAI;
// OpenRouter speaks the same protocol as OpenAI, only the endpoint changes.
$client = OpenAI::factory()
->withBaseUri("openrouter.ai/api/v1")
->withApiKey(getenv("OPENROUTER_API_KEY"))
// HTTP-Referer and X-Title are optional; OpenRouter uses them for attribution.
->withHttpHeader("HTTP-Referer", "https://ramonchancay.me")
->withHttpHeader("X-Title", "Personal Site")
->make();Model IDs follow the provider/model format, for example deepseek/deepseek-v4-flash or anthropic/claude-opus-4.8. With this you already have access to every model; the interesting part isn’t the connection, it’s the logic that decides which one each request goes to.
How to decide which model handles each request
Here’s the heart of it. There are two approaches, and it’s worth understanding the trade-off before choosing.
The first is rule-based routing: you classify the task by its type and map each type to a model. It’s deterministic, debuggable, and free to run because there’s no extra call. The price you pay is maintenance: you define the rules by hand and have to adjust them when new cases show up.
The second is estimated-complexity routing: a small, cheap model evaluates the request and decides whether it needs the big model. It’s more flexible, but it adds latency and an extra call, and introduces one more point of failure. I started with rules because most of my tasks fell into clear categories, and predictability was worth more than flexibility.
The rule-based flow is straightforward:
Request
│
▼
Task type
│
├── classify ──────► DeepSeek
├── extract ───────► DeepSeek
├── summarize ─────► DeepSeek
├── draft ─────────► Claude
└── reason ────────► Claude
│
▼
response
Which in code is a single routing table:
// lib/model-router.ts
// Model tiers: the cheap one handles the bulk, premium is the exception.
const MODELS = {
cheap: "deepseek/deepseek-v4-flash",
premium: "anthropic/claude-opus-4.8",
} as const;
type TaskType = "classify" | "extract" | "summarize" | "draft" | "reason";
// The explicit mapping is the "routing policy": easy to read and audit.
const ROUTING: Record<TaskType, keyof typeof MODELS> = {
classify: "cheap",
extract: "cheap",
summarize: "cheap",
draft: "premium", // user-facing writing: here I do pay for quality
reason: "premium", // multi-step reasoning: the cheap one falls short
};
export function pickModel(task: TaskType): string {
return MODELS[ROUTING[task]];
}# lib/model_router.py
# Model tiers: the cheap one handles the bulk, premium is the exception.
MODELS = {
"cheap": "deepseek/deepseek-v4-flash",
"premium": "anthropic/claude-opus-4.8",
}
# The explicit mapping is the "routing policy": easy to read and audit.
ROUTING = {
"classify": "cheap",
"extract": "cheap",
"summarize": "cheap",
"draft": "premium", # user-facing writing: here I do pay for quality
"reason": "premium", # multi-step reasoning: the cheap one falls short
}
def pick_model(task: str) -> str:
return MODELS[ROUTING[task]]<?php
// lib/model_router.php
// Model tiers: the cheap one handles the bulk, premium is the exception.
const MODELS = [
"cheap" => "deepseek/deepseek-v4-flash",
"premium" => "anthropic/claude-opus-4.8",
];
// The explicit mapping is the "routing policy": easy to read and audit.
const ROUTING = [
"classify" => "cheap",
"extract" => "cheap",
"summarize" => "cheap",
"draft" => "premium", // user-facing writing: here I do pay for quality
"reason" => "premium", // multi-step reasoning: the cheap one falls short
];
function pick_model(string $task): string {
return MODELS[ROUTING[$task]];
}The first version wasn’t like this: I had the routing scattered across several if statements throughout the code. It worked, but when the bill went up it was impossible to tell why a request had ended up on the expensive model. After a few weeks I moved it to this table, and the difference wasn’t performance but debuggability: the routing policy became an object you can read at a glance. When someone asks “why does this flow cost so much?”, the answer is in a single table, not buried in conditionals spread across the code.
With pickModel resolving the destination, the call itself is the usual OpenAI SDK one: you pass the model the router returns and, for structured tasks, a low temperature so the output is stable. It’s worth centralizing that call in a single function that returns, alongside the text, the model that was used. That datum seems like a detail, but it’s what later lets you measure what percentage of traffic went to each tier. Without it, optimizing costs is flying blind.
The routing table covers the default case, but it’s worth leaving a manual escape hatch: a fast option (or an explicit tier) on the call function itself, so the caller can force the model when it knows something the table doesn’t. For example, a normally-cheap classification that in a certain flow needs more capability. The rule decides by default; the flag is the escape for the cases the rule doesn’t see.
How to keep quality: validation and fallback
The risk with the cheap model isn’t that it’s bad, it’s that it fails differently. Sometimes it returns malformed JSON, sometimes it ignores a formatting instruction. The defense that worked for me is to validate the output and, if it doesn’t pass, retry with the premium model. That way the extra cost only shows up in the cases that genuinely need it.
Request
│
▼
DeepSeek
│
valid output?
│
├── Yes ──► return (cheap)
│
└── No ──► Claude ──► return (premium)
In code, that flow is:
// lib/extract-with-fallback.ts
import { client } from "./openrouter";
const CHEAP = "deepseek/deepseek-v4-flash";
const PREMIUM = "anthropic/claude-opus-4.8";
// Tolerant parsing: cleans the typical quirks of LLM output before giving up.
// Many "failures" of the cheap model are just formatting, not quality.
function parseJSON<T>(text: string): T {
// Strip markdown fences: ```json ... ```
const fence = text.match(/```(?:json)?\s*\n([\s\S]*?)\n\s*```/i);
if (fence) text = fence[1];
text = text.trim();
// Remove trailing commas before } or ]
text = text.replace(/,\s*([}\]])/g, "$1");
return JSON.parse(text) as T;
}
// Validate the output is the JSON we expect before trusting it.
function isValid(raw: string): boolean {
try {
const data = parseJSON<{ category?: unknown; tags?: unknown }>(raw);
return typeof data.category === "string" && Array.isArray(data.tags);
} catch {
return false;
}
}
export async function extract(prompt: string) {
// First attempt with the cheap model.
const first = await client.chat.completions.create({
model: CHEAP,
messages: [{ role: "user", content: prompt }],
temperature: 0.2,
response_format: { type: "json_object" },
});
const cheapText = first.choices[0]?.message?.content ?? "";
if (isValid(cheapText)) return { text: cheapText, model: CHEAP };
// Only if the cheap one fails do we escalate to premium. The extra cost is the exception.
const second = await client.chat.completions.create({
model: PREMIUM,
messages: [{ role: "user", content: prompt }],
temperature: 0.2,
response_format: { type: "json_object" },
});
return { text: second.choices[0]?.message?.content ?? "", model: PREMIUM };
}# lib/extract_with_fallback.py
import json
import re
from openrouter import client
CHEAP = "deepseek/deepseek-v4-flash"
PREMIUM = "anthropic/claude-opus-4.8"
# Tolerant parsing: cleans the typical quirks of LLM output before giving up.
# Many "failures" of the cheap model are just formatting, not quality.
def parse_json(text: str):
# Strip markdown fences: ```json ... ```
fence = re.search(r"```(?:json)?\s*\n([\s\S]*?)\n\s*```", text, re.I)
if fence:
text = fence.group(1)
text = text.strip()
# Remove trailing commas before } or ]
text = re.sub(r",\s*([}\]])", r"\1", text)
return json.loads(text)
# Validate the output is the JSON we expect before trusting it.
def is_valid(raw: str) -> bool:
try:
data = parse_json(raw)
return isinstance(data.get("category"), str) and isinstance(data.get("tags"), list)
except Exception:
return False
def extract(prompt: str) -> dict:
# First attempt with the cheap model.
first = client.chat.completions.create(
model=CHEAP,
messages=[{"role": "user", "content": prompt}],
temperature=0.2,
response_format={"type": "json_object"},
)
cheap_text = first.choices[0].message.content or ""
if is_valid(cheap_text):
return {"text": cheap_text, "model": CHEAP}
# Only if the cheap one fails do we escalate to premium. The extra cost is the exception.
second = client.chat.completions.create(
model=PREMIUM,
messages=[{"role": "user", "content": prompt}],
temperature=0.2,
response_format={"type": "json_object"},
)
return {"text": second.choices[0].message.content or "", "model": PREMIUM}<?php
// lib/extract_with_fallback.php
require "openrouter.php"; // exposes $client
const CHEAP = "deepseek/deepseek-v4-flash";
const PREMIUM = "anthropic/claude-opus-4.8";
// Tolerant parsing: cleans the typical quirks of LLM output before giving up.
// Many "failures" of the cheap model are just formatting, not quality.
function parse_json(string $text): mixed {
// Strip markdown fences: ```json ... ```
if (preg_match('/```(?:json)?\s*\n([\s\S]*?)\n\s*```/i', $text, $m)) {
$text = $m[1];
}
$text = trim($text);
// Remove trailing commas before } or ]
$text = preg_replace('/,\s*([}\]])/', '$1', $text);
return json_decode($text, true, flags: JSON_THROW_ON_ERROR);
}
// Validate the output is the JSON we expect before trusting it.
function is_valid(string $raw): bool {
try {
$data = parse_json($raw);
return is_string($data["category"] ?? null) && is_array($data["tags"] ?? null);
} catch (\Throwable) {
return false;
}
}
function extract(string $prompt): array {
global $client;
// First attempt with the cheap model.
$first = $client->chat()->create([
"model" => CHEAP,
"messages" => [["role" => "user", "content" => $prompt]],
"temperature" => 0.2,
"response_format" => ["type" => "json_object"],
]);
$cheapText = $first->choices[0]->message->content ?? "";
if (is_valid($cheapText)) {
return ["text" => $cheapText, "model" => CHEAP];
}
// Only if the cheap one fails do we escalate to premium. The extra cost is the exception.
$second = $client->chat()->create([
"model" => PREMIUM,
"messages" => [["role" => "user", "content" => $prompt]],
"temperature" => 0.2,
"response_format" => ["type" => "json_object"],
]);
return ["text" => $second->choices[0]->message->content ?? "", "model" => PREMIUM];
}This pattern—try cheap, validate, escalate if needed—gave me the best balance. The key is that validation must be cheap and objective: parse JSON, check required fields, verify length. If your validation requires another LLM call, you lose part of the savings.
The tolerant parseJSON above matters more than it seems. At first I was escalating to premium a lot because I used JSON.parse directly, and I assumed the cheap model wasn’t up to it. That wasn’t it: a good chunk of those fallbacks weren’t quality errors but formatting ones. The cheap model wrapped the response in a ```json block or left a trailing comma, and a bare JSON.parse marked it as failed. The day I added tolerant parsing, a good part of the fallbacks disappeared without touching the model or the prompt. Before blaming the cheap model, make sure you aren’t escalating over an extra brace.
A detail that costs money if you overlook it: response_format: { type: "json_object" } forces the model to respond in JSON, but it doesn’t protect against truncation. If the prompt is long and the model hits its max_tokens limit, the JSON gets cut in half and arrives malformed. Your validator catches it and escalates to premium, which is correct, but that fallback is unnecessary: the cheap model’s quality didn’t fail, it just ran out of room to finish. So it’s worth setting a generous max_tokens on the cheap model, comfortably above the maximum output size you expect. It’s the difference between escalating to premium because it was truly needed and escalating because you cut off the response yourself.
Structured Outputs reduce fallbacks
If the model supports Structured Outputs (output forced against a JSON Schema), use it instead of validating by hand. The difference versus json_object is that it doesn’t only guarantee valid JSON, it guarantees your schema: the required fields, the types, and the enums. That removes a good chunk of validation code and, above all, lowers the fallback rate, because the cheap model stops failing by deviating from the format. Manual validation is still useful as a safety net, but it shifts to covering only semantic errors, not formatting ones.
Trade-offs worth keeping in mind
- Fallback latency: when a request escalates to premium, the user waits for two calls instead of one. If latency matters in that flow, it’s worth measuring the 95th percentile, not just the average.
- Validation cost: if validating is complex, the savings erode. Keep validation in code, not in another model call.
- Silent quality drift: the cheap model can degrade without triggering your validation if it’s loose. It’s worth reviewing real samples now and then, not just trusting that the JSON parses.
- Dependence on an intermediary provider: OpenRouter is a single point in the path. In exchange for the convenience of one API, you accept that its availability is part of yours.
Route by response type, not just by difficulty
Task difficulty isn’t the only signal. The shape of the response also decides the tier and how you make the call. In practice I ended up with three distinct patterns:
- Short structured JSON (classify, extract, decide the routing itself): cheap tier, one-shot call, tolerant parsing. This is where DeepSeek shines: bounded, verifiable, high-volume output.
- Conversation / chat: cheap tier but with a streaming response, because the user expects to see the text appear. The required quality is medium and the volume high, so the cheap model with streaming is the sweet spot.
- Long user-facing generation (reports of several thousand tokens): premium tier with streaming. Here quality is the product, the cost per request is high but the volume is low, so paying for premium is justified.
It’s the same router, but the tier depends on four things: difficulty, format, whether it’s user-facing, and volume. A short structured call and a several-thousand-token report can’t go to the same model just because they belong to the same part of the product. Separating them by response pattern was as important as separating them by difficulty.
Observability: without metrics, routing is faith
The most common mistake I see is implementing routing and then not knowing whether it actually saves money. Optimizing without measuring is guessing. For each request it’s worth logging, at a minimum:
- model used (cheap or premium)
- task type / endpoint
- tokens in and out
- estimated cost of the call
- duration (to watch fallback latency)
- whether there was a fallback and why
- user or tenant, if you need to attribute spend
In practice I compute the cost of each call from a per-model price map and the tokens the response returns. Keeping prices in a table, instead of a magic number, makes adding a new model a one-line change:
// lib/llm-cost.ts
// Price per million tokens (USD). Source: each provider's pricing pages.
const MODEL_COSTS: Record<string, { input: number; output: number }> = {
"deepseek/deepseek-v4-flash": { input: 0.09, output: 0.18 },
"anthropic/claude-opus-4.8": { input: 5.0, output: 25.0 },
};
const PER_MILLION = 1_000_000;
export function calculateCost(model: string, inputTokens: number, outputTokens: number): number {
const costs = MODEL_COSTS[model];
if (!costs) return 0; // unknown model: log it and review the map
return (inputTokens / PER_MILLION) * costs.input + (outputTokens / PER_MILLION) * costs.output;
}# lib/llm_cost.py
# Price per million tokens (USD). Source: each provider's pricing pages.
MODEL_COSTS = {
"deepseek/deepseek-v4-flash": {"input": 0.09, "output": 0.18},
"anthropic/claude-opus-4.8": {"input": 5.0, "output": 25.0},
}
PER_MILLION = 1_000_000
def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
costs = MODEL_COSTS.get(model)
if not costs:
return 0.0 # unknown model: log it and review the map
return (input_tokens / PER_MILLION) * costs["input"] + (output_tokens / PER_MILLION) * costs["output"]<?php
// lib/llm_cost.php
// Price per million tokens (USD). Source: each provider's pricing pages.
const MODEL_COSTS = [
"deepseek/deepseek-v4-flash" => ["input" => 0.09, "output" => 0.18],
"anthropic/claude-opus-4.8" => ["input" => 5.0, "output" => 25.0],
];
const PER_MILLION = 1_000_000;
function calculate_cost(string $model, int $inputTokens, int $outputTokens): float {
$costs = MODEL_COSTS[$model] ?? null;
if ($costs === null) {
return 0.0; // unknown model: log it and review the map
}
return ($inputTokens / PER_MILLION) * $costs["input"] + ($outputTokens / PER_MILLION) * $costs["output"];
}I log it fire-and-forget: the metric must not block or break the user’s response. If the insert fails, I log it and move on, never taking down the request over a telemetry problem:
// lib/track-usage.ts
import { calculateCost } from "./llm-cost";
import { db } from "./db"; // your database client (Supabase, Postgres, etc.)
// "feature" labels which part of the product the call came from, to break down
// spend by functionality and not just by model.
export function trackUsage(params: {
model: string;
feature: string;
inputTokens: number;
outputTokens: number;
fellBack: boolean;
}) {
const cost = calculateCost(params.model, params.inputTokens, params.outputTokens);
// Fire-and-forget: no await on the critical path of the response.
void db.from("llm_usage").insert({ ...params, cost }).catch((err) => {
console.error("track usage error", err);
});
}# lib/track_usage.py
import asyncio
import logging
from llm_cost import calculate_cost
from db import db # your database client (Supabase, Postgres, etc.)
# "feature" labels which part of the product the call came from, to break down
# spend by functionality and not just by model.
def track_usage(model: str, feature: str, input_tokens: int, output_tokens: int, fell_back: bool) -> None:
cost = calculate_cost(model, input_tokens, output_tokens)
row = {
"model": model, "feature": feature,
"input_tokens": input_tokens, "output_tokens": output_tokens,
"fell_back": fell_back, "cost": cost,
}
# Fire-and-forget: don't block the critical path of the response.
async def _insert():
try:
await db.table("llm_usage").insert(row)
except Exception as err:
logging.error("track usage error: %s", err)
asyncio.create_task(_insert())<?php
// lib/track_usage.php
require "llm_cost.php";
require "db.php"; // your database client (Supabase, Postgres, etc.)
// "feature" labels which part of the product the call came from, to break down
// spend by functionality and not just by model.
function track_usage(string $model, string $feature, int $inputTokens, int $outputTokens, bool $fellBack): void {
global $db;
$cost = calculate_cost($model, $inputTokens, $outputTokens);
$row = compact("model", "feature", "inputTokens", "outputTokens", "fellBack") + ["cost" => $cost];
// Fire-and-forget: never break the response over a telemetry failure.
try {
$db->table("llm_usage")->insert($row);
} catch (\Throwable $err) {
error_log("track usage error: " . $err->getMessage());
}
}With those fields you can build a dashboard that answers the questions that matter: what percentage of traffic the cheap model resolves, which task types escalate to premium the most, and how much each feature costs. The metric that served me best was the percentage of traffic resolved by the cheap model. If it drops, something changed: either the requests got harder, or your validation got too strict and is escalating too much.
A useful complement is to also store an audit trail of the messages (system, user, assistant) truncated to a few hundred characters. Not to read them all, but to be able to reconstruct why a specific request escalated to premium when something looks off on the dashboard.
Routing is not the only lever: prompt caching
Routing decides which model. Prompt caching decides how much you pay for the context you repeat. If your system prompt is large (instructions, examples, output schema) and repeats on every request, caching it makes the input dramatically cheaper: the cached portion is billed at a fraction of the normal price. In flows with a stable system prompt and many requests, caching moves the bill as much as routing does.
This has an architectural consequence worth understanding before adding providers: caching is usually provider-specific. If you fragment too much across different providers to save a few cents per token, you can lose caching and end up paying more. So it’s worth measuring the effective cost with cache per model, not the list price. Sometimes the “more expensive” model with a cached system prompt comes out cheaper than the “cheap” one without cache. OpenRouter passes caching through for the models that support it, so it’s worth checking which ones in your mix take advantage of it before deciding routing on nominal price alone.
In one of my projects the dashboard showed that around 90% of the traffic was resolved by the cheap model and only about 6% of requests escalated to premium due to failed validation. Rounded figures, but the shape of the curve is what matters: the bulk is resolved cheaply and premium is the exception.
Limitations and where routing gets complicated
Routing isn’t free in complexity, and there are edges worth being clear about before adopting it:
- Different context windows: each model has its limit. A prompt that fits in premium may not fit in the cheap one (or vice versa). If you route dynamically, validate that the prompt fits in the destination model.
- Very long prompts: the more context, the more the capability gap between models shows, and the cheap one tends to lose the thread sooner. Long prompts are usually premium candidates.
- Tool calling and function calling: not all models handle tools equally well, and the call format can vary. If your flow depends on tools, test each model separately before routing traffic to it.
- Vision and images: if the task includes images, the set of valid models shrinks and routing by task type has to account for it.
- Same prompt, different response: two models respond differently to the same prompt. A prompt tuned for Claude may perform worse on DeepSeek without adjustments. Don’t assume the prompt is portable.
None of these is a reason to discard routing, but they are reasons not to treat it as a magic switch. Every model you add to the mix is one more model to test and maintain.
When NOT to use model routing
Routing adds complexity, and it isn’t always worth it. I wouldn’t add it if:
- You make fewer than ~100 calls a day: the savings are a few dollars a month and don’t pay for the extra code you have to maintain.
- Almost all your calls are complex: if everything needs deep reasoning, there’s no “easy” traffic to move to the cheap model and premium is justified for almost everything.
- Your application always needs maximum quality: in domains where an error is costly (legal, medical, financial), the price difference matters less than the risk of degrading.
- Simplicity matters more than saving a few dollars: a single model is easier to reason about, debug, and maintain. Sometimes that simplicity is worth more than the bill.
Routing pays off when you have two things at once: volume and a mix of tasks of uneven difficulty. If you’re missing one, you probably don’t need it yet.
Frequently asked questions
Is model routing worth it for a small project?
It depends on volume. If you make few requests a day, the difference in the bill is marginal and the added complexity doesn’t pay off. Routing starts to pay when traffic is enough for the price gap between models to be visible on the monthly bill. Below that, using a single model is simpler and reasonable.
Is DeepSeek good enough for production?
For structured, verifiable tasks—classification, extraction, summaries—in my experience it performs very close to much more expensive models. Where it falls short is open-ended multi-step reasoning and strict adherence to long instructions. The strategy that works isn’t “DeepSeek for everything”, it’s “DeepSeek by default with an escape to premium when validation fails”.
Does OpenRouter add much latency vs calling the provider directly?
It adds a network hop because it acts as an intermediary, so there’s some extra latency. In practice it was small compared to the model’s own generation time. If latency is critical in a specific flow, it’s worth measuring it in your environment before deciding.
What if OpenRouter or the cheap model fails?
OpenRouter lets you specify fallback models natively: you pass an array in the model parameter (for example ["deepseek/deepseek-v4-flash", "anthropic/claude-opus-4.8"]) and if the first doesn’t respond, it tries the next. It’s worth having, but there’s a subtle distinction that matters: the native fallback only fires on infrastructure failures, that is when the model’s server is down or returns a 500 error. It doesn’t look at the content of the response.
The code fallback I showed earlier covers the other case, which in practice is the most frequent: the request did respond with a 200, but the content is semantically incorrect or the JSON is malformed. There the native fallback doesn’t act because, as far as OpenRouter is concerned, the call was a success. That’s why I use both levels: the native array protects me from a provider being down, and my validation protects me from the cheap model responding with well-formatted garbage as an error.
Rule-based routing or estimated-complexity routing?
Start with rules. It’s deterministic, adds no latency or extra call, and is trivial to debug. Estimated-complexity routing—using a small model to decide—only pays off when your tasks don’t fall into clear categories and you need flexibility. In most real cases, an explicit routing table covers the vast majority of the traffic.
Conclusion
The biggest mistake I see is assuming there’s a single ideal model for every task. In practice, different problems require different levels of capability. Separating the simple tasks from the complex ones through routing was one of the highest cost-impact optimizations I implemented, and it also left the architecture ready to bring in new models without changing the business logic. Start with rules, validate the cheap model’s output, measure everything, and let the numbers tell you how much DeepSeek can take on before escalating to Claude.