Quality / cost / latency: the routing triangle
You cannot optimize all three at once. What you actually do is pick which one is the constraint, which one is the objective, and which one is allowed to slip. Here's how that shows up in a route-switch config.
If you’ve been around classical systems work, you know “consistency, availability, partition tolerance — pick two” isn’t literally true, but it’s a useful way to talk about trade-offs. LLM serving has a similar triangle:
- Quality — whatever “the answer was good” means for your task. Could be exact-match against ground truth, could be a human-judged score, could just be “the user didn’t escalate to a human.”
- Cost — dollars per call, mostly driven by which model you used and how many tokens were in the prompt + completion.
- Latency — wall-clock time from request to response.
You cannot optimize all three at once. This is not a fact about LLMs; it’s a fact about how providers price their offerings. The best-quality models are the slowest and the most expensive. The cheapest models are the fastest and the worst. The middle of the market is exactly that — a middle.
What an LLM gateway is actually doing, philosophically, is making the trade-off explicit. Route-Switch makes you write it down as a config. Here’s how that looks for the three common postures.
Posture 1: Quality is the constraint
This is “the answer has to be right; cost is fine; latency can be within reason.”
Common cases: legal drafting, medical summarization, code generation for production systems, anything where being wrong is more expensive than the inference call.
Config shape:
gateway:
strategy: "weighted_round_robin"
combinations:
- id: "primary"
model: "claude-3-5-sonnet"
provider: "anthropic"
weight: 80
fallbacks: ["secondary"]
- id: "secondary"
model: "gpt-4o"
provider: "openai"
weight: 20
fallbacks: []
fallback_threshold: 0.92
The high fallback_threshold is the giveaway. You’re saying: if
either combination’s success rate drops below 92%, something is
wrong — promote the other, page someone, do not silently degrade. The
weighted strategy isn’t about balancing cost; it’s about A/B-ing
two strong models against your evaluator. You’ll learn which one is
actually winning on your dataset, which is rarely the same answer as the
public benchmarks.
Cost in this posture is a fact you accept, not a variable you optimize.
Posture 2: Cost is the constraint
This is “the cheaper the better; the answer being slightly worse is acceptable; latency is fine as long as it’s under a few seconds.”
Common cases: bulk classification, internal tooling, anything embedded in a free tier, log enrichment, search query rewriting.
Config shape:
gateway:
strategy: "performance_based"
combinations:
- id: "cheap-primary"
model: "gpt-4o-mini"
provider: "openai"
weight: 70
fallbacks: ["cheap-secondary", "premium-fallback"]
- id: "cheap-secondary"
model: "claude-3-5-haiku"
provider: "anthropic"
weight: 30
fallbacks: ["premium-fallback"]
- id: "premium-fallback"
model: "gpt-4o"
provider: "openai"
weight: 0
fallbacks: []
fallback_threshold: 0.75
Two things to notice. First, the primary combinations are both small
models — you’re routing across providers’ cheap tiers, not
between cheap and expensive. Second, the premium-fallback has weight 0:
it never gets traffic in steady state, only when both cheap combinations
drop below 0.75 success rate. That’s the cost guard. You’re
saying “I would rather pay more occasionally than be wrong all the
time.”
performance_based strategy here is doing real work: as soon as one of the
cheap models starts faltering on a class of queries, its weight drops and
the other one absorbs traffic. The optimizer (turned on) is doing more real
work, because the cheap models are exactly the ones where good prompts
matter most.
Posture 3: Latency is the constraint
This is “p95 latency must be under N ms; quality matters but we can live with the second-best model; cost is what it is.”
Common cases: chat UIs, autocomplete, anything where the user is staring at a spinner.
Config shape:
gateway:
strategy: "performance_based"
combinations:
- id: "fast-primary"
model: "gpt-4o-mini"
provider: "openai"
weight: 60
fallbacks: ["fast-local"]
- id: "fast-local"
model: "llama3.2"
provider: "ollama"
weight: 40
fallbacks: []
fallback_threshold: 0.80
The shape here is shaped by an honest constraint:
the gateway’s performance_based strategy in route-switch
balances on success rate and response time average. You don’t get
to write “p95 latency under 800 ms” in the config today. What
you can do is exclude any model whose typical latency busts your budget, and
include an Ollama-backed local model as a fallback (or as a primary, if your
deployment supports it).
If you absolutely need a p95 enforcement layer, that lives upstream of the gateway today — a client-side timeout or a sidecar. We’d like to add explicit latency-budget routing; we have not.
What changes when you turn on optimization
The MIPROv2 optimizer doesn’t change which corner of the triangle you’re aiming for; it changes how efficient you are at that corner.
A cost-constrained config with optimization on has a different shape over time than the same config with optimization off. The cheap-model combinations slowly get better at the specific task they’re serving, because the optimizer is rewriting their instructions and picking better few-shot demos. The success rate creeps up. The fallback to the premium model fires less often. Cost goes down.
This is the actual value proposition of pairing routing with optimization in one binary: the routing strategy keeps you on the cost frontier; the optimizer pushes the frontier outward.
A quality-constrained config gets the same dynamic in a different shape: the two strong-model combinations get more aligned to your task over time, and the gap between “chose Claude” and “chose GPT-4o” narrows. You spend less time arguing about which model is better and more time arguing about whether the score function is right.
The trade-off you can’t escape
Whatever posture you pick, you are spending one of the three corners. The
gateway makes that spend legible. /v1/system/analytics shows you the
aggregate cost, success rate, and average latency. /v1/prompts/{id}/stats
shows you the same broken down per template. The point of the analytics
isn’t to optimize; it’s to let you check that the trade-off you
thought you were making is the trade-off you’re actually making.
That’s usually where the surprise is. The team that thought it was cost-optimizing finds out it’s been latency-optimizing all along, because the primary combination has been quietly slow and the fallbacks have been absorbing 30% of traffic. The team that thought it was quality-optimizing finds out it’s been cost-optimizing, because the weighted strategy was set to 90/10 and the 90 was the cheaper model.
A note on what the triangle looks like per-prompt
The trilemma framing is most useful at the prompt level, not at the system level. Different templates inside the same product will sit at different corners. Your support reply prompt might be quality-constrained because escalation is expensive. Your log-summarization prompt is almost certainly cost-constrained because nobody is watching the output in real time. Your autocomplete prompt is latency-constrained because the user left the page if you took 900 ms.
Route-Switch lets you configure each combination independently. A single
gateway instance can hold the support combinations in a
quality-constrained posture (weighted_round_robin between two strong
models, high fallback threshold) and the log-summarization combinations
in a cost-constrained posture (performance_based between cheap models,
premium fallback at weight 0) at the same time. They share the
analytics store, so when you ask “what did we spend on LLM calls
last week,” the answer rolls up across both. They don’t share
strategies, so the cheap-model collapse won’t accidentally drag
your support reply into a fallback.
This is the part where the triangle stops being a marketing diagram and starts being a per-prompt knob. The dashboards aren’t reporting “your gateway’s posture”; they’re reporting each prompt’s posture, and you can argue with the people who own each one separately.
The triangle doesn’t go away. What changes is whether you’re choosing your corner or your corner is choosing you.