Multi-model routing as a feature, not a hack

Almost every team that runs more than one LLM in production starts the same way: an if model == "gpt-4": openai_client.call(...) block, plus another branch for Anthropic, plus a TODO labelled handle Anthropic-specific prompt shape. It works on day one. It is genuinely the right move on day one.

It becomes the wrong move somewhere around the third provider, the second incident, or the first time someone in product asks “why are we paying $0.03 per support reply when 90% of them are factual lookups?”

Route-Switch’s position is that routing across LLM providers should be a gateway concern, not an application concern. Not because gateways are fashionable, but because the moment routing lives outside the application, a set of things become possible that were structurally impossible before.

What “routing as a hack” actually costs you

The in-application approach — an if-tree around provider SDKs — has four hidden taxes that compound:

Prompts get welded to providers. The string that works on gpt-4o doesn’t work on claude-3-opus. So you end up with two prompts, then four, then a prompts/ directory that nobody refactors because nobody has the eval to back the change.
Cost data is per-client, not per-prompt. Your OpenAI bill is a single number per month. Your Anthropic bill is a different single number. The question “how much does the support-flow template cost us per resolved ticket?” has no answer without bespoke instrumentation.
Fallback is reactive, not policy. When OpenAI 5xx’s, somebody on call ships a code change to flip a feature flag. There is no “if success-rate drops below X, demote this combination” because there is no combination.
Switching providers is a deploy, not a config change. This is the one that bites longest. The cost-to-evaluate a cheaper or faster provider becomes high enough that nobody bothers, and you quietly overpay forever.

A gateway doesn’t magically solve any of this. But it makes the solutions cheap.

What changes when routing lives in the gateway

Route-Switch registers a thing called a combination: a tuple of (prompt template, model, provider) with a weight, optional fallback IDs, and metadata. The gateway holds a registry of these and a load balancer picks one per request.

gateway:
  strategy: "performance_based"
  combinations:
    - id: "support-primary"
      prompt: "You are a support agent. {context}\n\nUser: {question}"
      model: "gpt-4o"
      provider: "openai"
      weight: 70
      fallbacks: ["support-secondary", "support-cheap"]
    - id: "support-secondary"
      prompt: "Customer support reply.\nContext: {context}\nQuestion: {question}"
      model: "claude-3-5-sonnet"
      provider: "anthropic"
      weight: 30
      fallbacks: ["support-cheap"]
    - id: "support-cheap"
      prompt: "Reply briefly. Context: {context}\nQ: {question}"
      model: "gpt-3.5-turbo"
      provider: "openai"
      weight: 0
      fallbacks: []

Three things shift the moment this config exists:

The prompt becomes a first-class object. The string lives in the gateway, not in your code. You can register a new prompt without a deploy (route-switch template register --prompt-file ...). The optimizer can rewrite it without a deploy. Crucially, you have one place to look when you ask “what are we sending to the model?”

Fallback becomes policy. Set fallback_threshold: 0.8 and the gateway automatically reduces the weight of any combination whose success rate drops below 0.8 and promotes its declared fallbacks. There is no on-call PR. The strategy is performance_based, the data is the DuckDB analytics store, the mechanism is in the load balancer.

Per-combination economics become legible. Because every call writes to the per-prompt SQLite dataset and to the analytics store, you can answer “cost per call for support-primary” without writing a script. /v1/prompts/support-primary/stats returns it.

What the gateway is not doing

It is important to be honest about what this style of routing does not attempt. Route-Switch does not do semantic routing in the sense that “send Q&A queries to a small model, send creative writing queries to a large model” happens automatically. That kind of routing — where a classifier decides which combination to use per query — is something you can build on top by selecting a combination via request metadata:

{
  "model": "gpt-4o",
  "messages": [...],
  "metadata": { "combination_id": "support-cheap" }
}

You provide the classifier. The gateway provides the substrate. That separation is intentional. Embedding a classifier in the routing path couples your routing decisions to whichever model you used to classify, which is exactly the lock-in you were trying to escape.

The smallest interesting setup

You don’t need every feature to get value. The smallest config that demonstrates the point looks like this:

Two combinations: one primary (good model, good prompt), one fallback (cheap model, equivalent-ish prompt).
Strategy: performance_based.
Fallback threshold: 0.85.
Analytics: on.
Optimization: off (for now).

You point your existing OpenAI client at http://localhost:8080/v1. The client doesn’t know which model served the response. Your dashboards do, because the gateway logged it. When the primary provider hiccups, the fallback gets requests; when it recovers, weights rebalance.

That alone is worth the binary. Everything else — the optimizer, the portable packages, the multi-strategy evaluation — is icing on top of the structural change of moving routing out of your code.

When this is the wrong move

A gateway is overhead. There’s a process to run, a config to maintain, a SQLite file per prompt, a DuckDB file for analytics, and a load balancer that can mis-route a request if you mis-configure weights. If you have:

One LLM provider, no plans to add a second.
No interest in tracking per-prompt cost or success.
Latency budgets so tight that a localhost hop matters.

…then the if-else around a single provider client is genuinely the right answer. Don’t adopt complexity you can’t justify.

But the moment you cross into “we have two providers and an opinion about when to use each,” the if-tree starts losing. A gateway makes the losing trade visible and gives you somewhere to put the wins.

The boring operational wins, which are the biggest wins

The framing so far has been about features. The honest pitch for moving routing out of the application is more about operations than about features. Three boring wins stack up over a quarter:

Credentials live in one place. When OpenAI rotates a key, you update one config and restart one process. You don’t grep through six services hunting for OPENAI_API_KEY references and you don’t discover, two weeks later, that the batch job nobody owns was using the old key.

Rate limits become enforceable. Set rate_limit: 1200 under model_providers.openai and the gateway enforces 1200 req/min before the upstream returns 429. The application layer no longer has to do that math, and the failure mode is “your gateway told you to slow down” instead of “OpenAI quietly rate-limited a batch job and the queue backed up by four hours.”

Observability gets cheaper. Every call going through one process means one log format, one metric name for latency, one cardinality budget in your observability backend. Most teams underestimate how much engineering time goes into making per-service LLM call logging consistent enough to actually dashboard against. The gateway makes it free.

These aren’t the things people put on landing pages, because they don’t sound exciting. They’re the things that pay back the gateway’s operational cost the moment you turn it on.

That’s the case for routing as a feature, not a hack.