Intelligent LLM routing,
with prompts that keep improving.
Route-Switch is a Go gateway that fronts multiple LLM providers behind a
single /v1/chat/completions endpoint, load-balances across
prompt+model combinations, and reruns
MIPROv2 on captured
production traces to keep prompts honest.
$ curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [{"role":"user","content":"Help me"}],
"variables": {"customer_name":"Jordan"}
}'
# request lands on a registered template,
# rendered, routed (round_robin | weighted |
# performance_based), logged to SQLite +
# DuckDB, and (optionally) the background
# optimizer rewrites the template next cycle. One endpoint, many providers
OpenAI, Anthropic, Google, Ollama, Cohere, Mistral — wired through the gollm adapter. Drop-in for OpenAI clients.
Strategies, not vibes
Round-robin, weighted, or performance-based load balancing across prompt+model combinations. Fallbacks when success rate dips below a threshold.
MIPROv2 on production traces
Captured calls feed a Bayesian instruction search (goptuna). The optimizer proposes new prompts, replays the dataset, and ships the winner.
What “automatic prompt optimization” actually means
Route-Switch implements MIPROv2 — an instruction + few-shot search that runs against the per-prompt SQLite dataset the gateway has already collected. Concretely, the optimizer loop:
- Bootstraps a calibration sample from the prompt’s dataset.
- Generates instruction candidates by calling the configured provider.
- Drives a Bayesian search (goptuna) across instruction×demo combinations.
- Scores each candidate by replaying dataset rows under
Similarity,ExactMatch, orKeywordMatch. - Writes the winner back to the prompt registry.
It is not a generic “rewrite the user query” layer and it is not RLHF. It rewrites templates you registered, scored against data your traffic produced, with the evaluation strategy you configured. Honest about the inputs, honest about the loss function.
What it is honest about
Yes
- OpenAI-compatible
/v1/chat/completions(streaming + non-streaming). - Provider fan-out via
gollm: OpenAI, Anthropic, Google, Ollama, Cohere, Mistral. - Per-prompt SQLite dataset; DuckDB-backed analytics store.
- Background optimizer with configurable interval.
- Portable prompt “packages” (template + dataset snapshot + recent logs).
Not
- A semantic router that classifies a free-form user query and sends it to the “best” model in real time. Routing is across registered combinations, not arbitrary models.
- An evaluation harness with held-out eval sets. The evaluator scores
against your captured traffic; bring your own ground truth if
Similaritywon’t cut it. - A vector DB, a feature store, or a guardrails layer.
- A managed cloud. You run the binary.
Compare
The space is crowded. Two grounded comparisons against tools people actually evaluate route-switch against:
- vs. OpenRouter — hosted aggregator vs. self-hosted gateway
- vs. Portkey — observability-first AI gateway vs. optimization-first
Recent notes
- 2026-06-02 Quality / cost / latency: the routing triangle
You cannot optimize all three at once. What you actually do is pick which one is the constraint, which one is the objective, and which one is allowed to slip. Here's how that shows up in a route-switch config.
- 2026-05-28 Per-model prompt optimization: what actually moves the score
MIPROv2 is two searches in a trench coat: an instruction search and a few-shot search. We walk through what the optimizer actually does to a prompt, what it doesn't do, and where it tends to disappoint.
- 2026-05-21 Multi-model routing as a feature, not a hack
Most multi-model setups in production are an if-else around a provider client. Why turning routing into a first-class gateway concern changes what you can ship — and what you can measure.