Notes
Short essays on the parts of multi-model LLM serving that bite you in production — routing strategy, prompt drift, and the trade-offs you can’t optimize all three of at once.
-
Quality / cost / latency: the routing triangle
You cannot optimize all three at once. What you actually do is pick which one is the constraint, which one is the objective, and which one is allowed to slip. Here's how that shows up in a route-switch config.
-
Per-model prompt optimization: what actually moves the score
MIPROv2 is two searches in a trench coat: an instruction search and a few-shot search. We walk through what the optimizer actually does to a prompt, what it doesn't do, and where it tends to disappoint.
-
Multi-model routing as a feature, not a hack
Most multi-model setups in production are an if-else around a provider client. Why turning routing into a first-class gateway concern changes what you can ship — and what you can measure.