Per-model prompt optimization: what actually moves the score

“Automatic prompt optimization” is a phrase that has done a lot of work in LLM marketing. Sometimes it means a thesaurus pass. Sometimes it means an LLM rewriting your prompt with hand-wavy instructions about clarity. Occasionally it means something specific.

Route-Switch implements something specific: MIPROv2. This note explains, in plain terms, what MIPROv2 actually does to a prompt, what the score function is in our implementation, and the parts where the algorithm tends to disappoint you if you don’t set it up carefully.

The minimum viable model of MIPROv2

MIPROv2 is two searches glued together by a Bayesian optimizer. Each search operates on a different lever of the prompt:

The instruction — the natural-language part of the prompt (“You are a helpful support agent. Reply briefly and cite the FAQ when relevant.”).
The few-shot demonstrations — zero or more input/output pairs prepended to the user’s request to show the model what good looks like.

The instruction search proposes new wordings of the system prompt by asking a model (configurable; you supply credentials). The demo search picks a subset of rows from your captured dataset to serve as exemplars.

The Bayesian optimizer treats the combination (instruction, demo set) as a point in a discrete search space and uses goptuna to pick the next combination to evaluate. After num_trials evaluations it returns the combination with the highest score.

That’s the whole shape. No reinforcement learning, no gradient descent, no synthetic training data. The search space is small, the score function is honest, the optimizer is a tree-structured Parzen estimator under the hood.

What “the score” actually is

Score is the part that determines whether the “optimized” prompt is actually better or just different. Route-Switch ships three built-in evaluation strategies and lets you implement your own through the EvaluationStrategy Go interface.

The three built-ins:

Similarity — token-overlap + length heuristics. Default. Best for open-ended outputs where exact wording doesn’t matter. Default threshold is 0.7.
ExactMatch — whitespace-trimmed string equality. Binary pass/fail. Best for deterministic tasks — classification labels, structured extraction, anything where the answer is supposed to be one string.
KeywordMatch — checks for the presence of pre-declared keywords (or keywords derived from the expected output). Partial scoring. Useful when you care that certain concepts appear but don’t care how the model phrases them.

This is the part of the system you cannot afford to ignore. The optimizer will dutifully maximize whatever score function you point it at. If your score function is similarity-based and your task is “classify sentiment into one of three labels,” the optimizer will find a prompt that produces longer, more “similar” (in token overlap) responses — which is the wrong objective. Pick ExactMatch for that task.

What MIPROv2 moves on a real prompt

In our experience, three things consistently improve when MIPROv2 runs against a well-instrumented prompt:

Format adherence. If your captured dataset includes structured outputs (JSON, key-value pairs, single-label classification), MIPROv2 reliably finds instructions that pull the model toward producing the expected format. This is a big deal because format failures are often the only failures — the model knew the answer, it just emitted it wrong.

Demonstration selection. The few-shot search is doing work even when the instruction search isn’t. The demos MIPROv2 picks are not always the “hardest” or the “most representative” rows — they are the rows whose presence in the prompt happens to lift the score on the rest of the calibration set the most. Sometimes that’s a boring row. That’s fine; the search doesn’t know it’s boring and you shouldn’t either.

Instruction specificity. Generic instructions (“You are a helpful assistant.”) get replaced with more specific ones (“You are a B2B SaaS support agent. Reply in 2-3 sentences. If the question references billing, ask for the account ID.”) roughly as often as you’d hope. This is the part that looks most like “magic” from the outside and is the most fragile in practice.

What it doesn’t move

Three things MIPROv2 will not fix for you, no matter how many trials you run:

A task the model can’t do. If the underlying model fundamentally can’t answer your domain (medical coding, niche legal analysis, a programming language with too little training data), the optimizer cannot manufacture capability through prompt engineering. You will see scores plateau low and stay there. Switch the model; don’t optimize harder.

A score function that doesn’t match the task. Covered above; it’s worth repeating because it’s the most common failure mode.

A dataset that doesn’t reflect production. MIPROv2 optimizes against your captured dataset. If your captured dataset is the first 200 calls from a beta cohort with a specific accent and you ship the optimized prompt to a global audience, the score gain will not generalize. Wait for representative data. The docs say at least 100 requests; in practice you want more like 500 before you trust the optimization.

Per-model, per-prompt, not per-query

The word “per-model” in the title is doing important work. MIPROv2 in route-switch runs per registered prompt+model+provider combination. It does not produce one universally optimal prompt; it produces a prompt that works on a specific model from a specific provider against a specific evaluation strategy.

If you run the same template against gpt-4o and claude-3-5-sonnet as two separate combinations, you will end up with two different optimized instructions. That is correct — the two models respond differently to the same wording. It is also a maintenance reality: the optimizer keeps those two prompts honest separately, on whatever interval you configure (gateway.optimization.interval_seconds).

Setting it up so the score actually moves

A short, opinionated checklist:

Collect at least 200 real calls before you first optimize.
Pick the evaluation strategy that matches the task. If you can’t decide, default to Similarity and accept that you’ll need a custom evaluator later.
Set num_trials to at least 15. Less than that and the Bayesian search hasn’t explored enough.
Run optimization manually first. route-switch --optimize-prompt prints the candidates, the trials, and the winner. Read the output. Make sure the winner is a prompt you’d be willing to ship.
Then turn on background optimization with a long interval (an hour or more). Short intervals churn for no good reason.

Inspecting a winner before you ship it

The CLI form of the optimizer prints the candidates, the trials, and the winner. The output looks roughly like this (numbers truncated for readability):

$ route-switch --config config.yaml \
    --prompt "Answer: {question}" \
    --model gpt-4o \
    --optimize-prompt

[bootstrap] 234 rows in dataset; sampling 50 for calibration
[instructions] generated 4 candidates from gpt-4o
[trials] running 20 Bayesian trials
  trial 03 score=0.74 instruction=#2 demos=[14, 87, 132]
  trial 08 score=0.81 instruction=#3 demos=[14, 132]
  trial 17 score=0.86 instruction=#3 demos=[14, 87]
[result] winner = instruction #3, demos [14, 87]
  baseline score: 0.71
  winner score:   0.86
  improvement:    +21%

Before you accept the winner, read the new instruction. Read the picked demos. Sometimes — particularly with smaller models — the winning instruction has internalized a quirk of the calibration set that won’t generalize. The lift might be real on the sample and modest on the holdout. If you have a separate holdout set, run it through the candidate before shipping.

The optimizer is not magic, and the marketing phrase “automatic prompt optimization” sells it short on what it actually does and oversells it on what to expect. What you get is a small, principled search over instruction + demos, scored against your data, with a winner you can inspect. That’s genuinely useful. It is also genuinely not “set it and forget it.”