Prompt optimization

std/llm/optimize provides a deterministic prompt-search loop for tuning an instruction against an eval set:

import { optimize_prompt } from "std/llm/optimize"

pipeline default() {
  let result = optimize_prompt({
    base_prompt: "Answer the question.",
    eval_set: [
      {id: "add", input: "2 + 2", expected: "4"},
      {id: "mul", input: "3 * 3", expected: "9"},
    ],
    metric: { ctx ->
      if contains(lowercase(ctx.prompt), "calculate") {
        return 1.0
      }
      return 0.0
    },
    trials: 4,
    instruction_proposals: [
      "Answer the question.",
      "Calculate carefully and answer only with the result.",
    ],
  })

  log(result.best_prompt)
  log(result.best_score)
}

The optimizer searches over (instruction, demos) candidates. Instruction proposals come from std/llm/refine via propose_instructions(...); callers can pass instruction_proposals, proposal_fn, or LLM options for structured proposal generation. Eval scoring is delegated to parallel_judge(...) from std/llm/judge, so each candidate's eval cases can run concurrently.

optimize_prompt(config) returns:

FieldDescription
best_promptRendered prompt for the best observed candidate
best_scoreMean eval-set score for best_prompt
best_candidate{instruction, demos, prompt, index} for the winning candidate
traceTrial-by-trial observations, case scores, and acquisition metadata
rankedObserved candidates sorted by score
candidatesFull discrete candidate space considered for search

The acquisition loop is intentionally inspectable. It evaluates a seed candidate first, then selects unobserved candidates using an expected-improvement score derived from a small similarity-weighted surrogate over prior observations.

budget.max_concurrent caps parallel eval cases. budget.max_trials and budget.max_evaluations cap the total search. For conformance and local tests, use explicit instruction_proposals and a deterministic metric closure to avoid real LLM calls.