LLM reranking

std/llm/rerank provides pairwise reranking helpers for cases where a single scalar score is brittle, plus a low-level confidence helper for models that expose token log probabilities.

import { pairwise_rerank, self_certainty } from "std/llm/rerank"

let candidates = [
  {id: "a", answer: "Use a bounded retry loop."},
  {id: "b", answer: "Retry forever until it works."},
  {id: "c", answer: "Fail fast on the first timeout."},
]

let result = pairwise_rerank(candidates, {
  task: "Choose the answer with the safest production behavior.",
  criteria: "Prefer bounded retries, clear failure modes, and low operational risk.",
  model_tier: "small",
})

log(result.ranked[0])
log(result.scores)

pairwise_rerank

pairwise_rerank(candidates, opts?) -> dict ranks a list with O(n log n) pairwise comparisons. It returns:

FieldTypeDescription
rankedlistCandidate values ordered best-first
scoreslistPer-original-candidate score records: {index, candidate, wins, losses, ties, comparisons, score, avg_confidence}
comparisonslistPairwise audit records: {left_index, right_index, winner, confidence, reasoning}

By default, each comparison calls llm_call_structured with a small judge schema. The judge receives opts.task, opts.criteria, and the two candidate payloads. opts.llm_options can hold generation options; when it is omitted, normal llm_call options on opts are used directly.

import { pairwise_rerank } from "std/llm/rerank"

let items = ["primary source", "unattributed summary", "official documentation"]
let ranked = pairwise_rerank(items, {
  task: "Pick the most relevant search result.",
  criteria: "Prefer direct answers from primary sources.",
  llm_options: {
    model_tier: "small",
    temperature: 0.0,
  },
})

For deterministic tests or application-specific scorers, pass opts.compare(left, right, ctx). The comparator can return a dict, string, bool, or number:

import { pairwise_rerank } from "std/llm/rerank"

let ranked = pairwise_rerank(
  ["short", "much longer"],
  {
    compare: { left, right, ctx ->
      if len(left) >= len(right) {
        return {winner: "left", confidence: 1.0}
      }
      return {winner: "right", confidence: 1.0}
    },
  },
)

Accepted winners are left/right/tie and aliases such as A, B, first, second, and equal. Numeric comparators use positive for left, negative for right, and zero for a tie.

self_certainty

self_certainty(text_or_result, model_opts?) -> float returns a length-normalized confidence score in [0.0, 1.0] from token log probabilities:

import { self_certainty } from "std/llm/rerank"

let score = self_certainty(
  "ignored when logprobs are supplied",
  {
    logprobs: [
      {token: "safe", logprob: -0.10},
      {token: " plan", logprob: -0.20},
    ],
  },
)

If a result dict already contains logprobs, pass it directly or as the second argument:

import { self_certainty } from "std/llm/rerank"

let response = llm_call("Write a short release note.", nil, {
  provider: "openai",
  logprobs: true,
  top_logprobs: 3,
  stream: false,
})

let confidence = self_certainty(response)

When no logprobs are supplied, self_certainty makes one extra model call that asks the model to repeat text_or_result exactly with logprobs: true. It fails if the provider/model does not return token log probabilities.

Provider support depends on the transport. Harn normalizes OpenAI-compatible chat-completion logprobs and legacy completion logprobs when providers return them, including local OpenAI-compatible servers. The mock provider accepts llm_mock({text, logprobs: [...]}) for deterministic tests. Anthropic, Bedrock, Gemini/Vertex, and native Ollama routes currently do not expose a normalized live logprob surface through llm_call, so use supplied logprobs or an OpenAI-compatible route for self_certainty.

The score reflects the model's token-level certainty in generated text, not factual correctness. Use it as a calibration signal alongside normal validation, retrieval checks, or pairwise judging.