Tutorial: build an eval pipeline

This tutorial builds a small evaluation loop that runs a set of examples, records metrics, and produces an auditable summary. The goal is to make quality visible, not to build an elaborate benchmark harness.

Use the companion example as a baseline:

cargo run --bin harn -- run examples/data-pipeline.harn

1. Define the dataset inline

Start with a tiny set of representative inputs. Keep the examples small enough that you can inspect failures by eye:

pipeline main(task) {
  let cases = [
    {id: "case-1", input: "What is 2 + 2?", expected: "4"},
    {id: "case-2", input: "Capital of France?", expected: "Paris"},
    {id: "case-3", input: "Color of grass?", expected: "green"},
  ]

  log("Loaded ${cases.count} eval cases")
}

2. Run the cases in parallel

If each case is independent, use parallel each so the slow parts overlap.

pipeline main(task) {
  let cases = [
    {id: "case-1", input: "What is 2 + 2?", expected: "4"},
    {id: "case-2", input: "Capital of France?", expected: "Paris"},
    {id: "case-3", input: "Color of grass?", expected: "green"},
  ]

  let results = parallel each cases { tc ->
    let answer = llm_call(tc.input, "Answer in one word or short phrase.", {
      temperature: 0.0,
      max_tokens: 64,
    })

    {
      id: tc.id,
      expected: tc.expected,
      actual: answer.text,
      correct: answer.text.contains(tc.expected),
    }
  }

  log(json_stringify(results))
}

For a real eval suite, replace the inline cases list with a manifest or a dataset file that your pipeline reads with read_file().

3. Record metrics

The important part of an eval pipeline is the metric trail. Use eval_metric() to record per-case and aggregate results.

pipeline main(task) {
  let cases = [
    {id: "case-1", input: "What is 2 + 2?", expected: "4"},
    {id: "case-2", input: "Capital of France?", expected: "Paris"},
  ]

  var passed = 0
  for tc in cases {
    let answer = llm_call(tc.input, "Answer in one word.", {temperature: 0.0})
    let correct = answer.text.contains(tc.expected)
    if correct {
      passed = passed + 1
    }
    eval_metric("case_correct", correct, {case_id: tc.id})
  }

  let accuracy = passed / cases.count
  eval_metric("accuracy", accuracy, {passed: passed, total: cases.count})
  eval_metric("run_id", uuid())
  eval_metric("generated_at", timestamp())
}

4. Export a report

Once the metrics are recorded, write a compact report so a later run can diff the results.

pipeline main(task) {
  let summary = {
    run_id: uuid(),
    generated_at: timestamp(),
    accuracy: 0.83,
    notes: "Replace the fixed accuracy with real case scoring",
  }

  write_file("eval-summary.json", json_stringify(summary))
  log(json_stringify(summary))
}

5. How to use it

Run the pipeline, inspect the metrics, then compare runs over time:

harn run examples/eval-workflow.harn
harn eval .harn-runs/<run-id>.json

A good eval pipeline answers three questions:

did the model improve?
did latency or token usage regress?
which cases failed, and why?

Skill and guidance gates

Use harn eval skill-gate when the artifact under review is a skill or guidance edit. The manifest records contamination-safe held-out tasks, with-vs-without observations, the frontier score, context-cost inputs, and immutable grader checksums:

harn eval skill-gate examples/evals/skill-gate/smoke/manifest.json \
  --output .harn-runs/skill-gate/smoke

The command writes summary.json, per_case.jsonl, summary.md, and a machine-readable receipt.json using harn.skill_gate.receipt.v1. The gate excludes static public/pre-cutoff tasks, reports gap recovery per cluster, rejects regressions and context bloat, and fails closed when a protected grader file or directory hash changes.