Testing

Harn provides several layers of testing support: a conformance test runner, a standard library testing module, and host-mock helpers for isolating agent behavior from real host capabilities.

Conformance tests

Conformance tests are the primary executable specification for the Harn language and runtime. They live under conformance/tests/ as paired files:

  • test_name.harn — Harn source code
  • test_name.expected — exact expected stdout output

Tests are grouped by area into subdirectories. ls conformance/tests/ gives the current top-level map (examples: language/, control_flow/, types/, collections/, concurrency/, stdlib/, templates/, modules/, agents/, scenarios/ (cross-feature compositions), reminders/, runtime/). High-volume categories may have a second level — for example, stdlib/oauth/, stdlib/json/, stdlib/hitl/, stdlib/preset_hooks/, stdlib/tool_hooks/, and stdlib/project/ group the larger stdlib API surfaces. The runner discovers .harn files recursively, so new tests just need to be dropped into the appropriate subdirectory.

Shared helpers live alongside the tests that use them: conformance/tests/modules/lib/ holds import targets for the modules/ tests, and conformance/tests/templates/fixtures/ holds prompt-template fixtures for the templates/ tests. The cross-cutting helper conformance/tests/_common.harn is imported as "../_common" from any direct subdirectory and "../../_common" from a second-level subdirectory.

Error tests live in two complementary homes:

  • conformance/errors/, subdivided by error class into syntax/, types/, semantic/, and runtime/ — for tests organized by where the error fires in the compilation pipeline.
  • conformance/tests/errors_by_feature/ — for error tests grouped by the feature that produces them (for example, agent_loop_*, defer_*, catch_*, finally_*).

Both homes share the .harn + .error (or .expected) sibling-file convention and are walked by the same runner.

Running tests

# Run the full conformance suite
harn test conformance

# Filter by name (substring match)
harn test conformance --filter workflow_runtime

# Filter by name or path
harn test conformance --filter agent

# Verbose output
harn test conformance --filter my_test -v

# Timing summary without verbose failure details
harn test conformance --timing --filter my_test

Writing a conformance test

Create a .harn file with a pipeline default(task) entry point and use log() to produce output:

// conformance/tests/<group>/my_feature.harn  (e.g. stdlib/, types/)
pipeline default(task) {
  let result = my_feature(42)
  log(result)
}

Then create a .expected file with the exact output:

[harn] 84

The std/testing module

Import std/testing in your Harn tests for higher-level test helpers:

import { mock_host_result, assert_host_called, clear_host_mocks } from "std/testing"

Host mock helpers

FunctionDescription
clear_host_mocks()Remove all registered host mocks
mock_host_result(cap, op, result, params?)Mock a host capability to return a value
mock_host_error(cap, op, message, params?)Mock a host capability to return an error
mock_host_response(cap, op, config)Mock with full response configuration

Host call assertions

FunctionDescription
host_calls()Return all recorded host calls
host_calls_for(cap, op)Return calls for a specific capability/operation
host_call_count() / host_call_count_for(cap, op)Return recorded host call counts
assert_host_called(cap, op, params?)Assert a host call was made
assert_host_call_count(expected_count, cap, op)Assert exact call count
assert_no_host_calls()Assert no host calls were made

Persona step assertions

Persona steel-thread tests can assert Harn orchestration boundaries without depending on Rust internals. step_assertions_begin(pattern?) installs PreStep / PostStep hooks for matching personas and records the hook payloads until step_assertions_end().

HelperDescription
step_assertions_begin(persona_pattern?)Clear persona hooks and start recording matching step payloads
step_events() / step_events_clear()Inspect or reset captured step payloads
assert_steps_ran(names)Assert the exact ordered list of @step names
assert_step_received(step, predicate?)Assert a PreStep payload matched a closure, dict subset, or value
assert_step_emitted(step, predicate?)Assert a PostStep payload matched a closure, dict subset, or value
assert_handoff_emitted(source, kind, target?)Assert a run record or handoff list contains a typed handoff
assert_receipt_field(receipt, pointer, value)Assert an RFC 6901 JSON Pointer field in a receipt
assert_golden_transcript(expected, actual)Structured subset matcher with <ms>, <uuid>, and <any> sentinels

Example

import { mock_host_result, assert_host_called, clear_host_mocks } from "std/testing"

pipeline default(task) {
  clear_host_mocks()

  // Mock the workspace.read_text capability
  mock_host_result("workspace", "read_text", "file contents")

  // Code under test calls host_call("workspace.read_text", ...)
  let content = host_call("workspace.read_text", {path: "test.txt"})
  log(content)

  // Verify the call was made
  assert_host_called("workspace", "read_text")
}

Scoped fixtures (with_host_mocks / with_llm_mocks / with_mocks)

Pipeline tests with many capabilities accumulate manual host_mock_clear() pairs around each test. A failing assertion can skip the clear step and leak mocks into the next test. Scoped fixtures handle that lifecycle for you and clean up reliably even when the body throws.

HelperDescription
with_host_mocks(mocks, body)Push a fresh host-mock scope, register mocks, run body(), restore on exit
with_llm_mocks(mocks, body)Same shape for LLM mocks (FIFO + match patterns)
with_mocks({host_mocks, llm_mocks}, body)Combined scope for tests that exercise both surfaces
llm_calls() / llm_call_count()Inspect the LLM call log captured inside the current scope

Each entry in the host-mock list is a dict shaped like the existing host_mock(...) config:

{capability: "runtime", operation: "pipeline_input", result: {}, params: {}}
{capability: "project", operation: "metadata_set", error: "denied"}

error (if non-nil) takes precedence over result, mirroring mock_host_error / mock_host_result.

import { with_host_mocks, assert_host_called } from "std/testing"

pipeline test_skill_registry() {
  with_host_mocks(
    [
      {capability: "runtime", operation: "pipeline_input", result: {}},
      {capability: "project", operation: "skills", result: []},
    ],
    { ->
      let registry = skill_registry_from_host()
      assert_eq(len(registry.skills), 0, "no skills registered")
      assert_host_called("project", "skills", nil, nil)
    },
  )
}

Key properties:

  • The body runs inside a fresh host-mock and host-call log; nothing inside leaks out, and nothing outside is visible inside.
  • The prior state is restored before the helper returns, including when the body throws — the thrown error is re-raised after cleanup.
  • Scopes nest: an inner with_host_mocks sees only its own mocks while active, then pops back to the outer scope on exit.
  • with_llm_mocks follows the same shape; entries are passed straight to llm_mock(...), so any field accepted by that builtin (including match / consume_match / error) is supported.

with_mocks(config, body) is the unified form for tests that need both:

with_mocks(
  {
    host_mocks: [{capability: "ws", operation: "read", result: "ok"}],
    llm_mocks: [{text: "agreed"}],
  },
  { ->
    run_pipeline_under_test()
  },
)

LLM mocking

For testing agent loops without real LLM calls, use llm_mock():

llm_mock({text: "The answer is 42"})

let result = llm_call([
  {role: "user", content: "What is the answer?"},
].join("\n"))
log(result)

This queues a canned response that the next LLM call consumes.

For end-to-end CLI runs, harn run and harn playground can preload the same mock infrastructure from a JSONL fixture file:

{"text":"PLAN: find the middleware module first","model":"fixture-model"}
{"match":"*hello*","text":"matched","model":"fixture-model"}
{"match":"*","error":{"category":"rate_limit","message":"fake rate limit"}}
{"match":"*retry*","error":{"status":503,"kind":"transient","reason":"upstream_unavailable"}}
harn run script.harn --llm-mock fixtures.jsonl
harn playground --script pipeline.harn --llm-mock fixtures.jsonl
  • A line without match is FIFO and is consumed on use.
  • A line with match is checked in file order as a glob against the request transcript text.
  • Add "consume_match": true when repeated matching prompts should advance through a scripted sequence instead of reusing the same line forever.
  • When no fixture matches, harn run --llm-mock ... and harn playground --llm-mock ... fail with the first prompt snippet so you can add the missing case directly.

To capture a replayable fixture from a run, record once and then replay the saved JSONL:

harn run script.harn --llm-mock-record fixtures.jsonl
harn run script.harn --llm-mock fixtures.jsonl

harn playground --script pipeline.harn --llm-mock-record fixtures.jsonl
harn playground --script pipeline.harn --llm-mock fixtures.jsonl

To import an external eval trace into the same fixture format:

harn trace import \
  --trace-file traces/generic.jsonl \
  --trace-id trace_123 \
  --output fixtures/imported.jsonl

The importer expects JSONL records shaped like {prompt, response, tool_calls} and passes through common metadata such as model, provider, and token counts when present.

Eval kinds

harn eval supports the default replay fixture flow plus an explicit clarifying-question kind for ambiguous tasks.

harn eval context <manifest> supports deterministic context-engineering fixtures for pack, projection, compaction, and tool-disclosure experiments. A manifest declares task fixtures and one or more context modes; the runner scores each task/mode pair without model calls and writes stable local artifacts: summary.json, per_run.jsonl, and summary.md. Use the builders in std/context/eval when authoring manifests from Harn code, and use spec/schemas/context-eval-report.v1.schema.json when ingesting harn.context_eval.report.v1 reports from hosted systems or downstream UIs.

harn eval context examples/evals/context-engineering-smoke.json \
  --output target/context-eval --json

harn eval scope_triage runs the opt-in pre-turn scope-classifier measurement harness. The default mode uses a deterministic reference classifier over the 100-case synthetic dataset; pass --live --model ollama:qwen3:1.7b to exercise the local small-model classifier. The report includes turn-cost reduction, coverage, false-positive rate, false-negative rate, and a keep-default-off / graduate decision.

harn eval scope_triage --output .harn-runs/scope-triage/latest

Eval packs

Portable eval packs live in harn.eval.toml or another TOML file listed in [package].evals in harn.toml. The same pack can be run locally and imported by hosted tooling because it contains only portable fixture references, rubrics, judge metadata, thresholds, and package metadata.

version = 1
id = "slack-connector"
name = "Slack connector evals"

[package]
name = "slack-connector"
version = "0.1.0"

[[fixtures]]
id = "url-verification-run"
kind = "run-record"
path = "fixtures/url-verification.run.json"

[[fixtures]]
id = "url-verification-replay"
kind = "replay-fixture"
path = "fixtures/url-verification.replay.json"

[[rubrics]]
id = "webhook-normalization"
kind = "deterministic"
description = "Webhook normalization keeps status and response shape stable."

[[rubrics.assertions]]
kind = "run-status"
expected = "completed"

[[cases]]
id = "url-verification"
name = "URL verification handshake"
run = "url-verification-run"
fixture = "url-verification-replay"
rubrics = ["webhook-normalization"]
severity = "blocking"

[cases.thresholds]
max-latency-ms = 500
max-cost-usd = 0.001

Run a single pack directly:

harn eval harn.eval.toml

Run the eval packs shipped by a package:

harn test package --evals

After harn install, this also includes eval packs declared by installed dependency packages under .harn/packages/<alias>/. Dependency eval packs are passive until this command or a root eval_pack://... trigger references them.

[package].evals is optional when the package root contains harn.eval.toml; otherwise declare one or more package-relative pack paths:

[package]
name = "slack-connector"
version = "0.1.0"
evals = ["evals/webhooks.toml", "evals/replay.toml"]

Fixture refs support these portable kind values:

KindLocal behavior
run-record or recorded-runLoads a persisted Harn run record JSON file
replay-fixtureLoads a replay fixture JSON file
friction-eventsLoads repeated-friction event fixtures and evaluates generated context-pack suggestions
jsonl-traceReserved for imported trace fixture metadata
provider-eventsReserved for synthetic provider event streams
connector-payloadReserved for connector payload samples

Local harn eval executes replay fixtures, baseline comparisons, deterministic assertions, HITL question assertions, repeated-friction context-pack suggestion assertions, and cost/latency/token/stage thresholds. llm-judge rubrics carry judge model, calibration, tie-break, and prompt-version metadata for hosted or explicit judge runners; a blocking llm-judge rubric fails locally rather than being silently skipped.

Eval packs can also include persona timeout ladders. A [[ladders]] entry runs the same persona fixture across every configured model-routes / timeout-tiers combination, writes per-tier JSONL transcripts, receipts, and summaries, and reports the first route/tier that completed correctly. Degraded and looping tiers remain in the machine-readable report so host CLIs and TUIs can render the same result without reimplementing the matrix runner.

[[ladders]]
id = "merge-captain-green-pr"
persona = "merge_captain"
artifact-root = ".harn-runs/merge-captain-timeout-ladder"

[ladders.backend]
kind = "replay"
path = "../../examples/personas/merge_captain/transcripts/green_pr.jsonl"

[[ladders.model-routes]]
id = "gemma-value"
route = "local/gemma-value"
provider = "llama.cpp"
model = "gemma"
profile = "value"

[[ladders.timeout-tiers]]
id = "balanced"
timeout-ms = 500
max-tool-calls = 4
max-model-calls = 1

Repeated-friction cases use friction_events = "<fixture-id-or-path>" and a rubric assertion such as:

[[rubrics.assertions]]
kind = "context-pack-suggestion"
contains = "incident"
expected = { min_suggestions = 1, recommended_artifact = "context_pack", required_capability = "splunk.search" }

Threshold severity controls gate behavior:

SeverityLocal gate behavior
blockingFailing case exits non-zero
warningFailure is reported but does not fail the command
informationalFailure is reported as info only

Replay evals

Replay evals are the default. They compare a run's persisted status and stage outcomes against an embedded or explicit replay fixture.

Clarifying-question evals

Clarifying-question evals assert that the agent called ask_user(...) and asked the minimal question required to proceed. The run record persists ask_user prompts, and the fixture can require a single question plus term-level constraints:

{
  "_type": "replay_fixture",
  "eval_kind": "clarifying_question",
  "expected_status": "completed",
  "clarifying_question": {
    "required_terms": ["repository"],
    "forbidden_terms": ["branch"],
    "min_questions": 1,
    "max_questions": 1
  }
}

Use this when defaults would be unsafe and the right behavior is to ask the user before continuing.

Determinism harness

Use harn test --determinism to assert that a pipeline replays the same way on a second pass:

harn test --determinism tests/agent_loop.harn

The harness records once and replays once when no sibling <name>.llm-mock.jsonl exists. If a sibling fixture is already present, it replays both passes from that fixture. It compares stdout, provider response payloads from llm_transcript.jsonl, and persisted run-record structure to catch branching drift.

Built-in assertions

Harn provides assert, assert_eq, and assert_ne builtins for test pipelines:

assert(x > 0, "x must be positive")
assert_eq(actual, expected)
assert_ne(actual, unexpected)
assert_eq(len(items), 3)

Failed assertions throw an error with a descriptive message including the expected and actual values.

Use require for runtime invariants in normal pipelines. The linter warns if you use assert* outside test pipelines, and it suggests assert* instead of require inside test pipelines.

Cross-platform test coverage

Most workspace tests run on both Unix and Windows. A small set of test modules opts out of Windows via #![cfg(unix)] because they exercise POSIX-only semantics (bash-fixture process spawning, SIGTERM-driven graceful shutdown). The full inventory and disposition lives at Windows test coverage, and the nightly Windows nightly GitHub Actions workflow runs the portable surface on windows-latest so cross-platform regressions surface within 24 hours.