Testing

Harn provides several layers of testing support: a conformance test runner, a standard library testing module, and host-mock helpers for isolating agent behavior from real host capabilities.

Conformance tests

Conformance tests are the primary executable specification for the Harn language and runtime. They live under conformance/tests/ as paired files:

test_name.harn — Harn source code
test_name.expected — exact expected stdout output

Tests are grouped by area into subdirectories. ls conformance/tests/ gives the current top-level map (examples: language/, control_flow/, types/, collections/, concurrency/, stdlib/, templates/, modules/, agents/, scenarios/ (cross-feature compositions), reminders/, runtime/). High-volume categories may have a second level — for example, stdlib/oauth/, stdlib/json/, stdlib/hitl/, stdlib/preset_hooks/, stdlib/tool_hooks/, and stdlib/project/ group the larger stdlib API surfaces. The runner discovers .harn files recursively, so new tests just need to be dropped into the appropriate subdirectory.

Shared helpers live alongside the tests that use them: conformance/tests/modules/lib/ holds import targets for the modules/ tests, and conformance/tests/templates/fixtures/ holds prompt-template fixtures for the templates/ tests. The cross-cutting helper conformance/tests/_common.harn is imported as "../_common" from any direct subdirectory and "../../_common" from a second-level subdirectory.

Error tests live in two complementary homes:

conformance/errors/, subdivided by error class into syntax/, types/, semantic/, and runtime/ — for tests organized by where the error fires in the compilation pipeline.
conformance/tests/errors_by_feature/ — for error tests grouped by the feature that produces them (for example, agent_loop_*, defer_*, catch_*, finally_*).

Both homes share the .harn + .error (or .expected) sibling-file convention and are walked by the same runner.

Running tests

# Run the full conformance suite
harn test conformance

# Filter by name (substring match)
harn test conformance --filter workflow_runtime

# Filter by name or path
harn test conformance --filter agent

# Verbose output
harn test conformance --filter my_test -v

# Timing summary without verbose failure details
harn test conformance --timing --filter my_test

Writing a conformance test

Create a .harn file with a pipeline default(task) entry point and use log() to produce output:

// conformance/tests/<group>/my_feature.harn  (e.g. stdlib/, types/)
pipeline default(task) {
  let result = my_feature(42)
  log(result)
}

Then create a .expected file with the exact output:

[harn] 84

The `std/testing` module

Import std/testing in your Harn tests for higher-level test helpers:

import { mock_host_result, assert_host_called, clear_host_mocks } from "std/testing"

Host mock helpers

Function	Description
`clear_host_mocks()`	Remove all registered host mocks
`mock_host_result(cap, op, result, params?)`	Mock a host capability to return a value
`mock_host_error(cap, op, message, params?)`	Mock a host capability to return an error
`mock_host_response(cap, op, config)`	Mock with full response configuration

Host call assertions

Function	Description
`host_calls()`	Return all recorded host calls
`host_calls_for(cap, op)`	Return calls for a specific capability/operation
`host_call_count()` / `host_call_count_for(cap, op)`	Return recorded host call counts
`assert_host_called(cap, op, params?)`	Assert a host call was made
`assert_host_call_count(expected_count, cap, op)`	Assert exact call count
`assert_no_host_calls()`	Assert no host calls were made

Persona step assertions

Persona steel-thread tests can assert Harn orchestration boundaries without depending on Rust internals. step_assertions_begin(pattern?) installs PreStep / PostStep hooks for matching personas and records the hook payloads until step_assertions_end().

Helper	Description
`step_assertions_begin(persona_pattern?)`	Clear persona hooks and start recording matching step payloads
`step_events()` / `step_events_clear()`	Inspect or reset captured step payloads
`assert_steps_ran(names)`	Assert the exact ordered list of `@step` names
`assert_step_received(step, predicate?)`	Assert a `PreStep` payload matched a closure, dict subset, or value
`assert_step_emitted(step, predicate?)`	Assert a `PostStep` payload matched a closure, dict subset, or value
`assert_handoff_emitted(source, kind, target?)`	Assert a run record or handoff list contains a typed handoff
`assert_receipt_field(receipt, pointer, value)`	Assert an RFC 6901 JSON Pointer field in a receipt
`assert_golden_transcript(expected, actual)`	Structured subset matcher with `<ms>`, `<uuid>`, and `<any>` sentinels

Example

import { mock_host_result, assert_host_called, clear_host_mocks } from "std/testing"

pipeline default(task) {
  clear_host_mocks()

  // Mock the workspace.read_text capability
  mock_host_result("workspace", "read_text", "file contents")

  // Code under test calls host_call("workspace.read_text", ...)
  let content = host_call("workspace.read_text", {path: "test.txt"})
  log(content)

  // Verify the call was made
  assert_host_called("workspace", "read_text")
}

Scoped fixtures (`with_host_mocks` / `with_llm_mocks` / `with_mocks`)

Pipeline tests with many capabilities accumulate manual host_mock_clear() pairs around each test. A failing assertion can skip the clear step and leak mocks into the next test. Scoped fixtures handle that lifecycle for you and clean up reliably even when the body throws.

Helper	Description
`with_host_mocks(mocks, body)`	Push a fresh host-mock scope, register `mocks`, run `body()`, restore on exit
`with_llm_mocks(mocks, body)`	Same shape for LLM mocks (FIFO + `match` patterns)
`with_mocks({host_mocks, llm_mocks}, body)`	Combined scope for tests that exercise both surfaces
`llm_calls()` / `llm_call_count()`	Inspect the LLM call log captured inside the current scope

Each entry in the host-mock list is a dict shaped like the existing host_mock(...) config:

{capability: "runtime", operation: "pipeline_input", result: {}, params: {}}
{capability: "project", operation: "metadata_set", error: "denied"}

error (if non-nil) takes precedence over result, mirroring mock_host_error / mock_host_result.

import { with_host_mocks, assert_host_called } from "std/testing"

pipeline test_skill_registry() {
  with_host_mocks(
    [
      {capability: "runtime", operation: "pipeline_input", result: {}},
      {capability: "project", operation: "skills", result: []},
    ],
    { ->
      let registry = skill_registry_from_host()
      assert_eq(len(registry.skills), 0, "no skills registered")
      assert_host_called("project", "skills", nil, nil)
    },
  )
}

Key properties:

The body runs inside a fresh host-mock and host-call log; nothing inside leaks out, and nothing outside is visible inside.
The prior state is restored before the helper returns, including when the body throws — the thrown error is re-raised after cleanup.
Scopes nest: an inner with_host_mocks sees only its own mocks while active, then pops back to the outer scope on exit.
with_llm_mocks follows the same shape; entries are passed straight to llm_mock(...), so any field accepted by that builtin (including match / consume_match / error) is supported.

with_mocks(config, body) is the unified form for tests that need both:

with_mocks(
  {
    host_mocks: [{capability: "ws", operation: "read", result: "ok"}],
    llm_mocks: [{text: "agreed"}],
  },
  { ->
    run_pipeline_under_test()
  },
)

LLM mocking

For testing agent loops without real LLM calls, use llm_mock():

llm_mock({text: "The answer is 42"})

let result = llm_call([
  {role: "user", content: "What is the answer?"},
].join("\n"))
log(result)

This queues a canned response that the next LLM call consumes.

For end-to-end CLI runs, harn run and harn playground can preload the same mock infrastructure from a JSONL fixture file:

{"text":"PLAN: find the middleware module first","model":"fixture-model"}
{"match":"*hello*","text":"matched","model":"fixture-model"}
{"match":"*","error":{"category":"rate_limit","message":"fake rate limit"}}
{"match":"*retry*","error":{"status":503,"kind":"transient","reason":"upstream_unavailable"}}

harn run script.harn --llm-mock fixtures.jsonl
harn playground --script pipeline.harn --llm-mock fixtures.jsonl

A line without match is FIFO and is consumed on use.
A line with match is checked in file order as a glob against the request transcript text.
Add "consume_match": true when repeated matching prompts should advance through a scripted sequence instead of reusing the same line forever.
When no fixture matches, harn run --llm-mock ... and harn playground --llm-mock ... fail with the first prompt snippet so you can add the missing case directly.

To capture a replayable fixture from a run, record once and then replay the saved JSONL:

harn run script.harn --llm-mock-record fixtures.jsonl
harn run script.harn --llm-mock fixtures.jsonl

harn playground --script pipeline.harn --llm-mock-record fixtures.jsonl
harn playground --script pipeline.harn --llm-mock fixtures.jsonl

To import an external eval trace into the same fixture format:

harn trace import \
  --trace-file traces/generic.jsonl \
  --trace-id trace_123 \
  --output fixtures/imported.jsonl

The importer expects JSONL records shaped like {prompt, response, tool_calls} and passes through common metadata such as model, provider, and token counts when present.

Eval kinds

harn eval supports the default replay fixture flow plus an explicit clarifying-question kind for ambiguous tasks.

harn eval context <manifest> supports deterministic context-engineering fixtures for pack, projection, compaction, and tool-disclosure experiments. A manifest declares task fixtures and one or more context modes; the runner scores each task/mode pair without model calls and writes stable local artifacts: summary.json, per_run.jsonl, and summary.md. Use the builders in std/context/eval when authoring manifests from Harn code, and use spec/schemas/context-eval-report.v1.schema.json when ingesting harn.context_eval.report.v1 reports from hosted systems or downstream UIs.

harn eval context examples/evals/context-engineering-smoke.json \
  --output target/context-eval --json

harn eval scope_triage runs the opt-in pre-turn scope-classifier measurement harness. The default mode uses a deterministic reference classifier over the 100-case synthetic dataset; pass --live --model ollama:qwen3:1.7b to exercise the local small-model classifier. The report includes turn-cost reduction, coverage, false-positive rate, false-negative rate, and a keep-default-off / graduate decision.

harn eval scope_triage --output .harn-runs/scope-triage/latest

Eval packs

Portable eval packs live in harn.eval.toml or another TOML file listed in [package].evals in harn.toml. The same pack can be run locally and imported by hosted tooling because it contains only portable fixture references, rubrics, judge metadata, thresholds, and package metadata.

version = 1
id = "slack-connector"
name = "Slack connector evals"

[package]
name = "slack-connector"
version = "0.1.0"

[[fixtures]]
id = "url-verification-run"
kind = "run-record"
path = "fixtures/url-verification.run.json"

[[fixtures]]
id = "url-verification-replay"
kind = "replay-fixture"
path = "fixtures/url-verification.replay.json"

[[rubrics]]
id = "webhook-normalization"
kind = "deterministic"
description = "Webhook normalization keeps status and response shape stable."

[[rubrics.assertions]]
kind = "run-status"
expected = "completed"

[[cases]]
id = "url-verification"
name = "URL verification handshake"
run = "url-verification-run"
fixture = "url-verification-replay"
rubrics = ["webhook-normalization"]
severity = "blocking"

[cases.thresholds]
max-latency-ms = 500
max-cost-usd = 0.001

Run a single pack directly:

harn eval harn.eval.toml

Run the eval packs shipped by a package:

harn test package --evals

After harn install, this also includes eval packs declared by installed dependency packages under .harn/packages/<alias>/. Dependency eval packs are passive until this command or a root eval_pack://... trigger references them.

[package].evals is optional when the package root contains harn.eval.toml; otherwise declare one or more package-relative pack paths:

[package]
name = "slack-connector"
version = "0.1.0"
evals = ["evals/webhooks.toml", "evals/replay.toml"]

Fixture refs support these portable kind values:

Kind	Local behavior
`run-record` or `recorded-run`	Loads a persisted Harn run record JSON file
`replay-fixture`	Loads a replay fixture JSON file
`friction-events`	Loads repeated-friction event fixtures and evaluates generated context-pack suggestions
`jsonl-trace`	Reserved for imported trace fixture metadata
`provider-events`	Reserved for synthetic provider event streams
`connector-payload`	Reserved for connector payload samples

Local harn eval executes replay fixtures, baseline comparisons, deterministic assertions, HITL question assertions, repeated-friction context-pack suggestion assertions, and cost/latency/token/stage thresholds. llm-judge rubrics carry judge model, calibration, tie-break, and prompt-version metadata for hosted or explicit judge runners; a blocking llm-judge rubric fails locally rather than being silently skipped.

Eval packs can also include persona timeout ladders. A [[ladders]] entry runs the same persona fixture across every configured model-routes / timeout-tiers combination, writes per-tier JSONL transcripts, receipts, and summaries, and reports the first route/tier that completed correctly. Degraded and looping tiers remain in the machine-readable report so host CLIs and TUIs can render the same result without reimplementing the matrix runner.

[[ladders]]
id = "merge-captain-green-pr"
persona = "merge_captain"
artifact-root = ".harn-runs/merge-captain-timeout-ladder"

[ladders.backend]
kind = "replay"
path = "../../examples/personas/merge_captain/transcripts/green_pr.jsonl"

[[ladders.model-routes]]
id = "gemma-value"
route = "local/gemma-value"
provider = "llama.cpp"
model = "gemma"
profile = "value"

[[ladders.timeout-tiers]]
id = "balanced"
timeout-ms = 500
max-tool-calls = 4
max-model-calls = 1

Repeated-friction cases use friction_events = "<fixture-id-or-path>" and a rubric assertion such as:

[[rubrics.assertions]]
kind = "context-pack-suggestion"
contains = "incident"
expected = { min_suggestions = 1, recommended_artifact = "context_pack", required_capability = "splunk.search" }

Threshold severity controls gate behavior:

Severity	Local gate behavior
`blocking`	Failing case exits non-zero
`warning`	Failure is reported but does not fail the command
`informational`	Failure is reported as info only

Replay evals

Replay evals are the default. They compare a run's persisted status and stage outcomes against an embedded or explicit replay fixture.

Clarifying-question evals

Clarifying-question evals assert that the agent called ask_user(...) and asked the minimal question required to proceed. The run record persists ask_user prompts, and the fixture can require a single question plus term-level constraints:

{
  "_type": "replay_fixture",
  "eval_kind": "clarifying_question",
  "expected_status": "completed",
  "clarifying_question": {
    "required_terms": ["repository"],
    "forbidden_terms": ["branch"],
    "min_questions": 1,
    "max_questions": 1
  }
}

Use this when defaults would be unsafe and the right behavior is to ask the user before continuing.

Determinism harness

Use harn test --determinism to assert that a pipeline replays the same way on a second pass:

harn test --determinism tests/agent_loop.harn

The harness records once and replays once when no sibling <name>.llm-mock.jsonl exists. If a sibling fixture is already present, it replays both passes from that fixture. It compares stdout, provider response payloads from llm_transcript.jsonl, and persisted run-record structure to catch branching drift.

Built-in assertions

Harn provides assert, assert_eq, and assert_ne builtins for test pipelines:

assert(x > 0, "x must be positive")
assert_eq(actual, expected)
assert_ne(actual, unexpected)
assert_eq(len(items), 3)

Failed assertions throw an error with a descriptive message including the expected and actual values.

Use require for runtime invariants in normal pipelines. The linter warns if you use assert* outside test pipelines, and it suggests assert* instead of require inside test pipelines.

Cross-platform test coverage

Most workspace tests run on both Unix and Windows. A small set of test modules opts out of Windows via #![cfg(unix)] because they exercise POSIX-only semantics (bash-fixture process spawning, SIGTERM-driven graceful shutdown). The full inventory and disposition lives at Windows test coverage, and the nightly Windows nightly GitHub Actions workflow runs the portable surface on windows-latest so cross-platform regressions surface within 24 hours.