Testing
Harn provides several layers of testing support: a conformance test runner, a standard library testing module, and host-mock helpers for isolating agent behavior from real host capabilities.
Conformance tests
Conformance tests are the primary executable specification for the Harn
language and runtime. They live under conformance/tests/ as paired files:
test_name.harn— Harn source codetest_name.expected— exact expected stdout output
Tests are grouped by area into subdirectories. ls conformance/tests/ gives
the current top-level map (examples: language/, control_flow/, types/,
collections/, concurrency/, stdlib/, templates/, modules/,
agents/, scenarios/ (cross-feature compositions), reminders/,
runtime/). High-volume categories may have a second level — for example,
stdlib/oauth/, stdlib/json/, stdlib/hitl/, stdlib/preset_hooks/,
stdlib/tool_hooks/, and stdlib/project/ group the larger stdlib API
surfaces. The runner discovers .harn files recursively, so new tests just
need to be dropped into the appropriate subdirectory.
Shared helpers live alongside the tests that use them:
conformance/tests/modules/lib/ holds import targets for the modules/
tests, and conformance/tests/templates/fixtures/ holds prompt-template
fixtures for the templates/ tests. The cross-cutting helper
conformance/tests/_common.harn is imported as "../_common" from any
direct subdirectory and "../../_common" from a second-level subdirectory.
Error tests live in two complementary homes:
conformance/errors/, subdivided by error class intosyntax/,types/,semantic/, andruntime/— for tests organized by where the error fires in the compilation pipeline.conformance/tests/errors_by_feature/— for error tests grouped by the feature that produces them (for example,agent_loop_*,defer_*,catch_*,finally_*).
Both homes share the .harn + .error (or .expected) sibling-file
convention and are walked by the same runner.
Running tests
# Run the full conformance suite
harn test conformance
# Filter by name (substring match)
harn test conformance --filter workflow_runtime
# Filter by name or path
harn test conformance --filter agent
# Verbose output
harn test conformance --filter my_test -v
# Timing summary without verbose failure details
harn test conformance --timing --filter my_test
Writing a conformance test
Create a .harn file with a pipeline default(task) entry point and use
log() to produce output:
// conformance/tests/<group>/my_feature.harn (e.g. stdlib/, types/)
pipeline default(task) {
let result = my_feature(42)
log(result)
}
Then create a .expected file with the exact output:
[harn] 84
The std/testing module
Import std/testing in your Harn tests for higher-level test helpers:
import { mock_host_result, assert_host_called, clear_host_mocks } from "std/testing"
Host mock helpers
| Function | Description |
|---|---|
clear_host_mocks() | Remove all registered host mocks |
mock_host_result(cap, op, result, params?) | Mock a host capability to return a value |
mock_host_error(cap, op, message, params?) | Mock a host capability to return an error |
mock_host_response(cap, op, config) | Mock with full response configuration |
Host call assertions
| Function | Description |
|---|---|
host_calls() | Return all recorded host calls |
host_calls_for(cap, op) | Return calls for a specific capability/operation |
host_call_count() / host_call_count_for(cap, op) | Return recorded host call counts |
assert_host_called(cap, op, params?) | Assert a host call was made |
assert_host_call_count(expected_count, cap, op) | Assert exact call count |
assert_no_host_calls() | Assert no host calls were made |
Persona step assertions
Persona steel-thread tests can assert Harn orchestration boundaries without
depending on Rust internals. step_assertions_begin(pattern?) installs
PreStep / PostStep hooks for matching personas and records the hook
payloads until step_assertions_end().
| Helper | Description |
|---|---|
step_assertions_begin(persona_pattern?) | Clear persona hooks and start recording matching step payloads |
step_events() / step_events_clear() | Inspect or reset captured step payloads |
assert_steps_ran(names) | Assert the exact ordered list of @step names |
assert_step_received(step, predicate?) | Assert a PreStep payload matched a closure, dict subset, or value |
assert_step_emitted(step, predicate?) | Assert a PostStep payload matched a closure, dict subset, or value |
assert_handoff_emitted(source, kind, target?) | Assert a run record or handoff list contains a typed handoff |
assert_receipt_field(receipt, pointer, value) | Assert an RFC 6901 JSON Pointer field in a receipt |
assert_golden_transcript(expected, actual) | Structured subset matcher with <ms>, <uuid>, and <any> sentinels |
Example
import { mock_host_result, assert_host_called, clear_host_mocks } from "std/testing"
pipeline default(task) {
clear_host_mocks()
// Mock the workspace.read_text capability
mock_host_result("workspace", "read_text", "file contents")
// Code under test calls host_call("workspace.read_text", ...)
let content = host_call("workspace.read_text", {path: "test.txt"})
log(content)
// Verify the call was made
assert_host_called("workspace", "read_text")
}
Scoped fixtures (with_host_mocks / with_llm_mocks / with_mocks)
Pipeline tests with many capabilities accumulate manual host_mock_clear()
pairs around each test. A failing assertion can skip the clear step and leak
mocks into the next test. Scoped fixtures handle that lifecycle for you and
clean up reliably even when the body throws.
| Helper | Description |
|---|---|
with_host_mocks(mocks, body) | Push a fresh host-mock scope, register mocks, run body(), restore on exit |
with_llm_mocks(mocks, body) | Same shape for LLM mocks (FIFO + match patterns) |
with_mocks({host_mocks, llm_mocks}, body) | Combined scope for tests that exercise both surfaces |
llm_calls() / llm_call_count() | Inspect the LLM call log captured inside the current scope |
Each entry in the host-mock list is a dict shaped like the existing
host_mock(...) config:
{capability: "runtime", operation: "pipeline_input", result: {}, params: {}}
{capability: "project", operation: "metadata_set", error: "denied"}
error (if non-nil) takes precedence over result, mirroring
mock_host_error / mock_host_result.
import { with_host_mocks, assert_host_called } from "std/testing"
pipeline test_skill_registry() {
with_host_mocks(
[
{capability: "runtime", operation: "pipeline_input", result: {}},
{capability: "project", operation: "skills", result: []},
],
{ ->
let registry = skill_registry_from_host()
assert_eq(len(registry.skills), 0, "no skills registered")
assert_host_called("project", "skills", nil, nil)
},
)
}
Key properties:
- The body runs inside a fresh host-mock and host-call log; nothing inside leaks out, and nothing outside is visible inside.
- The prior state is restored before the helper returns, including when the body throws — the thrown error is re-raised after cleanup.
- Scopes nest: an inner
with_host_mockssees only its own mocks while active, then pops back to the outer scope on exit. with_llm_mocksfollows the same shape; entries are passed straight tollm_mock(...), so any field accepted by that builtin (includingmatch/consume_match/error) is supported.
with_mocks(config, body) is the unified form for tests that need both:
with_mocks(
{
host_mocks: [{capability: "ws", operation: "read", result: "ok"}],
llm_mocks: [{text: "agreed"}],
},
{ ->
run_pipeline_under_test()
},
)
LLM mocking
For testing agent loops without real LLM calls, use llm_mock():
llm_mock({text: "The answer is 42"})
let result = llm_call([
{role: "user", content: "What is the answer?"},
].join("\n"))
log(result)
This queues a canned response that the next LLM call consumes.
For end-to-end CLI runs, harn run and harn playground can preload the same mock
infrastructure from a JSONL fixture file:
{"text":"PLAN: find the middleware module first","model":"fixture-model"}
{"match":"*hello*","text":"matched","model":"fixture-model"}
{"match":"*","error":{"category":"rate_limit","message":"fake rate limit"}}
{"match":"*retry*","error":{"status":503,"kind":"transient","reason":"upstream_unavailable"}}
harn run script.harn --llm-mock fixtures.jsonl
harn playground --script pipeline.harn --llm-mock fixtures.jsonl
- A line without
matchis FIFO and is consumed on use. - A line with
matchis checked in file order as a glob against the request transcript text. - Add
"consume_match": truewhen repeated matching prompts should advance through a scripted sequence instead of reusing the same line forever. - When no fixture matches,
harn run --llm-mock ...andharn playground --llm-mock ...fail with the first prompt snippet so you can add the missing case directly.
To capture a replayable fixture from a run, record once and then replay the saved JSONL:
harn run script.harn --llm-mock-record fixtures.jsonl
harn run script.harn --llm-mock fixtures.jsonl
harn playground --script pipeline.harn --llm-mock-record fixtures.jsonl
harn playground --script pipeline.harn --llm-mock fixtures.jsonl
To import an external eval trace into the same fixture format:
harn trace import \
--trace-file traces/generic.jsonl \
--trace-id trace_123 \
--output fixtures/imported.jsonl
The importer expects JSONL records shaped like
{prompt, response, tool_calls} and passes through common metadata
such as model, provider, and token counts when present.
Eval kinds
harn eval supports the default replay fixture flow plus an explicit
clarifying-question kind for ambiguous tasks.
harn eval context <manifest> supports deterministic context-engineering
fixtures for pack, projection, compaction, and tool-disclosure experiments. A
manifest declares task fixtures and one or more context modes; the runner scores
each task/mode pair without model calls and writes stable local artifacts:
summary.json, per_run.jsonl, and summary.md. Use the builders in
std/context/eval when authoring manifests from Harn code, and use
spec/schemas/context-eval-report.v1.schema.json when ingesting
harn.context_eval.report.v1 reports from hosted systems or downstream UIs.
harn eval context examples/evals/context-engineering-smoke.json \
--output target/context-eval --json
harn eval scope_triage runs the opt-in pre-turn scope-classifier measurement
harness. The default mode uses a deterministic reference classifier over the
100-case synthetic dataset; pass --live --model ollama:qwen3:1.7b to exercise
the local small-model classifier. The report includes turn-cost reduction,
coverage, false-positive rate, false-negative rate, and a keep-default-off /
graduate decision.
harn eval scope_triage --output .harn-runs/scope-triage/latest
Eval packs
Portable eval packs live in harn.eval.toml or another TOML file listed in
[package].evals in harn.toml. The same pack can be run locally and imported
by hosted tooling because it contains only portable fixture references, rubrics,
judge metadata, thresholds, and package metadata.
version = 1
id = "slack-connector"
name = "Slack connector evals"
[package]
name = "slack-connector"
version = "0.1.0"
[[fixtures]]
id = "url-verification-run"
kind = "run-record"
path = "fixtures/url-verification.run.json"
[[fixtures]]
id = "url-verification-replay"
kind = "replay-fixture"
path = "fixtures/url-verification.replay.json"
[[rubrics]]
id = "webhook-normalization"
kind = "deterministic"
description = "Webhook normalization keeps status and response shape stable."
[[rubrics.assertions]]
kind = "run-status"
expected = "completed"
[[cases]]
id = "url-verification"
name = "URL verification handshake"
run = "url-verification-run"
fixture = "url-verification-replay"
rubrics = ["webhook-normalization"]
severity = "blocking"
[cases.thresholds]
max-latency-ms = 500
max-cost-usd = 0.001
Run a single pack directly:
harn eval harn.eval.toml
Run the eval packs shipped by a package:
harn test package --evals
After harn install, this also includes eval packs declared by installed
dependency packages under .harn/packages/<alias>/. Dependency eval packs are
passive until this command or a root eval_pack://... trigger references them.
[package].evals is optional when the package root contains
harn.eval.toml; otherwise declare one or more package-relative pack paths:
[package]
name = "slack-connector"
version = "0.1.0"
evals = ["evals/webhooks.toml", "evals/replay.toml"]
Fixture refs support these portable kind values:
| Kind | Local behavior |
|---|---|
run-record or recorded-run | Loads a persisted Harn run record JSON file |
replay-fixture | Loads a replay fixture JSON file |
friction-events | Loads repeated-friction event fixtures and evaluates generated context-pack suggestions |
jsonl-trace | Reserved for imported trace fixture metadata |
provider-events | Reserved for synthetic provider event streams |
connector-payload | Reserved for connector payload samples |
Local harn eval executes replay fixtures, baseline comparisons,
deterministic assertions, HITL question assertions, repeated-friction
context-pack suggestion assertions, and cost/latency/token/stage thresholds.
llm-judge rubrics carry judge model, calibration, tie-break, and
prompt-version metadata for hosted or explicit judge runners; a blocking
llm-judge rubric fails locally rather than being silently skipped.
Eval packs can also include persona timeout ladders. A [[ladders]]
entry runs the same persona fixture across every configured
model-routes / timeout-tiers combination, writes per-tier JSONL
transcripts, receipts, and summaries, and reports the first route/tier
that completed correctly. Degraded and looping tiers remain in the
machine-readable report so host CLIs and TUIs can render the same
result without reimplementing the matrix runner.
[[ladders]]
id = "merge-captain-green-pr"
persona = "merge_captain"
artifact-root = ".harn-runs/merge-captain-timeout-ladder"
[ladders.backend]
kind = "replay"
path = "../../examples/personas/merge_captain/transcripts/green_pr.jsonl"
[[ladders.model-routes]]
id = "gemma-value"
route = "local/gemma-value"
provider = "llama.cpp"
model = "gemma"
profile = "value"
[[ladders.timeout-tiers]]
id = "balanced"
timeout-ms = 500
max-tool-calls = 4
max-model-calls = 1
Repeated-friction cases use friction_events = "<fixture-id-or-path>" and a
rubric assertion such as:
[[rubrics.assertions]]
kind = "context-pack-suggestion"
contains = "incident"
expected = { min_suggestions = 1, recommended_artifact = "context_pack", required_capability = "splunk.search" }
Threshold severity controls gate behavior:
| Severity | Local gate behavior |
|---|---|
blocking | Failing case exits non-zero |
warning | Failure is reported but does not fail the command |
informational | Failure is reported as info only |
Replay evals
Replay evals are the default. They compare a run's persisted status and stage outcomes against an embedded or explicit replay fixture.
Clarifying-question evals
Clarifying-question evals assert that the agent called ask_user(...)
and asked the minimal question required to proceed. The run record
persists ask_user prompts, and the fixture can require a single
question plus term-level constraints:
{
"_type": "replay_fixture",
"eval_kind": "clarifying_question",
"expected_status": "completed",
"clarifying_question": {
"required_terms": ["repository"],
"forbidden_terms": ["branch"],
"min_questions": 1,
"max_questions": 1
}
}
Use this when defaults would be unsafe and the right behavior is to ask the user before continuing.
Determinism harness
Use harn test --determinism to assert that a pipeline replays the same
way on a second pass:
harn test --determinism tests/agent_loop.harn
The harness records once and replays once when no sibling
<name>.llm-mock.jsonl exists. If a sibling fixture is already
present, it replays both passes from that fixture. It compares stdout,
provider response payloads from llm_transcript.jsonl, and persisted
run-record structure to catch branching drift.
Built-in assertions
Harn provides assert, assert_eq, and assert_ne builtins for test pipelines:
assert(x > 0, "x must be positive")
assert_eq(actual, expected)
assert_ne(actual, unexpected)
assert_eq(len(items), 3)
Failed assertions throw an error with a descriptive message including the expected and actual values.
Use require for runtime invariants in normal pipelines. The linter warns if
you use assert* outside test pipelines, and it suggests assert* instead of
require inside test pipelines.
Cross-platform test coverage
Most workspace tests run on both Unix and Windows. A small set of test
modules opts out of Windows via #![cfg(unix)] because they exercise
POSIX-only semantics (bash-fixture process spawning, SIGTERM-driven
graceful shutdown). The full inventory and disposition lives at
Windows test coverage, and the nightly
Windows nightly GitHub Actions workflow runs the portable surface on
windows-latest so cross-platform regressions surface within 24 hours.