Annotation tape format
An annotation file (<tape>.annotations.jsonl) is the durable form of
human judgment over a recorded testbench run. It pairs with a
unified event tape and lets humans (or agents
proxying for humans) attach structured correct / incorrect /
hypothesis / friction / crystallize_here records to specific
events. The same JSONL feeds the eval rubric, friction roll-up, and
crystallization candidate detector — no second labeling pipeline.
This is the substrate behind issue
#1474. The full
runtime lives at harn_vm::testbench::annotations.
File layout
run.tape # the unified event tape
run.tape.annotations.jsonl # annotation sidecar (this format)
run.tape.cas/ # tape's own content-addressed sidecar
The annotations file is line-delimited JSON. Empty lines and lines
starting with # are tolerated so authoring tools can group records
visually.
The first line is always a header:
{
"type": "header",
"schema_version": 1,
"tape_path": "run.tape",
"tape_content_hash": "<blake3>",
"harn_version": "0.8.6"
}
tape_content_hash is optional but recommended — when present, the
validator catches tape edits that invalidate event_id references.
Subsequent lines are annotations:
{
"type": "annotation",
"id": "ann_001",
"event_id": 42,
"kind": "hypothesis",
"evidence": "checkout incident — see runbook",
"author": {"id": "alice", "kind": "human", "surface": "burin-code"},
"timestamp": "2026-05-10T17:00:00Z",
"hypothesis_status": "active",
"span": {"start_event_id": 42, "end_event_id": 50},
"links": [{"label": "runbook", "url": "..."}],
"metadata": {}
}
Annotation kinds
| Kind | Required extras | Consumer |
|---|---|---|
correct | — | persona eval rubric ground truth |
incorrect | — | persona eval rubric ground truth |
alternative | suggested_fix (recommended) | replay-for-teaching, presenter mode |
note | — | free-text commentary |
marker | span (often) | replay-for-teaching anchor |
mute | — | suppress a flake on dashboards |
hypothesis | hypothesis_status (active | verifying | confirmed | disproven | stale) | human-prior-to-verify loop (harn-cloud#54) |
friction | friction_kind (must match the friction taxonomy) | friction roll-up (harn#452), context-pack candidates |
crystallize_here | span (recommended) | crystallization candidate detector (harn#451) |
Records carry a stable id (unique within the file), an event_id
matching a TapeRecord::seq, an optional span, an
author, and an evidence string. New kinds can be added without
breaking older readers — unknown kinds deserialize as
AnnotationKind::Unknown so a validator can still report on the rest
and the file loader does not refuse to open it.
Friction-kind taxonomy
friction_kind values must match the existing friction-event taxonomy
in harn_vm::orchestration::friction so a bag of friction
annotations and a bag of natively-emitted FrictionEvents are
interchangeable for downstream consumers:
repeated_queryrepeated_clarificationapproval_stallmissing_contextmanual_handofftool_gapfailed_assumptionexpensive_model_used_for_deterministic_stephuman_hypothesis
CLI surface
Surface annotations during replay
harn test-bench replay script.harn \
--process-tape run.process.json \
--annotations run.tape.annotations.jsonl
The runner loads + pre-validates the annotations sidecar, then prints
each annotation grouped by its referenced event in the run-summary
block. The exit code is 2 (CI-gateable) if validation surfaces any
problems.
Validate a sidecar against its tape
harn test-bench validate-annotations \
--tape run.tape \
--report validation.json \
run.tape.annotations.jsonl
The report enumerates unknown_event_id, hypothesis_status_missing,
friction_kind_unknown, invalid_span, duplicate_id, and
tape_digest_mismatch problems with stable code tags so CI can gate
on specific failure classes. The command exits non-zero (status 2)
when problems exist.
Export annotations by kind
# Re-emit hypothesis annotations as JSONL for the human-prior pipeline.
harn test-bench export-annotations run.tape.annotations.jsonl \
--kind hypothesis --format jsonl
# Re-emit friction annotations as `FrictionEvent`s ready to feed
# `orchestration::generate_context_pack_suggestions`.
harn test-bench export-annotations run.tape.annotations.jsonl \
--kind friction --format friction
--kind is repeatable and the union of matching annotations is
written. --format friction only emits records that successfully
adapt to a FrictionEvent (i.e. kind == friction with a recognised
friction_kind). --format jsonl (the default) emits the raw
annotation rows so the file is bundle-compatible with the v0.8.4
portable workflow bundle pattern.
Round-trip + bundling
Annotations survive serialize/deserialize byte-identically — the
AnnotationLine enum is the only producer of on-disk shapes, and
optional fields default to None so missing-field round-trips are
stable. Bundling a tape with its annotations is two files plus the
optional CAS sidecar:
run.tape
run.tape.cas/<blake3>...
run.tape.annotations.jsonl
The conformance suite exercises this round-trip via
conformance/tests/testbench/testbench_replay_fidelity.annotations.jsonl,
which the runner validates against the emitted tape after the fidelity
check.
Versioning contract
The header's schema_version integer gates compatibility:
- Loaders accept files with
schema_version <= ANNOTATION_SCHEMA_VERSIONand refuse newer files with a structured error. - Adding a kind is non-breaking: older readers see it as
AnnotationKind::Unknownand the validator emits a singleunknown_kindproblem per affected record. - Renaming or repurposing a field is breaking and requires a version bump.
When you bump the version, add a "Changes from v<previous>" section below.
Cross-references
- Tape format — the artifact this sidecar references.
- Testbench mode — how recording is activated.
- Issue #1474 — design rationale and acceptance criteria.
- Issue #451 / #452 — downstream consumers (crystallization, friction events).
- burin-code#599 — the IDE authoring surface that produces these files.