Annotation tape format

An annotation file (<tape>.annotations.jsonl) is the durable form of human judgment over a recorded testbench run. It pairs with a unified event tape and lets humans (or agents proxying for humans) attach structured correct / incorrect / hypothesis / friction / crystallize_here records to specific events. The same JSONL feeds the eval rubric, friction roll-up, and crystallization candidate detector — no second labeling pipeline.

This is the substrate behind issue #1474. The full runtime lives at harn_vm::testbench::annotations.

File layout

run.tape                    # the unified event tape
run.tape.annotations.jsonl  # annotation sidecar (this format)
run.tape.cas/               # tape's own content-addressed sidecar

The annotations file is line-delimited JSON. Empty lines and lines starting with # are tolerated so authoring tools can group records visually.

The first line is always a header:

{
  "type": "header",
  "schema_version": 1,
  "tape_path": "run.tape",
  "tape_content_hash": "<blake3>",
  "harn_version": "0.8.6"
}

tape_content_hash is optional but recommended — when present, the validator catches tape edits that invalidate event_id references.

Subsequent lines are annotations:

{
  "type": "annotation",
  "id": "ann_001",
  "event_id": 42,
  "kind": "hypothesis",
  "evidence": "checkout incident — see runbook",
  "author": {"id": "alice", "kind": "human", "surface": "burin-code"},
  "timestamp": "2026-05-10T17:00:00Z",
  "hypothesis_status": "active",
  "span": {"start_event_id": 42, "end_event_id": 50},
  "links": [{"label": "runbook", "url": "..."}],
  "metadata": {}
}

Annotation kinds

KindRequired extrasConsumer
correctpersona eval rubric ground truth
incorrectpersona eval rubric ground truth
alternativesuggested_fix (recommended)replay-for-teaching, presenter mode
notefree-text commentary
markerspan (often)replay-for-teaching anchor
mutesuppress a flake on dashboards
hypothesishypothesis_status (active | verifying | confirmed | disproven | stale)human-prior-to-verify loop (harn-cloud#54)
frictionfriction_kind (must match the friction taxonomy)friction roll-up (harn#452), context-pack candidates
crystallize_herespan (recommended)crystallization candidate detector (harn#451)

Records carry a stable id (unique within the file), an event_id matching a TapeRecord::seq, an optional span, an author, and an evidence string. New kinds can be added without breaking older readers — unknown kinds deserialize as AnnotationKind::Unknown so a validator can still report on the rest and the file loader does not refuse to open it.

Friction-kind taxonomy

friction_kind values must match the existing friction-event taxonomy in harn_vm::orchestration::friction so a bag of friction annotations and a bag of natively-emitted FrictionEvents are interchangeable for downstream consumers:

  • repeated_query
  • repeated_clarification
  • approval_stall
  • missing_context
  • manual_handoff
  • tool_gap
  • failed_assumption
  • expensive_model_used_for_deterministic_step
  • human_hypothesis

CLI surface

Surface annotations during replay

harn test-bench replay script.harn \
    --process-tape run.process.json \
    --annotations run.tape.annotations.jsonl

The runner loads + pre-validates the annotations sidecar, then prints each annotation grouped by its referenced event in the run-summary block. The exit code is 2 (CI-gateable) if validation surfaces any problems.

Validate a sidecar against its tape

harn test-bench validate-annotations \
    --tape run.tape \
    --report validation.json \
    run.tape.annotations.jsonl

The report enumerates unknown_event_id, hypothesis_status_missing, friction_kind_unknown, invalid_span, duplicate_id, and tape_digest_mismatch problems with stable code tags so CI can gate on specific failure classes. The command exits non-zero (status 2) when problems exist.

Export annotations by kind

# Re-emit hypothesis annotations as JSONL for the human-prior pipeline.
harn test-bench export-annotations run.tape.annotations.jsonl \
    --kind hypothesis --format jsonl

# Re-emit friction annotations as `FrictionEvent`s ready to feed
# `orchestration::generate_context_pack_suggestions`.
harn test-bench export-annotations run.tape.annotations.jsonl \
    --kind friction --format friction

--kind is repeatable and the union of matching annotations is written. --format friction only emits records that successfully adapt to a FrictionEvent (i.e. kind == friction with a recognised friction_kind). --format jsonl (the default) emits the raw annotation rows so the file is bundle-compatible with the v0.8.4 portable workflow bundle pattern.

Round-trip + bundling

Annotations survive serialize/deserialize byte-identically — the AnnotationLine enum is the only producer of on-disk shapes, and optional fields default to None so missing-field round-trips are stable. Bundling a tape with its annotations is two files plus the optional CAS sidecar:

run.tape
run.tape.cas/<blake3>...
run.tape.annotations.jsonl

The conformance suite exercises this round-trip via conformance/tests/testbench/testbench_replay_fidelity.annotations.jsonl, which the runner validates against the emitted tape after the fidelity check.

Versioning contract

The header's schema_version integer gates compatibility:

  • Loaders accept files with schema_version <= ANNOTATION_SCHEMA_VERSION and refuse newer files with a structured error.
  • Adding a kind is non-breaking: older readers see it as AnnotationKind::Unknown and the validator emits a single unknown_kind problem per affected record.
  • Renaming or repurposing a field is breaking and requires a version bump.

When you bump the version, add a "Changes from v<previous>" section below.

Cross-references

  • Tape format — the artifact this sidecar references.
  • Testbench mode — how recording is activated.
  • Issue #1474 — design rationale and acceptance criteria.
  • Issue #451 / #452 — downstream consumers (crystallization, friction events).
  • burin-code#599 — the IDE authoring surface that produces these files.