Event tape format

The event tape is the canonical artifact behind harn test-bench --emit-tape. Every non-deterministic input a script consumed during a run — clock reads, sleeps, LLM responses, FS reads/writes, subprocess spawns — lands as a typed TapeRecord with a logical sequence number and a virtual-time stamp. Tapes are diffable, content addressed, and versioned so they survive runtime upgrades.

This document is the source of truth for the on-disk schema. Bumping TAPE_FORMAT_VERSION requires an entry here.

File layout

run.tape           # NDJSON: one header line + one record line per event
run.tape.cas/      # content-addressed sidecar (BLAKE3 hex names)

run.tape is line-delimited JSON. Each line is one of two shapes:

  • Header (always the first line):

    {
      "type": "header",
      "version": 1,
      "harn_version": "0.8.4",
      "started_at_unix_ms": 1700000000000,
      "script_path": "examples/cron-rollup.harn",
      "argv": ["--mode=daily"]
    }
    
  • Record (zero or more, after the header):

    {
      "type": "record",
      "seq": 0,
      "phase": "user_script",
      "virtual_time_ms": 1700000000000,
      "monotonic_ms": 0,
      "kind": "clock_sleep",
      "duration_ms": 250
    }
    

run.tape.cas/ holds payload bytes that are too large to inline (see CAS thresholds). The directory is optional — tapes that never spilled a payload have no sidecar.

Record kinds

The current schema (version = 1) emits the kinds below. Unknown record kinds in newer tapes deserialize as unknown so older fidelity checkers still produce a structured report.

KindPayload fieldsSource
clock_readsource ("wall" or "monotonic"), value_msnow_ms() / monotonic_ms() builtins
clock_sleepduration_mssleep(...) / advance_time(...)
llm_callrequest_digest, response (inline or CAS)LLM provider interception
file_readpath, content_hash, len_bytesread_file(...) builtins
file_writepath, content_hash, len_byteswrite_file(...), append_file(...)
file_deletepathremove_file(...)
process_spawnprogram, args, cwd, exit_code, duration_ms, stdout_payload, stderr_payloadSandboxed subprocess invocation

Every record carries the wrapping fields:

  • seq — monotonic logical sequence number (assigned at record time).
  • phaseuser_script for records produced by the script body, or runtime_finalize for records produced while the runtime drains finish/resume/finalizer lifecycle work.
  • virtual_time_ms — UNIX-epoch ms observed via the unified mock clock.
  • monotonic_ms — ms since the testbench session activated.

Two tapes that differ only in real-time stamps (e.g. CI machines with different NTP skew) diff cleanly on the logical structure.

Content-addressed storage

Records whose serialized payload exceeds 4 KiB (the MAX_INLINE_BYTES threshold) spill to the sidecar. The inline JSON carries:

{ "content_hash": "<blake3-hex>", "len_bytes": 12345 }

…and the bytes themselves live at run.tape.cas/<blake3-hex>. Reusing the same payload across records — e.g. an idempotent LLM response served to two callers — stores it once.

Smaller payloads stay inline:

{ "content_hash": "<blake3-hex>", "text": "...stdout..." }

content_hash is a hex BLAKE3 digest of the raw bytes. The fidelity oracle compares hashes only; it never re-hashes payloads at compare time.

MCP client calls appear as mcp_json_rpc records. Each record carries the server name, method, request/response digests, latency, and redacted request/response payloads. This lets a single unified tape catch MCP schema or behavior drift alongside LLM, subprocess, filesystem, and clock drift.

Versioning contract

The header's version integer gates compatibility:

  • Loaders accept tapes with version <= TAPE_FORMAT_VERSION and refuse newer tapes with a structured error so a downgrade doesn't silently drop records.
  • Adding a record kind is non-breaking: older fidelity checkers see the unknown kind as TapeRecordKind::Unknown and emit a divergence with category unknown_kind.
  • Renaming or repurposing an existing field is breaking and requires a version bump.

When you bump the version, add a "Changes from v<previous>" section below.

Fidelity oracle

harn test-bench fidelity compares two tapes under one of four modes:

  • byte-identical (default, strictest). Every record matches by position, kind, content hash, and timing. The mode CI uses to gate "this PR did not regress replay determinism."
  • semantic. Ignores diffs that are non-meaningful by construction: monotonic-only sequence stamps, pure virtual-time drift, and recorded monotonic_ms deltas. Content hashes still gate every payload.
  • outcome (loosest). Compares only the script's externally observable result: the final FS write set, the exit status of the last subprocess, and the count of LLM calls. Useful for stochastic LLM runs where intermediate token streams legitimately diverge.
  • phase-aware. Compares user_script records byte-identically and runtime_finalize records semantically. Runtime-finalize clock reads are ignored so internal lifecycle observability can grow without regenerating user-script fidelity fixtures; runtime finalization file/process/LLM effects still participate in the semantic diff.

The CLI emits a structured JSON report (a FidelityReport) listing every diverging record with a stable category tag. CI pipelines gate on divergences == []; the public leaderboard (harn-cloud#19) ingests the score directly.

Example

Diff two recorded tapes:

harn test-bench fidelity recorded.tape replay.tape --mode byte-identical

Re-run a script under testbench replay and compare against the recorded tape:

harn test-bench fidelity script.harn --against recorded.tape \
    --mode semantic --report fidelity.json

Both forms exit non-zero (status 2) when the report has any divergences. CI gates can therefore rely on the exit code without parsing JSON.

Producing a tape

harn test-bench run script.harn \
    --clock paused --start-at 1767225600000 \
    --emit-tape run.tape

The tape is a byproduct of a normal testbench run; the script executes unmodified, and the recorder pushes a record at every host-capability boundary it crosses. When the run finishes, the tape (and its CAS sidecar) lands at the requested path.

Out of scope (v1)

The first version intentionally ships a small but principled set of record kinds. The following are tracked separately:

  • HTTP request/response capture (independent of LLM calls). The necessary capture point is the egress allowlist hook, but plumbing through every connector is its own ticket.
  • Tape compaction / GC. The current encoding stores every record; pruning can come once tapes get large in practice.
  • Cross-runtime tape exchange (Inngest/Temporal interop). Lives in the harn-cloud#19 harness, not this crate.

See also

  • Testbench mode — the composition primitive that installs the tape recorder.
  • Annotation tape format — sidecar that attaches structured human judgment to tape events.
  • Testing — approved deterministic test patterns.
  • Issue #1441 — design rationale for the unified tape and fidelity oracle.