Debugging agent runs

Harn provides several tools for inspecting, replaying, and evaluating agent runs. This page walks through the debugging workflow.

Source-level debugging

For step-through debugging, start the Debug Adapter Protocol server:

cargo run --bin harn-dap

In VS Code, the Harn extension contributes a harn debug configuration automatically. The equivalent launch.json entry is:

{
  "type": "harn",
  "request": "launch",
  "name": "Debug Current Harn File",
  "program": "${file}",
  "cwd": "${workspaceFolder}"
}

This supports line breakpoints, variable inspection, stack traces, and step in / over / out against .harn files.

Host-call bridge (harnHostCall)

The debug adapter advertises supportsHarnHostCall: true in its Capabilities response. When a script calls host_call(capability, operation, params) and the VM has no built-in handler for the op, the adapter forwards it to the DAP client as a reverse request named harnHostCall — mirroring the DAP runInTerminal pattern:

{"seq": 17, "type": "request", "command": "harnHostCall",
 "arguments": {"capability": "workspace", "operation": "project_root",
               "params": {}}}

The client replies with a normal DAP response:

{"seq": 18, "type": "response", "request_seq": 17, "command": "harnHostCall",
 "success": true, "body": {"value": "/Users/x/proj"}}

On success: true, the adapter returns the body's value field (or the whole body when value is absent) to the script. On success: false, the adapter throws VmError::Thrown(message) so scripts can try / catch the failure like any other Harn exception. Clients that do not implement harnHostCall still work — the script just sees the standalone fallbacks (workspace.project_root, workspace.cwd, etc.).

LLM telemetry output events

During run / step-through, the adapter forwards every llm_call the VM makes as a DAP output event with category: "telemetry" and a JSON body:

{"category": "telemetry",
 "output": "{\"call_id\":\"…\",\"model\":\"…\",\"prompt_tokens\":…,\"completion_tokens\":…,\"cache_tokens\":…,\"total_ms\":…,\"iteration\":…}"}

IDEs can parse these to show a live LLM-call ledger alongside the debug session.

Run records

Every agent_loop() or workflow_execute() call can produce a run record — a JSON file in .harn-runs/ that captures the full execution trace including LLM calls, tool invocations, and intermediate results.

# List recent runs
ls .harn-runs/

# Inspect a run record
harn runs inspect .harn-runs/<run-id>.json

The inspect command shows a structured summary: stages executed, tools called, token usage, timing, and final output.

Comparing runs

Compare a run against a baseline to identify regressions:

harn runs inspect .harn-runs/new.json --compare .harn-runs/old.json

This highlights differences in tool calls, outputs, and token consumption.

Replay

Replay re-executes a recorded run, using the saved LLM responses instead of making live API calls. This is useful for deterministic debugging:

harn replay .harn-runs/<run-id>.json

Replay shows each stage transition and lets you verify that your pipeline produces the same results given the same LLM responses.

Visualizing a pipeline

When you want a quick structural view instead of a live debug session, render a Mermaid graph from the AST:

harn viz main.harn
harn viz main.harn --output docs/main.mmd

The generated graph is useful for reviewing branch-heavy pipelines, match arms, parallel blocks, and nested retries before you start stepping through them.

Evaluation

The harn eval command scores a run or set of runs against expected outcomes:

# Evaluate a single run
harn eval .harn-runs/<run-id>.json

# Evaluate all runs in a directory
harn eval .harn-runs/

# Evaluate using a manifest
harn eval eval-suite.json

Custom metrics

Use eval_metric() in your pipeline to record domain-specific metrics:

eval_metric("accuracy", 0.95, {dataset: "test-v2"})
eval_metric("latency_ms", 1200)

These metrics appear in run records and are aggregated by harn eval.

Token usage tracking

Track LLM costs during a run:

let usage = llm_usage()
log("Tokens used: ${usage.input_tokens + usage.output_tokens}")
log("LLM calls: ${usage.total_calls}")

Portal

The Harn portal is an interactive web UI for inspecting runs:

harn portal

This opens a dashboard showing all runs in .harn-runs/, with drill-down into individual stages, tool calls, and transcript snapshots.

Tips

  • Add eval_metric() calls to your pipelines early — they're cheap to record and invaluable for tracking quality over time.
  • Use replay for debugging non-deterministic failures: record the failing run, then replay it locally to step through the logic.
  • Compare baselines when refactoring prompts or changing tool definitions to catch regressions before they ship.