Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Debugging Agent Runs

Harn provides several tools for inspecting, replaying, and evaluating agent runs. This page walks through the debugging workflow.

Source-level debugging

For step-through debugging, start the Debug Adapter Protocol server:

cargo run --bin harn-dap

In VS Code, the Harn extension contributes a harn debug configuration automatically. The equivalent launch.json entry is:

{
  "type": "harn",
  "request": "launch",
  "name": "Debug Current Harn File",
  "program": "${file}",
  "cwd": "${workspaceFolder}"
}

This supports line breakpoints, variable inspection, stack traces, and step in / over / out against .harn files.

Host-call bridge (harnHostCall)

The debug adapter advertises supportsHarnHostCall: true in its Capabilities response. When a script calls host_call(capability, operation, params) and the VM has no built-in handler for the op, the adapter forwards it to the DAP client as a reverse request named harnHostCall — mirroring the DAP runInTerminal pattern:

{"seq": 17, "type": "request", "command": "harnHostCall",
 "arguments": {"capability": "workspace", "operation": "project_root",
               "params": {}}}

The client replies with a normal DAP response:

{"seq": 18, "type": "response", "request_seq": 17, "command": "harnHostCall",
 "success": true, "body": {"value": "/Users/x/proj"}}

On success: true, the adapter returns the body’s value field (or the whole body when value is absent) to the script. On success: false, the adapter throws VmError::Thrown(message) so scripts can try / catch the failure like any other Harn exception. Clients that do not implement harnHostCall still work — the script just sees the standalone fallbacks (workspace.project_root, workspace.cwd, etc.).

LLM telemetry output events

During run / step-through, the adapter forwards every llm_call the VM makes as a DAP output event with category: "telemetry" and a JSON body:

{"category": "telemetry",
 "output": "{\"call_id\":\"…\",\"model\":\"…\",\"prompt_tokens\":…,\"completion_tokens\":…,\"cache_tokens\":…,\"total_ms\":…,\"iteration\":…}"}

IDEs can parse these to show a live LLM-call ledger alongside the debug session.

Run records

Every agent_loop() or workflow_execute() call can produce a run record — a JSON file in .harn-runs/ that captures the full execution trace including LLM calls, tool invocations, and intermediate results.

# List recent runs
ls .harn-runs/

# Inspect a run record
harn runs inspect .harn-runs/<run-id>.json

The inspect command shows a structured summary: stages executed, tools called, token usage, timing, and final output.

Comparing runs

Compare a run against a baseline to identify regressions:

harn runs inspect .harn-runs/new.json --baseline .harn-runs/old.json

This highlights differences in tool calls, outputs, and token consumption.

Replay

Replay re-executes a recorded run, using the saved LLM responses instead of making live API calls. This is useful for deterministic debugging:

harn replay .harn-runs/<run-id>.json

Replay shows each stage transition and lets you verify that your pipeline produces the same results given the same LLM responses.

Visualizing a pipeline

When you want a quick structural view instead of a live debug session, render a Mermaid graph from the AST:

harn viz main.harn
harn viz main.harn --output docs/main.mmd

The generated graph is useful for reviewing branch-heavy pipelines, match arms, parallel blocks, and nested retries before you start stepping through them.

Evaluation

The harn eval command scores a run or set of runs against expected outcomes:

# Evaluate a single run
harn eval .harn-runs/<run-id>.json

# Evaluate all runs in a directory
harn eval .harn-runs/

# Evaluate using a manifest
harn eval eval-suite.json

Custom metrics

Use eval_metric() in your pipeline to record domain-specific metrics:

eval_metric("accuracy", 0.95, {dataset: "test-v2"})
eval_metric("latency_ms", 1200)

These metrics appear in run records and are aggregated by harn eval.

Token usage tracking

Track LLM costs during a run:

let usage = llm_usage()
log("Tokens used: ${usage.input_tokens + usage.output_tokens}")
log("LLM calls: ${usage.total_calls}")

Portal

The Harn portal is an interactive web UI for inspecting runs:

harn portal

This opens a dashboard showing all runs in .harn-runs/, with drill-down into individual stages, tool calls, and transcript snapshots.

Tips

  • Add eval_metric() calls to your pipelines early — they’re cheap to record and invaluable for tracking quality over time.
  • Use replay for debugging non-deterministic failures: record the failing run, then replay it locally to step through the logic.
  • Compare baselines when refactoring prompts or changing tool definitions to catch regressions before they ship.