VM and stdlib hot-path profile
This page captures the allocation profile behind issue #1426 and the follow-on runtime/typechecker performance wave tracked by issue #2095. The first sections are historical context for the May 2026 optimization series; the post-#2095 section records which bottlenecks have since landed so new work starts from the current shape instead of refiling already-fixed hotspots. Reproduce locally with:
./scripts/bench_vm.sh --no-build --iterations 20
cargo bench -p harn-vm-perf --bench bench_vm_fixtures
cargo bench -p harn-orchestration-perf --bench bench_workflow_bundle
The fixture set covers the option-builder pipelines (dict merge, subscript
assign, filter_nil/pick_keys) the connector helpers and agent loops run
on every call, plus the workflow bundle export the host previews when it
ships a portable bundle.
Landed optimizations
1. SetSubscript mutates in place via Rc::make_mut
The previous out[k] = v fast path cloned the entire backing
BTreeMap/Vec on every assignment because active_local_slot_value
returns the slot value by clone, leaving Rc strong count ≥ 2. The new
path looks up the slot by index, then mutates the contained Rc<...>
directly with Rc::make_mut, which is a no-op when the slot owns the
unique reference (the steady state for builder loops).
Effect on dict_subscript_assign:
| metric | baseline (Harn 0.8.3) | post-#1426 | delta |
|---|---|---|---|
| allocations/run | 684,058 | 328,058 | −52 % |
| allocated bytes | 58,406,677 | 19,670,677 | −66 % |
| criterion median | 25.4 ms | 21.7 ms | −15 % |
bench_vm.sh 3-pass mean | ~30 ms | 15.9 ms | −47 % |
The closure-captured / env-fallback path is preserved — when the binding
lives in env (e.g. captured by a closure rather than a slot-resolved
local), Rc::try_unwrap keeps the no-other-references case
allocation-free.
2. Native option-builder helpers replace Harn + {[k]: v} loops
std/collections::filter_nil, std/collections::pick_keys, and the
std/json merge, pick, omit helpers all expanded to a
var result = {} accumulator with result = result + {[k]: v} per
iteration — fresh Rc<BTreeMap> allocation per inserted entry plus a
per-call closure dispatch in filter_nil. Every connector wrapper
(std/connectors/{github,linear,notion,slack}), std/context,
std/graphql, the agents stdlib, and the workflow scaffolding leans on
these helpers.
Five new builtins under crates/harn-vm/src/stdlib/collections.rs
handle the work in one allocation:
__dict_filter_nil(d)— dropnil,"", and the literal string"null"; returns the originalRcwhen nothing changes.__dict_merge(a, b)—Rc::try_unwrap(a)+BTreeMap::extend.__dict_pick(data, keys)— matchstd/json::picksemantics (drop missing +nil).__dict_pick_keys(d, keys, drop_nil)— matchstd/collections::pick_keys(preservenilunlessdrop_nilis set).__dict_omit(d, keys)—Rc::try_unwrap(d)+BTreeMap::retain.
The Harn-level pub fns in stdlib_collections.harn and
stdlib_json.harn now thin-wrap these so every existing
import { filter_nil } from "std/collections" consumer transparently
picks them up; the public API is unchanged.
Effect on filter_nil_loop (4,000 iterations of
filter_nil(merge(config, overlay)) plus a pick_keys projection — the
canonical connector option-builder shape):
| metric | baseline (Harn 0.8.3) | post-#1426 | delta |
|---|---|---|---|
| allocations/run | 1,868,316 | 412,276 | −78 % |
| allocated bytes | 535,181,340 | 34,187,963 | −94 % |
| criterion median | 161.9 ms | 25.5 ms | −84 % |
bench_vm.sh 3-pass mean | ~98 ms | 17.7 ms | −82 % |
Conformance was unchanged (stdlib_collections, stdlib_json, and the
broader 933-test suite all pass).
3. Regex builtins share compiled patterns via Rc instead of cloning
Issue #2796 surfaced this while porting a TypeScript repo-audit script:
a line-oriented scan of the repository was ~3.4× slower in Harn than the
Node baseline. Decomposing the scan (file walk / read_text / line split
/ contains / regex_match, each timed separately over ~740k lines)
isolated the cost entirely to regex_match at ~4.4 µs/call — file I/O,
splitting, and contains were all already cheap.
The pattern was cached, but get_cached_regex returned regex::Regex::clone
on every hit, and regex::Regex::clone deep-copies the compiled program and
its lazy-DFA match-cache pool. A standalone Rust probe confirmed the cost:
134k find_iter calls over a real file took 3.3 ms reusing one Regex,
466 ms cloning the Regex per call, and 2.7 ms cloning an Rc<Regex> per
call — i.e. the deep clone, not the match, was ~99% of the time.
The fix stores Rc<regex::Regex> in the thread-local cache so a hit is a
refcount bump, adds a single-slot "last pattern" memo that skips the
cache-key format! and HashMap hash when a scan loop reuses one pattern,
and switches the regex/contains/split family to borrow their subject and
needle (VmValue::as_str_cow) instead of display()-cloning per call.
Effect on regex_scan_loop (4,000 iterations × 10 lines of two contains
plus one regex_match — the canonical line scan shape):
| metric | baseline (Harn 0.8.60) | post-#2796 | delta |
|---|---|---|---|
harn bench 10-iter mean | 367.9 ms | 36.6 ms | −90 % |
A full-repository scan (~740k regex_match calls over 1,409 files) drops
its regex phase from ~4.0 s to ~1.2 s; the remaining cost is VM call
dispatch and result-list construction, not the regex engine. Conformance
(regex*, string*) was unchanged.
Post-#2095 performance wave
The #1426 profile left a second wave of runtime and typechecker work. Issue #2095 split that wave into small PRs so each hot path could land with isolated measurements and conformance parity.
| Area | Issues / PRs | Change | Measured signal |
|---|---|---|---|
| Typechecker scope entry | #2093 / #2102 | Replaced deep-cloned TypeScope.parent chains with Rc<TypeScope> parents and shared root-scope children. | Synthetic one-line function corpus typecheck dropped from 69 ms to 3 ms at 500 fns and from 8.27 s to 29 ms at 10,000 fns. |
| Closure callbacks | #2086 / #2099 | Pushed callback closures onto the existing VM frame stack and drove them with drive_until_frame_depth, removing the per-callback boxed future and the frame/iterator/deadline mem::take isolation. | list_map_filter moved from the checked-in #1426 baseline mean of 298.64 ms to 76.08 ms in the PR bench table. |
| Named user calls | #2085 / #2101 | Split Op::CallBuiltin into a sync user-closure fast path and async fallback. | function_call_loop best-of-three minimum improved by 11.2%. |
| Tail calls | #2088 / #2103 | Split Op::TailCall into a sync TCO fast path and async fallback for tracked/generator/non-closure cases. | recursive_countdown best-of-three minimum improved by 5.6%. |
| Call argument packing | #2091 / #2107 | Bound regular closure, tail-call, pipe, and sync-builtin arguments directly from VM stack slices; materialized Vecs only for paths that need ownership. | Conformance stayed green, with targeted hot fixture smoke runs covering function_call_loop, method_call_dispatch, and list_map_filter. |
VmValue layout | #2092 / #2100 | Boxed rare/large variants behind shared payloads and added a layout-budget test. | VmValue size budget tightened from 48 bytes to 32 bytes. |
| Method dispatch | #2087 / #2108 | Added sync method dispatch for optional nil, inline-cache hits, and pure receiver methods, leaving callable-backed methods on the async path. | method_call_dispatch release mean measured at 32.74 ms; list_map_filter stayed near 83 ms after the dispatch split. |
harn run setup | #2094 / #2109 | Deferred LLM builtin registration and lazy-loaded setup-only runtime config. | Warm run-setup samples for function_call_loop settled at roughly 1 ms after first-touch initialization. |
Historical pre-#2095 hotspots
bench_vm_fixtures numbers (allocations × wall-time per fixture run, on
the post-#1426 binary) were the input to #2095:
| fixture | alloc/run | bytes/run | median wall | shape |
|---|---|---|---|---|
list_map_filter | 10.9M | 4.43 GB | 376 ms | list.filter(closure).map(closure) in a loop |
local_variable_lookup | 2.20M | 3.0 MB | 161 ms | bare local-slot reads |
function_call_loop | 1.70M | 219 MB | 96 ms | tight step(value) recursion |
agent_tool_dispatch | 1.54M | 261 MB | 53 ms | agent_dispatch_tool_batch over 6 calls × 500 iters |
comparison_loop | 1.10M | 1.4 MB | 200 ms | numeric/string <,==,!= mix |
struct_field_read | 0.90M | 3.3 MB | 94 ms | struct field access in a hot loop |
dict_merge_loop | 0.85M | 96 MB | 45 ms | result = result + {[k]: v} accumulator |
Two patterns dominated that snapshot:
-
Closure callbacks per element.
list_map_filterallocates ~2,725 bytes and ~5,450 ops per iteration's worth of map+filter calls — the per-callbackVmEnvclone-on-call probe (bench_vmenv_clone) shows each call constructs a fresh capture environment even for closures with zero captures. #2086 removed the per-callback boxed future andmem::takeisolation, and the later method/call-argument work reduced the remaining callback dispatch overhead. Re-measure before filing more callback-specific work; the oldlist_map_filternumbers are no longer representative. -
Rc::try_unwrapdefeated by the slot/stack double-hold. Thedict + dictoperator already doesRc::try_unwrapfor the unique case, butresult = result + {[k]: v}always sees the slot still holding the value while the operator runs (slot ref + stack ref). The right answer is either avar <op>= rhspeephole that emits a "swap-take" sequence, or a compiler pass that moves a slot value onto the stack when it knows the slot is about to be overwritten. Cheaper interim is to keep migrating Harn helpers to subscript-store (now allocation-free) instead of the+accumulator.
Workflow-bundle export profile
bench_workflow_bundle exercises the validation + graph normalization +
portable-bundle export path (crates/harn-vm/src/orchestration/workflow_bundle.rs).
Allocation counts on a representative 6-node, 4-trigger, 2-connector,
2-capsule fixture:
| stage | alloc/run | bytes/run | criterion median |
|---|---|---|---|
| validate | 205 | 86 KB | 18 µs |
| preview | 2,567 | 310 KB | 102 µs |
| export_graph | 2,408 | 277 KB | 88 µs |
export_workflow_bundle_graph clones every per-node
editable_fields slot once into the node and once into the global
list. Those clones are correctness-preserving today (the global list is
sorted afterwards), but they're an obvious follow-up if this gets hot in
real CI loads. Numbers here are baseline for the new fixture; reproduce
with cargo bench -p harn-orchestration-perf --bench bench_workflow_bundle.
What's now realistic to port from Rust to Harn
With the option-builder cost paid natively and out[k] = v running
allocation-free, several control-plane paths previously kept in Rust on
performance grounds become reasonable Harn candidates:
- Trigger preflight wiring.
crates/harn-vm/src/triggers/dispatcherbuilds option dicts the same way connectors do; the bookkeeping is trivially expressible in Harn now that builder loops are cheap. - Workflow stage option assembly.
assemble_stage_optionsinorchestration/stage_options.rsdoes dozens of smallmerge/filter_nilstyle merges on every stage start. Moving this to a Harn helper that delegates to__dict_*builtins keeps the Rust crate boundary clean for the actual orchestrator while pushing the editorial work into Harn. - Connector setup-status normalization.
connectors/shared.harnalready runs in Harn but used to be cost-prohibitive for high-fan-out trigger packs; the option-builder cost is no longer the bottleneck.
Areas still better served by Rust: workflow_bundle graph normalization
(needs serde, deterministic sort, and SHA-256 digests in one place);
agent tool dispatch (touches the host bridge and tool annotation cache);
flow store atom emission (Ed25519 + SQLite). Profile reruns will tell us
when any of those tip over.