Agent lifecycle: suspend, resume, stop, self-park

Harn agents are not just one-shot calls — they are cooperatively schedulable units that can park mid-loop, persist a resumable snapshot, resume later, or hand unfinished child work back to a parent with a typed stop handoff. This page is the long-form reference for lifecycle primitives: when to reach for them, who owns the resume responsibility, and how transcript continuity or stop handoffs work across the boundary.

For a one-screen LLM quickref see the "Agent lifecycle: pause, resume, self-park" section in docs/llm/harn-quickref.md. For the upstream protocol contributions that mirror this surface see ACP session/suspend and A2A TaskState.PAUSED.

When to suspend

Suspend/resume is one of several ways Harn lets an agent yield. Pick the narrowest one that fits.

GoalUseWhy
End the loop because the work is doneNatural completion (status: "done")No state to preserve. The transcript is final.
Cap tokens or wall-clock timemax_iterations, token_budget, autonomy_budgetBudgeting is policy, not a checkpoint. The loop terminates with status: "budget_exhausted" or status: "approval_required".
Wait for one specific external eventSuspend/resume with agent_await_resumption(reason, conditions)Park the agent, declare what should wake it, keep the transcript.
Park indefinitely until an operator resumes by handSuspend/resume with agent_await_resumption(reason) (no conditions)Parent agent or operator owns the resume; no trigger gets registered.
Park one agent while another runs to completionParent-driven subagent_pause(handle, reason) plus subagent_resume(handle)The parent loop owns when the child resumes.
Stop a child and take over its unfinished workagent_stop(handle, {graceful: true}) or parent-driven subagent_stop(handle)The runtime folds the child transcript and descendant subagent summaries into a typed handoff artifact.
Background-run an agent and rejoin it laterspawn_agent({background: true}) + wait_agent(handle)The agent is running, not parked. Use suspend/resume only when the agent itself should yield CPU.

Two anti-patterns worth calling out:

  • Suspend is not a sleep. The runtime does not poll for resume — it registers conditions (when present) with the trigger dispatcher, persists a snapshot, and exits the loop frame. A polling sleep keeps the worker alive on a thread; suspend releases it. If you want fixed-cadence wakeup, pass conditions.timeout: {duration_minutes: N}.
  • Suspend is cooperative. A suspend request flips a flag that is honored at the next turn boundary. Tool calls and LLM requests in flight finish first. A worker that does not return to a turn boundary (e.g. an infinite shell command in a single tool) will never observe the request.

The lifecycle surface

The suspend/resume primitive ships as three builtin layers:

  1. Script-levelsuspend_agent, resume_agent, agent_stop, parse_resume_conditions, agent_await_resumption from std/agent/workers. Operators and parent pipelines call these directly.
  2. Model-facing toolsagent_loop(...) automatically exposes agent_await_resumption as a callable tool to the model so an agent can self-park. Pass subagents: true to also expose subagent_pause, subagent_resume, and subagent_stop.
  3. Resume-responsibility callbacksResumeBy.parent_llm, ResumeBy.local_runtime, ResumeBy.cloud_harness, and ResumeBy.pipeline_drain from std/agent/resume_by name who owns resuming the suspended agent. The runtime calls default_resume_by(...) when the script does not supply one.

Builtins

FunctionPurpose
suspend_agent(worker, reason?, options?)Cooperatively suspend a running worker. Persists a resumable snapshot, emits a WorkerSuspended lifecycle event, returns status: "suspended" with suspension metadata. Idempotent on already-suspended workers.
resume_agent(worker_or_snapshot, input?, continue_transcript?)Resume a suspended worker, optionally with new input. continue_transcript defaults to true (full transcript preserved); pass false to resume from prior summary plus new input only. Accepts either a live handle or a persisted snapshot dict.
agent_stop(worker, options?)Stop a worker. The default is a hard cancel; {graceful: true} emits WorkerStopped and returns {status: "stopped", handoff, children, handoffs, worker} where handoff is a normalized handoff artifact and children recursively folds descendant subagent handoffs.
agent_await_resumption(reason, conditions?, resume_by?)Build the normalized lifecycle-tool request. Inside an agent_loop running as a worker, the loop intercepts this call structurally and routes it through the same suspend path as suspend_agent.
parse_resume_conditions(conditions?)Validate and normalize a ResumeConditions dict without spawning a worker. Useful for input validation in handlers and pre-flight checks.
agent_lifecycle_tools(registry?, options?)Decorate a tool registry with agent_await_resumption (always) and, when subagents: true, subagent_pause / subagent_resume / subagent_stop. agent_loop calls this internally; scripts compose it when they build registries by hand.

Resume conditions

ResumeConditions is the shared shape consumed by both agent_await_resumption and spawn_agent({options: {resume_when: ...}}):

{
  trigger?:  TriggerSpec,    // any std/triggers trigger spec
  timeout?:  {duration_minutes: int, on_timeout?: string},
  on_event?: string,         // EventLog topic name
}

All three fields are optional. When all are omitted the worker is parked "open" — only the parent agent, an operator, or resume_agent(...) can wake it. parse_resume_conditions(nil) returns nil; bad input raises HARN-SUS-002 with the failing field path.

timeout.on_timeout defaults to "resume_with_summary" and accepts "fail" or "resume_with_input". trigger is validated by the same trigger-spec parser used by trigger_register(...), so any provider that works as a trigger source works as a resume condition.

Self-park mid-loop

When agent_loop(...) exposes agent_await_resumption as a model tool, the model can park itself between turns. The loop intercepts the call before normal tool dispatch, validates conditions, persists a snapshot, and returns a structured result to the parent (or the direct caller, for top-level loops):

import { agent_await_resumption } from "std/agent/workers"

let result = agent_loop("Wait for the maintainer's review.", nil, {
  provider: "openai",
  model: "gpt-5",
  tool_format: "native",
  // agent_await_resumption is registered automatically.
})

// The model decided to park; `result.status == "suspended"`.
if result.status == "suspended" {
  log(result.reason)                      // model-supplied
  log(result.initiator)                   // "self"
  log(result.conditions?.timeout?.duration_minutes)
  log(result.handle.snapshot_path)        // persisted snapshot
}

status: "suspended" carries handle, reason, initiator (one of "self", "parent", "operator", "triggered"), conditions, and iterations_completed (turns finished before the park). The handle is a resumable worker reference; the snapshot is a JSON document on disk.

Restore the snapshot from the CLI in a fresh process:

harn run --resume .harn/workers/worker_01HF...json

harn run --resume <path> rehydrates the worker, replays the resume- continuity reminder onto the next turn, and continues the loop in the same session.

Parent-driven pause, resume, and stop

For multi-agent setups, the parent loop owns pause/resume of its children. Pass subagents: true (or subagent_tools: true) to agent_loop(...) to expose subagent_pause, subagent_resume, and subagent_stop as model-callable tools:

import { agent_lifecycle_tools } from "std/agent/workers"

let parent_registry = agent_lifecycle_tools(tool_registry(), {subagents: true})

agent_loop("Coordinate the review.", nil, {
  tools: parent_registry,
  subagents: true,
  provider: "openai",
})

The parent's model can now call:

ToolEffect
subagent_pause(handle, reason)Pause a running child after its current turn settles. Idempotent on already-suspended children.
subagent_resume(handle, input?, continue_transcript? = true)Resume a suspended child with optional new input.
subagent_stop(handle, graceful? = true, reason?)Stop a child. Graceful mode is the default and returns a recursive typed handoff summary; graceful: false preserves hard-cancel behavior.

Scripts can drive the same lifecycle directly without going through the model:

let handle = sub_agent_run("Draft the changelog.", {
  background: true,
  provider: "openai",
})

let snapshot = suspend_agent(handle, "operator pulled context")
// ... do other work ...
let resumed = resume_agent(handle, "Pick up where you left off.")
let final = wait_agent(handle)
log(final.status)         // "done"
log(final.has_transcript) // true — transcript continuity preserved

subagent_pause and subagent_resume emit tool_call_audit telemetry with initiator: "parent" so the trust graph can distinguish parent- driven pauses from self-parks.

subagent_stop emits the same audit telemetry and, in graceful mode, returns a handoff artifact shaped for parent takeover. handoff.metadata includes the child session_id, workspace anchor, token budget/usage, snapshot path, and child_handoffs for recursively stopped descendant subagents.

Conditioned resume

Pair agent_await_resumption with conditions to declare what should wake the parked agent. The runtime registers the trigger (if any) with the local dispatcher and persists the conditions alongside the snapshot so they survive restart:

import { agent_await_resumption } from "std/agent/workers"

// Self-park until either a review approval lands OR 30 minutes pass.
let request = agent_await_resumption("waiting on review", {
  trigger: {
    kind: "review.approved",
    provider: "github",
    match: {events: ["review.approved"]},
  },
  timeout: {
    duration_minutes: 30,
    on_timeout: "resume_with_summary",
  },
})

on_timeout semantics:

  • "resume_with_summary" (default) — resume with a system reminder summarizing what happened during the wait.
  • "fail" — terminate the worker with status: "failed".
  • "resume_with_input" — resume with a caller-supplied input string.

spawn_agent({options: {resume_when: ...}}) accepts the same shape so a worker can be born already parked:

import { parse_resume_conditions, spawn_agent } from "std/agent/workers"

let resume_when = parse_resume_conditions({
  trigger: {
    kind: "channel.emit",
    provider: "channel",
    match: {events: ["channel:pr.merged"]},
  },
  timeout: {duration_minutes: 60, on_timeout: "fail"},
})

spawn_agent({
  task: "Cut a release once the PR merges.",
  node: {
    kind: "subagent",
    mode: "llm",
    model_policy: {provider: "openai", model: "gpt-5"},
    output_contract: {output_kinds: ["summary"]},
  },
  options: {resume_when: resume_when},
})

Resume responsibility (ResumeBy.*)

Every suspension has exactly one resume owner. The optional resume_by argument on agent_await_resumption names that owner with a callback:

CallbackWhen it runs
ResumeBy.parent_llmParent agent's LLM resumes the child via subagent_resume. The default when no conditions are present.
ResumeBy.local_runtimeLocal trigger dispatcher registers conditions.trigger and calls resume_agent on fire. The default when conditions are present and no cloud session is bound.
ResumeBy.cloud_harnessThe harn-cloud webhook receiver persists the suspension across process restart and resumes when the configured event lands. Selected when a cloud session is bound.
ResumeBy.pipeline_drainThe enclosing pipeline's drain step owns the resume responsibility. Use inside lifecycle combinators.

Each callback is a pure closure (harness, suspension) -> {handled, mechanism} so they compose with first_available(...), compose(...), and the rest of std/lifecycle/combinators:

import { ResumeBy, first_handled } from "std/agent/resume_by"

let RB = ResumeBy()
let chain = first_handled([RB.cloud_harness, RB.local_runtime, RB.parent_llm])

let request = agent_await_resumption(
  "wait for upstream merge",
  {trigger: {kind: "channel.emit", provider: "channel", match: {events: ["channel:upstream.merged"]}}},
  chain,
)

default_resume_by(...) mirrors the runtime's default policy and is publicly exported so scripts can call it explicitly:

  • conditions == nilResumeBy.parent_llm
  • conditions != nil, no cloud session → ResumeBy.local_runtime
  • conditions != nil, cloud session → first_handled([cloud_harness, local_runtime])

Every resolved dispatch emits a resume_by_dispatched audit through harness.emit_audit. Declined entries (e.g. cloud_harness with no cloud session) emit resume_by_declined. Both surface through lifecycle_audit_log_take() for tests and observability.

Transcript continuity

By default a resumed worker carries its full transcript forward. The runtime injects a single-shot system_reminder with dedupe_key: "resume_continuity" describing the gap, the suspend reason, and (when available) what fired the resume. The reminder is consumed on the next turn and does not re-apply on subsequent suspends.

Set continue_transcript: false on resume_agent(...) to drop the recorded turns and restart from the prior summary plus new input:

// Discard the long deliberation transcript; restart from a clean state
// with a fresh prompt.
resume_agent(handle, "Try a completely different approach.", false)

This is the right call when the previous deliberation hit a dead-end and you want the model to think from first principles, not "patch over" the prior reasoning.

Top-level loops and --resume

A root agent_loop(...) — one called from a pipeline or harn run script, not as a child of another agent — uses the same suspend path. On self-park the runtime persists a snapshot under .harn/workers/worker_*.json and returns status: "suspended" to the caller. The CLI can cold-restore that snapshot in any subsequent process:

# First run: agent parks itself, prints the snapshot path.
harn run scripts/triage.harn
# status=suspended
# snapshot=.harn/workers/worker_01HFEX...json

# Later, in a fresh process:
harn run --resume .harn/workers/worker_01HFEX...json --json
# {"event_type": "done", "value": {"status": "done", ...}}

--resume accepts both absolute and script-relative snapshot paths. The resumed loop emits its final value as a JSON event when --json is set so callers can pipeline it without parsing prose output.

Daemon idle is a degenerate case

agent_loop(..., {daemon: true}) and the daemon stdlib wrappers build on the same primitive. When a daemon idles waiting for a trigger event or a wake-interval tick, the stdlib records an agent_await_resumption request internally and parks the loop. Daemon-specific snapshot fields (pending_event_count, queued_event_count, inflight_event, wake_interval_ms, watch_paths) are persisted alongside the standard suspend metadata.

That means daemons can be cold-restored with daemon_resume(path) the same way a parked agent can be cold-restored with harn run --resume.

See Agent loops — Daemon stdlib wrappers for the daemon-specific surface.

Diagnostic codes

CodeMeaning
HARN-SUS-001suspend_agent was called on a worker that is not running.
HARN-SUS-002ResumeConditions validation failed.
HARN-SUS-003resume_agent was called on a live worker that is not suspended.
HARN-SUS-004A resume snapshot was missing, stale, unreadable, or version-incompatible.
HARN-SUS-005agent_await_resumption was invoked outside agent_loop structural handling (e.g. called directly as a tool handler).
HARN-SUS-006A concurrent resume changed the worker before resume completed.
HARN-SUS-007conditions.trigger could not be registered with the trigger dispatcher.
HARN-SUS-008A timeout fired or was configured with an unsupported on_timeout action.
HARN-SUS-009Resume input failed agent_loop input validation.
HARN-SUS-010A worker closed while suspended rejected a later resume.

The diagnostic catalog at Diagnostic codes catalog is authoritative; the table above lists only the HARN-SUS-* namespace.

Gotchas

  • Suspend is cooperative, not preemptive. suspend_agent(...) flips a flag; the worker exits the loop frame at the next turn boundary, not inside a tool call or LLM request. Workers that never return to a turn boundary will never observe the request. Cap long-running tools with tool_call_timeout (see Long-running tools).
  • Conditions are optional. A bare agent_await_resumption("waiting") parks the worker open — only the parent agent or an operator can wake it. Add conditions only when the agent itself knows what should fire the resume.
  • Snapshots survive process restart. Both the snapshot file (.harn/workers/worker_*.json) and any registered trigger conditions are durable. A fresh process can load the snapshot with resume_agent(snapshot) or harn run --resume <path> and continue. When a cloud session is bound, ResumeBy.cloud_harness keeps the registration in harn-cloud so the resume crosses machines.
  • Double-resume is detected. Concurrent resume_agent calls on the same handle raise HARN-SUS-006; the second caller observes the race and can retry against the now-running handle. See the suspend_double_resume_race conformance test for the exact contract.
  • Closing a suspended worker is terminal. close_agent(handle) on a suspended worker marks the snapshot rejected; a later resume_agent raises HARN-SUS-010. Closing is the correct call when the operator has decided the work is no longer relevant.
  • Graceful stop is a handoff, not a resume checkpoint. Use agent_stop(handle, {graceful: true}) when the parent should continue the work itself. Use suspend_agent/resume_agent when the same child should continue later.

See also

  • LLM quick reference — "Agent lifecycle: pause, resume, self-park" section.
  • Agent loops — full agent_loop reference.
  • Daemon stdlib — daemon-mode wrappers built on suspend/resume.
  • Agent channels — the durable pub/sub primitive that pairs naturally with agent_await_resumption(reason, conditions.trigger: {kind: "channel.emit", ...}).
  • ACP session/suspend RFC and A2A TaskState.PAUSED RFC — upstream protocol companions.
  • Conformance fixtures under conformance/tests/agents/suspend_*.harn, agent_await_resumption.harn, agent_top_level_await_resumption_cli.harn, and resume_by_*.harn exercise every contract documented here.