Agent lifecycle: suspend, resume, stop, self-park
Harn agents are not just one-shot calls — they are cooperatively schedulable units that can park mid-loop, persist a resumable snapshot, resume later, or hand unfinished child work back to a parent with a typed stop handoff. This page is the long-form reference for lifecycle primitives: when to reach for them, who owns the resume responsibility, and how transcript continuity or stop handoffs work across the boundary.
For a one-screen LLM quickref see the "Agent lifecycle: pause, resume,
self-park" section in docs/llm/harn-quickref.md. For the upstream protocol
contributions that mirror this surface see
ACP session/suspend and
A2A TaskState.PAUSED.
When to suspend
Suspend/resume is one of several ways Harn lets an agent yield. Pick the narrowest one that fits.
| Goal | Use | Why |
|---|---|---|
| End the loop because the work is done | Natural completion (status: "done") | No state to preserve. The transcript is final. |
| Cap tokens or wall-clock time | max_iterations, token_budget, autonomy_budget | Budgeting is policy, not a checkpoint. The loop terminates with status: "budget_exhausted" or status: "approval_required". |
| Wait for one specific external event | Suspend/resume with agent_await_resumption(reason, conditions) | Park the agent, declare what should wake it, keep the transcript. |
| Park indefinitely until an operator resumes by hand | Suspend/resume with agent_await_resumption(reason) (no conditions) | Parent agent or operator owns the resume; no trigger gets registered. |
| Park one agent while another runs to completion | Parent-driven subagent_pause(handle, reason) plus subagent_resume(handle) | The parent loop owns when the child resumes. |
| Stop a child and take over its unfinished work | agent_stop(handle, {graceful: true}) or parent-driven subagent_stop(handle) | The runtime folds the child transcript and descendant subagent summaries into a typed handoff artifact. |
| Background-run an agent and rejoin it later | spawn_agent({background: true}) + wait_agent(handle) | The agent is running, not parked. Use suspend/resume only when the agent itself should yield CPU. |
Two anti-patterns worth calling out:
- Suspend is not a sleep. The runtime does not poll for resume — it
registers conditions (when present) with the trigger dispatcher, persists
a snapshot, and exits the loop frame. A polling sleep keeps the worker
alive on a thread; suspend releases it. If you want fixed-cadence wakeup,
pass
conditions.timeout: {duration_minutes: N}. - Suspend is cooperative. A suspend request flips a flag that is honored at the next turn boundary. Tool calls and LLM requests in flight finish first. A worker that does not return to a turn boundary (e.g. an infinite shell command in a single tool) will never observe the request.
The lifecycle surface
The suspend/resume primitive ships as three builtin layers:
- Script-level —
suspend_agent,resume_agent,agent_stop,parse_resume_conditions,agent_await_resumptionfromstd/agent/workers. Operators and parent pipelines call these directly. - Model-facing tools —
agent_loop(...)automatically exposesagent_await_resumptionas a callable tool to the model so an agent can self-park. Passsubagents: trueto also exposesubagent_pause,subagent_resume, andsubagent_stop. - Resume-responsibility callbacks —
ResumeBy.parent_llm,ResumeBy.local_runtime,ResumeBy.cloud_harness, andResumeBy.pipeline_drainfromstd/agent/resume_byname who owns resuming the suspended agent. The runtime callsdefault_resume_by(...)when the script does not supply one.
Builtins
| Function | Purpose |
|---|---|
suspend_agent(worker, reason?, options?) | Cooperatively suspend a running worker. Persists a resumable snapshot, emits a WorkerSuspended lifecycle event, returns status: "suspended" with suspension metadata. Idempotent on already-suspended workers. |
resume_agent(worker_or_snapshot, input?, continue_transcript?) | Resume a suspended worker, optionally with new input. continue_transcript defaults to true (full transcript preserved); pass false to resume from prior summary plus new input only. Accepts either a live handle or a persisted snapshot dict. |
agent_stop(worker, options?) | Stop a worker. The default is a hard cancel; {graceful: true} emits WorkerStopped and returns {status: "stopped", handoff, children, handoffs, worker} where handoff is a normalized handoff artifact and children recursively folds descendant subagent handoffs. |
agent_await_resumption(reason, conditions?, resume_by?) | Build the normalized lifecycle-tool request. Inside an agent_loop running as a worker, the loop intercepts this call structurally and routes it through the same suspend path as suspend_agent. |
parse_resume_conditions(conditions?) | Validate and normalize a ResumeConditions dict without spawning a worker. Useful for input validation in handlers and pre-flight checks. |
agent_lifecycle_tools(registry?, options?) | Decorate a tool registry with agent_await_resumption (always) and, when subagents: true, subagent_pause / subagent_resume / subagent_stop. agent_loop calls this internally; scripts compose it when they build registries by hand. |
Resume conditions
ResumeConditions is the shared shape consumed by both agent_await_resumption
and spawn_agent({options: {resume_when: ...}}):
{
trigger?: TriggerSpec, // any std/triggers trigger spec
timeout?: {duration_minutes: int, on_timeout?: string},
on_event?: string, // EventLog topic name
}
All three fields are optional. When all are omitted the worker is parked
"open" — only the parent agent, an operator, or resume_agent(...) can
wake it. parse_resume_conditions(nil) returns nil; bad input raises
HARN-SUS-002 with the failing field path.
timeout.on_timeout defaults to "resume_with_summary" and accepts
"fail" or "resume_with_input". trigger is validated by the same
trigger-spec parser used by trigger_register(...), so any provider that
works as a trigger source works as a resume condition.
Self-park mid-loop
When agent_loop(...) exposes agent_await_resumption as a model tool,
the model can park itself between turns. The loop intercepts the call
before normal tool dispatch, validates conditions, persists a snapshot,
and returns a structured result to the parent (or the direct caller, for
top-level loops):
import { agent_await_resumption } from "std/agent/workers"
let result = agent_loop("Wait for the maintainer's review.", nil, {
provider: "openai",
model: "gpt-5",
tool_format: "native",
// agent_await_resumption is registered automatically.
})
// The model decided to park; `result.status == "suspended"`.
if result.status == "suspended" {
log(result.reason) // model-supplied
log(result.initiator) // "self"
log(result.conditions?.timeout?.duration_minutes)
log(result.handle.snapshot_path) // persisted snapshot
}
status: "suspended" carries handle, reason, initiator (one of
"self", "parent", "operator", "triggered"), conditions, and
iterations_completed (turns finished before the park). The handle is a
resumable worker reference; the snapshot is a JSON document on disk.
Restore the snapshot from the CLI in a fresh process:
harn run --resume .harn/workers/worker_01HF...json
harn run --resume <path> rehydrates the worker, replays the resume-
continuity reminder onto the next turn, and continues the loop in the same
session.
Parent-driven pause, resume, and stop
For multi-agent setups, the parent loop owns pause/resume of its children.
Pass subagents: true (or subagent_tools: true) to agent_loop(...) to
expose subagent_pause, subagent_resume, and subagent_stop as
model-callable tools:
import { agent_lifecycle_tools } from "std/agent/workers"
let parent_registry = agent_lifecycle_tools(tool_registry(), {subagents: true})
agent_loop("Coordinate the review.", nil, {
tools: parent_registry,
subagents: true,
provider: "openai",
})
The parent's model can now call:
| Tool | Effect |
|---|---|
subagent_pause(handle, reason) | Pause a running child after its current turn settles. Idempotent on already-suspended children. |
subagent_resume(handle, input?, continue_transcript? = true) | Resume a suspended child with optional new input. |
subagent_stop(handle, graceful? = true, reason?) | Stop a child. Graceful mode is the default and returns a recursive typed handoff summary; graceful: false preserves hard-cancel behavior. |
Scripts can drive the same lifecycle directly without going through the model:
let handle = sub_agent_run("Draft the changelog.", {
background: true,
provider: "openai",
})
let snapshot = suspend_agent(handle, "operator pulled context")
// ... do other work ...
let resumed = resume_agent(handle, "Pick up where you left off.")
let final = wait_agent(handle)
log(final.status) // "done"
log(final.has_transcript) // true — transcript continuity preserved
subagent_pause and subagent_resume emit tool_call_audit telemetry
with initiator: "parent" so the trust graph can distinguish parent-
driven pauses from self-parks.
subagent_stop emits the same audit telemetry and, in graceful mode, returns
a handoff artifact shaped for parent takeover. handoff.metadata includes
the child session_id, workspace anchor, token budget/usage, snapshot path,
and child_handoffs for recursively stopped descendant subagents.
Conditioned resume
Pair agent_await_resumption with conditions to declare what should
wake the parked agent. The runtime registers the trigger (if any) with
the local dispatcher and persists the conditions alongside the snapshot
so they survive restart:
import { agent_await_resumption } from "std/agent/workers"
// Self-park until either a review approval lands OR 30 minutes pass.
let request = agent_await_resumption("waiting on review", {
trigger: {
kind: "review.approved",
provider: "github",
match: {events: ["review.approved"]},
},
timeout: {
duration_minutes: 30,
on_timeout: "resume_with_summary",
},
})
on_timeout semantics:
"resume_with_summary"(default) — resume with a system reminder summarizing what happened during the wait."fail"— terminate the worker withstatus: "failed"."resume_with_input"— resume with a caller-supplied input string.
spawn_agent({options: {resume_when: ...}}) accepts the same shape so a
worker can be born already parked:
import { parse_resume_conditions, spawn_agent } from "std/agent/workers"
let resume_when = parse_resume_conditions({
trigger: {
kind: "channel.emit",
provider: "channel",
match: {events: ["channel:pr.merged"]},
},
timeout: {duration_minutes: 60, on_timeout: "fail"},
})
spawn_agent({
task: "Cut a release once the PR merges.",
node: {
kind: "subagent",
mode: "llm",
model_policy: {provider: "openai", model: "gpt-5"},
output_contract: {output_kinds: ["summary"]},
},
options: {resume_when: resume_when},
})
Resume responsibility (ResumeBy.*)
Every suspension has exactly one resume owner. The optional resume_by
argument on agent_await_resumption names that owner with a callback:
| Callback | When it runs |
|---|---|
ResumeBy.parent_llm | Parent agent's LLM resumes the child via subagent_resume. The default when no conditions are present. |
ResumeBy.local_runtime | Local trigger dispatcher registers conditions.trigger and calls resume_agent on fire. The default when conditions are present and no cloud session is bound. |
ResumeBy.cloud_harness | The harn-cloud webhook receiver persists the suspension across process restart and resumes when the configured event lands. Selected when a cloud session is bound. |
ResumeBy.pipeline_drain | The enclosing pipeline's drain step owns the resume responsibility. Use inside lifecycle combinators. |
Each callback is a pure closure (harness, suspension) -> {handled, mechanism}
so they compose with first_available(...), compose(...), and the rest of
std/lifecycle/combinators:
import { ResumeBy, first_handled } from "std/agent/resume_by"
let RB = ResumeBy()
let chain = first_handled([RB.cloud_harness, RB.local_runtime, RB.parent_llm])
let request = agent_await_resumption(
"wait for upstream merge",
{trigger: {kind: "channel.emit", provider: "channel", match: {events: ["channel:upstream.merged"]}}},
chain,
)
default_resume_by(...) mirrors the runtime's default policy and is
publicly exported so scripts can call it explicitly:
conditions == nil→ResumeBy.parent_llmconditions != nil, no cloud session →ResumeBy.local_runtimeconditions != nil, cloud session →first_handled([cloud_harness, local_runtime])
Every resolved dispatch emits a resume_by_dispatched audit through
harness.emit_audit. Declined entries (e.g. cloud_harness with no
cloud session) emit resume_by_declined. Both surface through
lifecycle_audit_log_take() for tests and observability.
Transcript continuity
By default a resumed worker carries its full transcript forward. The
runtime injects a single-shot system_reminder with dedupe_key: "resume_continuity" describing the gap, the suspend reason, and (when
available) what fired the resume. The reminder is consumed on the next
turn and does not re-apply on subsequent suspends.
Set continue_transcript: false on resume_agent(...) to drop the
recorded turns and restart from the prior summary plus new input:
// Discard the long deliberation transcript; restart from a clean state
// with a fresh prompt.
resume_agent(handle, "Try a completely different approach.", false)
This is the right call when the previous deliberation hit a dead-end and you want the model to think from first principles, not "patch over" the prior reasoning.
Top-level loops and --resume
A root agent_loop(...) — one called from a pipeline or harn run
script, not as a child of another agent — uses the same suspend path. On
self-park the runtime persists a snapshot under .harn/workers/worker_*.json
and returns status: "suspended" to the caller. The CLI can cold-restore
that snapshot in any subsequent process:
# First run: agent parks itself, prints the snapshot path.
harn run scripts/triage.harn
# status=suspended
# snapshot=.harn/workers/worker_01HFEX...json
# Later, in a fresh process:
harn run --resume .harn/workers/worker_01HFEX...json --json
# {"event_type": "done", "value": {"status": "done", ...}}
--resume accepts both absolute and script-relative snapshot paths. The
resumed loop emits its final value as a JSON event when --json is set
so callers can pipeline it without parsing prose output.
Daemon idle is a degenerate case
agent_loop(..., {daemon: true}) and the
daemon stdlib wrappers build on the same primitive.
When a daemon idles waiting for a trigger event or a wake-interval tick,
the stdlib records an agent_await_resumption request internally and
parks the loop. Daemon-specific snapshot fields
(pending_event_count, queued_event_count, inflight_event,
wake_interval_ms, watch_paths) are persisted alongside the standard
suspend metadata.
That means daemons can be cold-restored with daemon_resume(path) the
same way a parked agent can be cold-restored with harn run --resume.
See Agent loops — Daemon stdlib wrappers for the daemon-specific surface.
Diagnostic codes
| Code | Meaning |
|---|---|
HARN-SUS-001 | suspend_agent was called on a worker that is not running. |
HARN-SUS-002 | ResumeConditions validation failed. |
HARN-SUS-003 | resume_agent was called on a live worker that is not suspended. |
HARN-SUS-004 | A resume snapshot was missing, stale, unreadable, or version-incompatible. |
HARN-SUS-005 | agent_await_resumption was invoked outside agent_loop structural handling (e.g. called directly as a tool handler). |
HARN-SUS-006 | A concurrent resume changed the worker before resume completed. |
HARN-SUS-007 | conditions.trigger could not be registered with the trigger dispatcher. |
HARN-SUS-008 | A timeout fired or was configured with an unsupported on_timeout action. |
HARN-SUS-009 | Resume input failed agent_loop input validation. |
HARN-SUS-010 | A worker closed while suspended rejected a later resume. |
The diagnostic catalog at Diagnostic codes catalog is
authoritative; the table above lists only the HARN-SUS-* namespace.
Gotchas
- Suspend is cooperative, not preemptive.
suspend_agent(...)flips a flag; the worker exits the loop frame at the next turn boundary, not inside a tool call or LLM request. Workers that never return to a turn boundary will never observe the request. Cap long-running tools withtool_call_timeout(see Long-running tools). - Conditions are optional. A bare
agent_await_resumption("waiting")parks the worker open — only the parent agent or an operator can wake it. Addconditionsonly when the agent itself knows what should fire the resume. - Snapshots survive process restart. Both the snapshot file
(
.harn/workers/worker_*.json) and any registered trigger conditions are durable. A fresh process can load the snapshot withresume_agent(snapshot)orharn run --resume <path>and continue. When a cloud session is bound,ResumeBy.cloud_harnesskeeps the registration in harn-cloud so the resume crosses machines. - Double-resume is detected. Concurrent
resume_agentcalls on the same handle raiseHARN-SUS-006; the second caller observes the race and can retry against the now-running handle. See thesuspend_double_resume_raceconformance test for the exact contract. - Closing a suspended worker is terminal.
close_agent(handle)on a suspended worker marks the snapshot rejected; a laterresume_agentraisesHARN-SUS-010. Closing is the correct call when the operator has decided the work is no longer relevant. - Graceful stop is a handoff, not a resume checkpoint. Use
agent_stop(handle, {graceful: true})when the parent should continue the work itself. Usesuspend_agent/resume_agentwhen the same child should continue later.
See also
- LLM quick reference — "Agent lifecycle: pause, resume, self-park" section.
- Agent loops — full
agent_loopreference. - Daemon stdlib — daemon-mode wrappers built on suspend/resume.
- Agent channels — the durable pub/sub primitive
that pairs naturally with
agent_await_resumption(reason, conditions.trigger: {kind: "channel.emit", ...}). - ACP
session/suspendRFC and A2ATaskState.PAUSEDRFC — upstream protocol companions. - Conformance fixtures under
conformance/tests/agents/—suspend_*.harn,agent_await_resumption.harn,agent_top_level_await_resumption_cli.harn, andresume_by_*.harnexercise every contract documented here.