LLM calls

llm_call

Make a single LLM request. Harn normalizes provider responses into a canonical dict so product code does not need to parse provider-native message shapes.

let result = llm_call("What is 2 + 2?")
log(result.text)

With a system message:

let result = llm_call(
  "Explain quicksort",
  "You are a computer science teacher. Be concise."
)
log(result.text)

With options:

let result = llm_call(
  "Translate to French: Hello, world",
  "You are a translator.",
  {
    provider: "openai",
    model: "gpt-4o",
    max_tokens: 1024
  }
)
log(result.text)

With image or video content:

import { image_content, video_content } from "std/llm/media"

let result = llm_call("", nil, {
  provider: "minimax",
  model: "MiniMax-M3",
  messages: [{
    role: "user",
    content: [
      {type: "text", text: "Summarize these inputs."},
      image_content("diagram.png", {detail: "auto"}),
      video_content("demo.mp4"),
    ],
  }],
})
log(result.text)

Image blocks use the provider-neutral shape {type: "image", url?: string, base64?: string, media_type: string, detail?: "low"|"high"|"auto"}. Exactly one of url or base64 is required. Harn translates it to Anthropic source, OpenAI image_url, Gemini inline_data/file_data, or Ollama images fields at the provider boundary. Ollama's REST API only accepts base64 image data, so url image blocks are rejected for provider: "ollama". std/llm/media also provides image_message(...) and image_vision_context(...) helpers when a harness wants the same image as both LLM content and deterministic vision_ocr(...) context.

Video blocks use the provider-neutral shape {type: "video", url?: string, base64?: string, media_type: string}. Exactly one of url or base64 is required. Harn translates video blocks to OpenAI compatible video_url content, and to Gemini inline_data/file_data parts for routes that declare video support. std/llm/media also provides video_message(...).

Parameters

ParameterTypeRequiredDescription
promptstringyesThe user message
systemstringnoSystem message for the model
optionsdictnoProvider, model, and generation settings

Return value

llm_call always returns a dict:

FieldTypeDescription
textstringThe text content of the response
visible_textstringHuman-visible assistant output
modelstringThe model used
providerstringCanonical provider identifier
input_tokensintInput/prompt token count
output_tokensintOutput/completion token count
cache_read_tokensintPrompt tokens served from provider-side cache when supported
cache_write_tokensintPrompt tokens written into provider-side cache when supported
cache_creation_input_tokensintAnthropic-compatible alias for cache_write_tokens
cache_hit_ratiofloatFraction of prompt tokens served from provider-side cache
cache_savings_usdfloatEstimated prompt-cache savings versus full input-token price; cache writes can be negative when writes cost more than normal input
served_fastbooltrue when the provider confirmed it served this request at the accelerated ("fast mode") tier; drives premium-tier billing
usagedictToken and prompt-cache accounting fields, including the cache fields above and served_fast
dataanyParsed JSON (when response_format: "json")
tool_callslistTool calls (when model uses tools)
thinkingstringReasoning trace (when thinking is enabled)
private_reasoningstringProvider reasoning metadata kept separate from visible text
blockslistCanonical structured content blocks across providers
logprobslistToken log probability records when requested and returned by the provider
stop_reasonstring"end_turn", "max_tokens", "tool_use", "stop_sequence"
provider_response_idstringProvider-native response id when available, such as OpenAI Responses resp_*
transcriptdictTranscript carrying message history, events, summary, metadata, and id

Options dict

KeyTypeDefaultDescription
providerstring"anthropic"Any configured provider. Built-in names include "anthropic", "openai", "openrouter", "huggingface", "ollama", "gemini", and "local"
modelstringvaries by providerModel identifier
model_rolestringnilFill missing call options from [model_roles.<role>] before normal provider/model/routing resolution. Explicit call options win. The merge/fast_apply roles also read HARN_LLM_MERGE_* and HARN_LLM_FAST_APPLY_* provider/model/route-policy overrides.
max_tokensint16384Maximum tokens in the response
temperaturefloatprovider defaultSampling temperature (0.0-2.0)
top_pfloatnilNucleus sampling
top_kintnilTop-K sampling (Anthropic/Ollama only)
stoplistnilStop sequences
seedintnilReproducibility seed (OpenAI/Ollama)
frequency_penaltyfloatnilFrequency penalty (OpenAI only)
presence_penaltyfloatnilPresence penalty (OpenAI only)
logprobsboolfalseRequest token log probabilities when the selected provider route supports them
top_logprobsintnilRequest top alternative token log probabilities where supported
response_formatstring"text""text" or "json"; with output_schema/json_schema, "json" selects schema-validated JSON rather than loose JSON-object mode
output_formatstring/dict{kind: "text"}Provider-neutral output shape: "text", "json_object", or {kind: "json_schema", schema, strict?}
schema / json_schema / output_schemadictnilJSON Schema, OpenAPI Schema Object, canonical Harn schema dict, or Schema<T> type alias for structured output
output_validationstring"off""error" throws after exhausted schema retries; "warn" logs and returns the final envelope; "off" returns the final envelope without a warning
schema_retriesint1Re-prompt on schema validation failure with a corrective user message. Applies to direct and routing_policy calls.
schema_stream_abortboolinferredDefaults to true when output_schema is set. Aborts impossible streaming JSON early and consumes one schema_retries slot.
llm_retriesint0(deprecated; prefer with_retry from std/llm/handlers) Retries on transient HTTP / provider errors. Raw llm_call is fail-fast by default; set to N to allow N retries after the first attempt. Off-by-one: llm_retries: 3with_retry(..., {max_attempts: 4})
llm_backoff_msint250(deprecated; prefer with_retry) Base exponential backoff in ms between LLM retries
reasoning_policy / thinking_policystring/boolnilProvider-aware reasoning policy. Values: auto, off, minimal, low, medium, high, xhigh; none, disabled, and no_think alias to off. Harn lowers this to the selected route's native thinking shape. Explicit thinking or reasoning_effort wins.
reasoning_scale / problem_scalestring"medium"Scale hint for reasoning_policy: "auto": small, medium, or large.
reasoning_taskstringinferredTask hint for reasoning_policy: "auto": chat, agent, code, verify, or summarize.
thinkingbool/dictnilEnable typed provider reasoning. true and {budget_tokens: N} remain shorthand for {mode: "enabled"}; use {mode: "enabled", budget_tokens: N}, {mode: "adaptive"}, or {mode: "effort", level: "none" | "minimal" | "low" | "medium" | "high" | "xhigh"}. On Anthropic Opus models that declare interleaved-thinking support, {mode: "enabled"} also sends anthropic-beta: interleaved-thinking-2025-05-14. When thinking: false is set on a model whose chat template uses an in-prompt directive (Qwen3's /no_think), Harn auto-prepends the directive to the system message — thinking: false works uniformly across providers without scripts needing to know per-template prompt syntax.
interleaved_thinkingboolfalseAdd Anthropic's interleaved-thinking-2025-05-14 beta header for this call. thinking: true enables it automatically on supported Anthropic Opus models.
anthropic_beta_featuresstring/listnilExtra Anthropic beta feature names to pass in the comma-separated anthropic-beta header on Anthropic-style routes.
visionboolinferredRequire image-input support. Image content blocks set this implicitly; vision: true fails before transport unless the selected provider/model declares vision_supported.
toolslistnilTool definitions
tool_choicestring/dict"auto""auto", "none", "required", or {name: "tool"}
tool_searchbool/string/dictnilProgressive tool disclosure. See Tool Vault
api_modestring"chat_completions"OpenAI only: set "responses" to use Harn's native OpenAI Responses path. Generic OpenAI-compatible providers stay on chat completions.
provider_tools / hosted_toolslistnilOpenAI Responses only. Pass provider-hosted tools such as {type: "web_search"}, {type: "file_search", ...}, or {type: "mcp", server_label, server_url, require_approval}. Harn records provider-native IDs and normalized metadata but does not execute these tools locally.
previous_response_idstringnilOpenAI Responses conversation-state link. Use only when provider-side state is desired instead of replaying the full Harn transcript.
response_store / responses_storeboolprovider defaultOpenAI Responses persistence flag. A bool store is also accepted for direct raw Responses calls, but cache handlers reserve store: {backend...} for cache storage configuration.
backgroundboolprovider defaultOpenAI Responses background-mode flag.
truncationstringprovider defaultOpenAI Responses provider-side truncation/compaction policy such as "auto".
compactboolfalseOpenAI Responses standalone compaction. When true, Harn posts the request to /responses/compact and returns provider compaction items in result.blocks.
includelistnilOpenAI Responses metadata expansions to request.
max_tool_callsintnilOpenAI Responses provider-executed tool-call limit.
budgetdictnilPre-flight LLM budget envelope. Supports max_cost_usd, max_input_tokens, max_output_tokens, and total_budget_usd
cacheboolfalseEnable prompt caching (Anthropic)
fastboolfalseOpt into the model's accelerated-serving ("fast mode") tier. Maps to the per-provider knob declared in the catalog (speed for Anthropic, service_tier for OpenAI) and injects the Anthropic beta header when required. Rejected for models with no fast_mode tier or a deprecated one. Billed at the catalog's premium fast_mode.pricing only when the provider confirms it served fast (result.served_fast). speed: "fast" is accepted as an alias.
streambooltrueUse streaming SSE transport. Set false for synchronous request/response. Env: HARN_LLM_STREAM
timeoutint120Request timeout in seconds. timeout_ms accepted as an alias and rounded up to whole seconds (HTTP transports take Duration::from_secs); sub-second budgets must be enforced at the caller.
messageslistnilFull message list (overrides prompt)
structural_experimentstring/dict/closurenilPrompt-structure transform applied immediately before the provider call. Built-ins: prompt_order_permutation(seed: N), doubled_prompt, chain_of_draft, inverted_system. Env: HARN_STRUCTURAL_EXPERIMENT
transcriptdictnilContinue from a previous transcript; prompt is appended as the next user turn
model_tierstringnilResolve a configured tier alias such as "small", "mid", or "frontier"

The cache option above enables provider-side prompt caching when a provider supports it. It does not memoize full LLM responses. For Harn-owned response caching, import with_cache from std/llm/handlers:

Model roles are ordinary option defaults, so they compose with the existing routing layer instead of bypassing it:

[model_roles.merge]
provider = "ollama"
model = "devstral-small-2"
temperature = 0.0
route_policy = "manual"
let merged = llm_call(prompt, sys, {model_role: "merge", output_schema: schema})
import { with_cache } from "std/llm/handlers"

let result = with_cache("Summarize this file", nil, {
  provider: "anthropic",
  model: "claude-haiku-4-5",
  store: {backend: "sqlite", namespace: "summaries"},
  ttl: "10m",
  max_entries: 256,
})

with_cache returns the same envelope as llm_call. Its key is content-addressed as sha256: over canonical JSON for {prompt, system, provider, model, temperature, top_p, max_tokens} after defaults resolve. The default store is sqlite under Harn state, namespace llm.with_cache, TTL 10 minutes, and LRU size 256. Use store: {backend: "fs", namespace, path?} for one-file-per-entry storage. Calls with tools bypass the cache by default; set skip_when to a bool or predicate closure to override that policy.

Provider-specific overrides can be passed as sub-dicts:

let result = llm_call("hello", nil, {
  provider: "ollama",
  ollama: {num_ctx: 32768}
})

OpenAI Responses mode

Set api_mode: "responses" with provider: "openai" when the call should use OpenAI's native Responses API instead of the generic /chat/completions adapter:

let result = llm_call("Search and summarize current docs.", nil, {
  provider: "openai",
  model: "gpt-5.4",
  api_mode: "responses",
  output_format: {kind: "json_schema", schema: summary_schema, strict: true},
  provider_tools: [
    {type: "web_search"},
    {type: "mcp", server_label: "docs", server_url: "https://mcp.example.com", require_approval: "always"},
  ],
  truncation: "auto",
  max_tool_calls: 4,
})

Use normal Harn tools when Harn should execute, approve, and audit a tool or MCP server locally. Use provider_tools only when the provider should execute a hosted tool or remote MCP connector. Provider-executed calls appear in result.blocks, transcript provider_payload.blocks, and provider_response_id metadata with executor: "provider_native" and the provider-native IDs. Set compact: true for a standalone Responses compaction pass; Harn records the opaque compaction items as private blocks so later turns can explicitly choose whether to feed the compacted provider window back as input.

Structural experiments can be enabled directly on a call:

let result = llm_call("Instruction\n\nContext block", nil, {
  provider: "mock",
  structural_experiment: "prompt_order_permutation(seed: 42)",
})

For custom transforms, pass a closure (or a std/experiments.custom(...) spec) that rewrites {messages, system} and returns either nil, a new message list, or {messages?, system?, metadata?}.

llm_call_structured

llm_call_structured(prompt, schema, options?) is the ergonomic helper for the "ask for JSON against this schema, retry on validation failure, return just the parsed data" pattern. It wraps llm_call and pre-applies the schema-validated-JSON defaults so callsites stop repeating the same four options.

let schema = {
  type: "object",
  required: ["name", "age"],
  properties: {
    name: {type: "string"},
    age: {type: "integer"},
  },
}
let person = llm_call_structured(
  "Extract the speaker's name and age from the transcript.",
  schema,
  {provider: "anthropic", system: "You are precise."},
)
log(person.name)
log(person.age)

Parameters

ParameterTypeRequiredDescription
promptstringyesThe user message
schemadict or Schema<T>yesJSON Schema dict or a type alias in value position. When passed a Schema<T> the return narrows to T.
optionsdictnoAny option llm_call accepts, plus system (lifted into the system-message slot) and retries (alias for schema_retries)

Return value

The validated data payload, typed as T when the schema is a Schema<T>. Throws on exhausted schema retries or transport failure — callers can assume the return matches the schema.

The {response_format: "json", output_validation: "error", schema_retries: 3} defaults are applied unless the caller overrides them in options.

Non-throwing variant

llm_call_structured_safe(prompt, schema, options?) returns the {ok, data, error} envelope (mirroring llm_call_safe but with the validated .data pre-unwrapped) instead of throwing:

let r = llm_call_structured_safe(prompt, schema, {provider: "openai"})
if !r.ok {
  log("structured call failed:", r.error.category, r.error.message)
  return nil
}
let person = r.data

r.error.category is one of the canonical ErrorCategory strings ("rate_limit", "timeout", "schema_validation", "auth", "budget_exceeded", "transient_network", "generic", …) — match on the category instead of string-sniffing the message.

Diagnostic envelope variant

llm_call_structured_result(prompt, schema, options?) returns the full failure-mode envelope production agent pipelines need, so callers can keep raw model text, attempt counts, and validation / repair state without hand-rolling parse / repair chains. It never throws on transport or schema failures — ok: false plus error_category distinguishes the failure mode.

let r = llm_call_structured_result(prompt, schema, {
  provider: "auto",
  schema_retries: 2,
  // Optional repair pass — runs only on malformed JSON or
  // schema-invalid output. Skipped on transport failures.
  repair: {
    enabled: true,
    model: "cheapest_over_quality(low)",
    max_tokens: 600,
  },
})
if r.ok {
  let person = r.data
  // ...
} else {
  log("structured call failed:", r.error_category, "raw:", r.raw_text)
}

Envelope fields:

FieldTypeDescription
okbooltrue when the parsed payload validated against the schema.
dataT | nilValidated payload, or nil on failure. Narrows to T when schema: Schema<T>.
raw_textstringFinal attempt's raw model text. Preserved on failure for offline diagnostics or manual repair.
errorstringHuman-readable error message (empty on success).
error_categorystring | nilnil on success. On failure, one of transport-class categories (rate_limit, timeout, auth, transient_network, …) or missing_json / schema_validation / repair_failed.
attemptsintNumber of model calls made. 1 = no retries; 2+ = schema retries kicked in. 0 only when arg parsing failed before any call.
repairedbooltrue when the repair pass produced valid JSON.
extracted_jsonbooltrue when JSON had to be lifted from prose / markdown fences.
usage{input_tokens, output_tokens, cache_read_tokens, cache_write_tokens, cache_creation_input_tokens, cache_hit_ratio, cache_savings_usd, served_fast}Token and prompt-cache accounting from the final attempt.
modelstringModel that produced the final attempt.
providerstringProvider that produced the final attempt.

Repair-pass semantics:

  • The repair block is recognized only by llm_call_structured_result. Pass repair: {enabled: true, ...} to enable it; presence of the dict implies opt-in.
  • Repair runs at most once, with schema_retries: 0, only when the main call ended with malformed JSON or schema-invalid output. It is skipped on transport failures because there is no raw text to salvage.
  • Override keys (model, provider, max_tokens, system, …) are merged onto the main call's options for the repair attempt.

When to use which helper

  • Product code that needs just the parsed payload: prefer llm_call_structured. It removes the output_validation, schema_retries, response_format, and .data noise from every callsite.
  • Code that also needs token counts, transcript, thinking traces, or to pass a pre-built transcript: call llm_call directly and read .text / .data / .input_tokens / etc. off the full result dict.
  • Call sites that prefer explicit branching over try blocks: llm_call_structured_safe (the non-throwing envelope).
  • Production agent pipelines that need raw-text retention, attempt counts, and an optional repair pass on malformed JSON: llm_call_structured_result — replaces the llm_call → response.data → safe_parse → json_extract → repair → schema_check chain that downstream callers would otherwise hand-roll.

Composable callers

agent_loop accepts an llm_caller: option — a closure that owns each turn's llm_call(...). Wrap it with middleware from std/llm/handlers (retry / fallback / shadow / logging / budget / cache / circuit breaker) to compose resilience without forking the loop:

import {default_llm_caller, with_retry} from "std/llm/handlers"

let caller = with_retry(default_llm_caller(), {max_attempts: 4})

let result = agent_loop(task, system, {
  loop_until_done: true,
  llm_caller: caller,
})

Caller contract:

fn(call) -> {ok: true, value: <llm dict>}
          | {ok: false, status: <reserved>, error?: any, retryable?: bool}
//   call = {prompt, system, opts, turn: {iteration, session_id, attempt}}

with_retry's max_attempts: N counts total attempts. Migrating llm_retries: K (deprecated): pass max_attempts: K + 1 — the legacy option counted retries after the first attempt.

See Composable callers and middleware for the full module catalog (handlers, ensemble, refine, budget, defaults, safe, prompts, catalog).

llm_completion

Use llm_completion for text continuation and fill-in-the-middle generation. It lives at the same abstraction level as llm_call.

let result = llm_completion("let total = ", ";", nil, {
  provider: "ollama",
  model_tier: "small"
})
log(result.text)

Cost tracking

Harn provides builtins for estimating and controlling LLM costs:

// Estimate cost for a specific call
let cost = llm_cost("claude-sonnet-4-6", 1000, 500)
log("Estimated cost: $${cost}")

// Check cumulative session costs
let session = llm_session_cost()
log("Total: $${session.total_cost}")
log("Calls: ${session.call_count}")
log("Input tokens: ${session.input_tokens}")
log("Output tokens: ${session.output_tokens}")

// Set a budget (LLM calls throw if exceeded)
llm_budget(1.00)
log("Remaining: $${llm_budget_remaining()}")

For per-call controls, pass a budget envelope on llm_call:

let result = try {
  llm_call("Summarize this", nil, {
    provider: "openai",
    model: "gpt-4o",
    max_tokens: 1024,
    budget: {
      max_cost_usd: 0.001,
      max_input_tokens: 8000,
      max_output_tokens: 1024,
    },
  })
}

Harn estimates prompt tokens before the provider request leaves the process, using tiktoken-rs for known OpenAI models, labeled tiktoken approximations for Claude/Gemini families, and a heuristic fallback for unknown model IDs. It then projects cost with the provider/model pricing table and throws a terminal budget_exceeded dict when a limit would be exceeded. In a try { ... } expression that surfaces as Result.Err({kind: "terminal", reason: "budget_exceeded", projected_cost_usd: ...}).

agent_loop accepts the same envelope. max_* limits apply to each model turn; total_budget_usd is an aggregate loop budget and exits gracefully with status: "budget_exhausted" before starting a turn that would exceed it.

FunctionDescription
llm_cost(model, input_tokens, output_tokens)Estimate USD cost from embedded pricing table
llm_session_cost()Session totals: {total_cost, input_tokens, output_tokens, call_count}
llm_budget(max_cost)Set session budget in USD. LLM calls throw if exceeded
llm_budget_remaining()Remaining budget (nil if no budget set)
tiktoken_count_tokens(text, model)Count text with the selected tiktoken encoder for known OpenAI/Claude/Gemini model families

Import std/llm/budget for reusable helpers such as estimate_text_tokens_detail(text, model), which includes the encoder label (cl100k_base, o200k_base, etc.) and whether the count is exact or an approximation.

Testing with mock LLM responses

The mock provider returns deterministic responses without API keys. Use llm_mock() to queue specific responses — text, tool calls, or both:

// Queue a text response (consumed in FIFO order)
llm_mock({text: "The capital of France is Paris."})
let r = llm_call("What is the capital of France?", nil, {provider: "mock"})
assert_eq(r.text, "The capital of France is Paris.")

// Queue a response with tool calls
llm_mock({
  text: "Let me read that file.",
  tool_calls: [{name: "read_file", arguments: {path: "src/main.rs"}}],
})

// Queue token logprobs for confidence/reranking tests
llm_mock({text: "certain", logprobs: [{token: "certain", logprob: 0.0}]})

// Pattern-matched mocks (reusable by default, matched in declaration order)
llm_mock({text: "I don't know.", match: "*unknown*"})
llm_mock({text: "step 1", match: "*planner*", consume_match: true})
llm_mock({text: "step 2", match: "*planner*", consume_match: true})

// Provider-style error envelopes exercise the same catch/safe-call paths
// as live provider failures.
llm_mock({error: {status: 503, kind: "transient", reason: "upstream_unavailable"}})

// Inspect what was sent to the mock provider
let calls = llm_mock_calls()
// Each entry: {messages: [...], system: "..." or nil, tools: [...] or nil}

// Clear all mocks and call log between tests
llm_mock_clear()

When no llm_mock() responses are queued, the mock provider falls back to its default deterministic behavior (echoing prompt metadata). This means existing tests using provider: "mock" without llm_mock() continue to work unchanged.