Project scanning
The std/project module now includes a deterministic L0/L1 project scanner for
lightweight “what kind of project is this?” evidence without any LLM calls.
Import it with:
import "std/project"
What it returns
project_scan(path, options?) resolves path to a directory and returns a
dictionary describing exactly that directory:
let ev = project_scan(".", {tiers: ["ambient", "config"]})
Typical fields:
path: absolute path to the scanned directorylanguages: stable, confidence-filtered language IDs such as["rust"]frameworks: coarse framework IDs when an anchor is obviousbuild_systems: coarse build systems such as["cargo"]or["npm"]vcs: currently"git"when the directory is inside a Git checkoutanchors: anchor files or directories found at the project rootlockfiles: lockfiles found at the project rootconfidence: coarse per-language/per-framework scorespackage_name: root package/module name when it can be parsed deterministically
When tiers includes "config", the scan also fills in:
build_commands: default or discovered build/test commandsdeclared_scripts: parsedpackage.jsonscriptsmakefile_targets: parsed Makefile targetsdockerfile_commands: parsedRUN,CMD, andENTRYPOINTcommandsreadme_code_fences: fenced-language labels found in the README
Tiers
ambient: anchor files, lockfiles, coarse build system detection, VCS, and confidence scoring. No config parsing.config: deterministic config reads for files already found byambient.
If tiers is omitted, project_scan(...) defaults to ["ambient"].
Polyglot repos
Single-directory scans stay leaf-scoped on purpose. For polyglot repos and
monorepos, use project_scan_tree(...) and let callers decide how to combine
sub-project evidence:
let tree = project_scan_tree(".", {tiers: ["ambient"], depth: 3})
// {".": {...}, "frontend": {...}, "backend": {...}}
project_scan_tree(...):
- always includes
"."for the requested base directory - walks subdirectories deterministically
- honors
.gitignoreby default - skips standard vendor/build directories such as
node_modules/andtarget/by default
You can override those defaults with:
respect_gitignore: falseinclude_vendor: trueinclude_hidden: true
Enrichment
project_enrich(path, options) layers an L2, caller-owned enrichment pass on
top of deterministic project_scan(...) evidence. The caller supplies the
prompt template and the output schema; Harn owns prompt rendering, bounded file
selection, schema-retry plumbing, and content-hash caching.
Typical use:
let base = project_scan(".", {tiers: ["ambient", "config"]})
let enriched = project_enrich(".", {
base_evidence: base,
prompt: "Project: {{package_name}}\n{{ for file in files }}FILE {{file.path}}\n{{file.content}}\n{{ end }}\nReturn JSON.",
schema: {
type: "object",
required: ["framework", "indent_style"],
properties: {
framework: {type: "string"},
indent_style: {type: "string"},
},
},
budget_tokens: 4000,
model: "auto",
cache_key: "coding-enrichment-v1",
})
Bindings available to the template:
path: absolute project pathbase_evidence/evidence: the supplied or auto-scanned L0/L1 evidence- every top-level key from
base_evidence files: deterministic bounded file context as{path, content, truncated}
Behavior:
- cache key includes
cache_key, path, schema, rendered prompt, and the content hash of the selected files - cached hits surface
_provenance.cached == true - when the rendered prompt would exceed
budget_tokens, the call returns the base evidence withbudget_exceeded: trueinstead of failing - schema-retry exhaustion returns an envelope with
validation_errorandbase_evidenceinstead of raising
By default, cache entries live under .harn/cache/enrichment/ inside the
project root. Override that with cache_dir when a caller wants a different
location.
Cached deep scans
project_deep_scan(path, options?) layers a cached per-directory tree on top
of the metadata store. It is intended for repeated L2/L3 repo analysis where
callers want stable hierarchical evidence instead of re-running enrichment on
every turn.
Typical shape:
let tree = project_deep_scan(".", {
namespace: "coding-enrichment-v1",
tiers: ["ambient", "config", "enriched"],
incremental: true,
max_staleness_seconds: 86400,
depth: nil,
enrichment: {
prompt: "Return valid JSON only.",
schema: {purpose: "string", conventions: ["string"]},
provider: "mock",
budget_tokens_per_dir: 1024,
},
})
Notes:
namespaceis caller-owned, so multiple agents can keep separate trees for the same repo without collisions.incremental: truereuses cached directories whose local directorystructure_hashandcontent_hashstill match.depth: nilmeans unbounded traversal.- The filesystem backend persists namespace shards under
.harn/metadata/<namespace>/entries.json. project_deep_scan_status(namespace, path?)returns the last recorded scan summary for that scope:{total_dirs, enriched_dirs, stale_dirs, cache_hits, last_refresh, ...}.
project_enrich(path, options?) is the single-directory building block used by
deep scan when the enriched tier is requested.
Catalog
project_catalog() returns the authoritative built-in catalog that drives
ambient detection. Each entry includes:
idlanguagesframeworksbuild_systemsanchorslockfilessource_globsdefault_build_cmddefault_test_cmd
The catalog lives in
crates/harn-vm/src/stdlib/project_catalog.rs. Adding a new language should be
a table entry plus a test, not a new custom code path.
Existing helper
project_root_package() now delegates to the scanner’s config tier after
checking metadata enrichment, so existing callers keep the same package-name
surface while the manifest parsing logic stays centralized.