Project scanning

The std/project module now includes a deterministic L0/L1 project scanner for lightweight “what kind of project is this?” evidence without any LLM calls.

Import it with:

import "std/project"

What it returns

project_scan(path, options?) resolves path to a directory and returns a dictionary describing exactly that directory:

let ev = project_scan(".", {tiers: ["ambient", "config"]})

Typical fields:

path: absolute path to the scanned directory
languages: stable, confidence-filtered language IDs such as ["rust"]
frameworks: coarse framework IDs when an anchor is obvious
build_systems: coarse build systems such as ["cargo"] or ["npm"]
vcs: currently "git" when the directory is inside a Git checkout
anchors: anchor files or directories found at the project root
lockfiles: lockfiles found at the project root
confidence: coarse per-language/per-framework scores
package_name: root package/module name when it can be parsed deterministically

When tiers includes "config", the scan also fills in:

build_commands: default or discovered build/test commands
declared_scripts: parsed package.json scripts
makefile_targets: parsed Makefile targets
dockerfile_commands: parsed RUN, CMD, and ENTRYPOINT commands
readme_code_fences: fenced-language labels found in the README

Tiers

ambient: anchor files, lockfiles, coarse build system detection, VCS, and confidence scoring. No config parsing.
config: deterministic config reads for files already found by ambient.

If tiers is omitted, project_scan(...) defaults to ["ambient"].

Polyglot repos

Single-directory scans stay leaf-scoped on purpose. For polyglot repos and monorepos, use project_scan_tree(...) and let callers decide how to combine sub-project evidence:

let tree = project_scan_tree(".", {tiers: ["ambient"], depth: 3})
// {".": {...}, "frontend": {...}, "backend": {...}}

project_scan_tree(...):

always includes "." for the requested base directory
walks subdirectories deterministically
honors .gitignore by default
skips standard vendor/build directories such as node_modules/ and target/ by default

You can override those defaults with:

respect_gitignore: false
include_vendor: true
include_hidden: true

project_enrich(path, options) layers an L2, caller-owned enrichment pass on top of deterministic project_scan(...) evidence. The caller supplies the prompt template and the output schema; Harn owns prompt rendering, bounded file selection, schema-retry plumbing, and content-hash caching.

Typical use:

let base = project_scan(".", {tiers: ["ambient", "config"]})
let enriched = project_enrich(".", {
  base_evidence: base,
  prompt: "Project: {{package_name}}\n{{ for file in files }}FILE {{file.path}}\n{{file.content}}\n{{ end }}\nReturn JSON.",
  schema: {
    type: "object",
    required: ["framework", "indent_style"],
    properties: {
      framework: {type: "string"},
      indent_style: {type: "string"},
    },
  },
  budget_tokens: 4000,
  model: "auto",
  cache_key: "coding-enrichment-v1",
})

Bindings available to the template:

path: absolute project path
base_evidence / evidence: the supplied or auto-scanned L0/L1 evidence
every top-level key from base_evidence
files: deterministic bounded file context as {path, content, truncated}

Behavior:

cache key includes cache_key, path, schema, rendered prompt, and the content hash of the selected files
cached hits surface _provenance.cached == true
when the rendered prompt would exceed budget_tokens, the call returns the base evidence with budget_exceeded: true instead of failing
schema-retry exhaustion returns an envelope with validation_error and base_evidence instead of raising

By default, cache entries live under .harn/cache/enrichment/ inside the project root. Override that with cache_dir when a caller wants a different location.

Cached deep scans

project_deep_scan(path, options?) layers a cached per-directory tree on top of the metadata store. It is intended for repeated L2/L3 repo analysis where callers want stable hierarchical evidence instead of re-running enrichment on every turn.

Typical shape:

let tree = project_deep_scan(".", {
  namespace: "coding-enrichment-v1",
  tiers: ["ambient", "config", "enriched"],
  incremental: true,
  max_staleness_seconds: 86400,
  depth: nil,
  enrichment: {
    prompt: "Return valid JSON only.",
    schema: {purpose: "string", conventions: ["string"]},
    provider: "mock",
    budget_tokens_per_dir: 1024,
  },
})

Notes:

namespace is caller-owned, so multiple agents can keep separate trees for the same repo without collisions.
incremental: true reuses cached directories whose local directory structure_hash and content_hash still match.
depth: nil means unbounded traversal.
The filesystem backend persists namespace shards under .harn/metadata/<namespace>/entries.json.
project_deep_scan_status(namespace, path?) returns the last recorded scan summary for that scope: {total_dirs, enriched_dirs, stale_dirs, cache_hits, last_refresh, ...}.

project_enrich(path, options?) is the single-directory building block used by deep scan when the enriched tier is requested.

Catalog

project_catalog() returns the authoritative built-in catalog that drives ambient detection. Each entry includes:

id
languages
frameworks
build_systems
anchors
lockfiles
source_globs
default_build_cmd
default_test_cmd

The catalog lives in crates/harn-vm/src/stdlib/project_catalog.rs. Adding a new language should be a table entry plus a test, not a new custom code path.

Existing helper

project_root_package() now delegates to the scanner’s config tier after checking metadata enrichment, so existing callers keep the same package-name surface while the manifest parsing logic stays centralized.

Harn Documentation