onboarding an existing pile of markdown notes into a fresh kb is tedious today: there's no one-shot path that turns docs/*.md into reviewable pages. bundle import (import_check/import_apply) moves already-approved artifacts between two vouch kbs, and vault sync (sync_vault) mirrors approved pages out and brings user edits back — neither seeds a kb from a plain directory of prose that has never been through the gate. this asks for a one-way onboarding path: walk a directory of markdown and file one page proposal per file, so a reviewer works down the queue exactly as they would for agent-proposed pages.
this is distinct from #181 (bidirectional sync — a live mirror of approved pages, with edits flowing back) and #90 (deterministic sync/merge of two diverged .vouch/ trees). corpus import is strictly forward, strictly propose-only, and has no notion of a mirror or a merge: it never writes to pages/, only to proposed/.
proposed surface
a new cli command:
vouch import-corpus <dir> [--glob '**/*.md'] [--page-type concept]
[--tag onboarding] [--actor corpus-import]
[--dry-run]
for each matching file it:
- registers the file bytes as a content-addressed source through the existing
read_under_root containment path that kb.register_source_from_path uses, so the reviewer sees the exact bytes the proposal came from; put_source coalesces on the sha256 content hash, so re-runs reuse the same source id.
- files one page proposal through
proposals.propose_page(...) with proposed_by=<actor>, the file's frontmatter title (or filename stem as fallback), page_type=<--page-type>, source_ids=[<source id>], and slug_hint derived from the relative path so re-imports target the same page id rather than spawning duplicates.
idempotency is by content hash: the command records what it has already proposed (per-file content sha) so a second run over an unchanged corpus skips those files. --dry-run prints the per-file plan (propose / skip-unchanged / skip-no-title) and writes nothing, threading the existing dry_run flag already accepted by propose_page.
no new kb.* method is required if this ships as cli-only orchestration over propose_page. if a machine-callable kb.import_corpus is later wanted on the mcp/jsonl surfaces, it must touch all four registration sites — @mcp.tool in server.py, _h_import_corpus + HANDLERS in jsonl_server.py, METHODS in capabilities.py, and the cli.py mirror — plus tests/test_import_corpus.py. the walk/hash/skip logic belongs in a new module (e.g. src/vouch/corpus_import.py), not in storage.py, which stays pure i/o.
review gate & scope
every file becomes a proposal, never an approved page — approval stays a separate human kb.approve per proposal (or a future batch approve, #110). the command must not call proposals.approve() or write under pages/ directly. status and slug logic lives in proposals.py; the new module only orchestrates the walk and the source-registration / propose calls. it stays local-first: it reads a local directory, writes only into the local .vouch/proposed/ tree, and reaches nothing over the network. sources are registered through the containment-checked read_under_root path so a corpus outside the kb root can't smuggle bytes past the boundary.
acceptance criteria
onboarding an existing pile of markdown notes into a fresh kb is tedious today: there's no one-shot path that turns
docs/*.mdinto reviewable pages. bundle import (import_check/import_apply) moves already-approved artifacts between two vouch kbs, and vault sync (sync_vault) mirrors approved pages out and brings user edits back — neither seeds a kb from a plain directory of prose that has never been through the gate. this asks for a one-way onboarding path: walk a directory of markdown and file one page proposal per file, so a reviewer works down the queue exactly as they would for agent-proposed pages.this is distinct from #181 (bidirectional sync — a live mirror of approved pages, with edits flowing back) and #90 (deterministic sync/merge of two diverged
.vouch/trees). corpus import is strictly forward, strictly propose-only, and has no notion of a mirror or a merge: it never writes topages/, only toproposed/.proposed surface
a new cli command:
for each matching file it:
read_under_rootcontainment path thatkb.register_source_from_pathuses, so the reviewer sees the exact bytes the proposal came from;put_sourcecoalesces on the sha256 content hash, so re-runs reuse the same source id.proposals.propose_page(...)withproposed_by=<actor>, the file's frontmatter title (or filename stem as fallback),page_type=<--page-type>,source_ids=[<source id>], andslug_hintderived from the relative path so re-imports target the same page id rather than spawning duplicates.idempotency is by content hash: the command records what it has already proposed (per-file content sha) so a second run over an unchanged corpus skips those files.
--dry-runprints the per-file plan (propose / skip-unchanged / skip-no-title) and writes nothing, threading the existingdry_runflag already accepted bypropose_page.no new
kb.*method is required if this ships as cli-only orchestration overpropose_page. if a machine-callablekb.import_corpusis later wanted on the mcp/jsonl surfaces, it must touch all four registration sites —@mcp.toolinserver.py,_h_import_corpus+HANDLERSinjsonl_server.py,METHODSincapabilities.py, and thecli.pymirror — plustests/test_import_corpus.py. the walk/hash/skip logic belongs in a new module (e.g.src/vouch/corpus_import.py), not instorage.py, which stays pure i/o.review gate & scope
every file becomes a proposal, never an approved page — approval stays a separate human
kb.approveper proposal (or a future batch approve, #110). the command must not callproposals.approve()or write underpages/directly. status and slug logic lives inproposals.py; the new module only orchestrates the walk and the source-registration / propose calls. it stays local-first: it reads a local directory, writes only into the local.vouch/proposed/tree, and reaches nothing over the network. sources are registered through the containment-checkedread_under_rootpath so a corpus outside the kb root can't smuggle bytes past the boundary.acceptance criteria
vouch import-corpus <dir>walks the directory (default glob**/*.md) and files exactly one page proposal per matching fileproposed/— nothing is written topages/, andproposals.approve()is never called--dry-runprints the per-file plan and writes nothing (no sources, no proposals)slug_hintis stable across runs so an edited-then-reimported file targets the same page id instead of duplicatingcorpus.importevent is written to the audit log recording the actor and file counttests/test_import_corpus.pycover: happy-path propose, idempotent skip, dry-run, missing-title skip, and that no approved artifact is writtenmake checkgreen (pytest, mypy, ruff)