feat: bulk corpus import — propose pages for a directory of markdown

onboarding an existing pile of markdown notes into a fresh kb is tedious today: there's no one-shot path that turns `docs/*.md` into reviewable pages. bundle import (`import_check`/`import_apply`) moves already-approved artifacts between two vouch kbs, and vault sync (`sync_vault`) mirrors approved pages out and brings user edits back — neither seeds a kb from a plain directory of prose that has never been through the gate. this asks for a one-way onboarding path: walk a directory of markdown and file one page **proposal** per file, so a reviewer works down the queue exactly as they would for agent-proposed pages.

this is distinct from #181 (bidirectional sync — a live mirror of *approved* pages, with edits flowing back) and #90 (deterministic sync/merge of two diverged `.vouch/` trees). corpus import is strictly forward, strictly propose-only, and has no notion of a mirror or a merge: it never writes to `pages/`, only to `proposed/`.

## proposed surface

a new cli command:

```
vouch import-corpus <dir> [--glob '**/*.md'] [--page-type concept]
                          [--tag onboarding] [--actor corpus-import]
                          [--dry-run]
```

for each matching file it:

1. registers the file bytes as a content-addressed source through the existing `read_under_root` containment path that `kb.register_source_from_path` uses, so the reviewer sees the exact bytes the proposal came from; `put_source` coalesces on the sha256 content hash, so re-runs reuse the same source id.
2. files one page proposal through `proposals.propose_page(...)` with `proposed_by=<actor>`, the file's frontmatter title (or filename stem as fallback), `page_type=<--page-type>`, `source_ids=[<source id>]`, and `slug_hint` derived from the relative path so re-imports target the same page id rather than spawning duplicates.

idempotency is by content hash: the command records what it has already proposed (per-file content sha) so a second run over an unchanged corpus skips those files. `--dry-run` prints the per-file plan (propose / skip-unchanged / skip-no-title) and writes nothing, threading the existing `dry_run` flag already accepted by `propose_page`.

no new `kb.*` method is required if this ships as cli-only orchestration over `propose_page`. if a machine-callable `kb.import_corpus` is later wanted on the mcp/jsonl surfaces, it must touch all four registration sites — `@mcp.tool` in `server.py`, `_h_import_corpus` + `HANDLERS` in `jsonl_server.py`, `METHODS` in `capabilities.py`, and the `cli.py` mirror — plus `tests/test_import_corpus.py`. the walk/hash/skip logic belongs in a new module (e.g. `src/vouch/corpus_import.py`), not in `storage.py`, which stays pure i/o.

## review gate & scope

every file becomes a **proposal**, never an approved page — approval stays a separate human `kb.approve` per proposal (or a future batch approve, #110). the command must not call `proposals.approve()` or write under `pages/` directly. status and slug logic lives in `proposals.py`; the new module only orchestrates the walk and the source-registration / propose calls. it stays local-first: it reads a local directory, writes only into the local `.vouch/proposed/` tree, and reaches nothing over the network. sources are registered through the containment-checked `read_under_root` path so a corpus outside the kb root can't smuggle bytes past the boundary.

## acceptance criteria

- [ ] `vouch import-corpus <dir>` walks the directory (default glob `**/*.md`) and files exactly one page proposal per matching file
- [ ] every file lands as a pending proposal in `proposed/` — nothing is written to `pages/`, and `proposals.approve()` is never called
- [ ] each proposal carries a content-addressed source registered from the file bytes, and re-running over the same bytes reuses the same source id
- [ ] idempotent by content hash: a second run over an unchanged corpus proposes zero new pages and reports the files as skipped-unchanged
- [ ] a file with no derivable title / unparseable frontmatter is skipped with a clear log line rather than filing a malformed proposal
- [ ] `--dry-run` prints the per-file plan and writes nothing (no sources, no proposals)
- [ ] `slug_hint` is stable across runs so an edited-then-reimported file targets the same page id instead of duplicating
- [ ] a `corpus.import` event is written to the audit log recording the actor and file count
- [ ] tests under `tests/test_import_corpus.py` cover: happy-path propose, idempotent skip, dry-run, missing-title skip, and that no approved artifact is written
- [ ] `make check` green (pytest, mypy, ruff)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: bulk corpus import — propose pages for a directory of markdown #321

proposed surface

review gate & scope

acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat: bulk corpus import — propose pages for a directory of markdown #321

Description

proposed surface

review gate & scope

acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions