Skip to content

feat: bulk corpus import — propose pages for a directory of markdown #321

Description

@plind-junior

onboarding an existing pile of markdown notes into a fresh kb is tedious today: there's no one-shot path that turns docs/*.md into reviewable pages. bundle import (import_check/import_apply) moves already-approved artifacts between two vouch kbs, and vault sync (sync_vault) mirrors approved pages out and brings user edits back — neither seeds a kb from a plain directory of prose that has never been through the gate. this asks for a one-way onboarding path: walk a directory of markdown and file one page proposal per file, so a reviewer works down the queue exactly as they would for agent-proposed pages.

this is distinct from #181 (bidirectional sync — a live mirror of approved pages, with edits flowing back) and #90 (deterministic sync/merge of two diverged .vouch/ trees). corpus import is strictly forward, strictly propose-only, and has no notion of a mirror or a merge: it never writes to pages/, only to proposed/.

proposed surface

a new cli command:

vouch import-corpus <dir> [--glob '**/*.md'] [--page-type concept]
                          [--tag onboarding] [--actor corpus-import]
                          [--dry-run]

for each matching file it:

  1. registers the file bytes as a content-addressed source through the existing read_under_root containment path that kb.register_source_from_path uses, so the reviewer sees the exact bytes the proposal came from; put_source coalesces on the sha256 content hash, so re-runs reuse the same source id.
  2. files one page proposal through proposals.propose_page(...) with proposed_by=<actor>, the file's frontmatter title (or filename stem as fallback), page_type=<--page-type>, source_ids=[<source id>], and slug_hint derived from the relative path so re-imports target the same page id rather than spawning duplicates.

idempotency is by content hash: the command records what it has already proposed (per-file content sha) so a second run over an unchanged corpus skips those files. --dry-run prints the per-file plan (propose / skip-unchanged / skip-no-title) and writes nothing, threading the existing dry_run flag already accepted by propose_page.

no new kb.* method is required if this ships as cli-only orchestration over propose_page. if a machine-callable kb.import_corpus is later wanted on the mcp/jsonl surfaces, it must touch all four registration sites — @mcp.tool in server.py, _h_import_corpus + HANDLERS in jsonl_server.py, METHODS in capabilities.py, and the cli.py mirror — plus tests/test_import_corpus.py. the walk/hash/skip logic belongs in a new module (e.g. src/vouch/corpus_import.py), not in storage.py, which stays pure i/o.

review gate & scope

every file becomes a proposal, never an approved page — approval stays a separate human kb.approve per proposal (or a future batch approve, #110). the command must not call proposals.approve() or write under pages/ directly. status and slug logic lives in proposals.py; the new module only orchestrates the walk and the source-registration / propose calls. it stays local-first: it reads a local directory, writes only into the local .vouch/proposed/ tree, and reaches nothing over the network. sources are registered through the containment-checked read_under_root path so a corpus outside the kb root can't smuggle bytes past the boundary.

acceptance criteria

  • vouch import-corpus <dir> walks the directory (default glob **/*.md) and files exactly one page proposal per matching file
  • every file lands as a pending proposal in proposed/ — nothing is written to pages/, and proposals.approve() is never called
  • each proposal carries a content-addressed source registered from the file bytes, and re-running over the same bytes reuses the same source id
  • idempotent by content hash: a second run over an unchanged corpus proposes zero new pages and reports the files as skipped-unchanged
  • a file with no derivable title / unparseable frontmatter is skipped with a clear log line rather than filing a malformed proposal
  • --dry-run prints the per-file plan and writes nothing (no sources, no proposals)
  • slug_hint is stable across runs so an edited-then-reimported file targets the same page id instead of duplicating
  • a corpus.import event is written to the audit log recording the actor and file count
  • tests under tests/test_import_corpus.py cover: happy-path propose, idempotent skip, dry-run, missing-title skip, and that no approved artifact is written
  • make check green (pytest, mypy, ruff)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestsize: M200-499 changed non-doc linessyncsync, vault mirror, and diff flows

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions