feat(ingest): minimal mode — parse→persist→ready, skip LLM enrichment + tables#30
Conversation
Add IngestConfig.Mode (yaml `mode`, values full|minimal, default full)
to the engine config, with VLE_INGEST_MODE env override and Validate
rejecting unknown values. Forward it from the deployed server's config
wrapper via firstEnv("VLS_INGEST_MODE", "VLE_INGEST_MODE") so the live
vectorless-server can be flipped to minimal ingest with a single env
var, no secret edit.
…M/tables Add Pipeline.Mode; when "minimal", Run dispatches to runMinimal which does parse → build tree → persist → ready and skips every per-section LLM stage (summarize, HyDE, multi-axis summaries, TOC build). The parser registry is rebuilt with table extraction DISABLED (nil opts) regardless of ingest.tables.enabled, since the pdftable table-finding pass is the slow/hang-prone part of parse and the page-based strategy reads raw page text (which still contains the table's text). persistTree/parse/fail now take the persistence target through a narrow docPersister interface (*db.Pool satisfies it) so the minimal path is exercisable without a live Postgres. Both cmd/engine and cmd/server set Mode from cfg.Ingest.Mode and log when minimal mode is active.
- pkg/ingest/minimal_mode_test.go: a minimal-mode pipeline run with an LLM client that fails the test on any call reaches StatusReady with sections persisted and a call counter of 0 — proving minimal ingest is pure-Go. A second test reconstructs the persisted tree and confirms the synthesised-TOC fallback is title-bearing and section bodies load back from storage. - pkg/retrieval: TestPageIndexMinimalIngestedDoc drives the page-based strategy end-to-end against a minimal-ingested doc shape (page ranges + content refs, NO summaries, nil TOC) and asserts it produces a cited answer from the synthesised TOC + raw page reads. - pkg/config: default mode is "full"; VLE_INGEST_MODE=minimal override and Validate accept/reject coverage. - Document ingest.mode in both example configs.
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis PR adds ingest mode configuration that allows documents to be marked ready after parsing and persisting, skipping expensive LLM enrichment stages (summarize, HyDE, multi-axis) and table/TOC extraction. The feature wires mode through config validation, pipeline branching, startup initialization, and includes comprehensive tests ensuring minimal-mode documents remain queryable. ChangesIngest Mode Feature
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Adds a new minimal ingest mode that collapses the pipeline to parse → persist → ready, skipping all LLM enrichment stages (summarize, HyDE, multi-axis, TOC build) and the pdftable table-finding pass. This makes documents queryable in seconds via the page-based retrieval strategy, which doesn't need any of the skipped enrichment.
Changes:
- New
Pipeline.Modefield withModeMinimalconstant;Rundispatches to a newrunMinimalthat parses withRegistryFromTableOpts(nil), persists the section tree, and flips straight toStatusReady. - Introduces a
docPersisterinterface over the persistence calls so the minimal path is testable without Postgres;parse,persistTree, andfailnow take it as a parameter. - Wires
ingest.modeconfig (withVLE_INGEST_MODEand forwardedVLS_INGEST_MODEoverrides), validation, defaults, example YAMLs, and startup logging.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/ingest/ingest.go | Adds ModeMinimal, docPersister, runMinimal; threads persister through parse/persistTree/fail. |
| pkg/ingest/minimal_mode_test.go | New tests asserting zero LLM calls, ready status, and queryable post-ingest shape. |
| pkg/retrieval/pageindex_strategy_test.go | New cross-package test proving page-based strategy answers a minimal-ingested doc (nil TOC, no summaries). |
| pkg/config/config.go | Adds IngestConfig.Mode with default full, env override, and validation. |
| pkg/config/config_test.go | Covers default, env override, and validate accept/reject for ingest.mode. |
| internal/config/config.go | Forwards VLS_INGEST_MODE/VLE_INGEST_MODE to Engine.Ingest.Mode. |
| cmd/engine/main.go | Sets Pipeline.Mode from config; logs when minimal mode is active. |
| cmd/server/main.go | Same wiring + logging on the server binary. |
| config.example.yaml | Documents the new ingest.mode option. |
| config.server.example.yaml | Documents engine.ingest.mode plus the VLS_INGEST_MODE override. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Why
Today, full ingest of a ~90-page 10-K does ~1,000–3,000 LLM calls (summarize every section + HyDE every leaf + multi-axis + a TOC build) AND a slow/hang-prone pdftable table-extraction pass — minutes of wall time before a document is
ready.But the engine's page-based retrieval strategy (
/v1/answer/pageindex, the path that produces cited answers) needs none of that enrichment. It navigates a TOC tree (synthesising one from the section tree whendocuments.toc_treeis NULL) and reads raw section/page text at query time. So for that path we can collapse ingest to: parse → build section tree → persist → ready — roughly parse-speed (seconds).This PR adds a minimal ingest mode that does exactly that.
What's skipped in minimal mode
nilregardless ofingest.tables.enabled(the table-finding pass is the slow/hang-prone part of parse; the page strategy reads raw page text, which still contains the table's text, so dropping table sections loses nothing for it)The document flips to
StatusReadyimmediately after the section tree is persisted.mode: full(the default) is unchanged.Design
pkg/config:IngestConfig.Mode(yamlmode,fulldefault |minimal), env overrideVLE_INGEST_MODE,Validaterejects unknown values.internal/config(deployed server wrapper): forwardsfirstEnv("VLS_INGEST_MODE", "VLE_INGEST_MODE")→c.Engine.Ingest.Mode, so the livevectorless-serverflips with one env var, no secret edit.pkg/ingest:Pipeline.Mode;Rundispatches to a newrunMinimalwhenminimal.persistTree/parse/failnow take the persistence target through a narrowdocPersisterinterface (*db.Poolsatisfies it) so the minimal path is testable without a live Postgres.cmd/engine+cmd/serversetModefrom config and log when minimal mode is active.How to enable
VLE_INGEST_MODE=minimalvectorless-server:VLS_INGEST_MODE=minimal(env-only, no secret/config edit)ingest.mode: minimal(engine config) /engine.ingest.mode: minimal(server config)Test plan
pkg/ingest/minimal_mode_test.go— minimal-mode run with an LLM client that fails the test on any call reachesreadywith sections persisted and a call counter of 0 (proof minimal mode is pure-Go). Asserts no summaries / axes / HyDE questions were written.pkg/retrievalTestPageIndexMinimalIngestedDoc— page-based strategy drives structure→get_pages→done end-to-end against a minimal-ingested doc shape (page ranges + content refs, no summaries, nil TOC) and returns a cited answer from the synthesised TOC + raw page reads.pkg/config— default modefull;VLE_INGEST_MODE=minimaloverride; Validate accept/reject coverage.go build ./...,go vet ./...,go test ./...all green (both binaries build).ingest.modedocumented inconfig.example.yaml+config.server.example.yaml.Summary by CodeRabbit
New Features
Tests