feat(ingest): LLM-built TOC tree (PageIndex-style, PR-A)#24
Conversation
PR-A of the PageIndex-style redesign — schema and types only. The LLM-driven TOC builder lands on top of this; persistence is wired here so the builder can write its result independently of the existing sections tree. - 0006 migration adds the JSONB column (NULL until the builder runs; non-PDF docs keep it NULL forever). - tree.TOCNode mirrors PageIndex's tree-output shape (structure / start_page / end_page / nodes) so external tooling that already speaks that vocabulary can interop without translation. - Document.TOCTree stores the raw JSONB bytes so the column round-trips byte-identically. - UpdateDocumentTOCTree mirrors UpdateSectionSummaryAxes — patch one column without touching the rest of the document row.
Ports the three-phase PageIndex pipeline into Go:
1. detect — single-page TOC detector over the first N pages.
2. extract — if a TOC page was found, parse it into nested
nodes; otherwise fall through to the no-TOC
path that generates a TOC straight from body
text tagged with <physical_index_X> markers.
3. verify — concurrently re-check every node's claimed
start page; mismatches clear the page back to
zero (the "unknown / open" sentinel) rather
than making one up.
End pages are derived from sibling ordering after verification.
Node IDs are stamped deterministically from the dotted
structure so callers can diff trees across re-ingestions.
The retry-on-parse-failure helper mirrors
pkg/retrieval/single_pass.go::runSelectionWithRetry; it is
duplicated rather than imported so the builder doesn't drag the
retrieval package into its dependency graph.
LLM parse blips degrade to "no usable nodes" with a logged
warning so a single bad response never fails ingest — the
document remains fully retrievable via the existing sections
tree.
Tests cover the happy path, the no-TOC path, verification
repair, JSON retry, end-page derivation, hierarchy assembly,
the <physical_index_X> tag parser, and the empty-input
short-circuit.
- Adds IngestConfig.TOC (Enabled / Model / Concurrency /
TOCCheckPages) with defaults (Enabled=true, Concurrency=4,
TOCCheckPages=20), env overrides
(VLE_INGEST_TOC_{ENABLED,MODEL,CONCURRENCY,TOC_CHECK_PAGES}),
validation, and example-config documentation.
- Pipeline.Run calls the new builder after summarize+HyDE for
PDF inputs, persists the result via UpdateDocumentTOCTree, and
logs the LLM-call accounting. Failures are non-fatal — they
leave documents.toc_tree NULL and the document remains fully
retrievable via the existing sections tree.
- assemblePagesFromSections groups parsed sections by PageStart
to reconstruct per-page text the builder can reason over.
PageStart==0 sections are skipped so the builder never sees
ambiguous page numbers.
- cmd/server wires the new config block into the pipeline literal.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (12)
📝 WalkthroughWalkthroughThis PR introduces an LLM-driven table-of-contents builder for PDF documents during ingestion. It adds a TOCNode data model, database persistence, configuration management, a three-phase builder pipeline (detect → extract/generate → verify), and wires the builder into the ingest orchestration with non-fatal failure handling. ChangesTOC Builder Feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
PR-A of the PageIndex-style redesign. Adds an LLM-driven table-of-contents tree builder that runs as a new stage of the ingest pipeline on PDF inputs. The resulting hierarchical TOC is persisted on
documents.toc_tree(JSONB) and is intended as a higher-level map that retrieval strategies can reason over before drilling into the parser-derived sections tree.This PR is intentionally additive: the existing summarize / HyDE / retrieval paths are unchanged, the column defaults to NULL on every row that pre-dates the migration, and a builder failure simply leaves the column NULL (the document remains fully retrievable via the existing sections tree).
The retrieval strategy that consumes
toc_treeis part of PR-B (feat/pageindex-strategy), which runs in parallel.Design
Three-phase pipeline ported from
PageIndex/pageindex/page_index.py:toc_detector_single_page) over the firstTOCCheckPagespages (default 20).generate_toc_init) that emits a TOC straight from body text tagged with<physical_index_X>markers.check_title_appearance_in_start). Mismatches clear the page back to zero (the documented "unknown / open" sentinel) rather than making one up.End pages are derived from sibling ordering after verification. Node IDs are stamped deterministically from the dotted
Structureso callers can diff trees across re-ingestions.Opt-out
ingest.toc.enabled: falseVLE_INGEST_TOC_ENABLED=falseThe stage is on by default for PDFs because the benefit-per-LLM-call ratio is excellent on the documents the engine targets (filings, manuals, papers) and a regression is one config flag away.
Files
pkg/db/migrations/0006_documents_toc_tree.{up,down}.sql— JSONB column.pkg/tree/tree.go— additiveTOCNodetype mirroring PageIndex's JSON shape.pkg/db/documents.go—Document.TOCTreefield +UpdateDocumentTOCTreehelper. SELECT lists updated to include the new column.pkg/db/documents_marshal_test.go— round-trip + omit-empty-fields tests.pkg/ingest/toc_builder.go— the three-phase builder + retry helper.pkg/ingest/toc_builder_test.go— happy path / no-TOC / verify-repair / JSON retry / end-page derivation / hierarchy assembly /assemblePagesFromSectionsbridge / synthetic 10-K (4 top-level nodes).pkg/ingest/ingest.go— one call site after summarize+HyDE; non-fatal.pkg/config/config.go+pkg/config/config_test.go—IngestConfig.TOCblock, defaults, env overrides, validation, default-values + env-override coverage.cmd/server/main.go— pipeline literal updated.config.example.yaml— documents the new block.The retry-on-parse-failure helper is duplicated from
pkg/retrieval/single_pass.go::runSelectionWithRetry(marked with a comment) so the builder doesn't drag the retrieval package into its dependency graph.Test plan
go build ./...clean.go vet ./...clean.go test ./...all green (every existing package + new pkg/ingest TOC suite).ADD COLUMN IF NOT EXISTS).ContentTypegate).Out of scope
pkg/retrieval/— the strategy that consumestoc_treeis PR-B.internal/api/andcmd/engine/— unchanged.Summary by CodeRabbit
Release Notes
New Features
Tests