Skip to content

feat(ingest): LLM-built TOC tree (PageIndex-style, PR-A)#24

Merged
hallelx2 merged 3 commits into
mainfrom
feat/toc-tree-builder
May 27, 2026
Merged

feat(ingest): LLM-built TOC tree (PageIndex-style, PR-A)#24
hallelx2 merged 3 commits into
mainfrom
feat/toc-tree-builder

Conversation

@hallelx2
Copy link
Copy Markdown
Owner

@hallelx2 hallelx2 commented May 27, 2026

Summary

PR-A of the PageIndex-style redesign. Adds an LLM-driven table-of-contents tree builder that runs as a new stage of the ingest pipeline on PDF inputs. The resulting hierarchical TOC is persisted on documents.toc_tree (JSONB) and is intended as a higher-level map that retrieval strategies can reason over before drilling into the parser-derived sections tree.

This PR is intentionally additive: the existing summarize / HyDE / retrieval paths are unchanged, the column defaults to NULL on every row that pre-dates the migration, and a builder failure simply leaves the column NULL (the document remains fully retrievable via the existing sections tree).

The retrieval strategy that consumes toc_tree is part of PR-B (feat/pageindex-strategy), which runs in parallel.

Design

Three-phase pipeline ported from PageIndex/pageindex/page_index.py:

  1. detect — single-page TOC detector (toc_detector_single_page) over the first TOCCheckPages pages (default 20).
  2. extract — if a TOC page was found, parse it into nested nodes. Otherwise fall through to the no-TOC path (generate_toc_init) that emits a TOC straight from body text tagged with <physical_index_X> markers.
  3. verify — concurrently re-check every node's claimed start page (check_title_appearance_in_start). Mismatches clear the page back to zero (the documented "unknown / open" sentinel) rather than making one up.

End pages are derived from sibling ordering after verification. Node IDs are stamped deterministically from the dotted Structure so callers can diff trees across re-ingestions.

Opt-out

  • YAML: ingest.toc.enabled: false
  • Env: VLE_INGEST_TOC_ENABLED=false
  • Per-document: non-PDF inputs unconditionally skip the stage.

The stage is on by default for PDFs because the benefit-per-LLM-call ratio is excellent on the documents the engine targets (filings, manuals, papers) and a regression is one config flag away.

Files

  • pkg/db/migrations/0006_documents_toc_tree.{up,down}.sql — JSONB column.
  • pkg/tree/tree.go — additive TOCNode type mirroring PageIndex's JSON shape.
  • pkg/db/documents.goDocument.TOCTree field + UpdateDocumentTOCTree helper. SELECT lists updated to include the new column.
  • pkg/db/documents_marshal_test.go — round-trip + omit-empty-fields tests.
  • pkg/ingest/toc_builder.go — the three-phase builder + retry helper.
  • pkg/ingest/toc_builder_test.go — happy path / no-TOC / verify-repair / JSON retry / end-page derivation / hierarchy assembly / assemblePagesFromSections bridge / synthetic 10-K (4 top-level nodes).
  • pkg/ingest/ingest.go — one call site after summarize+HyDE; non-fatal.
  • pkg/config/config.go + pkg/config/config_test.goIngestConfig.TOC block, defaults, env overrides, validation, default-values + env-override coverage.
  • cmd/server/main.go — pipeline literal updated.
  • config.example.yaml — documents the new block.

The retry-on-parse-failure helper is duplicated from pkg/retrieval/single_pass.go::runSelectionWithRetry (marked with a comment) so the builder doesn't drag the retrieval package into its dependency graph.

Test plan

  • go build ./... clean.
  • go vet ./... clean.
  • go test ./... all green (every existing package + new pkg/ingest TOC suite).
  • Migration is additive (ADD COLUMN IF NOT EXISTS).
  • Down migration drops the column.
  • Non-PDF documents bypass the builder entirely (ContentType gate).
  • Builder failures never fail ingest (logged + downgraded to NULL).
  • Mock LLM with scripted phase routing reproduces a synthetic 10-K with four top-level nodes.

Out of scope

  • pkg/retrieval/ — the strategy that consumes toc_tree is PR-B.
  • internal/api/ and cmd/engine/ — unchanged.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added AI-powered table-of-contents extraction for PDF documents during ingestion
    • Introduced configurable TOC settings including enablement, model selection, concurrency, and page scanning parameters
  • Tests

    • Added comprehensive test coverage for TOC extraction and building logic

Review Change Stack

Halleluyah Oladipo added 3 commits May 27, 2026 17:10
PR-A of the PageIndex-style redesign — schema and types only. The
LLM-driven TOC builder lands on top of this; persistence is wired
here so the builder can write its result independently of the
existing sections tree.

- 0006 migration adds the JSONB column (NULL until the builder
  runs; non-PDF docs keep it NULL forever).
- tree.TOCNode mirrors PageIndex's tree-output shape (structure /
  start_page / end_page / nodes) so external tooling that already
  speaks that vocabulary can interop without translation.
- Document.TOCTree stores the raw JSONB bytes so the column
  round-trips byte-identically.
- UpdateDocumentTOCTree mirrors UpdateSectionSummaryAxes — patch
  one column without touching the rest of the document row.
Ports the three-phase PageIndex pipeline into Go:

  1. detect    — single-page TOC detector over the first N pages.
  2. extract   — if a TOC page was found, parse it into nested
                 nodes; otherwise fall through to the no-TOC
                 path that generates a TOC straight from body
                 text tagged with <physical_index_X> markers.
  3. verify    — concurrently re-check every node's claimed
                 start page; mismatches clear the page back to
                 zero (the "unknown / open" sentinel) rather
                 than making one up.

End pages are derived from sibling ordering after verification.
Node IDs are stamped deterministically from the dotted
structure so callers can diff trees across re-ingestions.

The retry-on-parse-failure helper mirrors
pkg/retrieval/single_pass.go::runSelectionWithRetry; it is
duplicated rather than imported so the builder doesn't drag the
retrieval package into its dependency graph.

LLM parse blips degrade to "no usable nodes" with a logged
warning so a single bad response never fails ingest — the
document remains fully retrievable via the existing sections
tree.

Tests cover the happy path, the no-TOC path, verification
repair, JSON retry, end-page derivation, hierarchy assembly,
the <physical_index_X> tag parser, and the empty-input
short-circuit.
- Adds IngestConfig.TOC (Enabled / Model / Concurrency /
  TOCCheckPages) with defaults (Enabled=true, Concurrency=4,
  TOCCheckPages=20), env overrides
  (VLE_INGEST_TOC_{ENABLED,MODEL,CONCURRENCY,TOC_CHECK_PAGES}),
  validation, and example-config documentation.
- Pipeline.Run calls the new builder after summarize+HyDE for
  PDF inputs, persists the result via UpdateDocumentTOCTree, and
  logs the LLM-call accounting. Failures are non-fatal — they
  leave documents.toc_tree NULL and the document remains fully
  retrievable via the existing sections tree.
- assemblePagesFromSections groups parsed sections by PageStart
  to reconstruct per-page text the builder can reason over.
  PageStart==0 sections are skipped so the builder never sees
  ambiguous page numbers.
- cmd/server wires the new config block into the pipeline literal.
Copilot AI review requested due to automatic review settings May 27, 2026 16:25
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @hallelx2, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@hallelx2 hallelx2 merged commit 28ffc33 into main May 27, 2026
4 of 8 checks passed
@hallelx2 hallelx2 deleted the feat/toc-tree-builder branch May 27, 2026 16:26
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0361dcab-4aa2-4f00-afae-b84590dfd1a3

📥 Commits

Reviewing files that changed from the base of the PR and between 52b7381 and 4f1f49d.

📒 Files selected for processing (12)
  • cmd/server/main.go
  • config.example.yaml
  • pkg/config/config.go
  • pkg/config/config_test.go
  • pkg/db/documents.go
  • pkg/db/documents_marshal_test.go
  • pkg/db/migrations/0006_documents_toc_tree.down.sql
  • pkg/db/migrations/0006_documents_toc_tree.up.sql
  • pkg/ingest/ingest.go
  • pkg/ingest/toc_builder.go
  • pkg/ingest/toc_builder_test.go
  • pkg/tree/tree.go

📝 Walkthrough

Walkthrough

This PR introduces an LLM-driven table-of-contents builder for PDF documents during ingestion. It adds a TOCNode data model, database persistence, configuration management, a three-phase builder pipeline (detect → extract/generate → verify), and wires the builder into the ingest orchestration with non-fatal failure handling.

Changes

TOC Builder Feature

Layer / File(s) Summary
TOC Data Model and Database Persistence
pkg/tree/tree.go, pkg/db/migrations/0006_documents_toc_tree.*.sql, pkg/db/documents.go, pkg/db/documents_marshal_test.go
TOCNode represents a hierarchical table-of-contents entry with node ID, structure, title, page range, and optional summary. A new toc_tree JSONB column persists this in the documents table. Retrieval methods (GetDocument, ListDocuments) and a new UpdateDocumentTOCTree method handle JSONB serialization and deserialization. JSON marshaling tests validate the wire contract and confirm optional fields are omitted when zero.
TOC Configuration Schema and Defaults
pkg/config/config.go, config.example.yaml, pkg/config/config_test.go
TOCBlock configuration type includes enablement flag, optional model override, concurrency for verification, and toc_check_pages scan bound. Defaults enable the stage with concurrency 4 and 20-page scan. Environment variables VLE_INGEST_TOC_* allow runtime override. Configuration validation enforces non-negative bounds on numeric parameters.
LLM-Driven TOC Builder Pipeline
pkg/ingest/toc_builder.go, pkg/ingest/toc_builder_test.go
Three-phase LLM pipeline: (1) detect whether first N pages contain a TOC via single-page classifier, (2) if found, extract hierarchy from those pages; otherwise generate from body text, (3) concurrently verify leaf-node page titles and repair incorrect claims (clear to 0). Derives end-page ranges from sibling ordering, stamps deterministic node IDs, and aggregates LLM token/cost usage. Includes lenient JSON parsing, retry logic, and hierarchy assembly from dotted structure strings. Comprehensive tests cover detect/no-detect paths, verification repair, JSON retry, and helper functions.
Ingest Pipeline Orchestration and Server Wiring
pkg/ingest/ingest.go, cmd/server/main.go
Pipeline struct gains four TOC configuration fields. Run conditionally executes TOC building for PDFs after summarize/HyDE stages complete; failures are non-fatal and leave toc_tree as NULL. NewPipeline applies defaults to concurrency and page-scan bounds. runTOCBuilder assembles page text from parsed sections, selects the builder model (falling back to summary model), runs the builder, and persists results. Server startup wires TOC config into the Pipeline.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

📖 A rabbit builds a table of contents,
With three-phase magic and LLM nonsense,
Pages detected, verified, and ranked,
Hierarchies stamped and deterministically ranked,
Now PDF folds into structured delight! 🐰

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/toc-tree-builder

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hallelx2 hallelx2 review requested due to automatic review settings May 27, 2026 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant