fix(ingest): total-parse timeout + working section cap + pdftable v0.3.1 (robust full-feature ingest)#31
Conversation
v0.3.1 ships the grid-indexed cell finder (the O(n^2/n^3) -> O(cells) fix), so table extraction no longer degrades pathologically on dense financial pages. Full table-extraction feature stays on; this only makes it faster/bounded.
A 10-K was observed hanging 600s+ in `parsing` even in minimal mode: the hang is in ledongthuc row extraction (extractPDFRows -> reader.Page(n).Content()), which is pure-Go and pre-LLM, so none of the existing per-stage table-extraction budgets bound it. Wrap the entire PDF parse (row extraction, table extraction, section building, leaf cap) in a deadline. The work runs on a goroutine; on timeout or ctx cancellation Parse returns a clear error and abandons the goroutine (buffered result channel, so no leak on send; a panic in the work is recovered). This is the same abandon-on-deadline pattern safeExtractTables already uses for a single table page, lifted to cover the whole parse so ANY parse pathology fails fast and cleanly instead of wedging ingest. The ingest pipeline already treats a parse error as a doc-level failure, so the document goes to `failed` and is visible to ops/bench rather than hanging forever. Nothing is disabled: the full feature set (LLM TOC, tables, summarize, HyDE, multi-axis) still runs — parse is merely bounded. Config: IngestConfig.ParseTimeoutSeconds (default 120). Env VLE_INGEST_PARSE_TIMEOUT_SECONDS; the server binary forwards VLS_/VLE_INGEST_PARSE_TIMEOUT_SECONDS. Also threads the existing ingest.max_sections through to the parser (previously dropped on the floor) via RegistryFromIngestParams, so both robustness valves are operator-tunable. A negative value disables the bound (escape hatch). Tests: the deadline mechanism returns a timeout error in ~the timeout (not after a 10s sleep), passes fast work through, propagates real parse errors, runs inline when disabled, honours ctx cancel, and recovers panics; plus config default/env-override/validate coverage.
The cap only merged ADJACENT leaf siblings, but the real 10-K explosion
is hundreds of SINGLE-LEAF PARENTS (heading -> one body leaf) that have
no adjacent leaf-sibling pairs at all — so the cap silently did nothing
and a 92-page filing sailed past the 400 cap at 463-1465 leaves, each
costing a summarize + HyDE + multi-axis LLM call (the throughput killer
for full ingest).
Fix it in two phases:
1. collapseSingleLeafParents flattens every heading -> lone-leaf chain
(bottom-up, so deep chains fold in one pass) so the formerly
only-children become adjacent leaf siblings. Count is unchanged; the
parent absorbs the child's content + page range and becomes the leaf.
2. The existing smallest-first adjacent-pair merge then reduces the
count to the cap, with a defensive single-leaf-parent collapse for
any pair a table leaf blocked.
The top-level sections are wrapped under a synthetic root so the merge
step — which shrinks a sibling list by rewriting parent.Children — can
shrink the TOP-level list too (a bare slice parameter would not
propagate the shrink back). This is what makes the invariant hold for a
flat list of single-leaf parents.
Invariant: for any tree with > N mergeable leaves, capLeafSections drives
the leaf count to <= N. Content is always preserved (concatenated,
page ranges unioned) and table sections (Metadata["table"]=="true") are
never merged or collapsed. Nothing is disabled — the full section tree is
still produced; it's just bounded.
Tests: 1000 single-leaf-parents -> <= 400 with no content loss (the case
the old cap failed and that let 1465 through), deep heading->subheading->
body chains, a mixed flat/single-leaf/multi-child tree, and a
table-protection test asserting table leaves survive verbatim.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (11)
📝 WalkthroughWalkthroughThis PR introduces configurable parse-timeout and leaf-section limits to the PDF ingest pipeline. Configuration adds ChangesParse Timeout Configuration & Defaults
Ingest Pipeline Wiring
PDF Parser Timeout Implementation
Leaf-Section Capping Algorithm Rewrite
Possibly related PRs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
Hardens PDF ingest so the full feature set stays ON (LLM-built TOC, table extraction, summarize, HyDE, multi-axis — all of it) but can no longer hang. Nothing is disabled; the pipeline is bounded and the runaway-section path is fixed.
Three changes:
1. Bump
pdftablev0.3.0 → v0.3.1v0.3.1 ships the grid-indexed cell finder (the O(n²/n³)→O(cells) fix), so table extraction no longer degrades pathologically on dense financial pages. Table extraction stays enabled; it's just faster/bounded.
2. Total-parse timeout (the key robustness fix)
A 10-K (ACTIVISIONBLIZZARD_2019_10K) was observed hanging 600s+ in
parsingeven in minimal mode — the hang is in ledongthuc row extraction (extractPDFRows→reader.Page(n).Content()), which is pure-Go and pre-LLM, so none of the existing per-stage table-extraction budgets bound it.The entire PDF parse (row extraction → table extraction → section building → leaf cap) is now wrapped in a configurable deadline. The work runs on a goroutine; on timeout or ctx cancellation
Parsereturns a clear error (pdf: parse exceeded <timeout> — document too complex or malformed) and abandons the goroutine (buffered result channel, no leak on send; a panic in the work is recovered). Same abandon-on-deadline patternsafeExtractTablesalready uses for one table page, lifted to cover the whole parse. The ingest pipeline already treats a parse error as a doc-level failure, so a doc that can't parse fast goes tofailedand is visible to ops/bench instead of wedging forever.ingest.parse_timeout_seconds(default 120). EnvVLE_INGEST_PARSE_TIMEOUT_SECONDS; the server binary forwardsVLS_/VLE_INGEST_PARSE_TIMEOUT_SECONDS.ingest.max_sectionsthrough to the parser (previously dropped on the floor) viaRegistryFromIngestParams, so both robustness valves are operator-tunable.3. Fix the section-cap merge bug
capLeafSectionsonly merged adjacent leaf siblings, but the real 10-K explosion is hundreds of single-leaf parents (heading → one body leaf) with no adjacent leaf-sibling pairs — so the cap silently did nothing and a 92-page filing sailed past the 400 cap at 463-1465 leaves (each one a summarize + HyDE + multi-axis LLM call, the throughput killer).Now reduces any tree shape:
collapseSingleLeafParentsflattens heading→lone-leaf chains (bottom-up) so only-children become adjacent siblings (count unchanged; parent absorbs the leaf's content + page range).Invariant: for any tree with > N mergeable leaves,
capLeafSectionsdrives the leaf count to ≤ N. Content is always preserved (concatenated, page ranges unioned); table sections (Metadata["table"]=="true") are never merged or collapsed.Nothing is turned off. The full enrichment pipeline (TOC, tables, summarize, HyDE, multi-axis) still runs end to end — parse is merely bounded and the section count is correctly capped.
Test plan
go build ./...green (both binaries)go vet ./...cleango test ./...green (all packages)Parsetest with a 1ms deadline on a real PDF.parse_timeout_seconds.pdftablev0.3.1 resolves cleanly and pins in go.mod/go.sum.config.example.yamldocumentsingest.parse_timeout_seconds(andingest.max_sections).Summary by CodeRabbit
New Features
Chores