fix(ingest): total-parse timeout + working section cap + pdftable v0.3.1 (robust full-feature ingest) by hallelx2 · Pull Request #31 · hallelx2/vectorless-engine

hallelx2 · 2026-05-29T16:58:46Z

Summary

Hardens PDF ingest so the full feature set stays ON (LLM-built TOC, table extraction, summarize, HyDE, multi-axis — all of it) but can no longer hang. Nothing is disabled; the pipeline is bounded and the runaway-section path is fixed.

Three changes:

1. Bump `pdftable` v0.3.0 → v0.3.1

v0.3.1 ships the grid-indexed cell finder (the O(n²/n³)→O(cells) fix), so table extraction no longer degrades pathologically on dense financial pages. Table extraction stays enabled; it's just faster/bounded.

2. Total-parse timeout (the key robustness fix)

A 10-K (ACTIVISIONBLIZZARD_2019_10K) was observed hanging 600s+ in parsing even in minimal mode — the hang is in ledongthuc row extraction (extractPDFRows → reader.Page(n).Content()), which is pure-Go and pre-LLM, so none of the existing per-stage table-extraction budgets bound it.

The entire PDF parse (row extraction → table extraction → section building → leaf cap) is now wrapped in a configurable deadline. The work runs on a goroutine; on timeout or ctx cancellation Parse returns a clear error (pdf: parse exceeded <timeout> — document too complex or malformed) and abandons the goroutine (buffered result channel, no leak on send; a panic in the work is recovered). Same abandon-on-deadline pattern safeExtractTables already uses for one table page, lifted to cover the whole parse. The ingest pipeline already treats a parse error as a doc-level failure, so a doc that can't parse fast goes to failed and is visible to ops/bench instead of wedging forever.

Config: ingest.parse_timeout_seconds (default 120). Env VLE_INGEST_PARSE_TIMEOUT_SECONDS; the server binary forwards VLS_/VLE_INGEST_PARSE_TIMEOUT_SECONDS.
Also threads the existing ingest.max_sections through to the parser (previously dropped on the floor) via RegistryFromIngestParams, so both robustness valves are operator-tunable.
A negative value disables the bound (escape hatch / legacy unbounded behaviour).

3. Fix the section-cap merge bug

capLeafSections only merged adjacent leaf siblings, but the real 10-K explosion is hundreds of single-leaf parents (heading → one body leaf) with no adjacent leaf-sibling pairs — so the cap silently did nothing and a 92-page filing sailed past the 400 cap at 463-1465 leaves (each one a summarize + HyDE + multi-axis LLM call, the throughput killer).

Now reduces any tree shape:

Phase 1: collapseSingleLeafParents flattens heading→lone-leaf chains (bottom-up) so only-children become adjacent siblings (count unchanged; parent absorbs the leaf's content + page range).
Phase 2: the existing smallest-first adjacent-pair merge reduces to the cap, with a defensive collapse for any pair a table leaf blocked.
Top-level sections are wrapped under a synthetic root so the merge step can shrink the top-level sibling list too (a bare slice parameter would not propagate the shrink back — this was also why the naive fix span 90-140s on the guard).

Invariant: for any tree with > N mergeable leaves, capLeafSections drives the leaf count to ≤ N. Content is always preserved (concatenated, page ranges unioned); table sections (Metadata["table"]=="true") are never merged or collapsed.

Nothing is turned off. The full enrichment pipeline (TOC, tables, summarize, HyDE, multi-axis) still runs end to end — parse is merely bounded and the section count is correctly capped.

Test plan

go build ./... green (both binaries)
go vet ./... clean
go test ./... green (all packages)
Total-parse-timeout tests: the deadline mechanism returns a timeout error in ~the timeout (work sleeps 10s, returns in ~0.05s — proves it does NOT wait out the hang), passes fast work through, propagates real parse errors, runs inline when disabled, honours ctx cancel, recovers panics; plus a full-Parse test with a 1ms deadline on a real PDF.
Cap tests: 1000 single-leaf-parents → ≤ 400 with no content loss (the exact case the old cap failed and that let 1465 through), deep heading→subheading→body chains, mixed flat/single-leaf/multi-child tree, and a table-protection test asserting table leaves survive verbatim.
Config: default (120) / env-override / validate-rejects-negative coverage for parse_timeout_seconds.
pdftable v0.3.1 resolves cleanly and pins in go.mod/go.sum.
config.example.yaml documents ingest.parse_timeout_seconds (and ingest.max_sections).

Summary by CodeRabbit

New Features
- Added configurable parse timeout limit for PDF documents (default 120 seconds)
- Added configurable maximum section limit to manage document processing complexity (default 400 sections)
Chores
- Updated dependencies

v0.3.1 ships the grid-indexed cell finder (the O(n^2/n^3) -> O(cells) fix), so table extraction no longer degrades pathologically on dense financial pages. Full table-extraction feature stays on; this only makes it faster/bounded.

A 10-K was observed hanging 600s+ in `parsing` even in minimal mode: the hang is in ledongthuc row extraction (extractPDFRows -> reader.Page(n).Content()), which is pure-Go and pre-LLM, so none of the existing per-stage table-extraction budgets bound it. Wrap the entire PDF parse (row extraction, table extraction, section building, leaf cap) in a deadline. The work runs on a goroutine; on timeout or ctx cancellation Parse returns a clear error and abandons the goroutine (buffered result channel, so no leak on send; a panic in the work is recovered). This is the same abandon-on-deadline pattern safeExtractTables already uses for a single table page, lifted to cover the whole parse so ANY parse pathology fails fast and cleanly instead of wedging ingest. The ingest pipeline already treats a parse error as a doc-level failure, so the document goes to `failed` and is visible to ops/bench rather than hanging forever. Nothing is disabled: the full feature set (LLM TOC, tables, summarize, HyDE, multi-axis) still runs — parse is merely bounded. Config: IngestConfig.ParseTimeoutSeconds (default 120). Env VLE_INGEST_PARSE_TIMEOUT_SECONDS; the server binary forwards VLS_/VLE_INGEST_PARSE_TIMEOUT_SECONDS. Also threads the existing ingest.max_sections through to the parser (previously dropped on the floor) via RegistryFromIngestParams, so both robustness valves are operator-tunable. A negative value disables the bound (escape hatch). Tests: the deadline mechanism returns a timeout error in ~the timeout (not after a 10s sleep), passes fast work through, propagates real parse errors, runs inline when disabled, honours ctx cancel, and recovers panics; plus config default/env-override/validate coverage.

The cap only merged ADJACENT leaf siblings, but the real 10-K explosion is hundreds of SINGLE-LEAF PARENTS (heading -> one body leaf) that have no adjacent leaf-sibling pairs at all — so the cap silently did nothing and a 92-page filing sailed past the 400 cap at 463-1465 leaves, each costing a summarize + HyDE + multi-axis LLM call (the throughput killer for full ingest). Fix it in two phases: 1. collapseSingleLeafParents flattens every heading -> lone-leaf chain (bottom-up, so deep chains fold in one pass) so the formerly only-children become adjacent leaf siblings. Count is unchanged; the parent absorbs the child's content + page range and becomes the leaf. 2. The existing smallest-first adjacent-pair merge then reduces the count to the cap, with a defensive single-leaf-parent collapse for any pair a table leaf blocked. The top-level sections are wrapped under a synthetic root so the merge step — which shrinks a sibling list by rewriting parent.Children — can shrink the TOP-level list too (a bare slice parameter would not propagate the shrink back). This is what makes the invariant hold for a flat list of single-leaf parents. Invariant: for any tree with > N mergeable leaves, capLeafSections drives the leaf count to <= N. Content is always preserved (concatenated, page ranges unioned) and table sections (Metadata["table"]=="true") are never merged or collapsed. Nothing is disabled — the full section tree is still produced; it's just bounded. Tests: 1000 single-leaf-parents -> <= 400 with no content loss (the case the old cap failed and that let 1465 through), deep heading->subheading-> body chains, a mixed flat/single-leaf/multi-child tree, and a table-protection test asserting table leaves survive verbatim.

sourcery-ai

Sorry @hallelx2, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

coderabbitai · 2026-05-29T16:58:59Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 38fa60f0-21f9-4328-9fd1-5922ec905bb4

📥 Commits

Reviewing files that changed from the base of the PR and between dfc1c45 and 9d8c7b4.

⛔ Files ignored due to path filters (1)

go.sum is excluded by !**/*.sum

📒 Files selected for processing (11)

cmd/engine/main.go
cmd/server/main.go
config.example.yaml
go.mod
internal/config/config.go
pkg/config/config.go
pkg/config/config_test.go
pkg/ingest/ingest.go
pkg/parser/cap_test.go
pkg/parser/pdf.go
pkg/parser/pdf_parse_timeout_test.go

📝 Walkthrough

Walkthrough

This PR introduces configurable parse-timeout and leaf-section limits to the PDF ingest pipeline. Configuration adds ParseTimeoutSeconds (default 120s) and MaxSections (default 400) with environment override support; the pipeline routes these to a new RegistryFromIngestParams constructor that wires the PDF parser with deadline enforcement and capping. The parser implements timeout via goroutine-based deadline wrapper with panic recovery; leaf-capping is rewritten to reliably handle complex outline trees by collapsing single-leaf parents then merging smallest pairs, while preserving table-marked sections.

Changes

Parse Timeout Configuration & Defaults

Layer / File(s)	Summary
Configuration schema and environment handling `pkg/config/config.go`, `internal/config/config.go`, `go.mod`	`IngestConfig` adds `ParseTimeoutSeconds` field with 120s default and YAML binding; `applyEnvOverrides` handles `VLE_INGEST_PARSE_TIMEOUT_SECONDS` environment variable; validation rejects negative values; `pdftable` dependency updated to v0.3.1.
Configuration tests and documentation `pkg/config/config_test.go`, `config.example.yaml`	`TestDefaultValues` asserts 120s timeout and 400 max-sections defaults; new tests validate environment override and validation error paths; example YAML documents both fields with enforcement semantics.

Ingest Pipeline Wiring

Layer / File(s)	Summary
Registry factory and parameterized construction `pkg/ingest/ingest.go`	New `RegistryFromIngestParams(opts, maxSections, parseTimeout)` builds parser.Registry with PDF parser created via `NewPDFWithConfig`; `RegistryFromTableOpts` delegates to it with zero values for backward compatibility.
Application entry-point pipeline wiring `cmd/engine/main.go`, `cmd/server/main.go`	Ingest pipeline switches from `RegistryFromTableOpts` to `RegistryFromIngestParams`, passing config-derived `MaxSections` and `ParseTimeoutSeconds` converted to time.Duration.

PDF Parser Timeout Implementation

Layer / File(s)	Summary
ParseTimeout field and timeout resolution `pkg/parser/pdf.go`	`PDF` struct adds `ParseTimeout` field; `resolvedParseTimeout()` selects 120s default when zero, disables when negative; `NewPDFWithConfig(opts, maxSections, parseTimeout)` constructor wires both bounds explicitly.
Deadline wrapper and Parse rewrite `pkg/parser/pdf.go`	`Parse` method wraps full read/parse work in `runParseWithDeadline`; goroutine executes `parseDoc`, select on result/timeout/context-cancel, abandons goroutine on timeout using buffered channel, recovers panics as errors.
Timeout and deadline tests `pkg/parser/pdf_parse_timeout_test.go`	Tests cover timeout enforcement (error, no hang), fast completion, work-error propagation, disabled timeout (inline), context cancellation, panic recovery, timeout resolution logic, and end-to-end integration with 1ms timeout on real PDF fixture.

Leaf-Section Capping Algorithm Rewrite

Layer / File(s)	Summary
Two-phase leaf-capping and helpers `pkg/parser/pdf.go`	`capLeafSections` rewritten to first collapse single-leaf parent chains into mergeable siblings (skipping table leaves), then iteratively merge smallest adjacent leaf pairs until within cap; new helpers detect table leaves, absorb collapsed content/ranges, exclude table leaves from merge eligibility.
Leaf-capping test suite `pkg/parser/cap_test.go`	`singleLeafParentTree` helper builds deterministic outline models with single-leaf parents; new tests validate collapse behavior, deep chain flattening, mixed-shape reduction, and table-leaf preservation even under restrictive caps.

Possibly related PRs

hallelx2/vectorless-engine#30: Modifies ingest pipeline wiring to support minimal mode with altered parser registry setup and disabled table extraction.
hallelx2/vectorless-engine#23: Changes PDF.Parse pipeline at the primitive glyph-to-row extraction and reader error handling layer.
hallelx2/vectorless-engine#12: Modifies (*PDF).Parse and downstream section/leaf post-processing for heading detection and oversized leaf splitting.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🐰 A Parse Timeout Tale

With bounds to check and deadlines near,
Leaf sections cap—no runaway fear!
Goroutines race, then timeout calls,
Panic's caught before it falls. ⏰

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/parse-robustness

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

hallelx2 added 3 commits May 29, 2026 17:13

build(deps): bump pdftable to v0.3.1

084abc8

v0.3.1 ships the grid-indexed cell finder (the O(n^2/n^3) -> O(cells) fix), so table extraction no longer degrades pathologically on dense financial pages. Full table-extraction feature stays on; this only makes it faster/bounded.

Copilot AI review requested due to automatic review settings May 29, 2026 16:58

sourcery-ai Bot reviewed May 29, 2026

View reviewed changes

Copilot started reviewing on behalf of hallelx2 May 29, 2026 16:58 View session

hallelx2 merged commit cbd46f5 into main May 29, 2026
3 of 9 checks passed

hallelx2 deleted the feat/parse-robustness branch May 29, 2026 17:00

Copilot AI reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingest): total-parse timeout + working section cap + pdftable v0.3.1 (robust full-feature ingest)#31

fix(ingest): total-parse timeout + working section cap + pdftable v0.3.1 (robust full-feature ingest)#31
hallelx2 merged 3 commits into
mainfrom
feat/parse-robustness

hallelx2 commented May 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Possibly related PRs

Estimated code review effort

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hallelx2 commented May 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Bump pdftable v0.3.0 → v0.3.1

2. Total-parse timeout (the key robustness fix)

3. Fix the section-cap merge bug

Test plan

Summary by CodeRabbit

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Possibly related PRs

Estimated code review effort

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hallelx2 commented May 29, 2026 •

edited by coderabbitai Bot

Loading

1. Bump `pdftable` v0.3.0 → v0.3.1

coderabbitai Bot commented May 29, 2026 •

edited

Loading