Skip to content

feat: pdftable integration - table-aware ingest (Phase 1.3.F)#20

Merged
hallelx2 merged 5 commits into
mainfrom
feat/pdftable-integration
May 27, 2026
Merged

feat: pdftable integration - table-aware ingest (Phase 1.3.F)#20
hallelx2 merged 5 commits into
mainfrom
feat/pdftable-integration

Conversation

@hallelx2
Copy link
Copy Markdown
Owner

Summary

  • Swap the PDF parser's primitive layer from ledongthuc/pdf glyph reassembly to pdftable v0.3.0, which already ships pdfplumber-parity word grouping, letter-spacing collapse, and ligature expansion.
  • Add a new table-extraction stage: every table pdftable detects on a page becomes its own Section flagged with Metadata["table"]="true" and content rendered as GitHub-flavoured Markdown. Tables are wrapped under a synthetic "Tables" container at the document root so the prose outline order stays untouched.
  • Expose the knobs as ingest.tables.{enabled,vertical_strategy,horizontal_strategy,min_table_rows,min_table_cols} with matching VLE_INGEST_TABLES_* env overrides. Default on; one flip disables the integration entirely.

Design rationale

PDF is a layout format — every numeric question in a 10-K balance sheet lives in a ruled table that text-only extraction collapses into a space-joined run. pdftable's lines strategy reconstructs those tables from drawn rules; the text strategy handles borderless / narrative tables; both axes can be configured independently. The parser keeps the existing font-size + bold heuristics and outline-driven structure recovery (the latter still backed by ledongthuc/pdf since pdftable doesn't surface /Outlines yet).

The ledongthuc/pdf dependency stays for outline access only, which is graceful: if the secondary reader can't open a PDF that pdftable accepts, the engine falls through to the heuristic path without failing ingest.

Risk envelope

  • Tables on by default — tables.enabled: false is a one-line rollback.
  • pdftable errors and panics are caught: safeExtractTables wraps every page.ExtractTables in recover() plus error logging at warn. Ingest produces text-only output in the worst case.
  • Validation rejects unknown strategy names + negative minima at startup.
  • Encrypted PDFs are still auto-decrypted via pdfcpu's empty-password path; the encryption sentinel is matched via errors.Is(err, pdftable.ErrEncrypted).

Test plan

  • go build ./... clean
  • go vet ./... clean
  • go test ./... all green (incl. five new parser tests, three new config tests)
  • Manual smoke against 3M_2023Q2_10Q.pdf via VLE_TEST_FILING_PDF -> 62 table sections across 38 distinct pages ingested cleanly
  • Corrupt PDF fixture returns a wrapped pdf: open: error rather than panicking
  • tables.enabled: false produces no table sections and no "Tables" container, all other sections still emit

Before / after snippet

A 3x4 ruled table on the issue-466 fixture used to leak into the document outline as one prose run:

T0-C0 T0-C1 T0-C2 T0-00 T0-01 T0-02 T0-10 T0-11 T0-12 T0-20-last ...

After the integration it surfaces as its own section:

Title: "Table (page 1)"
Metadata:
  table: "true"
  rows: "4"
  cols: "3"
Content: |
  | T0-C0 | T0-C1 | T0-C2 |
  | --- | --- | --- |
  | T0-00 | T0-01 | T0-02 |
  | T0-10 | T0-11 | T0-12 |
  | T0-20-last | T0-21-last | T0-22-last |

Notes

  • go.mod carries a replace github.com/hallelx2/pdftable => ../pdftable directive until pdftable v0.3.0 is tagged on the remote. Strip the directive in a follow-up commit once go get github.com/hallelx2/pdftable@v0.3.0 resolves cleanly.
  • The fixture (pkg/parser/testdata/tables-example.pdf, 13 KB) is the golden two-table PDF from pdftable's testdata.
  • Do not merge until benchmark validation lands on top.

hallelx2 added 3 commits May 27, 2026 02:44
Replace the ledongthuc/pdf glyph-reassembly path with pdftable's
positioned-word extractor and add a new table-aware extraction stage
that emits each detected table as its own Section flagged with
Metadata["table"]="true".

- pdftable.Page.Words() handles intra-word glyph reassembly,
  letter-spacing collapse, and ligature expansion natively. The
  bespoke collapseLetterSpacing / looksLetterSpaced / multiSpaceRe
  helpers are deleted (handled by pdftable's WordOpts).
- The engine still uses ledongthuc/pdf solely for /Outlines access —
  pdftable does not yet expose the outline dictionary. Outline-driven
  parsing degrades gracefully when ledongthuc fails on a PDF that
  pdftable accepts.
- Encrypted PDFs are detected via pdftable.ErrEncrypted and routed
  through pdfcpu's empty-password decryptor as before.
- Table extraction runs after section building; tables are wrapped
  under a synthetic "Tables" container at the document root so the
  prose outline order stays untouched. Markdown rendering escapes
  pipes and collapses embedded newlines to keep GFM well-formed.
- Resilience: every page.ExtractTables() call is wrapped in
  safeExtractTables (recover()) and errors are logged-and-swallowed.
  pdftable cannot break ingest under any condition.

On the 3M 2023Q2 10-Q this surfaces 62 table sections across 38
distinct pages — content that previously collapsed into space-joined
runs and was effectively unsearchable.
Surface pdftable's table-extraction knobs through the engine config so
operators can flip strategies / minima / kill-switch without code
changes.

- IngestConfig.Tables (yaml: ingest.tables) with Enabled (default
  true), VerticalStrategy / HorizontalStrategy ("lines" defaults),
  MinTableRows / MinTableCols (2 / 2 floor).
- VLE_INGEST_TABLES_ENABLED, VLE_INGEST_TABLES_VERTICAL_STRATEGY,
  VLE_INGEST_TABLES_HORIZONTAL_STRATEGY, VLE_INGEST_TABLES_MIN_ROWS,
  VLE_INGEST_TABLES_MIN_COLS env overrides following the existing
  pattern.
- Validate() rejects unknown strategy values and negative minima.
- ingest.RegistryFromTableOpts() constructs a parser.Registry with a
  table-aware PDF parser; DefaultRegistry stays compatible for tests.
- cmd/engine + cmd/server wire the config block through, log the
  enabled / disabled state at startup so the operator can see the
  active configuration in the journal.
- config.example.yaml documents the block alongside its sibling
  HyDE / global LLM concurrency knobs.
Adds the regression gate for the integration: a small (13 KB) two-table
PDF copied from pdftable's golden fixtures asserts the parser actually
emits table sections with the expected metadata, the synthetic "Tables"
container is in place, the kill-switch works, and corrupt input never
panics.

- pkg/parser/pdf_tables_test.go: TestPDFParserEmitsTableSections
  asserts pages, GFM rendering, known cell substrings; rows/cols
  metadata; TestPDFParserTablesContainerHidesUnderParent verifies the
  container wrapping; TestPDFParserDisabledTables verifies the
  rollback path; TestPDFParserCorruptInputReturnsCleanError pins the
  error contract; TestPDFParser10KSmokeOptional is gated on
  VLE_TEST_FILING_PDF for manual benchmark validation against real
  10-Ks.
- pkg/parser/testdata/tables-example.pdf: the issue-466 two-table
  golden fixture from pdftable. Small enough to commit.
- pkg/config/config_test.go: TestTablesDefaults / TestTablesEnvOverride
  / TestTablesValidateRejectsBadStrategy round-trip the new config
  block through YAML + env + Validate.
Copilot AI review requested due to automatic review settings May 27, 2026 01:46
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @hallelx2, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Warning

Review limit reached

@hallelx2, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 11 minutes and 31 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c4bfcd35-705e-4eb1-acf0-e4679a72fdd8

📥 Commits

Reviewing files that changed from the base of the PR and between 75efc0c and 9b77f36.

⛔ Files ignored due to path filters (2)
  • go.sum is excluded by !**/*.sum
  • pkg/parser/testdata/tables-example.pdf is excluded by !**/*.pdf
📒 Files selected for processing (11)
  • README.md
  • cmd/engine/main.go
  • cmd/server/main.go
  • config.example.yaml
  • docs/ENGINE.md
  • go.mod
  • pkg/config/config.go
  • pkg/config/config_test.go
  • pkg/ingest/ingest.go
  • pkg/parser/pdf.go
  • pkg/parser/pdf_tables_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/pdftable-integration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

pdftable v0.3.0 is now tagged on the remote. Resolves cleanly via
go get and removes the only blocker for clean external go-module
fetch — CI can now build the engine without a sibling pdftable
checkout.
@hallelx2 hallelx2 merged commit cd42ae8 into main May 27, 2026
6 of 8 checks passed
@hallelx2 hallelx2 deleted the feat/pdftable-integration branch May 27, 2026 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants