Skip to content

fix(parser): revert primitive layer to ledongthuc/pdf (keep pdftable for tables)#23

Merged
hallelx2 merged 1 commit into
mainfrom
fix/revert-parser-primitives-to-ledongthuc
May 27, 2026
Merged

fix(parser): revert primitive layer to ledongthuc/pdf (keep pdftable for tables)#23
hallelx2 merged 1 commit into
mainfrom
fix/revert-parser-primitives-to-ledongthuc

Conversation

@hallelx2
Copy link
Copy Markdown
Owner

@hallelx2 hallelx2 commented May 27, 2026

Summary

PR #20's pdftable.Words()-based primitive layer broke section titles on real SEC filings. pdftable v0.3.0 ships without standard-14 AFM metrics → glyph X-advance estimated wrong on Times Roman / Helvetica → inter-word X-gaps shrink below pdftable's default 3pt tolerance → adjacent words get concatenated.

Before / after on 3M 10-Q

Before this fix (live 00038-jpt): 508 sections, titles like:

ChangesinAccumulatedOtherComprehensiveIncome(Loss)Attributableto3MbyComponent
Currentmarketablesecurities 56 238

Selection LLM picks 0 sections on 4/5 FinanceBench questions. F1 = 0.000.

After: extractPDFRows restored to ledongthuc/pdf's Content() glyph stream with the per-glyph X-gap-into-spaces heuristic PR #12 tuned for 10-Ks. Section titles are readable; chunked-tree retrieval works again.

What's preserved

Followup

When pdftable bundles standard-14 AFM metrics (filed as a v0.4.x goal in hallelx2/pdftable#5), flip back to pdftable.Words().

Test plan

  • go build ./...
  • go vet ./pkg/parser/...
  • go test ./pkg/parser/... — all green

Summary by Sourcery

Restore the PDF parser’s glyph-level primitive layer to ledongthuc/pdf while preserving pdftable-based table extraction to fix broken text rows and headings.

Bug Fixes:

  • Prevent word concatenation on standard-14 fonts by no longer relying on pdftable’s word grouping for text rows, restoring accurate section titles and headings.
  • Fail PDF parsing early when the ledongthuc/pdf backend rejects a document instead of silently falling back to degraded text extraction.

Enhancements:

  • Reintroduce glyph-level spacing heuristics and letter-spacing collapse logic to robustly reconstruct readable lines from raw PDF text content.

Summary by CodeRabbit

  • Bug Fixes
    • Improved PDF text extraction accuracy with enhanced character spacing detection and bold formatting recognition.
    • Strengthened error handling for invalid PDF file parsing to prevent silent failures.

Review Change Stack

…for tables)

PR #20's pdftable.Words()-based primitive layer broke section titles on
real SEC filings. Root cause: pdftable v0.3.0 ships without standard-14
AFM metrics, so glyph X-advance is estimated wrong on Times Roman / Helvetica
(the only fonts a 10-K uses), inter-word X-gaps shrink below pdftable's
default 3pt tolerance, and adjacent words get concatenated. The 3M 10-Q's
508 sections ended up with titles like:

  ChangesinAccumulatedOtherComprehensiveIncome(Loss)Attributableto3MbyComponent
  Currentmarketablesecurities 56 238

The selection LLM, given 112K tokens of outline like that, picked zero
sections — driving FinanceBench vectorless to 0.000 on the post-deploy
run. Pre-#20 the parser fix from PR #12 was producing 174 readable
sections from the same PDF and the bench was on track.

This commit restores extractPDFRows to use ledongthuc/pdf's Content()
glyph stream (the implementation PR #12 tuned for SEC filings):

  - X-gap > 0.20·fontSize → insert a space (the per-glyph heuristic
    that gave us clean word boundaries on 10-Ks).
  - collapseLetterSpacing / looksLetterSpaced restored — they fix the
    "U N I T E D S T A T E S" cover-page artifact.
  - multiSpaceRe restored.

The pdftable extractPDFTables stage is UNTOUCHED — line/lines_strict
table finding works correctly because it operates on drawn rules, not
glyph X-advances. The 3M 10-Q still emits 62 table sections under the
"Tables" container; verified end-to-end via /v1/documents/.../tree.

When pdftable bundles standard-14 AFM metrics (filed upstream as a
v0.4.x goal), we can flip extractPDFRows back to pdftable.Words().
Copilot AI review requested due to automatic review settings May 27, 2026 09:42
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 27, 2026

Reviewer's Guide

Reverts the PDF text primitive layer from pdftable.Words() back to ledongthuc/pdf’s glyph stream for row extraction, tightening error handling when ledongthuc/pdf fails and reintroducing heuristics to reconstruct word boundaries and handle letter-spaced headings while preserving pdftable’s table extraction.

File-Level Changes

Change Details Files
Restore ledongthuc/pdf as the primitive text source for row extraction and make it a hard requirement
  • Change Parse to treat ledongthuc/pdf reader initialization failure as a fatal error with a clear wrapped message instead of silently falling back
  • Update extractPDFRows signature to accept a ledongthuc/pdf Reader instead of a pdftable.Document and pass the Reader from Parse
pkg/parser/pdf.go
Reimplement row extraction to operate on glyphs from ledongthuc/pdf.Content() rather than pdftable.Words()
  • Iterate pages via reader.NumPage()/Page() and skip null pages
  • Use page.Content().Text glyphs, bucket them into rows by Y position, and track max font size per bucket
  • Sort rows top-to-bottom and glyphs left-to-right, then reconstruct line text by inserting spaces when X gaps exceed 0.2×font size
  • Compute boldness based on glyph font metadata instead of word font names and continue to drop boilerplate lines
  • Return pdfRow slices built from the new glyph-based representation
pkg/parser/pdf.go
Reintroduce and wire up letter-spacing heuristics to fix over-spaced headings
  • Add multiSpaceRe, looksLetterSpaced, and collapseLetterSpacing helpers to detect and collapse letter-spaced runs like "U N I T E D S T A T E S"
  • Apply collapseLetterSpacing to each reconstructed row’s text before boilerplate filtering and row emission
pkg/parser/pdf.go

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0699016e-eb69-47cb-9c5a-4c91cb89dee8

📥 Commits

Reviewing files that changed from the base of the PR and between 99a1963 and 19feb96.

📒 Files selected for processing (1)
  • pkg/parser/pdf.go

📝 Walkthrough

Walkthrough

PDF row extraction is refactored from using high-level word extraction to primitive glyph streams, making pdflib.Reader creation mandatory and rewriting row assembly with glyph bucketing, spacing, bold detection, and letter-spacing collapse.

Changes

PDF Row Extraction from Glyph Stream

Layer / File(s) Summary
Parse entry point: pdflib.Reader creation as fatal
pkg/parser/pdf.go
Parse now returns an error when pdflib.NewReader fails instead of treating outline access as optional, and passes the created Reader to extractPDFRows for row extraction.
extractPDFRows refactor: Reader dependency and glyph bucketing structure
pkg/parser/pdf.go
extractPDFRows signature changes to accept *pdflib.Reader instead of pdftable.Document; internal row-bucketing structure shifts from word-level grouping to glyph-level grouping with per-bucket font-size tracking.
Glyph assembly and letter-spacing collapse
pkg/parser/pdf.go
Core glyph-to-row pipeline: bucket glyphs by Y, sort by X, insert spaces using font-size-scaled gaps, detect bold from glyph font name, collapse letter-spaced runs via new helpers and regex, and emit pdfRow entries.

Possibly Related PRs

  • hallelx2/vectorless-engine#12: Directly aligned with this PR's glyph-stream refactor, bold detection, and letter-spacing collapse in the PDF row-to-heading assembly flow.

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 From words we leaped to glyphs so small,
Bucketing by Y through glyph streams all,
Bold fonts detected, spaces spread wide,
Letter-spacing collapsed with regex pride,
The PDF flows now pure and right! 📄✨

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/revert-parser-primitives-to-ledongthuc

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hallelx2 hallelx2 merged commit 52b7381 into main May 27, 2026
5 of 9 checks passed
@hallelx2 hallelx2 deleted the fix/revert-parser-primitives-to-ledongthuc branch May 27, 2026 09:43
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The new hard failure when ledongthuc/pdf cannot open a document changes previous behavior (outline-only optional) to rejecting PDFs that pdftable can still parse; consider whether a guarded fallback path (e.g., feature-flagged or best-effort pdftable.Words() mode) is preferable for those cases.
  • When calling page.Content() in extractPDFRows, the returned error is ignored; it would be safer to check the error and skip the page (mirroring the previous per-page failure handling) instead of silently treating a failed content decode as an empty page.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new hard failure when `ledongthuc/pdf` cannot open a document changes previous behavior (outline-only optional) to rejecting PDFs that `pdftable` can still parse; consider whether a guarded fallback path (e.g., feature-flagged or best-effort `pdftable.Words()` mode) is preferable for those cases.
- When calling `page.Content()` in `extractPDFRows`, the returned error is ignored; it would be safer to check the error and skip the page (mirroring the previous per-page failure handling) instead of silently treating a failed content decode as an empty page.

## Individual Comments

### Comment 1
<location path="pkg/parser/pdf.go" line_range="553" />
<code_context>
+		if page.V.IsNull() {
 			continue
 		}
+		content := page.Content()

-		// Group words by visual top (Y1). Values within 2pt are
</code_context>
<issue_to_address>
**issue (bug_risk):** Page.Content() return value looks like it might be ignoring a possible error

In ledongthuc/pdf, `Content()` typically returns `(Content, error)`. Here it’s treated as value-only, so any error would be silently ignored and could lead to mis-parsing or panics when accessing `content.Text`. Please either handle the error (e.g., skip the page on failure, consistent with prior behavior) or confirm this API cannot fail and document that assumption clearly.
</issue_to_address>

### Comment 2
<location path="pkg/parser/pdf.go" line_range="670-671" />
<code_context>
+		parts := strings.Fields(g)
+		// If every part is a single character, glue them.
+		allSingles := true
+		for _, p := range parts {
+			if len(p) > 1 {
+				allSingles = false
+				break
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Using byte length to detect single-character tokens can misclassify multi-byte runes

In `collapseLetterSpacing`, `len(p)` is used for this check, which counts bytes, so non-ASCII glyphs (e.g., Japanese or accented Latin) will appear as length > 1 and be treated as multi-character. If you need to support non-ASCII PDFs, use `utf8.RuneCountInString(p)` so the condition reflects rune count instead of byte length.

Suggested implementation:

```golang
		for _, p := range parts {
			if utf8.RuneCountInString(p) > 1 {
				allSingles = false
				break
			}
		}

```

To compile successfully, ensure that `pkg/parser/pdf.go` imports the utf8 package:

- Add `import "unicode/utf8"` to the existing import block (or include `utf8` in an existing grouped import).
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread pkg/parser/pdf.go
if page.V.IsNull() {
continue
}
content := page.Content()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Page.Content() return value looks like it might be ignoring a possible error

In ledongthuc/pdf, Content() typically returns (Content, error). Here it’s treated as value-only, so any error would be silently ignored and could lead to mis-parsing or panics when accessing content.Text. Please either handle the error (e.g., skip the page on failure, consistent with prior behavior) or confirm this API cannot fail and document that assumption clearly.

Comment thread pkg/parser/pdf.go
Comment on lines +670 to +671
for _, p := range parts {
if len(p) > 1 {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Using byte length to detect single-character tokens can misclassify multi-byte runes

In collapseLetterSpacing, len(p) is used for this check, which counts bytes, so non-ASCII glyphs (e.g., Japanese or accented Latin) will appear as length > 1 and be treated as multi-character. If you need to support non-ASCII PDFs, use utf8.RuneCountInString(p) so the condition reflects rune count instead of byte length.

Suggested implementation:

		for _, p := range parts {
			if utf8.RuneCountInString(p) > 1 {
				allSingles = false
				break
			}
		}

To compile successfully, ensure that pkg/parser/pdf.go imports the utf8 package:

  • Add import "unicode/utf8" to the existing import block (or include utf8 in an existing grouped import).

@hallelx2 hallelx2 review requested due to automatic review settings May 27, 2026 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant