fix(parser): revert primitive layer to ledongthuc/pdf (keep pdftable for tables) by hallelx2 · Pull Request #23 · hallelx2/vectorless-engine

hallelx2 · 2026-05-27T09:42:50Z

Summary

PR #20's pdftable.Words()-based primitive layer broke section titles on real SEC filings. pdftable v0.3.0 ships without standard-14 AFM metrics → glyph X-advance estimated wrong on Times Roman / Helvetica → inter-word X-gaps shrink below pdftable's default 3pt tolerance → adjacent words get concatenated.

Before / after on 3M 10-Q

Before this fix (live 00038-jpt): 508 sections, titles like:

ChangesinAccumulatedOtherComprehensiveIncome(Loss)Attributableto3MbyComponent
Currentmarketablesecurities 56 238

Selection LLM picks 0 sections on 4/5 FinanceBench questions. F1 = 0.000.

After: extractPDFRows restored to ledongthuc/pdf's Content() glyph stream with the per-glyph X-gap-into-spaces heuristic PR #12 tuned for 10-Ks. Section titles are readable; chunked-tree retrieval works again.

What's preserved

pdftable table extraction is untouched — line/lines_strict finding works on drawn rules, not glyph X-advances. The 3M 10-Q still emits 62 table sections.
collapseLetterSpacing / looksLetterSpaced / multiSpaceRe restored (the cover-page "U N I T E D S T A T E S" fix from PR parser: detect bold-as-heading + collapse letter-spacing (fixes filing parse) #12).

Followup

When pdftable bundles standard-14 AFM metrics (filed as a v0.4.x goal in hallelx2/pdftable#5), flip back to pdftable.Words().

Test plan

go build ./...
go vet ./pkg/parser/...
go test ./pkg/parser/... — all green

Summary by Sourcery

Restore the PDF parser’s glyph-level primitive layer to ledongthuc/pdf while preserving pdftable-based table extraction to fix broken text rows and headings.

Bug Fixes:

Prevent word concatenation on standard-14 fonts by no longer relying on pdftable’s word grouping for text rows, restoring accurate section titles and headings.
Fail PDF parsing early when the ledongthuc/pdf backend rejects a document instead of silently falling back to degraded text extraction.

Enhancements:

Reintroduce glyph-level spacing heuristics and letter-spacing collapse logic to robustly reconstruct readable lines from raw PDF text content.

Summary by CodeRabbit

Bug Fixes
- Improved PDF text extraction accuracy with enhanced character spacing detection and bold formatting recognition.
- Strengthened error handling for invalid PDF file parsing to prevent silent failures.

…for tables) PR #20's pdftable.Words()-based primitive layer broke section titles on real SEC filings. Root cause: pdftable v0.3.0 ships without standard-14 AFM metrics, so glyph X-advance is estimated wrong on Times Roman / Helvetica (the only fonts a 10-K uses), inter-word X-gaps shrink below pdftable's default 3pt tolerance, and adjacent words get concatenated. The 3M 10-Q's 508 sections ended up with titles like: ChangesinAccumulatedOtherComprehensiveIncome(Loss)Attributableto3MbyComponent Currentmarketablesecurities 56 238 The selection LLM, given 112K tokens of outline like that, picked zero sections — driving FinanceBench vectorless to 0.000 on the post-deploy run. Pre-#20 the parser fix from PR #12 was producing 174 readable sections from the same PDF and the bench was on track. This commit restores extractPDFRows to use ledongthuc/pdf's Content() glyph stream (the implementation PR #12 tuned for SEC filings): - X-gap > 0.20·fontSize → insert a space (the per-glyph heuristic that gave us clean word boundaries on 10-Ks). - collapseLetterSpacing / looksLetterSpaced restored — they fix the "U N I T E D S T A T E S" cover-page artifact. - multiSpaceRe restored. The pdftable extractPDFTables stage is UNTOUCHED — line/lines_strict table finding works correctly because it operates on drawn rules, not glyph X-advances. The 3M 10-Q still emits 62 table sections under the "Tables" container; verified end-to-end via /v1/documents/.../tree. When pdftable bundles standard-14 AFM metrics (filed upstream as a v0.4.x goal), we can flip extractPDFRows back to pdftable.Words().

sourcery-ai · 2026-05-27T09:42:56Z

Reviewer's Guide

Reverts the PDF text primitive layer from pdftable.Words() back to ledongthuc/pdf’s glyph stream for row extraction, tightening error handling when ledongthuc/pdf fails and reintroducing heuristics to reconstruct word boundaries and handle letter-spaced headings while preserving pdftable’s table extraction.

File-Level Changes

Change	Details	Files
Restore ledongthuc/pdf as the primitive text source for row extraction and make it a hard requirement	Change Parse to treat ledongthuc/pdf reader initialization failure as a fatal error with a clear wrapped message instead of silently falling back Update extractPDFRows signature to accept a ledongthuc/pdf Reader instead of a pdftable.Document and pass the Reader from Parse	`pkg/parser/pdf.go`
Reimplement row extraction to operate on glyphs from ledongthuc/pdf.Content() rather than pdftable.Words()	Iterate pages via reader.NumPage()/Page() and skip null pages Use page.Content().Text glyphs, bucket them into rows by Y position, and track max font size per bucket Sort rows top-to-bottom and glyphs left-to-right, then reconstruct line text by inserting spaces when X gaps exceed 0.2×font size Compute boldness based on glyph font metadata instead of word font names and continue to drop boilerplate lines Return pdfRow slices built from the new glyph-based representation	`pkg/parser/pdf.go`
Reintroduce and wire up letter-spacing heuristics to fix over-spaced headings	Add multiSpaceRe, looksLetterSpaced, and collapseLetterSpacing helpers to detect and collapse letter-spaced runs like "U N I T E D S T A T E S" Apply collapseLetterSpacing to each reconstructed row’s text before boilerplate filtering and row emission	`pkg/parser/pdf.go`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

coderabbitai · 2026-05-27T09:43:03Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0699016e-eb69-47cb-9c5a-4c91cb89dee8

📥 Commits

Reviewing files that changed from the base of the PR and between 99a1963 and 19feb96.

📒 Files selected for processing (1)

pkg/parser/pdf.go

📝 Walkthrough

Walkthrough

PDF row extraction is refactored from using high-level word extraction to primitive glyph streams, making pdflib.Reader creation mandatory and rewriting row assembly with glyph bucketing, spacing, bold detection, and letter-spacing collapse.

Changes

PDF Row Extraction from Glyph Stream

Layer / File(s)	Summary
Parse entry point: pdflib.Reader creation as fatal `pkg/parser/pdf.go`	`Parse` now returns an error when `pdflib.NewReader` fails instead of treating outline access as optional, and passes the created `Reader` to `extractPDFRows` for row extraction.
extractPDFRows refactor: Reader dependency and glyph bucketing structure `pkg/parser/pdf.go`	`extractPDFRows` signature changes to accept `*pdflib.Reader` instead of `pdftable.Document`; internal row-bucketing structure shifts from word-level grouping to glyph-level grouping with per-bucket font-size tracking.
Glyph assembly and letter-spacing collapse `pkg/parser/pdf.go`	Core glyph-to-row pipeline: bucket glyphs by Y, sort by X, insert spaces using font-size-scaled gaps, detect bold from glyph font name, collapse letter-spaced runs via new helpers and regex, and emit `pdfRow` entries.

Possibly Related PRs

hallelx2/vectorless-engine#12: Directly aligned with this PR's glyph-stream refactor, bold detection, and letter-spacing collapse in the PDF row-to-heading assembly flow.

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 From words we leaped to glyphs so small,
Bucketing by Y through glyph streams all,
Bold fonts detected, spaces spread wide,
Letter-spacing collapsed with regex pride,
The PDF flows now pure and right! 📄✨

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/revert-parser-primitives-to-ledongthuc

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

The new hard failure when ledongthuc/pdf cannot open a document changes previous behavior (outline-only optional) to rejecting PDFs that pdftable can still parse; consider whether a guarded fallback path (e.g., feature-flagged or best-effort pdftable.Words() mode) is preferable for those cases.
When calling page.Content() in extractPDFRows, the returned error is ignored; it would be safer to check the error and skip the page (mirroring the previous per-page failure handling) instead of silently treating a failed content decode as an empty page.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The new hard failure when `ledongthuc/pdf` cannot open a document changes previous behavior (outline-only optional) to rejecting PDFs that `pdftable` can still parse; consider whether a guarded fallback path (e.g., feature-flagged or best-effort `pdftable.Words()` mode) is preferable for those cases.
- When calling `page.Content()` in `extractPDFRows`, the returned error is ignored; it would be safer to check the error and skip the page (mirroring the previous per-page failure handling) instead of silently treating a failed content decode as an empty page.

## Individual Comments

### Comment 1
<location path="pkg/parser/pdf.go" line_range="553" />
<code_context>
+		if page.V.IsNull() {
 			continue
 		}
+		content := page.Content()

-		// Group words by visual top (Y1). Values within 2pt are
</code_context>
<issue_to_address>
**issue (bug_risk):** Page.Content() return value looks like it might be ignoring a possible error

In ledongthuc/pdf, `Content()` typically returns `(Content, error)`. Here it’s treated as value-only, so any error would be silently ignored and could lead to mis-parsing or panics when accessing `content.Text`. Please either handle the error (e.g., skip the page on failure, consistent with prior behavior) or confirm this API cannot fail and document that assumption clearly.
</issue_to_address>

### Comment 2
<location path="pkg/parser/pdf.go" line_range="670-671" />
<code_context>
+		parts := strings.Fields(g)
+		// If every part is a single character, glue them.
+		allSingles := true
+		for _, p := range parts {
+			if len(p) > 1 {
+				allSingles = false
+				break
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Using byte length to detect single-character tokens can misclassify multi-byte runes

In `collapseLetterSpacing`, `len(p)` is used for this check, which counts bytes, so non-ASCII glyphs (e.g., Japanese or accented Latin) will appear as length > 1 and be treated as multi-character. If you need to support non-ASCII PDFs, use `utf8.RuneCountInString(p)` so the condition reflects rune count instead of byte length.

Suggested implementation:

```golang
		for _, p := range parts {
			if utf8.RuneCountInString(p) > 1 {
				allSingles = false
				break
			}
		}

```

To compile successfully, ensure that `pkg/parser/pdf.go` imports the utf8 package:

- Add `import "unicode/utf8"` to the existing import block (or include `utf8` in an existing grouped import).
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-05-27T09:44:09Z

+		if page.V.IsNull() {
 			continue
 		}
+		content := page.Content()


issue (bug_risk): Page.Content() return value looks like it might be ignoring a possible error

In ledongthuc/pdf, Content() typically returns (Content, error). Here it’s treated as value-only, so any error would be silently ignored and could lead to mis-parsing or panics when accessing content.Text. Please either handle the error (e.g., skip the page on failure, consistent with prior behavior) or confirm this API cannot fail and document that assumption clearly.

sourcery-ai · 2026-05-27T09:44:09Z

+		for _, p := range parts {
+			if len(p) > 1 {


suggestion (bug_risk): Using byte length to detect single-character tokens can misclassify multi-byte runes

In collapseLetterSpacing, len(p) is used for this check, which counts bytes, so non-ASCII glyphs (e.g., Japanese or accented Latin) will appear as length > 1 and be treated as multi-character. If you need to support non-ASCII PDFs, use utf8.RuneCountInString(p) so the condition reflects rune count instead of byte length.

Suggested implementation:

for _, p := range parts { if utf8.RuneCountInString(p) > 1 { allSingles = false break } }

To compile successfully, ensure that pkg/parser/pdf.go imports the utf8 package:

Add import "unicode/utf8" to the existing import block (or include utf8 in an existing grouped import).

Copilot AI review requested due to automatic review settings May 27, 2026 09:42

Copilot started reviewing on behalf of hallelx2 May 27, 2026 09:43 View session

hallelx2 merged commit 52b7381 into main May 27, 2026
5 of 9 checks passed

hallelx2 deleted the fix/revert-parser-primitives-to-ledongthuc branch May 27, 2026 09:43

sourcery-ai Bot reviewed May 27, 2026

View reviewed changes

hallelx2 review requested due to automatic review settings May 27, 2026 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(parser): revert primitive layer to ledongthuc/pdf (keep pdftable for tables)#23

fix(parser): revert primitive layer to ledongthuc/pdf (keep pdftable for tables)#23
hallelx2 merged 1 commit into
mainfrom
fix/revert-parser-primitives-to-ledongthuc

hallelx2 commented May 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

sourcery-ai Bot commented May 27, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Possibly Related PRs

Estimated Code Review Effort

Poem

Uh oh!

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot May 27, 2026

Uh oh!

sourcery-ai Bot May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallelx2 commented May 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Before / after on 3M 10-Q

What's preserved

Followup

Test plan

Summary by Sourcery

Summary by CodeRabbit

Uh oh!

sourcery-ai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Possibly Related PRs

Estimated Code Review Effort

Poem

Uh oh!

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hallelx2 commented May 27, 2026 •

edited by coderabbitai Bot

Loading

sourcery-ai Bot commented May 27, 2026 •

edited

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading