fix(parser): revert primitive layer to ledongthuc/pdf (keep pdftable for tables)#23
Conversation
…for tables) PR #20's pdftable.Words()-based primitive layer broke section titles on real SEC filings. Root cause: pdftable v0.3.0 ships without standard-14 AFM metrics, so glyph X-advance is estimated wrong on Times Roman / Helvetica (the only fonts a 10-K uses), inter-word X-gaps shrink below pdftable's default 3pt tolerance, and adjacent words get concatenated. The 3M 10-Q's 508 sections ended up with titles like: ChangesinAccumulatedOtherComprehensiveIncome(Loss)Attributableto3MbyComponent Currentmarketablesecurities 56 238 The selection LLM, given 112K tokens of outline like that, picked zero sections — driving FinanceBench vectorless to 0.000 on the post-deploy run. Pre-#20 the parser fix from PR #12 was producing 174 readable sections from the same PDF and the bench was on track. This commit restores extractPDFRows to use ledongthuc/pdf's Content() glyph stream (the implementation PR #12 tuned for SEC filings): - X-gap > 0.20·fontSize → insert a space (the per-glyph heuristic that gave us clean word boundaries on 10-Ks). - collapseLetterSpacing / looksLetterSpaced restored — they fix the "U N I T E D S T A T E S" cover-page artifact. - multiSpaceRe restored. The pdftable extractPDFTables stage is UNTOUCHED — line/lines_strict table finding works correctly because it operates on drawn rules, not glyph X-advances. The 3M 10-Q still emits 62 table sections under the "Tables" container; verified end-to-end via /v1/documents/.../tree. When pdftable bundles standard-14 AFM metrics (filed upstream as a v0.4.x goal), we can flip extractPDFRows back to pdftable.Words().
Reviewer's GuideReverts the PDF text primitive layer from pdftable.Words() back to ledongthuc/pdf’s glyph stream for row extraction, tightening error handling when ledongthuc/pdf fails and reintroducing heuristics to reconstruct word boundaries and handle letter-spaced headings while preserving pdftable’s table extraction. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughPDF row extraction is refactored from using high-level word extraction to primitive glyph streams, making ChangesPDF Row Extraction from Glyph Stream
Possibly Related PRs
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The new hard failure when
ledongthuc/pdfcannot open a document changes previous behavior (outline-only optional) to rejecting PDFs thatpdftablecan still parse; consider whether a guarded fallback path (e.g., feature-flagged or best-effortpdftable.Words()mode) is preferable for those cases. - When calling
page.Content()inextractPDFRows, the returned error is ignored; it would be safer to check the error and skip the page (mirroring the previous per-page failure handling) instead of silently treating a failed content decode as an empty page.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new hard failure when `ledongthuc/pdf` cannot open a document changes previous behavior (outline-only optional) to rejecting PDFs that `pdftable` can still parse; consider whether a guarded fallback path (e.g., feature-flagged or best-effort `pdftable.Words()` mode) is preferable for those cases.
- When calling `page.Content()` in `extractPDFRows`, the returned error is ignored; it would be safer to check the error and skip the page (mirroring the previous per-page failure handling) instead of silently treating a failed content decode as an empty page.
## Individual Comments
### Comment 1
<location path="pkg/parser/pdf.go" line_range="553" />
<code_context>
+ if page.V.IsNull() {
continue
}
+ content := page.Content()
- // Group words by visual top (Y1). Values within 2pt are
</code_context>
<issue_to_address>
**issue (bug_risk):** Page.Content() return value looks like it might be ignoring a possible error
In ledongthuc/pdf, `Content()` typically returns `(Content, error)`. Here it’s treated as value-only, so any error would be silently ignored and could lead to mis-parsing or panics when accessing `content.Text`. Please either handle the error (e.g., skip the page on failure, consistent with prior behavior) or confirm this API cannot fail and document that assumption clearly.
</issue_to_address>
### Comment 2
<location path="pkg/parser/pdf.go" line_range="670-671" />
<code_context>
+ parts := strings.Fields(g)
+ // If every part is a single character, glue them.
+ allSingles := true
+ for _, p := range parts {
+ if len(p) > 1 {
+ allSingles = false
+ break
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Using byte length to detect single-character tokens can misclassify multi-byte runes
In `collapseLetterSpacing`, `len(p)` is used for this check, which counts bytes, so non-ASCII glyphs (e.g., Japanese or accented Latin) will appear as length > 1 and be treated as multi-character. If you need to support non-ASCII PDFs, use `utf8.RuneCountInString(p)` so the condition reflects rune count instead of byte length.
Suggested implementation:
```golang
for _, p := range parts {
if utf8.RuneCountInString(p) > 1 {
allSingles = false
break
}
}
```
To compile successfully, ensure that `pkg/parser/pdf.go` imports the utf8 package:
- Add `import "unicode/utf8"` to the existing import block (or include `utf8` in an existing grouped import).
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| if page.V.IsNull() { | ||
| continue | ||
| } | ||
| content := page.Content() |
There was a problem hiding this comment.
issue (bug_risk): Page.Content() return value looks like it might be ignoring a possible error
In ledongthuc/pdf, Content() typically returns (Content, error). Here it’s treated as value-only, so any error would be silently ignored and could lead to mis-parsing or panics when accessing content.Text. Please either handle the error (e.g., skip the page on failure, consistent with prior behavior) or confirm this API cannot fail and document that assumption clearly.
| for _, p := range parts { | ||
| if len(p) > 1 { |
There was a problem hiding this comment.
suggestion (bug_risk): Using byte length to detect single-character tokens can misclassify multi-byte runes
In collapseLetterSpacing, len(p) is used for this check, which counts bytes, so non-ASCII glyphs (e.g., Japanese or accented Latin) will appear as length > 1 and be treated as multi-character. If you need to support non-ASCII PDFs, use utf8.RuneCountInString(p) so the condition reflects rune count instead of byte length.
Suggested implementation:
for _, p := range parts {
if utf8.RuneCountInString(p) > 1 {
allSingles = false
break
}
}To compile successfully, ensure that pkg/parser/pdf.go imports the utf8 package:
- Add
import "unicode/utf8"to the existing import block (or includeutf8in an existing grouped import).
Summary
PR #20's pdftable.Words()-based primitive layer broke section titles on real SEC filings. pdftable v0.3.0 ships without standard-14 AFM metrics → glyph X-advance estimated wrong on Times Roman / Helvetica → inter-word X-gaps shrink below pdftable's default 3pt tolerance → adjacent words get concatenated.
Before / after on 3M 10-Q
Before this fix (live
00038-jpt): 508 sections, titles like:Selection LLM picks 0 sections on 4/5 FinanceBench questions. F1 = 0.000.
After:
extractPDFRowsrestored to ledongthuc/pdf's Content() glyph stream with the per-glyph X-gap-into-spaces heuristic PR #12 tuned for 10-Ks. Section titles are readable; chunked-tree retrieval works again.What's preserved
Followup
When pdftable bundles standard-14 AFM metrics (filed as a v0.4.x goal in hallelx2/pdftable#5), flip back to
pdftable.Words().Test plan
go build ./...go vet ./pkg/parser/...go test ./pkg/parser/...— all greenSummary by Sourcery
Restore the PDF parser’s glyph-level primitive layer to ledongthuc/pdf while preserving pdftable-based table extraction to fix broken text rows and headings.
Bug Fixes:
Enhancements:
Summary by CodeRabbit