v0.1.0: Words + ExtractText (Phase 1.3.B)#1
Conversation
Port pdfplumber's WordExtractor and extract_text into Go. Three new methods on the Page interface: - Page.Words(WordOpts) → []Word, error - Page.ExtractText(TextOpts) → string, error - Page.ExtractTextSimple(xt, yt) → string, error Each Word carries its bbox, font name/size, upright flag, and direction (ltr/rtl/ttb/btt), with an optional Chars slice when KeepChars=true. Supporting infrastructure: - geometry.go — BBox value type with Union/Intersect/Contains/Snap and MergeBBoxes helpers. - clustering.go — 1-D agglomerative clustering primitives (clusterFloat1D, clusterObjects[T], groupObjectsByAttr[T,K], dedupeChars). Ports of pdfplumber/utils/clustering.py. - text.go — Word + WordExtractor algorithm, dense and layout- preserving ExtractText paths, ligature expansion table. The Page interface is additive: v0.0.1 callers that only use Chars/Lines/Rects/Curves continue to compile and work unchanged. Tests: - geometry_test.go, clustering_test.go, text_test.go — table- driven unit tests for each primitive and each public entry point. - golden_test.go — parity tests against pdfplumber output on three fixture PDFs (hello, rules, simple1). Expected outputs in testdata/golden/*.expected.json, regenerable via scripts/gen_golden.py. Parity notes: - Word text, count, order, and direction match pdfplumber exactly. - Word bbox positions drift by up to ~10 PDF points on standard-14 fonts because the AFM metrics aren't yet bundled (planned for v0.2.x). The golden test tolerance is 15 points to absorb this.
Reviewer's GuideImplements pdfplumber-parity word and text extraction in Go by extending the Page API with Words/ExtractText/ExtractTextSimple, adding clustering and geometry primitives to support word grouping and layout-preserving extraction, and wiring up golden tests and documentation for v0.1.0. Sequence diagram for Page.ExtractText workflowsequenceDiagram
actor Client
participant Page
participant page
participant clustering as clusterObjects
participant text as extractTextFromChars
participant layout as extractTextWithLayout
Client->>Page: ExtractText(opts TextOpts)
Page->>page: ExtractText(opts TextOpts)
page->>page: applyTextOptDefaults(opts)
page->>page: Chars() []Char
alt [opts.Layout]
page->>layout: extractTextWithLayout(chars, Width(), Height(), opts)
layout->>page: text string
else [!opts.Layout]
page->>text: extractTextFromChars(chars, opts)
text->>clustering: clusterObjects(words, keyFn, opts.YTolerance, false)
clustering-->>text: lines of words
text-->>page: text string
end
page-->>Client: text string
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
📝 WalkthroughWalkthroughThis PR introduces v0.1.0 text and word extraction capabilities. It adds geometry primitives ( ChangesWord and Text Extraction Pipeline
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 5 issues, and left some high level feedback:
- The option defaulting logic for tolerances (e.g.,
applyWordOptDefaults,applyTextOptDefaults, andtextOptsToWordOpts) treats0as "use default", which makes it impossible for callers to intentionally request a true zero-tolerance behaviour (even though lower-level clustering supports a specialtolerance == 0semantics); consider either using a separate sentinel (e.g., negative) or a*float64/nil-means-default pattern so0can be expressed explicitly. - The key-building helpers in
extractWordsFromChars/dedupeChars(keyOffunctions) allocate fresh[]byteand strings on every call and re-run even inside tight loops (e.g., thesort.SliceStablecomparator indedupeCharsrecomputes keys repeatedly); if profiling shows this as hot, consider caching keys per element or precomputing them once to reduce per-char allocations.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The option defaulting logic for tolerances (e.g., `applyWordOptDefaults`, `applyTextOptDefaults`, and `textOptsToWordOpts`) treats `0` as "use default", which makes it impossible for callers to intentionally request a true zero-tolerance behaviour (even though lower-level clustering supports a special `tolerance == 0` semantics); consider either using a separate sentinel (e.g., negative) or a `*float64`/`nil`-means-default pattern so `0` can be expressed explicitly.
- The key-building helpers in `extractWordsFromChars`/`dedupeChars` (`keyOf` functions) allocate fresh `[]byte` and strings on every call and re-run even inside tight loops (e.g., the `sort.SliceStable` comparator in `dedupeChars` recomputes keys repeatedly); if profiling shows this as hot, consider caching keys per element or precomputing them once to reduce per-char allocations.
## Individual Comments
### Comment 1
<location path="clustering.go" line_range="286" />
<code_context>
+// Float64bits dependency so other tests/files don't have to import
+// "math" just to compare floats. Left as a package-private helper for
+// now — only dedupeChars's key construction uses it.
+func float64Bits(f float64) uint64 { return math.Float64bits(f) }
</code_context>
<issue_to_address>
**nitpick:** float64Bits helper is currently unused and adds unnecessary surface area.
Given it isn’t used, this is effectively dead code and expands the package surface unnecessarily. Unless there’s a concrete near-term use, consider removing it and calling `math.Float64bits` directly where needed instead.
</issue_to_address>
### Comment 2
<location path="text_test.go" line_range="342" />
<code_context>
+ }
+}
+
+func TestDirectionFor(t *testing.T) {
+ tests := []struct {
+ upright, ltr, ttb bool
</code_context>
<issue_to_address>
**suggestion (testing):** Add tests that exercise vertical and rotated text flows (ttb/btt) end-to-end
Existing tests only cover upright horizontal (LTR) text. Since mergeLineIntoWords/charBeginsNewWord/sortCharsByDir have direction-specific branches for rotated text (Upright=false, TTB/BTT), please add tests that construct rotated Char slices and verify resulting Words’ Direction and grouping. For instance, use a synthetic top-to-bottom column and assert word order and behavior when toggling VerticalTTB/HorizontalLTR, so the vertical/rotated paths are exercised and protected against regressions.
</issue_to_address>
### Comment 3
<location path="text_test.go" line_range="325" />
<code_context>
+ }
+}
+
+func TestPageExtractTextSimple(t *testing.T) {
+ doc, err := openHelloWorldDoc()
+ if err != nil {
</code_context>
<issue_to_address>
**suggestion (testing):** Strengthen coverage of ExtractTextSimple with unit-style Char slices and edge cases
ExtractTextSimple has bespoke behavior (dropping empty-text chars, gap-based space insertion, defaulting tolerances, handling empty pages), but it’s only tested via a full PDF and substring checks. Please add unit-style tests that exercise the core logic directly (e.g., via a helper or by factoring it out) with hand-crafted Char slices, and assert exact output for key cases: no chars, empty Text chars, explicit spaces vs gap-inferred spaces, and xTolerance/yTolerance = 0 vs non-zero. This will better lock in the intended semantics and protect future refactors.
Suggested implementation:
```golang
func TestExtractTextSimple_EmptyChars(t *testing.T) {
// Empty input should yield empty output without error.
got := extractTextSimple(nil, DefaultTextOpts())
if got != "" {
t.Fatalf("extractTextSimple(nil) = %q, want empty string", got)
}
got = extractTextSimple([]Char{}, DefaultTextOpts())
if got != "" {
t.Fatalf("extractTextSimple([]Char{}) = %q, want empty string", got)
}
}
func TestExtractTextSimple_DropsEmptyTextChars(t *testing.T) {
opts := DefaultTextOpts()
chars := []Char{
{Text: ""},
{Text: "A"},
{Text: ""},
{Text: "B"},
{Text: ""},
}
got := extractTextSimple(chars, opts)
want := "AB"
if got != want {
t.Fatalf("extractTextSimple dropped-empty-text: got %q, want %q", got, want)
}
}
func TestExtractTextSimple_ExplicitSpacesVsGapSpaces(t *testing.T) {
// This test assumes ExtractTextSimple:
// * keeps explicit space characters as-is
// * inserts spaces when the gap between consecutive chars exceeds xTolerance
opts := DefaultTextOpts()
opts.XTolerance = 2.0
chars := []Char{
// "A B" with explicit space
{Text: "A", X: 0, Y: 0},
{Text: " ", X: 1, Y: 0},
{Text: "B", X: 2, Y: 0},
// large gap should cause an inserted space between C and D
{Text: "C", X: 10, Y: 0},
{Text: "D", X: 20, Y: 0},
}
got := extractTextSimple(chars, opts)
// We expect both the explicit space (between A and B) and the gap-based space (between C and D).
if !strings.Contains(got, "A B") {
t.Fatalf("extractTextSimple explicit-space: got %q, want to contain %q", got, "A B")
}
if !strings.Contains(got, "C D") {
t.Fatalf("extractTextSimple gap-space: got %q, want to contain %q", got, "C D")
}
}
func TestExtractTextSimple_XToleranceZero(t *testing.T) {
// With xTolerance = 0, only explicit spaces should appear; no gap-based insertion.
opts := DefaultTextOpts()
opts.XTolerance = 0
chars := []Char{
{Text: "A", X: 0, Y: 0},
{Text: "B", X: 10, Y: 0}, // large gap, but xTolerance == 0
}
got := extractTextSimple(chars, opts)
want := "AB" // no gap-based space
if got != want {
t.Fatalf("extractTextSimple xTolerance=0: got %q, want %q", got, want)
}
}
func TestExtractTextSimple_YToleranceGrouping(t *testing.T) {
// This test exercises how yTolerance affects line grouping.
opts := DefaultTextOpts()
opts.YTolerance = 1.0
chars := []Char{
// "AB" on first line
{Text: "A", X: 0, Y: 0},
{Text: "B", X: 1, Y: 0},
// "CD" on a second line outside yTolerance
{Text: "C", X: 0, Y: 3},
{Text: "D", X: 1, Y: 3},
}
got := extractTextSimple(chars, opts)
// We expect a line break between "AB" and "CD".
if !strings.Contains(got, "AB") || !strings.Contains(got, "CD") {
t.Fatalf("extractTextSimple yTolerance: got %q, want to contain both %q and %q", got, "AB", "CD")
}
if !strings.Contains(got, "AB\nCD") && !strings.Contains(got, "AB\r\nCD") {
t.Fatalf("extractTextSimple yTolerance: got %q, want line break between lines", got)
}
}
```
I assumed there is an internal helper `extractTextSimple(chars []Char, opts TextOptions) string` and a `Char` type with at least `Text string`, `X float64`, and `Y float64` fields, plus `DefaultTextOpts()` returning a struct that includes `XTolerance` and `YTolerance`. If the actual API differs, you will need to:
1. Adjust the helper calls to the real function that implements ExtractTextSimple’s core logic (or factor such a helper out from the method under test into the package).
2. Update the `Char` field initializations in the tests to match the real `Char` struct (e.g., use `BBox`/`Point`/`Pos` fields instead of `X`/`Y` if necessary).
3. If `extractTextSimple` returns `(string, error)` instead of just `string`, capture and assert `err == nil` before checking `got`.
4. Place these tests near your existing `TestPageExtractTextSimple` (or update/replace that test if you prefer these unit-style cases instead of the current PDF-based one).
</issue_to_address>
### Comment 4
<location path="clustering_test.go" line_range="89" />
<code_context>
+
+// TestClusterObjects exercises both preserveOrder modes on a tiny set
+// of struct-valued inputs.
+func TestClusterObjects(t *testing.T) {
+ type pt struct {
+ x float64
</code_context>
<issue_to_address>
**suggestion (testing):** Add tests for clusterObjects and dedupeChars with zero/negative tolerances
One remaining gap is how clusterObjects and dedupeChars behave when tolerance is 0 (or negative) in their own APIs. Please add:
- A test for clusterObjects with tolerance==0 that asserts the expected grouping (identical vs near-equal values).
- A test for dedupeChars with tolerance==0 that confirms whether only exact coordinate matches are removed.
This will lock in the intended contract and protect against subtle regressions around tolerance handling.
Suggested implementation:
```golang
// TestClusterObjects exercises both preserveOrder modes on a tiny set
// of struct-valued inputs.
func TestClusterObjects(t *testing.T) {
type pt struct {
x float64
tag string
}
xs := []pt{
{x: 1, tag: "a"},
{x: 10, tag: "b"},
{x: 2, tag: "c"},
{x: 11, tag: "d"},
}
// tolerance == 0: identical keys should cluster, distinct keys should not.
zeroTol := 0.0
xsZeroTol := []pt{
{x: 1, tag: "a1"},
{x: 1, tag: "a2"}, // identical to a1, should be in same cluster at tol=0
{x: 1.000001, tag: "a3"}, // near-equal but not identical, should be in a different cluster at tol=0
}
clustersZero := clusterObjects(xsZeroTol, func(p pt) float64 { return p.x }, zeroTol, false)
if len(clustersZero) != len(xsZeroTol) {
t.Fatalf("clusterObjects(tol=0) returned %d labels for %d inputs", len(clustersZero), len(xsZeroTol))
}
if clustersZero[0] != clustersZero[1] {
t.Errorf("expected identical keys (indices 0 and 1) to be in same cluster at tol=0, got %d vs %d", clustersZero[0], clustersZero[1])
}
if clustersZero[0] == clustersZero[2] {
t.Errorf("expected near-equal but non-identical keys (indices 0 and 2) to be in different clusters at tol=0, both got %d", clustersZero[0])
}
// preserveOrder=false: clusters sorted by key, items sorted within
```
To fully implement your suggestion, you’ll also want to add a dedicated test for `dedupeChars` with `tolerance == 0`. I can’t see the definition/signature of `dedupeChars` or the concrete character type it operates on, so the exact code needs to be aligned with your existing types, but the intended structure is:
1. Identify the character type used by `dedupeChars` (for example, `char`, `glyph`, or similar), including whatever fields carry the coordinates used for deduping (e.g. `x`, `y`, or a `pt`/`pos` field).
2. Add a new test function, e.g. `func TestDedupeCharsZeroTolerance(t *testing.T)`, in `clustering_test.go`.
3. In that test:
- Construct an input slice like:
- Two characters with exactly identical coordinates and other relevant fields.
- One or more characters with very close but not bitwise-identical coordinates.
- Call `dedupeChars` with `tolerance == 0`.
- Assert that:
- Only one of the exactly identical characters remains.
- None of the near-equal (but not identical) characters are removed.
4. Place this new test after your existing `TestDedupeChars` (if present) to keep related tests grouped.
Conceptually, the test body will look like:
```go
func TestDedupeCharsZeroTolerance(t *testing.T) {
chars := []YourCharType{
{/* coord: (10, 20), rune: 'a' */},
{/* coord: (10, 20), rune: 'a' */}, // exact duplicate
{/* coord: (10.000001, 20), rune: 'a' */}, // near-duplicate
}
got := dedupeChars(chars, 0)
// Assert that only one exact duplicate remains, and
// the near-duplicate is still present.
}
```
You’ll need to replace `YourCharType` and the field initializers with whatever is actually used in your package and wired into `dedupeChars`.
</issue_to_address>
### Comment 5
<location path="clustering.go" line_range="200" />
<code_context>
+// The output is in the SAME ORDER as the input — the first occurrence
+// of each cluster is kept and subsequent duplicates are dropped. This
+// preserves content-stream order, which downstream code may rely on.
+func dedupeChars(chars []Char, tolerance float64, extraAttrs []string) []Char {
+ if len(chars) == 0 {
+ return nil
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying the `dedupeChars` implementation by precomputing per-char keys, using a lightweight local coordinate-clustering helper instead of the generic clustering pipeline, and removing the unused `float64Bits` wrapper.
You can simplify the hot path in `dedupeChars` without changing behavior, and also drop some indirection that’s not paying for itself.
### 1. Avoid recomputing the string key for each comparison
Right now `keyOf` is called repeatedly:
- Inside the `sort.SliceStable` comparator.
- Inside the outer `for` that walks runs of equal keys.
That makes the grouping logic harder to follow and less efficient. Precompute the key once per char and carry it around:
```go
func dedupeChars(chars []Char, tolerance float64, extraAttrs []string) []Char {
if len(chars) == 0 {
return nil
}
buildKey := func(c Char) string {
buf := make([]byte, 0, 32+len(c.Text)+len(c.FontName))
if c.Upright {
buf = append(buf, 'U')
} else {
buf = append(buf, 'u')
}
buf = append(buf, '\x00')
buf = append(buf, c.Text...)
for _, attr := range extraAttrs {
buf = append(buf, '\x00')
switch attr {
case "fontname":
buf = append(buf, c.FontName...)
case "size":
bits := math.Float64bits(c.FontSize)
for i := 7; i >= 0; i-- {
buf = append(buf, byte(bits>>(i*8)))
}
}
}
return string(buf)
}
type indexed struct {
c Char
idx int
key string
}
sorted := make([]indexed, len(chars))
for i, c := range chars {
sorted[i] = indexed{c: c, idx: i, key: buildKey(c)}
}
// Sort once by the precomputed key.
sort.SliceStable(sorted, func(i, j int) bool {
return sorted[i].key < sorted[j].key
})
keepIdx := make(map[int]struct{}, len(chars))
// Walk equal-key runs without recomputing keys.
for i := 0; i < len(sorted); {
j := i + 1
k := sorted[i].key
for j < len(sorted) && sorted[j].key == k {
j++
}
run := sorted[i:j]
// ... inner clustering on run (see next section) ...
i = j
}
// unchanged: build output using keepIdx
}
```
This makes the equivalence relation explicit (`key` field) and the logic easier to audit.
### 2. Use a simpler clustering helper for dedupe (avoid `makeClusterDict` indirection)
Within each equal-key run, you don’t need the full `clusterObjects` pipeline (which goes through `makeClusterDict` + map + extra allocation). For dedupe, a straightforward “sort by coord + linear scan” per dimension is enough and simpler to read.
You can keep the generic helper *local* to this file and use it only from `dedupeChars`, while leaving `clusterObjects` untouched for other callsites:
```go
// clusterByCoord clusters a slice by a float coordinate extracted by keyFn.
// It sorts by the coordinate and then groups consecutive items whose
// coordinate differs by <= tol.
func clusterByCoord[T any](xs []T, keyFn func(T) float64, tol float64) [][]T {
if len(xs) == 0 {
return nil
}
sort.Slice(xs, func(i, j int) bool {
return keyFn(xs[i]) < keyFn(xs[j])
})
var out [][]T
current := []T{xs[0]}
last := keyFn(xs[0])
for _, v := range xs[1:] {
k := keyFn(v)
if k <= last+tol {
current = append(current, v)
} else {
out = append(out, current)
current = []T{v}
}
last = k
}
out = append(out, current)
return out
}
```
Then use it in `dedupeChars` instead of nested `clusterObjects`:
```go
// inside dedupeChars, after computing `run := sorted[i:j]`:
yClusters := clusterByCoord(run, func(e indexed) float64 { return e.c.Y0 }, tolerance)
for _, yc := range yClusters {
xClusters := clusterByCoord(yc, func(e indexed) float64 { return e.c.X0 }, tolerance)
for _, xc := range xClusters {
minIdx := xc[0].idx
for _, e := range xc[1:] {
if e.idx < minIdx {
minIdx = e.idx
}
}
keepIdx[minIdx] = struct{}{}
}
}
```
Behaviorally this matches the existing logic:
- `clusterByCoord` sorts by Y0/X0 and does a simple `delta <= tolerance` agglomeration (same as `clusterFloat1D`).
- You still choose the smallest original index within each (Y, X) bucket, preserving “first occurrence wins”.
But the dedupe path no longer depends on:
- `makeClusterDict`
- `clusterObjects`
- `clusterFloat1D`/map rebuilds for each run
which significantly reduces the conceptual depth of this function.
### 3. Remove the unused `float64Bits` wrapper
`float64Bits` is defined but not used; it adds a name without simplifying anything:
```go
// currently:
func float64Bits(f float64) uint64 { return math.Float64bits(f) }
```
You can safely delete this function. If you later find multiple callsites wanting stable float serialization for keys, you could instead introduce a more descriptive helper (e.g. the `buildKey` logic above already shows the pattern).
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| // Float64bits dependency so other tests/files don't have to import | ||
| // "math" just to compare floats. Left as a package-private helper for | ||
| // now — only dedupeChars's key construction uses it. | ||
| func float64Bits(f float64) uint64 { return math.Float64bits(f) } |
There was a problem hiding this comment.
nitpick: float64Bits helper is currently unused and adds unnecessary surface area.
Given it isn’t used, this is effectively dead code and expands the package surface unnecessarily. Unless there’s a concrete near-term use, consider removing it and calling math.Float64bits directly where needed instead.
| } | ||
| } | ||
|
|
||
| func TestDirectionFor(t *testing.T) { |
There was a problem hiding this comment.
suggestion (testing): Add tests that exercise vertical and rotated text flows (ttb/btt) end-to-end
Existing tests only cover upright horizontal (LTR) text. Since mergeLineIntoWords/charBeginsNewWord/sortCharsByDir have direction-specific branches for rotated text (Upright=false, TTB/BTT), please add tests that construct rotated Char slices and verify resulting Words’ Direction and grouping. For instance, use a synthetic top-to-bottom column and assert word order and behavior when toggling VerticalTTB/HorizontalLTR, so the vertical/rotated paths are exercised and protected against regressions.
| } | ||
| } | ||
|
|
||
| func TestPageExtractTextSimple(t *testing.T) { |
There was a problem hiding this comment.
suggestion (testing): Strengthen coverage of ExtractTextSimple with unit-style Char slices and edge cases
ExtractTextSimple has bespoke behavior (dropping empty-text chars, gap-based space insertion, defaulting tolerances, handling empty pages), but it’s only tested via a full PDF and substring checks. Please add unit-style tests that exercise the core logic directly (e.g., via a helper or by factoring it out) with hand-crafted Char slices, and assert exact output for key cases: no chars, empty Text chars, explicit spaces vs gap-inferred spaces, and xTolerance/yTolerance = 0 vs non-zero. This will better lock in the intended semantics and protect future refactors.
Suggested implementation:
func TestExtractTextSimple_EmptyChars(t *testing.T) {
// Empty input should yield empty output without error.
got := extractTextSimple(nil, DefaultTextOpts())
if got != "" {
t.Fatalf("extractTextSimple(nil) = %q, want empty string", got)
}
got = extractTextSimple([]Char{}, DefaultTextOpts())
if got != "" {
t.Fatalf("extractTextSimple([]Char{}) = %q, want empty string", got)
}
}
func TestExtractTextSimple_DropsEmptyTextChars(t *testing.T) {
opts := DefaultTextOpts()
chars := []Char{
{Text: ""},
{Text: "A"},
{Text: ""},
{Text: "B"},
{Text: ""},
}
got := extractTextSimple(chars, opts)
want := "AB"
if got != want {
t.Fatalf("extractTextSimple dropped-empty-text: got %q, want %q", got, want)
}
}
func TestExtractTextSimple_ExplicitSpacesVsGapSpaces(t *testing.T) {
// This test assumes ExtractTextSimple:
// * keeps explicit space characters as-is
// * inserts spaces when the gap between consecutive chars exceeds xTolerance
opts := DefaultTextOpts()
opts.XTolerance = 2.0
chars := []Char{
// "A B" with explicit space
{Text: "A", X: 0, Y: 0},
{Text: " ", X: 1, Y: 0},
{Text: "B", X: 2, Y: 0},
// large gap should cause an inserted space between C and D
{Text: "C", X: 10, Y: 0},
{Text: "D", X: 20, Y: 0},
}
got := extractTextSimple(chars, opts)
// We expect both the explicit space (between A and B) and the gap-based space (between C and D).
if !strings.Contains(got, "A B") {
t.Fatalf("extractTextSimple explicit-space: got %q, want to contain %q", got, "A B")
}
if !strings.Contains(got, "C D") {
t.Fatalf("extractTextSimple gap-space: got %q, want to contain %q", got, "C D")
}
}
func TestExtractTextSimple_XToleranceZero(t *testing.T) {
// With xTolerance = 0, only explicit spaces should appear; no gap-based insertion.
opts := DefaultTextOpts()
opts.XTolerance = 0
chars := []Char{
{Text: "A", X: 0, Y: 0},
{Text: "B", X: 10, Y: 0}, // large gap, but xTolerance == 0
}
got := extractTextSimple(chars, opts)
want := "AB" // no gap-based space
if got != want {
t.Fatalf("extractTextSimple xTolerance=0: got %q, want %q", got, want)
}
}
func TestExtractTextSimple_YToleranceGrouping(t *testing.T) {
// This test exercises how yTolerance affects line grouping.
opts := DefaultTextOpts()
opts.YTolerance = 1.0
chars := []Char{
// "AB" on first line
{Text: "A", X: 0, Y: 0},
{Text: "B", X: 1, Y: 0},
// "CD" on a second line outside yTolerance
{Text: "C", X: 0, Y: 3},
{Text: "D", X: 1, Y: 3},
}
got := extractTextSimple(chars, opts)
// We expect a line break between "AB" and "CD".
if !strings.Contains(got, "AB") || !strings.Contains(got, "CD") {
t.Fatalf("extractTextSimple yTolerance: got %q, want to contain both %q and %q", got, "AB", "CD")
}
if !strings.Contains(got, "AB\nCD") && !strings.Contains(got, "AB\r\nCD") {
t.Fatalf("extractTextSimple yTolerance: got %q, want line break between lines", got)
}
}I assumed there is an internal helper extractTextSimple(chars []Char, opts TextOptions) string and a Char type with at least Text string, X float64, and Y float64 fields, plus DefaultTextOpts() returning a struct that includes XTolerance and YTolerance. If the actual API differs, you will need to:
- Adjust the helper calls to the real function that implements ExtractTextSimple’s core logic (or factor such a helper out from the method under test into the package).
- Update the
Charfield initializations in the tests to match the realCharstruct (e.g., useBBox/Point/Posfields instead ofX/Yif necessary). - If
extractTextSimplereturns(string, error)instead of juststring, capture and asserterr == nilbefore checkinggot. - Place these tests near your existing
TestPageExtractTextSimple(or update/replace that test if you prefer these unit-style cases instead of the current PDF-based one).
|
|
||
| // TestClusterObjects exercises both preserveOrder modes on a tiny set | ||
| // of struct-valued inputs. | ||
| func TestClusterObjects(t *testing.T) { |
There was a problem hiding this comment.
suggestion (testing): Add tests for clusterObjects and dedupeChars with zero/negative tolerances
One remaining gap is how clusterObjects and dedupeChars behave when tolerance is 0 (or negative) in their own APIs. Please add:
- A test for clusterObjects with tolerance==0 that asserts the expected grouping (identical vs near-equal values).
- A test for dedupeChars with tolerance==0 that confirms whether only exact coordinate matches are removed.
This will lock in the intended contract and protect against subtle regressions around tolerance handling.
Suggested implementation:
// TestClusterObjects exercises both preserveOrder modes on a tiny set
// of struct-valued inputs.
func TestClusterObjects(t *testing.T) {
type pt struct {
x float64
tag string
}
xs := []pt{
{x: 1, tag: "a"},
{x: 10, tag: "b"},
{x: 2, tag: "c"},
{x: 11, tag: "d"},
}
// tolerance == 0: identical keys should cluster, distinct keys should not.
zeroTol := 0.0
xsZeroTol := []pt{
{x: 1, tag: "a1"},
{x: 1, tag: "a2"}, // identical to a1, should be in same cluster at tol=0
{x: 1.000001, tag: "a3"}, // near-equal but not identical, should be in a different cluster at tol=0
}
clustersZero := clusterObjects(xsZeroTol, func(p pt) float64 { return p.x }, zeroTol, false)
if len(clustersZero) != len(xsZeroTol) {
t.Fatalf("clusterObjects(tol=0) returned %d labels for %d inputs", len(clustersZero), len(xsZeroTol))
}
if clustersZero[0] != clustersZero[1] {
t.Errorf("expected identical keys (indices 0 and 1) to be in same cluster at tol=0, got %d vs %d", clustersZero[0], clustersZero[1])
}
if clustersZero[0] == clustersZero[2] {
t.Errorf("expected near-equal but non-identical keys (indices 0 and 2) to be in different clusters at tol=0, both got %d", clustersZero[0])
}
// preserveOrder=false: clusters sorted by key, items sorted withinTo fully implement your suggestion, you’ll also want to add a dedicated test for dedupeChars with tolerance == 0. I can’t see the definition/signature of dedupeChars or the concrete character type it operates on, so the exact code needs to be aligned with your existing types, but the intended structure is:
- Identify the character type used by
dedupeChars(for example,char,glyph, or similar), including whatever fields carry the coordinates used for deduping (e.g.x,y, or apt/posfield). - Add a new test function, e.g.
func TestDedupeCharsZeroTolerance(t *testing.T), inclustering_test.go. - In that test:
- Construct an input slice like:
- Two characters with exactly identical coordinates and other relevant fields.
- One or more characters with very close but not bitwise-identical coordinates.
- Call
dedupeCharswithtolerance == 0. - Assert that:
- Only one of the exactly identical characters remains.
- None of the near-equal (but not identical) characters are removed.
- Construct an input slice like:
- Place this new test after your existing
TestDedupeChars(if present) to keep related tests grouped.
Conceptually, the test body will look like:
func TestDedupeCharsZeroTolerance(t *testing.T) {
chars := []YourCharType{
{/* coord: (10, 20), rune: 'a' */},
{/* coord: (10, 20), rune: 'a' */}, // exact duplicate
{/* coord: (10.000001, 20), rune: 'a' */}, // near-duplicate
}
got := dedupeChars(chars, 0)
// Assert that only one exact duplicate remains, and
// the near-duplicate is still present.
}You’ll need to replace YourCharType and the field initializers with whatever is actually used in your package and wired into dedupeChars.
| // The output is in the SAME ORDER as the input — the first occurrence | ||
| // of each cluster is kept and subsequent duplicates are dropped. This | ||
| // preserves content-stream order, which downstream code may rely on. | ||
| func dedupeChars(chars []Char, tolerance float64, extraAttrs []string) []Char { |
There was a problem hiding this comment.
issue (complexity): Consider simplifying the dedupeChars implementation by precomputing per-char keys, using a lightweight local coordinate-clustering helper instead of the generic clustering pipeline, and removing the unused float64Bits wrapper.
You can simplify the hot path in dedupeChars without changing behavior, and also drop some indirection that’s not paying for itself.
1. Avoid recomputing the string key for each comparison
Right now keyOf is called repeatedly:
- Inside the
sort.SliceStablecomparator. - Inside the outer
forthat walks runs of equal keys.
That makes the grouping logic harder to follow and less efficient. Precompute the key once per char and carry it around:
func dedupeChars(chars []Char, tolerance float64, extraAttrs []string) []Char {
if len(chars) == 0 {
return nil
}
buildKey := func(c Char) string {
buf := make([]byte, 0, 32+len(c.Text)+len(c.FontName))
if c.Upright {
buf = append(buf, 'U')
} else {
buf = append(buf, 'u')
}
buf = append(buf, '\x00')
buf = append(buf, c.Text...)
for _, attr := range extraAttrs {
buf = append(buf, '\x00')
switch attr {
case "fontname":
buf = append(buf, c.FontName...)
case "size":
bits := math.Float64bits(c.FontSize)
for i := 7; i >= 0; i-- {
buf = append(buf, byte(bits>>(i*8)))
}
}
}
return string(buf)
}
type indexed struct {
c Char
idx int
key string
}
sorted := make([]indexed, len(chars))
for i, c := range chars {
sorted[i] = indexed{c: c, idx: i, key: buildKey(c)}
}
// Sort once by the precomputed key.
sort.SliceStable(sorted, func(i, j int) bool {
return sorted[i].key < sorted[j].key
})
keepIdx := make(map[int]struct{}, len(chars))
// Walk equal-key runs without recomputing keys.
for i := 0; i < len(sorted); {
j := i + 1
k := sorted[i].key
for j < len(sorted) && sorted[j].key == k {
j++
}
run := sorted[i:j]
// ... inner clustering on run (see next section) ...
i = j
}
// unchanged: build output using keepIdx
}This makes the equivalence relation explicit (key field) and the logic easier to audit.
2. Use a simpler clustering helper for dedupe (avoid makeClusterDict indirection)
Within each equal-key run, you don’t need the full clusterObjects pipeline (which goes through makeClusterDict + map + extra allocation). For dedupe, a straightforward “sort by coord + linear scan” per dimension is enough and simpler to read.
You can keep the generic helper local to this file and use it only from dedupeChars, while leaving clusterObjects untouched for other callsites:
// clusterByCoord clusters a slice by a float coordinate extracted by keyFn.
// It sorts by the coordinate and then groups consecutive items whose
// coordinate differs by <= tol.
func clusterByCoord[T any](xs []T, keyFn func(T) float64, tol float64) [][]T {
if len(xs) == 0 {
return nil
}
sort.Slice(xs, func(i, j int) bool {
return keyFn(xs[i]) < keyFn(xs[j])
})
var out [][]T
current := []T{xs[0]}
last := keyFn(xs[0])
for _, v := range xs[1:] {
k := keyFn(v)
if k <= last+tol {
current = append(current, v)
} else {
out = append(out, current)
current = []T{v}
}
last = k
}
out = append(out, current)
return out
}Then use it in dedupeChars instead of nested clusterObjects:
// inside dedupeChars, after computing `run := sorted[i:j]`:
yClusters := clusterByCoord(run, func(e indexed) float64 { return e.c.Y0 }, tolerance)
for _, yc := range yClusters {
xClusters := clusterByCoord(yc, func(e indexed) float64 { return e.c.X0 }, tolerance)
for _, xc := range xClusters {
minIdx := xc[0].idx
for _, e := range xc[1:] {
if e.idx < minIdx {
minIdx = e.idx
}
}
keepIdx[minIdx] = struct{}{}
}
}Behaviorally this matches the existing logic:
clusterByCoordsorts by Y0/X0 and does a simpledelta <= toleranceagglomeration (same asclusterFloat1D).- You still choose the smallest original index within each (Y, X) bucket, preserving “first occurrence wins”.
But the dedupe path no longer depends on:
makeClusterDictclusterObjectsclusterFloat1D/map rebuilds for each run
which significantly reduces the conceptual depth of this function.
3. Remove the unused float64Bits wrapper
float64Bits is defined but not used; it adds a name without simplifying anything:
// currently:
func float64Bits(f float64) uint64 { return math.Float64bits(f) }You can safely delete this function. If you later find multiple callsites wanting stable float serialization for keys, you could instead introduce a more descriptive helper (e.g. the buildKey logic above already shows the pattern).
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
README.md (1)
217-237:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winDefine the
musthelper or remove it.Line 236 uses an undefined
musthelper function. Users copying this example will encounter a compilation error. Either definemust(e.g.,func must(s string, _ error) string { return s }) in the example, or replace it with explicit error handling as shown in the quickstart above.📝 Proposed fix: remove the must helper
-fmt.Println(must(page.ExtractText(pdftable.DefaultTextOpts()))) +text, _ := page.ExtractText(pdftable.DefaultTextOpts()) +fmt.Println(text)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` around lines 217 - 237, The README example calls an undefined helper must around page.ExtractText(pdftable.DefaultTextOpts()), causing a compile error; fix by either adding a small helper func must(s string, err error) string that returns s or by replacing the must call with explicit error handling: call page.ExtractText(pdftable.DefaultTextOpts()), check the returned error, handle/log/exit on error and then print the returned string. Update the example to reference the chosen approach and ensure the symbol names (must, page.ExtractText, pdftable.DefaultTextOpts) are used correctly.page.go (1)
31-31:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftAvoid source-breaking by expanding
Page
page.goexpands the exportedPageinterface withWords,ExtractText, andExtractTextSimple; in Go, adding methods to a public interface is source-breaking for any downstream type/mocks that already implementpdftable.Page.- The repo’s “Page interface is additive” wording covers callers that only use the interface, not external implementers. If third-party implementations are expected, keep
Pagestable and add a separate extended interface (e.g.,TextPageembeddingPage).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@page.go` at line 31, The change expands the exported Page interface by adding Words, ExtractText, and ExtractTextSimple which is source-breaking for external implementers; instead, keep the existing Page unchanged and introduce a new extended interface (e.g., TextPage) that embeds Page and declares Words, ExtractText, and ExtractTextSimple, then update internal code paths that need text features to use TextPage while leaving Page consumers/implementers intact.
🧹 Nitpick comments (2)
text.go (2)
795-799: 💤 Low valueDead code: no-op space assignment.
This block checks if a position is already a space, then sets it to a space—a no-op. The comment suggests separator insertion intent, but since the grid is initialized with spaces, this code has no effect.
♻️ Remove the dead code
col++ } - // Insert a separator space if there's room — but only if - // the next position isn't already non-blank. - if col < widthChars && rows[row][col] == ' ' { - rows[row][col] = ' ' - } } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@text.go` around lines 795 - 799, Remove the no-op separator assignment: the if block that checks "if col < widthChars && rows[row][col] == ' ' { rows[row][col] = ' ' }" is dead code (grid is initialized with spaces) and should be deleted; locate the code using the identifiers rows, row, col, and widthChars and remove that conditional and its body (or, if the intended behavior was to insert a separator when the cell is not blank, change the condition to rows[row][col] != ' ' and set it to ' ' instead).
282-305: 💤 Low valueIneffectual assignment flagged by static analysis.
The assignment
filtered := charsat line 283 is never used—both branches of the if/else overwrite it. This can be simplified into a single filtering loop.♻️ Suggested simplification
// Step 1/2: filter blanks (unless KeepBlankChars) and empties. -filtered := chars -if !opts.KeepBlankChars { - out := make([]Char, 0, len(chars)) - for _, c := range chars { - if c.Text == "" { - continue - } - if isAllSpace(c.Text) { - continue - } - out = append(out, c) - } - filtered = out -} else { - out := make([]Char, 0, len(chars)) - for _, c := range chars { - if c.Text == "" { - continue - } - out = append(out, c) - } - filtered = out -} +filtered := make([]Char, 0, len(chars)) +for _, c := range chars { + if c.Text == "" { + continue + } + if !opts.KeepBlankChars && isAllSpace(c.Text) { + continue + } + filtered = append(filtered, c) +}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@text.go` around lines 282 - 305, The preallocated assignment filtered := chars is ineffectual because both branches replace it; remove that assignment and replace the duplicated branches with a single loop that builds out := make([]Char, 0, len(chars)) iterating over chars and applying the two checks: always skip empty Text (c.Text == ""), and conditionally skip all-space tokens by calling isAllSpace(c.Text) only when opts.KeepBlankChars is false; assign filtered = out at the end (use the existing types Char, variable chars, and option opts.KeepBlankChars).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@clustering.go`:
- Around line 282-286: The helper function float64Bits is unused (lint error);
remove the unused function float64Bits (the small wrapper around
math.Float64bits) from the file, or if intended to be used, replace direct calls
to math.Float64bits in dedupeChars's key construction with float64Bits to make
it referenced; prefer deleting the float64Bits function to satisfy static
analysis if no call sites are added.
In `@geometry.go`:
- Around line 75-78: The comment above the Intersect logic incorrectly states
that shared-edge (zero-area) cases are treated as non-overlap while the
Intersect function and tests treat edge-touch (w>0,h=0 or w=0,h>0) as
overlapping; update the documentation comment for the Intersect function in
geometry.go to accurately describe the implemented behavior (edge-touch where
one dimension is zero and the other >0 is considered an intersection) and
mention the specific condition used (w>0 || h>0) so callers are not misled.
In `@text_test.go`:
- Line 260: Replace the naked defer doc.Close() calls with a deferred closure
that checks and surfaces the error returned by Document.Close; for each
occurrence (the doc.Close() calls in text_test.go) change to defer func() { if
err := doc.Close(); err != nil { t.Fatalf("closing document: %v", err) } } (or
t.Errorf if you prefer non-fatal) so cleanup failures are not silently discarded
— locate the doc.Close() usages and wrap them in the deferred error-checking
closure.
- Line 261: The test is discarding the error returned by Document.Page; change
all occurrences like p, _ := doc.Page(1) to capture the error (p, err :=
doc.Page(1)) and immediately check it (fail/assert) so lookup/out-of-range
failures surface; update the instances in text_test.go (lines around
261/283/314/331) and page_test.go (around line 159) to assert err == nil (or
t.Fatalf/t.Helper with the error) before using p.
---
Outside diff comments:
In `@page.go`:
- Line 31: The change expands the exported Page interface by adding Words,
ExtractText, and ExtractTextSimple which is source-breaking for external
implementers; instead, keep the existing Page unchanged and introduce a new
extended interface (e.g., TextPage) that embeds Page and declares Words,
ExtractText, and ExtractTextSimple, then update internal code paths that need
text features to use TextPage while leaving Page consumers/implementers intact.
In `@README.md`:
- Around line 217-237: The README example calls an undefined helper must around
page.ExtractText(pdftable.DefaultTextOpts()), causing a compile error; fix by
either adding a small helper func must(s string, err error) string that returns
s or by replacing the must call with explicit error handling: call
page.ExtractText(pdftable.DefaultTextOpts()), check the returned error,
handle/log/exit on error and then print the returned string. Update the example
to reference the chosen approach and ensure the symbol names (must,
page.ExtractText, pdftable.DefaultTextOpts) are used correctly.
---
Nitpick comments:
In `@text.go`:
- Around line 795-799: Remove the no-op separator assignment: the if block that
checks "if col < widthChars && rows[row][col] == ' ' { rows[row][col] = ' ' }"
is dead code (grid is initialized with spaces) and should be deleted; locate the
code using the identifiers rows, row, col, and widthChars and remove that
conditional and its body (or, if the intended behavior was to insert a separator
when the cell is not blank, change the condition to rows[row][col] != ' ' and
set it to ' ' instead).
- Around line 282-305: The preallocated assignment filtered := chars is
ineffectual because both branches replace it; remove that assignment and replace
the duplicated branches with a single loop that builds out := make([]Char, 0,
len(chars)) iterating over chars and applying the two checks: always skip empty
Text (c.Text == ""), and conditionally skip all-space tokens by calling
isAllSpace(c.Text) only when opts.KeepBlankChars is false; assign filtered = out
at the end (use the existing types Char, variable chars, and option
opts.KeepBlankChars).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 21f89b40-896e-4cd7-9005-ea55c3209473
⛔ Files ignored due to path filters (3)
testdata/golden/hello.pdfis excluded by!**/*.pdftestdata/golden/rules.pdfis excluded by!**/*.pdftestdata/golden/simple1.pdfis excluded by!**/*.pdf
📒 Files selected for processing (15)
CHANGELOG.mdREADME.mdclustering.goclustering_test.gogeometry.gogeometry_test.gogolden_test.gopage.gopdftable.goscripts/gen_golden.pytestdata/golden/hello.expected.jsontestdata/golden/rules.expected.jsontestdata/golden/simple1.expected.jsontext.gotext_test.go
| // float64Bits is a tiny re-export point that consolidates the math. | ||
| // Float64bits dependency so other tests/files don't have to import | ||
| // "math" just to compare floats. Left as a package-private helper for | ||
| // now — only dedupeChars's key construction uses it. | ||
| func float64Bits(f float64) uint64 { return math.Float64bits(f) } |
There was a problem hiding this comment.
Remove unused float64Bits helper (lint error).
Line 286 is unused and already flagged by static analysis. Please remove it (or use it consistently) to keep CI/lint clean.
Proposed cleanup
-// float64Bits is a tiny re-export point that consolidates the math.
-// Float64bits dependency so other tests/files don't have to import
-// "math" just to compare floats. Left as a package-private helper for
-// now — only dedupeChars's key construction uses it.
-func float64Bits(f float64) uint64 { return math.Float64bits(f) }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // float64Bits is a tiny re-export point that consolidates the math. | |
| // Float64bits dependency so other tests/files don't have to import | |
| // "math" just to compare floats. Left as a package-private helper for | |
| // now — only dedupeChars's key construction uses it. | |
| func float64Bits(f float64) uint64 { return math.Float64bits(f) } |
🧰 Tools
🪛 golangci-lint (2.12.2)
[error] 286-286: func float64Bits is unused
(unused)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@clustering.go` around lines 282 - 286, The helper function float64Bits is
unused (lint error); remove the unused function float64Bits (the small wrapper
around math.Float64bits) from the file, or if intended to be used, replace
direct calls to math.Float64bits in dedupeChars's key construction with
float64Bits to make it referenced; prefer deleting the float64Bits function to
satisfy static analysis if no call sites are added.
| // We treat touching-but-not-overlapping (shared edge, zero area) as | ||
| // non-overlap, matching pdfplumber's `o_height + o_width > 0` check — | ||
| // a single-line ruler that grazes a word's bbox should not be reported | ||
| // as "intersecting" the word. |
There was a problem hiding this comment.
Fix Intersect doc semantics for edge-touch cases.
Line 75 currently says shared-edge, zero-area intersections are treated as non-overlap, but Line 89 logic and tests treat edge-touch (w>0,h=0 or w=0,h>0) as overlap. Please align the comment to avoid misleading callers.
Proposed doc fix
-// We treat touching-but-not-overlapping (shared edge, zero area) as
-// non-overlap, matching pdfplumber's `o_height + o_width > 0` check —
-// a single-line ruler that grazes a word's bbox should not be reported
-// as "intersecting" the word.
+// We treat point-touch (w=0,h=0) as non-overlap, matching
+// pdfplumber's `o_height + o_width > 0` check. Edge-touch
+// intersections (w=0,h>0 or w>0,h=0) are counted as overlap.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@geometry.go` around lines 75 - 78, The comment above the Intersect logic
incorrectly states that shared-edge (zero-area) cases are treated as non-overlap
while the Intersect function and tests treat edge-touch (w>0,h=0 or w=0,h>0) as
overlapping; update the documentation comment for the Intersect function in
geometry.go to accurately describe the implemented behavior (edge-touch where
one dimension is zero and the other >0 is considered an intersection) and
mention the specific condition used (w>0 || h>0) so callers are not misled.
| if err != nil { | ||
| t.Fatalf("open: %v", err) | ||
| } | ||
| defer doc.Close() |
There was a problem hiding this comment.
Handle Document.Close errors instead of discarding them.
Line 260, Line 282, Line 313, and Line 330 use defer doc.Close() without checking the returned error, which is currently tripping errcheck and can hide cleanup failures.
Suggested fix pattern
- defer doc.Close()
+ t.Cleanup(func() {
+ if cerr := doc.Close(); cerr != nil {
+ t.Errorf("close: %v", cerr)
+ }
+ })Also applies to: 282-282, 313-313, 330-330
🧰 Tools
🪛 golangci-lint (2.12.2)
[error] 260-260: Error return value of doc.Close is not checked
(errcheck)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@text_test.go` at line 260, Replace the naked defer doc.Close() calls with a
deferred closure that checks and surfaces the error returned by Document.Close;
for each occurrence (the doc.Close() calls in text_test.go) change to defer
func() { if err := doc.Close(); err != nil { t.Fatalf("closing document: %v",
err) } } (or t.Errorf if you prefer non-fatal) so cleanup failures are not
silently discarded — locate the doc.Close() usages and wrap them in the deferred
error-checking closure.
| t.Fatalf("open: %v", err) | ||
| } | ||
| defer doc.Close() | ||
| p, _ := doc.Page(1) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Inspect Document/Page signatures:"
rg -nP --type=go -C2 '\btype\s+Document\b|\bfunc\s*\([^)]*\)\s*Page\s*\('
echo
echo "Find call sites that ignore the second return value:"
rg -nP --type=go -C2 '\bp\s*,\s*_\s*:=\s*doc\.Page\s*\('Repository: hallelx2/pdftable
Length of output: 1610
Don’t discard the error from doc.Page(1).
Document.Page returns (Page, error) (pdf.go), so p, _ := doc.Page(1) can hide lookup/out-of-range failures and make later assertions fail unclearly.
In text_test.go lines 261/283/314/331 (and also page_test.go line 159), replace _ with an err check, e.g. p, err := doc.Page(1) and fail/assert on err.
p, _ := doc.Page(1)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@text_test.go` at line 261, The test is discarding the error returned by
Document.Page; change all occurrences like p, _ := doc.Page(1) to capture the
error (p, err := doc.Page(1)) and immediately check it (fail/assert) so
lookup/out-of-range failures surface; update the instances in text_test.go
(lines around 261/283/314/331) and page_test.go (around line 159) to assert err
== nil (or t.Fatalf/t.Helper with the error) before using p.
Summary
Port pdfplumber's word + text extraction algorithms into Go. Three new methods on
Page:Words(WordOpts) ([]Word, error)ExtractText(TextOpts) (string, error)ExtractTextSimple(xTolerance, yTolerance float64) (string, error)Plus supporting machinery:
BBoxhelpers (geometry.go), 1-D clustering primitives (clustering.go), and theWordExtractoralgorithm (text.go).API additions
Parity with pdfplumber
Matches exactly:
(KeepBlankChars, UseTextFlow, SplitAtPunctuation, Expand).
dedupeCharssemantics (text + position + extra_attrs equality).Intentionally differs:
Layout=trueproduces a structurally similar fixed-width grid but is not byte-equal to pdfplumber'sextract_text(layout=True)(pdfplumber's layout output has its own version-to-version drift).Not yet ported (documented as future work):
extract_text_lines(regex-based line extraction).TextMap.search(regex over assembled page text with char-back-references).extra_attrsbeyondfontnameandsize.Tests
geometry_test.go,clustering_test.go,text_test.go.hello.pdf,rules.pdf,simple1.pdf(intestdata/golden/). Regenerable viapython scripts/gen_golden.py.go test ./...runs in ~4 seconds.Test plan
go build ./...go vet ./...go test -count=1 ./...clean (final two-line output:ok github.com/hallelx2/pdftable 0.803s/ok github.com/hallelx2/pdftable/internal/pdf 2.975s).go test -run TestGolden ./...).Summary by Sourcery
Add word-level and text extraction APIs to pages and align their behaviour with pdfplumber, including clustering and geometry helpers plus golden parity tests.
New Features:
Enhancements:
Documentation:
Tests:
Summary by CodeRabbit
Release Notes
New Features
Words()extracts positioned text runs,ExtractText()provides layout-preserving text output, andExtractTextSimple()offers a streamlined alternative.BBoxgeometry type with spatial operations for bounding box calculations and containment checks.Tests