doc-extract

Text extraction library for RAG pipelines, LLM apps, and AI agents — turn uploaded documents into clean plain text before chunking, embedding, or indexing.

Native addon for Node.js and Bun (Rust + N-API): fast, in-process, no subprocesses. PDF, Office, EPUB, calendars, contacts, Apple Wallet passes, and more.

Why doc-extract

Fast - Rust parsers run natively
Small footprint - ~6 MB native addon per platform
Non-blocking - extractText() returns a Promise; CPU work runs on a Rust thread pool
Backpressure - built-in concurrency limit (default 32) via setMaxConcurrent
Flexible limits - global, per-instance, or per-call file size in megabytes
Large files - file paths and ZIP-based formats read from disk; buffers above threshold auto-spill to temp
No subprocesses - PDF and Office handled in-process

Install

npm install @alexstep/doc-extract
# or
bun add @alexstep/doc-extract

Requires Node.js ≥ 18 or Bun ≥ 1.3.

Quick start

import docExtract from '@alexstep/doc-extract'

// Global defaults
docExtract.setMaxConcurrent(4)
docExtract.setMaxFilesizeMB(42)
docExtract.setInMemoryThresholdMB(64)

const text = await docExtract.extractText('./report.pdf')
const fromUrl = await docExtract.extractText('https://example.com/file.pdf')
const pass = await docExtract.parsePkPass('./ticket.pkpass')

// Isolated instance
const custom = new docExtract({ maxConcurrent: 4, maxFileSizeMB: 200 })

// Buffer: auto-detect by magic bytes (%PDF, ZIP/docx, etc.)
const buffer = await Bun.file('./report.pdf').bytes()
const doc = await custom.extractText(Buffer.from(buffer))

// Explicit format when magic is ambiguous (e.g. plain CSV bytes)
const csv = await custom.extractText(someBuffer, 'csv')

Auto-detect

Input	Hint source
File path	extension hint + magic bytes from file head
URL	extension from pathname + magic bytes
`Buffer`	magic bytes only (no filename)

Works without a second argument for PDF (%PDF), Office ZIP (docx/xlsx/pptx), ICS/VCF, JSON, HTML, and similar.

For buffers, pass format explicitly when the content has no clear signature (e.g. legacy .doc, ambiguous plain text):

await docExtract.extractText(buffer, 'docx')
await docExtract.extractText(buffer, { format: 'pdf', debug: true })
await docExtract.extractText(buffer, { unknown: 'reject' }) // strict: no text heuristic

Unknown content policy

When format cannot be determined confidently:

`unknown`	Behavior
`text-if-likely` (default)	Treat as `txt` only if bytes look like text (UTF-8/UTF-16/BOM, low control-byte ratio)
`reject`	Return `""` (unsupported)
`text-lossy`	Try `txt` unless bytes are obviously binary

Detection uses explicit format first, then combines extension hints and magic bytes. Magic bytes override conflicting extensions when possible (e.g. %PDF vs a .zip path, ICS content vs a .txt name). Unknown bytes fall back according to unknown policy.

Path-based text heuristics read up to 32 KB of file head; magic/ZIP sniffing uses the first 4 KB.

API

Method	Description
`docExtract.extractText(input, format?)`	Extract text. `input` = `Buffer`, file path, or `http(s)://` URL. Optional `format` or `{ format, unknown, maxFileSizeMB, inMemoryThresholdMB, tempDir }`.
`docExtract.parsePkPass(input, options?)`	Parse `.pkpass` → `{ pass, localization?, stripImage? }`.
`docExtract.setMaxConcurrent(n)`	Global parallel parse limit (`n === 0` or negative = no-op).
`docExtract.setMaxFilesizeMB(n)`	Global max input size in MB (default 42). `0` = unlimited.
`docExtract.setInMemoryThresholdMB(n)`	Above this size, paths/URLs skip JS heap; buffers spill to temp (default 64).
`docExtract.setMaxWorkingSetMB(n)`	Optional cap on total in-flight parse memory (`0` = disabled).
`docExtract.setTempDir(dir)`	Directory for auto-spill temp files.
`new docExtract({ maxConcurrent, maxFileSizeMB, inMemoryThresholdMB, tempDir, debug })`	Instance with its own limits and optional debug logging.

Environment: DOCEXTRACT_MAX_CONCURRENT, DOCEXTRACT_MAX_FILESIZE_MB, DOCEXTRACT_MAX_BYTES, DOCEXTRACT_IN_MEMORY_THRESHOLD_MB, DOCEXTRACT_MAX_WORKING_SET_MB, DOCEXTRACT_TMPDIR, DOCEXTRACT_DEBUG.

Error handling

extractText resolves to a string — no try/catch needed for content issues:

Situation	Result
Empty PDF (image scan), unsupported format, parse failure	`""`
File too large, missing file, URL fetch error	throws — use `.catch()`

const text = await docExtract.extractText('./scan.pdf') // "" if image-only PDF

await docExtract.extractText('https://example.com/huge.pdf').catch((err) => {
  // size limit, network, missing file
})

await docExtract.extractText('./scan.pdf', { debug: true })
// console.debug: PDF has no text layer (likely image-only scan)

parsePkPass still returns null for invalid passes (unchanged).

Large files

doc-extract is built for batch imports and server-side pipelines, not only small uploads.

Input	Behavior
File path	Rust opens the file directly; ZIP formats (docx, xlsx, pptx, epub, odt, pkpass) read entries from disk
URL	Streamed to a temp file, then parsed via path pipeline
Buffer ≤ threshold	Passed to native code in memory (default threshold 64 MB)
Buffer > threshold	Auto-written to temp, parsed from disk, temp removed

Policy vs memory:

maxFileSizeMB — reject files larger than N (0 = no limit)
inMemoryThresholdMB — when to avoid keeping full payload in the JS heap

Batch tip: for many large files (e.g. 20 × 200 MB), lower maxConcurrent (2–4) and optionally set setMaxWorkingSetMB to cap total in-flight memory. Peak RAM for ZIP formats is roughly concurrency × parser working set, not concurrency × file size.

pkpass

const text = await docExtract.extractText('./ticket.pkpass') // formatted text for AI
const json = await docExtract.parsePkPass('./ticket.pkpass') // structured pass.json

Supported formats

Group	Extensions
Office	`pdf`, `docx`, `docm`, `xlsx`, `xls`, `ods`, `pptx`, `pptm`, `odt`, `rtf`
Books	`epub`, `fb2`
Calendar / contacts	`ics`, `ifb`, `ical`, `vcf`, `vcard`
Data	`json`, `jsonl`, `ndjson`, `csv`, `tsv`
Web / text	`html`, `htm`, `xhtml`, `xml`, `txt`, `md`, `markdown`, `log`
Wallet	`pkpass` (auto-detected from ZIP + `pass.json`)

Limitations & alternatives

doc-extract targets in-process text extraction: text-layer PDF, Office Open XML, EPUB, calendars, and similar. No OCR, no subprocesses, no Docker sidecar.

Not supported (returns "" or needs another tool):

Image-only / scanned PDF (no text layer) — see OCR below
Legacy .doc (binary Word), .msg, PostScript
Images with text: PNG, JPEG, TIFF (needs Tesseract OCR)
Audio: mp3, wav

For those cases, a HTTP sidecar such as textract-docker is a practical fallback. It wraps Python textract with Tesseract OCR, antiword for .doc, and many other backends behind a simple REST API.

Typical integration pattern:

Try docExtract.extractText() first (fast, in-process).
If result is "" or format is unsupported — call textract-docker (or your existing docparser service).

Performance

doc-extract runs inside the Node/Bun process — no HTTP, base64 encoding, or Docker hop per request. That makes it a better fit for high-throughput paths (batch imports) where latency and concurrency matter.

textract-docker adds network and Python/subprocess overhead on each call, but covers OCR and legacy formats doc-extract deliberately skips.

Build from source

git clone https://github.com/alexstep/doc-extract.git
cd doc-extract
bun install
bun run build   # Rust ≥ 1.88
bun test

License

MIT

Stats


Native addon size	~6 MB per platform
Default max input	42 MB (`setMaxFilesizeMB`, `0` = unlimited)
In-memory threshold	64 MB (`setInMemoryThresholdMB`)
Zip entry cap	64 MB per entry — exceeds limit throws (not silent truncate)
Supported extensions	30+
Default concurrency	32 (`DOCEXTRACT_MAX_CONCURRENT`)
Runtime	Node.js ≥ 18, Bun ≥ 1.3

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
__tests__		__tests__
examples		examples
fixtures		fixtures
scripts		scripts
src		src
.gitignore		.gitignore
.npmignore		.npmignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
browser.js		browser.js
build.rs		build.rs
bun.lock		bun.lock
demo.html		demo.html
doc-extract.d.ts		doc-extract.d.ts
doc-extract.js		doc-extract.js
index.d.ts		index.d.ts
index.js		index.js
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

doc-extract

Why doc-extract

Install

Quick start

Auto-detect

Unknown content policy

API

Error handling

Large files

pkpass

Supported formats

Limitations & alternatives

Performance

Build from source

License

Stats

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

doc-extract

Why doc-extract

Install

Quick start

Auto-detect

Unknown content policy

API

Error handling

Large files

pkpass

Supported formats

Limitations & alternatives

Performance

Build from source

License

Stats

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages