bench: ndarray SIMD OCR 10x faster than tesseract preprocessing Benchmark on real Wikileaks PDF (KENOZA vs GIAT, 2481×3508 @ 300 DPI): ndarray SIMD preprocess: 477ms (57 Mpix/s) tesseract full pipeline: 4866ms (5.3 Mpix/s) Speedup: 10.2x Per-step breakdown: Otsu threshold: 21ms (histogram + optimal split) Binarize: 8ms (64 pixels/u64, bit-packed, 1.1 Gpix/s) Density: 0.15ms (popcount, instant) Skew detection: 102-174ms (bottleneck, 101-angle projection) Adaptive thresh: 80-91ms (integral image + local mean) Optimal pipeline: ndarray preprocess → pipe to tesseract LSTM only. Skipping tesseract's scalar C++ preprocessing saves ~2-3s/page. ocr_benchmark.rs: loads raw grayscale pages, benchmarks both paths, shows quality metrics (threshold, density, skew angle, word count). https://claude.ai/code/session_01ChLvBfpJS8dQhHxRD4pYNp #85

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Apr 5, 2026

+595 −0

Provide feedback