AdaWorldAPI · 2026-04-05T09:15:02Z

No description provided.

ndarray/hpc/ocr_simd.rs: SIMD-accelerated image preprocessing for OCR: - Otsu binarization: U8x64 histogram + optimal threshold - Bit-packed BinaryImage: 64 pixels per u64 word - Adaptive threshold: integral image + local mean (handles uneven lighting) - Skew estimation: horizontal projection profile variance - Foreground density: popcount for blank page detection - Full preprocess_page() pipeline: binarize → skew → density check For tesseract integration: preprocess with SIMD, then pipe binary image to tesseract LSTM (which only does character recognition, the fast part). For our own OCR: binary image → connected components → Base17 fingerprint per character glyph → codebook lookup = O(1) character recognition. 10 tests: Otsu bimodal, binarize all-white/black/checkerboard, density, blank page, text page, skew detection, adaptive vs Otsu. Data-flow: &[u8] slices (SIMD), owned BinaryImage (write-back), no &mut self. https://claude.ai/code/session_01ChLvBfpJS8dQhHxRD4pYNp

Benchmark on real Wikileaks PDF (KENOZA vs GIAT, 2481×3508 @ 300 DPI): ndarray SIMD preprocess: 477ms (57 Mpix/s) tesseract full pipeline: 4866ms (5.3 Mpix/s) Speedup: 10.2x Per-step breakdown: Otsu threshold: 21ms (histogram + optimal split) Binarize: 8ms (64 pixels/u64, bit-packed, 1.1 Gpix/s) Density: 0.15ms (popcount, instant) Skew detection: 102-174ms (bottleneck, 101-angle projection) Adaptive thresh: 80-91ms (integral image + local mean) Optimal pipeline: ndarray preprocess → pipe to tesseract LSTM only. Skipping tesseract's scalar C++ preprocessing saves ~2-3s/page. ocr_benchmark.rs: loads raw grayscale pages, benchmarks both paths, shows quality metrics (threshold, density, skew angle, word count). https://claude.ai/code/session_01ChLvBfpJS8dQhHxRD4pYNp

claude added 2 commits April 4, 2026 23:33

AdaWorldAPI merged commit d4fc733 into master Apr 5, 2026
4 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AdaWorldAPI commented Apr 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented Apr 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants