Add README with speed comparison + cosine emulation docs by AdaWorldAPI · Pull Request #93 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-04-13T11:09:45Z

Summary

Comprehensive README.md in rustynum style for senior Rust developers
GEMM benchmarks: 10.5× over upstream (139 vs 13 GFLOPS at 1024×1024)
Cosine emulation via 256-step palette: 611M/s at 0.4% error (1/40σ), 12× faster than SIMD f32 dot
7 "stable Rust tricks" sections: SIMD polyfill, f16 without nightly, AMX via asm!(.byte), tiered NEON, frozen dispatch, BF16 RNE bit-exact, cognitive codec stack
Full module inventory (55 HPC modules, 880 tests)

Test plan

Verify README renders correctly on GitHub
All existing tests still pass (no code changes, documentation only)

https://claude.ai/code/session_017ZN5PNEf8boFBgorUZVrFU

…orch Comprehensive benchmark document (rustynum-style): - GEMM: 10.5× over upstream at 1024×1024 (139 vs 13 GFLOPS) - Codebook inference: 380K tok/s (AMX) down to 500 tok/s (Pi 4 NEON) - SPO palette: 611M lookups/sec, 1.8ns latency, 388KB RAM - f16 transcoding: 30MB (2× compression), 94M params/sec, 7.3e-6 max error Feature comparison table: upstream (no SIMD) vs fork (55 HPC modules). SIMD tier table: AMX → AVX-512 → AVX2 → NEON dotprod → NEON → Scalar. ARM SBC support: Pi Zero 2W through Pi 5, Orange Pi 4/5 (big.LITTLE aware). Precision toolkit: f16, Scaled-f16, Double-f16, Kahan summation. Ecosystem links: lance-graph, home-automation-rs, ada-rs. https://claude.ai/code/session_017ZN5PNEf8boFBgorUZVrFU

… Rust tricks Rewritten in rustynum README style for a senior Rust developer audience. Performance data: - GEMM: 139 GFLOPS (10.5× over upstream, matches NumPy OpenBLAS) - Codebook: 380K tok/s (AMX) → 500 tok/s (Pi 4 NEON) per-tier breakdown - SPO palette: 611M lookups/s, 1.8ns latency, 388KB working set - f16 transcoding: 94M params/s, 7.3e-6 max error on 15M param model - Cosine emulation: 611M/s via 256-step palette (0.4% error at 1/40σ) Architecture sections: - SIMD polyfill layer (F32x16 etc. on stable, LazyLock dispatch) - Backend layer (Goto GEMM, MKL/OpenBLAS feature-gated) - HPC module library (55 modules, 880 tests) - Codec layer (Fingerprint, Base17, CAM-PQ, palette semiring) - Burn integration (SIMD-augmented tensor ops) 7 "What We Build That Nobody Else Does": 1. Complete std::simd polyfill on stable 2. f16 types without nightly (u16 carrier + F16C/FCVTL) 3. AMX on stable via asm!(".byte") encoding 4. Tiered ARM NEON (A53/A72/A76 with microarch awareness) 5. Frozen dispatch (0.3ns function pointer, no branch) 6. BF16 RNE bit-exact with hardware VCVTNEPS2BF16 7. Cognitive codec stack (Fingerprint→Base17→CAM-PQ→Palette→bgz7) Cosine emulation section explaining palette distance tables: - 256×256 u8 table = 64KB (fits L1 cache) - Foveal (1/40σ): 0.4% error, 611M/s - Good (1/4σ): 2% error, 611M/s - Near (1σ): 8% error, 2.4B/s (64-step) - 12× faster than SIMD f32 dot product (no FP division/multiply) https://claude.ai/code/session_017ZN5PNEf8boFBgorUZVrFU

claude added 2 commits April 13, 2026 10:50

AdaWorldAPI merged commit 1c6f8ef into master Apr 13, 2026
5 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add README with speed comparison + cosine emulation docs#93

Add README with speed comparison + cosine emulation docs#93
AdaWorldAPI merged 2 commits into
masterfrom
claude/setup-rust-smart-home-SOPAY

AdaWorldAPI commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented Apr 13, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants