Add README with speed comparison + cosine emulation docs#93
Merged
Conversation
…orch Comprehensive benchmark document (rustynum-style): - GEMM: 10.5× over upstream at 1024×1024 (139 vs 13 GFLOPS) - Codebook inference: 380K tok/s (AMX) down to 500 tok/s (Pi 4 NEON) - SPO palette: 611M lookups/sec, 1.8ns latency, 388KB RAM - f16 transcoding: 30MB (2× compression), 94M params/sec, 7.3e-6 max error Feature comparison table: upstream (no SIMD) vs fork (55 HPC modules). SIMD tier table: AMX → AVX-512 → AVX2 → NEON dotprod → NEON → Scalar. ARM SBC support: Pi Zero 2W through Pi 5, Orange Pi 4/5 (big.LITTLE aware). Precision toolkit: f16, Scaled-f16, Double-f16, Kahan summation. Ecosystem links: lance-graph, home-automation-rs, ada-rs. https://claude.ai/code/session_017ZN5PNEf8boFBgorUZVrFU
… Rust tricks
Rewritten in rustynum README style for a senior Rust developer audience.
Performance data:
- GEMM: 139 GFLOPS (10.5× over upstream, matches NumPy OpenBLAS)
- Codebook: 380K tok/s (AMX) → 500 tok/s (Pi 4 NEON) per-tier breakdown
- SPO palette: 611M lookups/s, 1.8ns latency, 388KB working set
- f16 transcoding: 94M params/s, 7.3e-6 max error on 15M param model
- Cosine emulation: 611M/s via 256-step palette (0.4% error at 1/40σ)
Architecture sections:
- SIMD polyfill layer (F32x16 etc. on stable, LazyLock dispatch)
- Backend layer (Goto GEMM, MKL/OpenBLAS feature-gated)
- HPC module library (55 modules, 880 tests)
- Codec layer (Fingerprint, Base17, CAM-PQ, palette semiring)
- Burn integration (SIMD-augmented tensor ops)
7 "What We Build That Nobody Else Does":
1. Complete std::simd polyfill on stable
2. f16 types without nightly (u16 carrier + F16C/FCVTL)
3. AMX on stable via asm!(".byte") encoding
4. Tiered ARM NEON (A53/A72/A76 with microarch awareness)
5. Frozen dispatch (0.3ns function pointer, no branch)
6. BF16 RNE bit-exact with hardware VCVTNEPS2BF16
7. Cognitive codec stack (Fingerprint→Base17→CAM-PQ→Palette→bgz7)
Cosine emulation section explaining palette distance tables:
- 256×256 u8 table = 64KB (fits L1 cache)
- Foveal (1/40σ): 0.4% error, 611M/s
- Good (1/4σ): 2% error, 611M/s
- Near (1σ): 8% error, 2.4B/s (64-step)
- 12× faster than SIMD f32 dot product (no FP division/multiply)
https://claude.ai/code/session_017ZN5PNEf8boFBgorUZVrFU
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan
https://claude.ai/code/session_017ZN5PNEf8boFBgorUZVrFU