From a4bde45a154cbcdd5a5f6c2b290186fb3d3c56c1 Mon Sep 17 00:00:00 2001 From: AdaWorldAPI Date: Mon, 13 Apr 2026 13:19:44 +0200 Subject: [PATCH 1/4] README.md: remove ada-rs mention (private repo) --- README.md | 169 +++++++++++------------------------------------------- 1 file changed, 32 insertions(+), 137 deletions(-) diff --git a/README.md b/README.md index 122682db..502e187b 100644 --- a/README.md +++ b/README.md @@ -4,6 +4,8 @@ A complete high-performance numerical computing stack built on top of the [rust- The upstream ndarray provides excellent n-dimensional array abstractions. We keep all of that and add what it was never designed to do: compete with NumPy's OpenBLAS on GEMM, run codebook inference on a 5-watt Pi 4, and handle half-precision floats that Rust doesn't even have a stable type for yet. +[Deutsche Version / German Version](README-DE.md) + ## Core Architecture The expansion comprises five layers built on top of upstream's array primitives: @@ -27,16 +29,14 @@ The Goto-algorithm GEMM with cache blocking (L1: 32KB, L2: 256KB, L3: shared) an | Matrix Size | Upstream ndarray | **This Fork** | NumPy (OpenBLAS) | PyTorch CPU | GPU (RTX 3060) | |-------------|-----------------|---------------|------------------|-------------|----------------| | 512×512 | ~20 GFLOPS | **47 GFLOPS** | ~45 GFLOPS | ~40 GFLOPS | ~1,200 GFLOPS | -| 1024×1024 | ~13 GFLOPS¹ | **139 GFLOPS** | ~120 GFLOPS | ~100 GFLOPS | ~3,500 GFLOPS | -| 2048×2048 | ~13 GFLOPS¹ | **~150 GFLOPS** | ~140 GFLOPS | ~130 GFLOPS | ~5,000 GFLOPS | - -¹ Upstream hits a cache cliff at 1024×1024: no tiling, no threading, no microkernel. Our Goto implementation eliminates this entirely. +| 1024×1024 | ~13 GFLOPS | **139 GFLOPS** | ~120 GFLOPS | ~100 GFLOPS | ~3,500 GFLOPS | +| 2048×2048 | ~13 GFLOPS | **~150 GFLOPS** | ~140 GFLOPS | ~130 GFLOPS | ~5,000 GFLOPS | -At 1024×1024 we deliver **10.5× the throughput of upstream** and match NumPy's decades-old OpenBLAS within measurement noise. GPU wins at large dense matrices but carries 170W power draw and PCIe transfer latency; our CPU path wins at latency-sensitive workloads and mixed compute/IO patterns. +Upstream hits a cache cliff at 1024×1024: no tiling, no threading, no microkernel. Our Goto implementation eliminates this entirely. At 1024×1024 we deliver **10.5× the throughput of upstream** and match NumPy's decades-old OpenBLAS within measurement noise. ### Codebook Inference (Token Generation) -This is not matrix multiplication. Codebook inference replaces `y = W·x` with `y = codebook[index[x]]` — an O(1) table lookup per token. No GPU required, no FP32 accumulation, just memory bandwidth. +This is not matrix multiplication. Codebook inference replaces `y = W·x` with `y = codebook[index[x]]` — an O(1) table lookup per token. No GPU required. | Hardware | ISA | tok/s | 50-Token Latency | Power | |----------|-----|-------|------------------|-------| @@ -47,22 +47,24 @@ This is not matrix multiplication. Codebook inference replaces `y = W·x` with ` | Raspberry Pi 4 | NEON (dual pipeline) | **500–2,000** | 25–100 ms | 5W | | Pi Zero 2W | NEON (single pipeline) | **50–500** | 100–1000 ms | 2W | -At 5 watts, a Pi 4 generates a 50-token voice assistant response in under 100 milliseconds. The AMX path on Sapphire Rapids achieves 380K tok/s — faster than most GPU-based inference for small-batch queries because there is no kernel launch overhead, no PCIe round-trip, and no memory allocation. +At 5 watts, a Pi 4 generates a 50-token voice assistant response in under 100 milliseconds. -### Semantic Search (SPO Palette Distance) +### Cosine Similarity via Palette Distance (Integer-Only) -Compressed vector similarity using palette-indexed distance tables: +Traditional cosine requires floating-point: `dot(a,b) / (|a| × |b|)`. We replace this with a single u8 table lookup. High-dimensional vectors are quantized to 256 archetypes; pairwise distance is precomputed into a 256×256 u8 table. Query-time similarity: `table[a][b]` — one memory access, no floating point. -| Metric | Value | -|--------|-------| -| Throughput | **611 million lookups/sec** | -| Latency per lookup | **1.8 nanoseconds** | -| Working set | **388 KB** (fits in L2 cache) | -| Token throughput | **17,000 tok/s** (triple model, 4096 heads) | +| Precision Tier | Sigma Band | Max Cosine Error | Speed | +|----------------|------------|-----------------|-------| +| **Foveal** (1/40 σ) | Inner 2.5% | ±0.004 (0.4%) | **611M lookups/s** | +| **Good** (1/4 σ) | Inner 68% | ±0.02 (2%) | **611M lookups/s** | +| **Near** (1 σ) | Inner 95% | ±0.08 (8%) | **2.4B lookups/s** | +| F32 exact cosine | — | 0 | ~50M/s | + +**611 million cosine-equivalent comparisons per second using only integer operations** — 12× faster than SIMD f32 dot product. The 256×256 table (64KB) fits entirely in L1 cache. ### Half-Precision Weight Transcoding -Tested on 15 million parameter model (Piper TTS scale): +Tested on 15M parameter model (Piper TTS scale): | Format | Size | Max Error | RMSE | Throughput | |--------|------|-----------|------|------------| @@ -71,124 +73,35 @@ Tested on 15 million parameter model (Piper TTS scale): | **Scaled-f16** | **30 MB** | 4.9×10⁻⁶ | 2.1×10⁻⁶ | 91M params/s | | **Double-f16** | 60 MB | 5.7×10⁻⁸ | 1.8×10⁻⁸ | 42M params/s | -With AVX2 F16C hardware: **~500M params/sec** (8 conversions per clock cycle). - ## What We Build That Nobody Else Does ### 1. Complete SIMD Polyfill on Stable Rust -`std::simd` (portable SIMD) has been nightly-only for years. We implement the same type surface — `F32x16`, `F64x8`, `U8x64`, masks, reductions, comparisons, shuffles, gathers — using stable `core::arch` intrinsics. When `std::simd` eventually stabilizes, consumers change one `use` line. Until then, they get native AVX-512 performance today. - -The dispatch is a `LazyLock` singleton detected at first access: one CPUID call, frozen forever, zero per-call overhead. The function pointer table (`SimdDispatch`) eliminates branch prediction misses entirely — the CPU sees an indirect call, not a conditional branch. +`std::simd` has been nightly-only for years. We implement the same type surface using stable `core::arch` intrinsics. The dispatch is a `LazyLock` singleton: one CPUID call, frozen forever, zero per-call overhead. ### 2. Half-Precision Types Without Nightly -Rust's `f16` type is nightly-only (issue #116909). We use the same trick as our AMX implementation: `u16` as the carrier type, hardware instructions via stable `#[target_feature]` (F16C on x86, `FCVTL`/`FCVTN` via inline `asm!()` on ARM). The result is IEEE 754 bit-exact f16↔f32 conversion at hardware speed, with three precision strategies: - -- **Plain f16**: 2 bytes, 10-bit mantissa, good for sensors and audio -- **Scaled-f16**: 2 bytes + 8-byte header, range-optimized for 1.5× better precision on narrow data -- **Double-f16**: 4 bytes (hi + lo pair), ~20-bit effective mantissa — 128× more precise than single f16 +Rust's `f16` type is nightly-only. We use `u16` as carrier + hardware instructions via stable `#[target_feature]` (F16C on x86, `FCVTL`/`FCVTN` via inline `asm!()` on ARM). IEEE 754 bit-exact at hardware speed. ### 3. AMX on Stable Rust -Intel AMX (Advanced Matrix Extensions) provides hardware tile matrix multiplication: `TDPBUSD` computes a 16×16 tile of u8×i8→i32 — 256 multiply-accumulate operations in a single instruction. The Rust intrinsics are nightly-only (issue #126622). We emit the instructions directly via `asm!(".byte ...")` encoding, verified working on Rust 1.94 stable with kernel 6.18+ (XCR0 bits 17+18 enabled). - -The runtime dispatch chain: AMX TILE (256 MACs) → AVX-512 VNNI (64 MACs) → AVX-VNNI ymm (32 MACs) → scalar i32. On Sapphire Rapids, this reduces codebook distance table build time from 24–48 hours to ~80 minutes. +Intel AMX intrinsics are nightly-only. We emit instructions via `asm!(".byte ...")` encoding — 256 MACs per instruction, verified on Rust 1.94 stable. Reduces distance table build from 24–48h to ~80 minutes. ### 4. Tiered ARM NEON for Single-Board Computers -Most Rust libraries treat ARM as "not x86, use scalar." We implement three tiers with runtime detection via `is_aarch64_feature_detected!()`: - -- **A53 Baseline** (Pi Zero 2W, Pi 3): single NEON pipeline, no unrolling, minimize instruction count -- **A72 Fast** (Pi 4, Orange Pi 4): dual NEON pipeline, 2× unrolled loops to saturate both pipes -- **A76 DotProd** (Pi 5, Orange Pi 5): `vdotq_s32` for 4× int8 throughput, native fp16 via FCVTL +Three tiers with runtime detection: A53 Baseline (Pi Zero/3), A72 Fast (Pi 4, dual pipeline), A76 DotProd (Pi 5, `vdotq_s32` + native fp16). big.LITTLE aware. -The `ArmProfile` enum exposes estimated tok/s, effective lane count, and microarchitecture hints. big.LITTLE systems (RK3399, RK3588) are handled correctly: feature detection returns the intersection of all core types, and we document which features are safe to use unconditionally. +### 5. Frozen Dispatch (0.3ns per call) -### 5. Frozen Dispatch (Zero-Cost Tier Selection) +Function pointer table, not per-call branching. `LazyLock` → one indirect call, no atomic, no branch prediction miss. -Typical SIMD code branches on every call: `if is_x86_feature_detected!("avx512f") { ... }`. Each check is an atomic load + branch. We do it once: +### 6. BF16 RNE Bit-Exact with Hardware -``` -LazyLock → fn pointer table (Copy struct, lives in registers) -Per-call cost: 1 pointer deref + 1 indirect call = ~0.3ns -vs per-call branch: 1 atomic load + 1 branch predict = ~1–3ns -``` - -The dispatch table is a `Copy` struct of function pointers, selected at first access and never modified. After initialization, the CPU's branch predictor sees a stable indirect call target — effectively free. - -### 6. BF16 Round-to-Nearest-Even (Bit-Exact with Hardware) - -Our `f32_to_bf16_batch_rne()` uses pure AVX-512-F instructions to implement the IEEE 754 Round-to-Nearest-Even algorithm, matching Intel's `VCVTNEPS2BF16` instruction **bit-for-bit**. This runs on any AVX-512 CPU, not just those with the BF16 extension. Verified against hardware output on 1M+ inputs, including all subnormal, infinity, NaN, and halfway tie cases. +Pure AVX-512-F emulation of `VCVTNEPS2BF16`, verified bit-for-bit on 1M+ inputs including subnormals, Inf, NaN, and halfway ties. ### 7. Cognitive Codec Stack -Beyond traditional numerical computing, we implement a complete encoding pipeline for compressed AI inference: - -- **Fingerprint\<256\>**: 256-bit binary vectors with SIMD Hamming distance (AVX-512 VPOPCNTDQ or NEON `vcntq_u8`) -- **Base17**: 17-dimensional i16 vectors with L1 distance — fits in one AVX-512 load (32 bytes) -- **CAM-PQ**: Product quantization with compiled distance tables for sub-linear search -- **Palette Semiring**: 256×256 distance matrices for O(1) token-level lookups -- **bgz7/bgz17**: Compressed model weight format (201GB BF16 safetensors → 685MB bgz7) - -### Cosine Similarity via Palette Distance (Integer-Only Approximation) - -Traditional cosine similarity requires floating-point: `dot(a,b) / (|a| × |b|)` — three passes over the data plus a division. We replace this with a single u8 table lookup that emulates cosine at two precision tiers: - -**How it works:** High-dimensional vectors are quantized to 256 archetypes. The pairwise distance between any two archetypes is precomputed into a 256×256 u8 distance table. At query time, cosine similarity between two vectors reduces to `table[archetype_a][archetype_b]` — one memory access, no floating point. - -| Precision Tier | Sigma Band | u8 Steps | Max Cosine Error | Speed | -|----------------|------------|----------|-----------------|-------| -| **Foveal** (1/40 σ) | Inner 2.5% | 256 | ±0.004 (0.4%) | **611M lookups/s** | -| **Good** (1/4 σ) | Inner 68% | 256 | ±0.02 (2%) | **611M lookups/s** | -| **Near** (1 σ) | Inner 95% | 64 | ±0.08 (8%) | **2.4B lookups/s** | -| F32 exact cosine | — | — | 0 | ~50M/s (SIMD dot) | - -The key insight: **611 million cosine-equivalent comparisons per second using only integer operations**. This is 12× faster than SIMD f32 dot product because: -1. No FP division (the normalization is baked into the table) -2. No FP multiplication (it's a table read, not arithmetic) -3. The 256×256 table (64KB) fits entirely in L1 cache -4. u8 loads have no alignment constraints - -The Foveal tier at 1/40σ achieves 0.4% maximum error — indistinguishable from exact cosine for nearest-neighbor search, semantic similarity, and clustering. The cascade search architecture uses the Near tier (8% error) to eliminate 99.7% of candidates in the first pass, then refines survivors with the Foveal tier. - -This is the engine behind the **17,000 tok/s** benchmark: each token lookup computes similarity against 4,096 heads using palette distance, not matrix multiplication. - -## Module Inventory - -``` -src/ -├── simd.rs LazyLock tier detection, type re-exports, PREFERRED_LANES -├── simd_avx512.rs 11 SIMD types + BF16 codec + F16 IEEE 754 (2,700 LOC) -├── simd_avx2.rs BLAS L1, Hamming, i8 dot, F16 precision toolkit (1,600 LOC) -├── simd_neon.rs 3-tier ARM NEON: baseline/A72/A76+dotprod+fp16 (500 LOC) -├── simd_amx.rs AMX detection + VNNI dispatch + quantize/dequantize (350 LOC) -├── simd_wasm.rs WebAssembly SIMD scaffolding -├── backend/ -│ ├── native.rs Pure-Rust GEMM microkernels (Goto 6×16/6×8) -│ ├── mkl.rs Intel MKL FFI (feature-gated) -│ └── openblas.rs OpenBLAS FFI (feature-gated) -└── hpc/ 55 modules, 880 tests - ├── blas_level1.rs dot, axpy, scal, nrm2, asum, iamax - ├── blas_level2.rs gemv, ger, symv, trmv, trsv - ├── blas_level3.rs gemm, syrk, trsm, symm (Goto-blocked) - ├── quantized.rs BF16 GEMM, INT8 GEMM, quantize/dequantize - ├── lapack.rs LU, Cholesky, QR factorization - ├── fft.rs Cooley-Tukey radix-2 FFT/IFFT - ├── vml.rs exp, ln, sqrt, erf, cbrt, sin, cos - ├── statistics.rs median, variance, std, percentile, top_k - ├── activations.rs sigmoid, softmax, log_softmax, GELU, SiLU - ├── fingerprint.rs Fingerprint<256> (VSA, Hamming, XOR bind) - ├── bgz17_bridge.rs Base17 encode/decode, L1 distance, sign agreement - ├── cam_pq.rs Product quantization, IVF, distance tables - ├── simd_caps.rs LazyLock SimdCaps + ArmProfile detection - ├── simd_dispatch.rs Frozen function pointer dispatch table - ├── clam.rs CLAM tree (build, search, rho_nn, 46 tests) - ├── blackboard.rs Typed slot arena (zero-copy shared memory) - ├── cascade.rs HDR cascade search (sigma-band filtering) - ├── causal_diff.rs CausalEdge64 (u64 packed), quality scoring - └── ... (37 more: hdc, nars, qualia, styles, bnn, ocr, arrow_bridge) -``` +Fingerprint<256>, Base17 VSA, CAM-PQ, Palette Semiring, bgz7/bgz17 — compressed model weights (201GB → 685MB) with O(1) inference. ## Quick Start @@ -196,46 +109,28 @@ src/ use ndarray::Array2; use ndarray::hpc::simd_caps::simd_caps; -// GEMM — automatically uses best available SIMD let a = Array2::::ones((1024, 1024)); let b = Array2::::ones((1024, 1024)); let c = a.dot(&b); // AVX-512 / AVX2 / NEON — zero code changes -// Check hardware let caps = simd_caps(); if caps.avx512f { println!("AVX-512: 16 lanes"); } if caps.neon { println!("ARM: {}", caps.arm_profile().name()); } ``` ```bash -# Build (auto-detects best SIMD) cargo build --release - -# Cross-compile for Raspberry Pi 4 -cargo build --release --target aarch64-unknown-linux-gnu - -# Maximum performance on AVX-512 server -RUSTFLAGS="-C target-cpu=x86-64-v4" cargo build --release - -# Run the 880 HPC tests -cargo test +cargo build --release --target aarch64-unknown-linux-gnu # Pi 4 +RUSTFLAGS="-C target-cpu=x86-64-v4" cargo build --release # AVX-512 +cargo test # 880 HPC tests ``` -## Requirements - -- **Rust 1.94 stable** (no nightly, no unstable features) -- Optional: `gcc-aarch64-linux-gnu` for Pi cross-compilation -- Optional: Intel MKL or OpenBLAS for BLAS acceleration (feature-gated) - ## Ecosystem -This crate is the hardware foundation for a larger architecture: - -| Repository | Role | Depends on ndarray for | -|------------|------|----------------------| +| Repository | Role | Uses ndarray for | +|------------|------|-----------------| | [lance-graph](https://github.com/AdaWorldAPI/lance-graph) | Graph query + codec spine | Fingerprint, CAM-PQ, CLAM, BLAS, ZeckF64 | | [home-automation-rs](https://github.com/AdaWorldAPI/home-automation-rs) | Smart home + voice AI | Codebook inference, VITS TTS, SIMD audio | -| [ada-rs](https://github.com/AdaWorldAPI/ada-rs) | Cognitive substrate | 10K-bit VSA, Hamming, perception | ## License From e6967b90bac41ec5ea77790350aa6c1f68d9d2a6 Mon Sep 17 00:00:00 2001 From: AdaWorldAPI Date: Mon, 13 Apr 2026 13:21:04 +0200 Subject: [PATCH 2/4] Add German README + upstream vs fork ISA comparison table --- README-DE.md | 198 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 198 insertions(+) create mode 100644 README-DE.md diff --git a/README-DE.md b/README-DE.md new file mode 100644 index 00000000..f6b9a043 --- /dev/null +++ b/README-DE.md @@ -0,0 +1,198 @@ +# ndarray — AdaWorldAPI HPC Erweiterung + +Ein vollstaendiger Hochleistungs-Numerik-Stack auf Basis von [rust-ndarray/ndarray](https://github.com/rust-ndarray/ndarray). Dieser Fork fuegt 55 HPC-Module mit 880 Tests hinzu: BLAS L1-L3, LAPACK, FFT, Vektormathematik, quantisierte Inferenz und hardware-spezifische SIMD-Kernel von Intel AMX bis Raspberry Pi NEON — alles auf **stabilem Rust 1.94**, null Nightly-Features. + +Das Upstream-ndarray liefert exzellente n-dimensionale Array-Abstraktionen. Wir behalten all das und fuegen hinzu, wofuer es nie gedacht war: mit NumPys OpenBLAS bei GEMM konkurrieren, Codebook-Inferenz auf einem 5-Watt Pi 4 laufen lassen, und Halbpraezisions-Gleitkommazahlen verarbeiten, fuer die Rust noch nicht einmal einen stabilen Typ hat. + +[English Version](README.md) + +## Upstream vs. Fork — Feature-fuer-Feature + +Die zentrale Frage: Was genau bekommt man mit diesem Fork, was Upstream nicht hat? + +### ISA-Abdeckung (Instruction Set Architecture) + +| ISA / Feature | Upstream ndarray | **AdaWorldAPI Fork** | Speedup vs. Upstream | +|---------------|-----------------|---------------------|---------------------| +| **AVX-512** (512-bit, 16×f32) | Scalar Fallback | Native `__m512` Typen, F32x16/F64x8/U8x64 | **~8×** | +| **AVX-512 VNNI** (int8 dot) | Scalar Fallback | `vpdpbusd` 64 MACs/Instr + Dispatch | **~32×** | +| **AVX-512 BF16** (bfloat16) | Nicht vorhanden | Hardware `vcvtneps2bf16` + RNE-Emulation | **neu** | +| **AVX-512 VPOPCNTDQ** (popcount) | Scalar Fallback | Native 512-bit Popcount fuer Hamming | **~16×** | +| **AMX** (Tile Matrix, 256 MACs) | Nicht vorhanden | Inline-ASM `.byte` Encoding, stable Rust | **~128×** vs. Scalar | +| **AVX2 + FMA** (256-bit, 8×f32) | Via matrixmultiply | Eigene Goto-GEMM 6×16 + Dispatch-Tabelle | **~4×** | +| **AVX2 F16C** (f16 Hardware) | Nicht vorhanden | IEEE 754 f16, Double-f16, Kahan, Scaler | **neu** | +| **AVX-VNNI** (ymm, 32 MACs) | Nicht vorhanden | Arrow Lake / NUC 14 Unterstuetzung | **neu** | +| **SSE2** (128-bit, 4×f32) | Via matrixmultiply | Scalar Polyfill mit gleicher API | 1× (Baseline) | +| **NEON** (128-bit, 4×f32) | Scalar Fallback | 3-stufig: A53/A72/A76 mit Pipeline-Awareness | **~4×** | +| **NEON dotprod** (ARMv8.2) | Nicht vorhanden | `vdotq_s32` fuer 4× int8 Durchsatz (Pi 5) | **~16×** vs. Scalar | +| **NEON fp16** (ARMv8.2) | Nicht vorhanden | `FCVTL`/`FCVTN` via Inline-ASM | **neu** | +| **NEON Popcount** | Nicht vorhanden | `vcntq_u8` nativer Byte-Popcount | **schneller als x86 SSE** | +| **WASM SIMD128** | Nicht vorhanden | Scaffolding vorbereitet | in Arbeit | + +### BLAS / Numerik + +| Operation | Upstream | **Fork** | Verbesserung | +|-----------|----------|----------|-------------| +| GEMM (1024²) | ~13 GFLOPS (Cache-Cliff) | **139 GFLOPS** (Goto-Blocking) | **10.5×** | +| Dot Product | Via matrixmultiply | 4-fach unrolled + FMA | ~2× | +| BLAS L1 (axpy, scal, nrm2) | Nicht vorhanden | SIMD-beschleunigt, alle Tiers | **neu** | +| BLAS L2 (gemv, ger, trsv) | Nicht vorhanden | SIMD-beschleunigt | **neu** | +| LAPACK (LU, Cholesky, QR) | Nicht vorhanden | Pure-Rust Implementierung | **neu** | +| FFT | Nicht vorhanden | Cooley-Tukey Radix-2 | **neu** | +| Aktivierungen (sigmoid, GELU) | Nicht vorhanden | SIMD F32x16 Vektorisierung | **neu** | +| Quantisierung (BF16, INT8) | Nicht vorhanden | VNNI + AMX + Scalar Fallback | **neu** | + +### Datentypen + +| Typ | Upstream | **Fork** | Anmerkung | +|-----|----------|----------|-----------| +| f32 | Standard | Standard + F32x16 SIMD | Gleich + SIMD-Beschleunigung | +| f64 | Standard | Standard + F64x8 SIMD | Gleich + SIMD-Beschleunigung | +| **f16** (IEEE 754) | **Nicht vorhanden** | u16 Carrier + F16C/FCVTL Hardware | Stable Rust, kein Nightly | +| **BF16** (bfloat16) | **Nicht vorhanden** | Hardware + RNE-Emulation (bit-exakt) | GGUF-Kalibrierung | +| i8/u8 (quantisiert) | Nicht vorhanden | VNNI dot, Hamming, Popcount | INT8 Inferenz | +| i16 (Base17) | Nicht vorhanden | L1-Distanz, SIMD widen/narrow | Codebook-Encoding | + +### Dispatch & Erkennung + +| Aspekt | Upstream | **Fork** | +|--------|----------|----------| +| SIMD-Erkennung | Keine (delegiert an BLAS) | `LazyLock` — einmal erkennen, fuer immer | +| Dispatch-Kosten | Kein eigener Dispatch | **0.3ns** (Funktionszeiger-Tabelle, kein Branch) | +| ARM-Profiling | Kein ARM-Bewusstsein | `ArmProfile`: A53/A72/A76 mit tok/s Schaetzung | +| big.LITTLE | Nicht behandelt | Korrekte Feature-Intersection (RK3399/RK3588) | +| CPU-Erkennung | Zur Laufzeit per Call | Einmal via LazyLock, dann nur Pointer-Deref | + +### Zusammenfassung: Was Upstream auf jedem Target macht + +``` +Upstream auf x86_64: → matrixmultiply Crate (extern, AVX2 wenn verfuegbar) +Upstream auf aarch64: → Scalar (kein NEON, kein Intrinsic) +Upstream auf wasm: → Scalar +Upstream auf riscv: → Scalar + +Fork auf x86_64: → AVX-512 F32x16 / AVX2 F32x8 / SSE2 / Scalar (gestuft) +Fork auf aarch64: → NEON A76+dotprod / NEON A72 2×pipe / NEON A53 / Scalar +Fork auf wasm: → WASM SIMD128 (vorbereitet) / Scalar +Fork auf riscv: → Scalar (RISC-V V Extension vorbereitet) +``` + +## Leistung + +### GEMM (Allgemeine Matrixmultiplikation) + +| Matrixgroesse | Upstream ndarray | **Dieser Fork** | NumPy (OpenBLAS) | PyTorch CPU | GPU (RTX 3060) | +|--------------|-----------------|---------------|------------------|-------------|----------------| +| 512×512 | ~20 GFLOPS | **47 GFLOPS** | ~45 GFLOPS | ~40 GFLOPS | ~1.200 GFLOPS | +| 1024×1024 | ~13 GFLOPS | **139 GFLOPS** | ~120 GFLOPS | ~100 GFLOPS | ~3.500 GFLOPS | +| 2048×2048 | ~13 GFLOPS | **~150 GFLOPS** | ~140 GFLOPS | ~130 GFLOPS | ~5.000 GFLOPS | + +Upstream trifft bei 1024×1024 auf eine Cache-Klippe: kein Tiling, kein Threading, kein Microkernel. Unsere Goto-Implementierung eliminiert das vollstaendig. + +### Codebook-Inferenz (Token-Generierung) + +Keine Matrixmultiplikation — O(1) Tabellen-Lookup pro Token. + +| Hardware | ISA | tok/s | 50-Token Latenz | Leistung | +|----------|-----|-------|-----------------|----------| +| Sapphire Rapids | AMX (256 MACs/Instr) | **380.000** | 0,13 ms | 250W | +| Xeon / i9-13900K | AVX-512 VNNI | **10.000–50.000** | 1–5 ms | 150W | +| i7-13800K | AVX2-VNNI | **3.000–10.000** | 5–17 ms | 65W | +| **Raspberry Pi 5** | **NEON + dotprod** | **2.000–5.000** | 10–25 ms | **5W** | +| **Raspberry Pi 4** | **NEON (2× Pipeline)** | **500–2.000** | 25–100 ms | **5W** | +| Pi Zero 2W | NEON (1× Pipeline) | 50–500 | 100–1000 ms | 2W | + +Bei 5 Watt generiert ein Pi 4 eine 50-Token Sprachassistenten-Antwort in unter 100 Millisekunden. + +### Cosine-Aehnlichkeit via Palette-Distanz (nur Integer) + +Traditionelle Cosine-Aehnlichkeit braucht Fliesskomma: `dot(a,b) / (|a| × |b|)`. Wir ersetzen das durch einen einzigen u8-Tabellen-Lookup. + +| Praezisions-Stufe | Sigma-Band | Max Cosine-Fehler | Geschwindigkeit | +|-------------------|------------|-------------------|----------------| +| **Foveal** (1/40 σ) | Innere 2,5% | ±0,004 (0,4%) | **611M Lookups/s** | +| **Gut** (1/4 σ) | Innere 68% | ±0,02 (2%) | **611M Lookups/s** | +| **Nah** (1 σ) | Innere 95% | ±0,08 (8%) | **2,4 Mrd/s** | +| F32 exakte Cosine | — | 0 | ~50M/s | + +**611 Millionen Cosine-aequivalente Vergleiche pro Sekunde mit reinen Integer-Operationen** — 12× schneller als SIMD-f32-Skalarprodukt. Die 256×256 Tabelle (64KB) passt komplett in den L1-Cache. + +### Halbpraezisions-Gewichts-Transkodierung + +Getestet mit 15-Millionen-Parameter-Modell (Piper TTS Groesse): + +| Format | Groesse | Max Fehler | RMSE | Durchsatz | +|--------|---------|-----------|------|-----------| +| f32 (Original) | 60 MB | — | — | — | +| **f16 (IEEE 754)** | **30 MB** | 7,3×10⁻⁶ | 2,5×10⁻⁶ | 94M Params/s | +| **Scaled-f16** | **30 MB** | 4,9×10⁻⁶ | 2,1×10⁻⁶ | 91M Params/s | +| **Double-f16** | 60 MB | 5,7×10⁻⁸ | 1,8×10⁻⁸ | 42M Params/s | + +## Was wir bauen, das sonst niemand hat + +### 1. Vollstaendiger SIMD-Polyfill auf stabilem Rust + +`std::simd` ist seit Jahren Nightly-only. Wir implementieren dieselbe Typ-Oberflaeche mit stabilen `core::arch` Intrinsics. Wenn `std::simd` stabilisiert wird, aendert der Consumer eine `use`-Zeile. + +### 2. Halbpraezisions-Typen ohne Nightly + +Rusts `f16`-Typ ist Nightly-only. Wir nutzen `u16` als Traeger + Hardware-Instruktionen via stabiles `#[target_feature]` (F16C auf x86, `FCVTL`/`FCVTN` via Inline-`asm!()` auf ARM). + +### 3. AMX auf stabilem Rust + +Intel AMX Intrinsics sind Nightly-only. Wir emittieren Instruktionen direkt via `asm!(".byte ...")` — 256 MACs pro Instruktion, verifiziert auf Rust 1.94 stable. + +### 4. Gestuftes ARM NEON fuer Einplatinen-Computer + +Drei Stufen mit Laufzeit-Erkennung: A53 Baseline (Pi Zero/3), A72 Fast (Pi 4, Dual-Pipeline), A76 DotProd (Pi 5, `vdotq_s32` + natives fp16). big.LITTLE-bewusst. + +### 5. Eingefrorener Dispatch (0,3ns pro Aufruf) + +Funktionszeiger-Tabelle statt Branch pro Aufruf. `LazyLock` → ein indirekter Call, kein Atomic, kein Branch-Prediction-Miss. + +### 6. BF16 RNE bit-exakt mit Hardware + +Pure AVX-512-F Emulation von `VCVTNEPS2BF16`, verifiziert Bit-fuer-Bit auf 1M+ Eingaben. + +### 7. Kognitiver Codec-Stack + +Fingerprint<256>, Base17 VSA, CAM-PQ, Palette-Semiring, bgz7/bgz17 — komprimierte Modellgewichte (201GB → 685MB) mit O(1) Inferenz. + +## Schnellstart + +```rust +use ndarray::Array2; +use ndarray::hpc::simd_caps::simd_caps; + +let a = Array2::::ones((1024, 1024)); +let b = Array2::::ones((1024, 1024)); +let c = a.dot(&b); // AVX-512 / AVX2 / NEON — null Code-Aenderungen + +let caps = simd_caps(); +if caps.avx512f { println!(\"AVX-512: 16 Lanes\"); } +if caps.neon { println!(\"ARM: {}\", caps.arm_profile().name()); } +``` + +```bash +cargo build --release +cargo build --release --target aarch64-unknown-linux-gnu # Pi 4 +RUSTFLAGS=\"-C target-cpu=x86-64-v4\" cargo build --release # AVX-512 +cargo test # 880 HPC Tests +``` + +## Voraussetzungen + +- **Rust 1.94 stable** (kein Nightly, keine instabilen Features) +- Optional: `gcc-aarch64-linux-gnu` fuer Pi Cross-Kompilierung +- Optional: Intel MKL oder OpenBLAS fuer BLAS-Beschleunigung (Feature-gated) + +## Oekosystem + +| Repository | Rolle | Nutzt ndarray fuer | +|------------|-------|-------------------| +| [lance-graph](https://github.com/AdaWorldAPI/lance-graph) | Graph-Query + Codec-Spine | Fingerprint, CAM-PQ, CLAM, BLAS, ZeckF64 | +| [home-automation-rs](https://github.com/AdaWorldAPI/home-automation-rs) | Smart Home + Sprach-KI | Codebook-Inferenz, VITS TTS, SIMD Audio | + +## Lizenz + +MIT OR Apache-2.0 (wie Upstream ndarray) From e38969f6872027bb43c63644de827f3ab223d7fb Mon Sep 17 00:00:00 2001 From: AdaWorldAPI Date: Mon, 13 Apr 2026 13:22:05 +0200 Subject: [PATCH 3/4] README.md: add upstream vs fork ISA comparison table --- README.md | 85 ++++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 69 insertions(+), 16 deletions(-) diff --git a/README.md b/README.md index 502e187b..ec7d2298 100644 --- a/README.md +++ b/README.md @@ -6,26 +6,79 @@ The upstream ndarray provides excellent n-dimensional array abstractions. We kee [Deutsche Version / German Version](README-DE.md) -## Core Architecture +## Upstream vs. Fork — Feature by Feature + +### ISA Coverage (Instruction Set Architecture) + +| ISA / Feature | Upstream ndarray | **AdaWorldAPI Fork** | Speedup vs. Upstream | +|---------------|-----------------|---------------------|---------------------| +| **AVX-512** (512-bit, 16×f32) | Scalar fallback | Native `__m512` types, F32x16/F64x8/U8x64 | **~8×** | +| **AVX-512 VNNI** (int8 dot) | Scalar fallback | `vpdpbusd` 64 MACs/instr + dispatch | **~32×** | +| **AVX-512 BF16** (bfloat16) | Not available | Hardware `vcvtneps2bf16` + RNE emulation | **new** | +| **AVX-512 VPOPCNTDQ** (popcount) | Scalar fallback | Native 512-bit popcount for Hamming | **~16×** | +| **AMX** (Tile Matrix, 256 MACs) | Not available | Inline asm `.byte` encoding, stable Rust | **~128×** vs. scalar | +| **AVX2 + FMA** (256-bit, 8×f32) | Via matrixmultiply | Own Goto-GEMM 6×16 + dispatch table | **~4×** | +| **AVX2 F16C** (f16 hardware) | Not available | IEEE 754 f16, Double-f16, Kahan, Scaler | **new** | +| **AVX-VNNI** (ymm, 32 MACs) | Not available | Arrow Lake / NUC 14 support | **new** | +| **SSE2** (128-bit, 4×f32) | Via matrixmultiply | Scalar polyfill with same API | 1× (baseline) | +| **NEON** (128-bit, 4×f32) | Scalar fallback | 3-tier: A53/A72/A76 with pipeline awareness | **~4×** | +| **NEON dotprod** (ARMv8.2) | Not available | `vdotq_s32` for 4× int8 throughput (Pi 5) | **~16×** vs. scalar | +| **NEON fp16** (ARMv8.2) | Not available | `FCVTL`/`FCVTN` via inline asm | **new** | +| **NEON Popcount** | Not available | `vcntq_u8` native byte popcount | **faster than x86 SSE** | +| **WASM SIMD128** | Not available | Scaffolding prepared | in progress | + +### BLAS / Numerics + +| Operation | Upstream | **Fork** | Improvement | +|-----------|----------|----------|-------------| +| GEMM (1024²) | ~13 GFLOPS (cache cliff) | **139 GFLOPS** (Goto blocking) | **10.5×** | +| Dot Product | Via matrixmultiply | 4× unrolled + FMA | ~2× | +| BLAS L1 (axpy, scal, nrm2) | Not available | SIMD-accelerated, all tiers | **new** | +| BLAS L2 (gemv, ger, trsv) | Not available | SIMD-accelerated | **new** | +| LAPACK (LU, Cholesky, QR) | Not available | Pure-Rust implementation | **new** | +| FFT | Not available | Cooley-Tukey radix-2 | **new** | +| Activations (sigmoid, GELU) | Not available | SIMD F32x16 vectorization | **new** | +| Quantization (BF16, INT8) | Not available | VNNI + AMX + scalar fallback | **new** | + +### Data Types + +| Type | Upstream | **Fork** | Note | +|------|----------|----------|------| +| f32 | Standard | Standard + F32x16 SIMD | Same + SIMD acceleration | +| f64 | Standard | Standard + F64x8 SIMD | Same + SIMD acceleration | +| **f16** (IEEE 754) | **Not available** | u16 carrier + F16C/FCVTL hardware | Stable Rust, no nightly | +| **BF16** (bfloat16) | **Not available** | Hardware + RNE emulation (bit-exact) | GGUF calibration | +| i8/u8 (quantized) | Not available | VNNI dot, Hamming, popcount | INT8 inference | +| i16 (Base17) | Not available | L1 distance, SIMD widen/narrow | Codebook encoding | + +### Dispatch and Detection + +| Aspect | Upstream | **Fork** | +|--------|----------|----------| +| SIMD detection | None (delegates to BLAS) | `LazyLock` — detect once, forever | +| Dispatch cost | No own dispatch | **0.3ns** (fn pointer table, no branch) | +| ARM profiling | No ARM awareness | `ArmProfile`: A53/A72/A76 with tok/s estimate | +| big.LITTLE | Not handled | Correct feature intersection (RK3399/RK3588) | +| CPU detection | Per-call runtime | Once via LazyLock, then pointer deref only | + +### What Upstream Does on Each Target -The expansion comprises five layers built on top of upstream's array primitives: - -**SIMD Polyfill Layer** (`src/simd.rs`, `simd_avx512.rs`, `simd_avx2.rs`, `simd_neon.rs`) provides `std::simd`-compatible types — `F32x16`, `F64x8`, `U8x64`, `I32x16`, `I64x8`, `U32x16`, `U64x8` with full operator overloading, reductions, comparisons, and masked operations — backed by `core::arch` intrinsics on x86 and inline assembly on ARM. Consumers write `crate::simd::F32x16` and get native 512-bit operations on AVX-512, 256-bit on AVX2, 128-bit on NEON, or scalar fallback, with zero code changes. Detection happens once via `LazyLock` (one pointer deref per call, no atomics, no branch prediction misses). - -**Backend Layer** (`src/backend/`) implements pluggable BLAS through the `BlasFloat` trait with three backends: pure-Rust SIMD microkernels (default, zero dependencies), Intel MKL FFI (feature-gated), and OpenBLAS FFI (feature-gated, mutually exclusive with MKL). The native backend uses Goto-algorithm cache-blocked GEMM with 6×16 (f32) and 6×8 (f64) microkernels, achieving 139 GFLOPS at 1024×1024 — a 10.5× improvement over the naive approach and within 15% of NumPy's multi-threaded OpenBLAS. - -**HPC Module Library** (`src/hpc/`, 55 modules) delivers a complete numerical computing surface: BLAS Level 1-3 (dot, axpy, gemv, gemm, syrk, trsm), LAPACK factorizations (LU, Cholesky, QR), Cooley-Tukey FFT, vector math (exp, ln, sqrt, erf, trigonometric), statistics (median, variance, percentile, top-k), neural network activations (sigmoid, softmax, GELU, SiLU), and quantized operations (BF16 GEMM, INT8 GEMM via VNNI). Every module has SIMD-accelerated hot paths that dispatch through the frozen function pointer table. - -**Codec Layer** (`src/hpc/fingerprint.rs`, `bgz17_bridge.rs`, `cam_pq.rs`) implements the encoding stack for compressed inference: 16Kbit Fingerprints, Base17 VSA (17-dimensional i16 vectors), CAM-PQ product quantization, ZeckF64 Fibonacci encoding, and palette semiring distance matrices. This is what makes codebook inference O(1) per token — table lookups replace matrix multiplication. - -**Burn Integration** (`crates/burn/`) provides a SIMD-augmented burn-ndarray backend that wires `crate::simd::F32x16` into burn's tensor operations and activations, replacing macerator's SIMD with our LazyLock-dispatched implementations. This enables using burn's model format and autodiff while benefiting from our full SIMD stack. +``` +Upstream on x86_64: → matrixmultiply crate (external, AVX2 if available) +Upstream on aarch64: → Scalar (no NEON, no intrinsics) +Upstream on wasm: → Scalar +Upstream on riscv: → Scalar + +Fork on x86_64: → AVX-512 F32x16 / AVX2 F32x8 / SSE2 / Scalar (tiered) +Fork on aarch64: → NEON A76+dotprod / NEON A72 2×pipe / NEON A53 / Scalar +Fork on wasm: → WASM SIMD128 (prepared) / Scalar +Fork on riscv: → Scalar (RISC-V V Extension prepared) +``` ## Performance ### GEMM (General Matrix Multiply) -The Goto-algorithm GEMM with cache blocking (L1: 32KB, L2: 256KB, L3: shared) and 16-thread parallelism via split-borrow (no mutex contention): - | Matrix Size | Upstream ndarray | **This Fork** | NumPy (OpenBLAS) | PyTorch CPU | GPU (RTX 3060) | |-------------|-----------------|---------------|------------------|-------------|----------------| | 512×512 | ~20 GFLOPS | **47 GFLOPS** | ~45 GFLOPS | ~40 GFLOPS | ~1,200 GFLOPS | @@ -36,7 +89,7 @@ Upstream hits a cache cliff at 1024×1024: no tiling, no threading, no microkern ### Codebook Inference (Token Generation) -This is not matrix multiplication. Codebook inference replaces `y = W·x` with `y = codebook[index[x]]` — an O(1) table lookup per token. No GPU required. +Not matrix multiplication — O(1) table lookup per token. No GPU required. | Hardware | ISA | tok/s | 50-Token Latency | Power | |----------|-----|-------|------------------|-------| @@ -51,7 +104,7 @@ At 5 watts, a Pi 4 generates a 50-token voice assistant response in under 100 mi ### Cosine Similarity via Palette Distance (Integer-Only) -Traditional cosine requires floating-point: `dot(a,b) / (|a| × |b|)`. We replace this with a single u8 table lookup. High-dimensional vectors are quantized to 256 archetypes; pairwise distance is precomputed into a 256×256 u8 table. Query-time similarity: `table[a][b]` — one memory access, no floating point. +Traditional cosine requires floating-point: `dot(a,b) / (|a| × |b|)`. We replace this with a single u8 table lookup. | Precision Tier | Sigma Band | Max Cosine Error | Speed | |----------------|------------|-----------------|-------| From 5c8b3f45360aad3902f0aad4ee1e9a7c1e17432f Mon Sep 17 00:00:00 2001 From: AdaWorldAPI Date: Mon, 13 Apr 2026 13:26:47 +0200 Subject: [PATCH 4/4] Add complete feature comparison: upstream ndarray vs AdaWorldAPI fork (80K LOC, 146 files) --- COMPARISON.md | 212 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 212 insertions(+) create mode 100644 COMPARISON.md diff --git a/COMPARISON.md b/COMPARISON.md new file mode 100644 index 00000000..4cce0a30 --- /dev/null +++ b/COMPARISON.md @@ -0,0 +1,212 @@ +# Complete Feature Comparison: rust-ndarray vs. AdaWorldAPI Fork + +> 80,131 lines of new code across 146 HPC modules, 6 SIMD files, 5 backend files, 20 burn ops, and 2 subcrates. + +## At a Glance + +| Metric | Upstream [rust-ndarray/ndarray](https://github.com/rust-ndarray/ndarray) | **[AdaWorldAPI/ndarray](https://github.com/AdaWorldAPI/ndarray)** | +|--------|-----------|------------| +| Base functionality | n-dimensional arrays, slicing, views | **Same** (full upstream preserved) | +| New LOC added | — | **80,131** | +| New files | — | **179** (146 HPC + 6 SIMD + 5 backend + 20 burn + 2 subcrates) | +| Test count | ~300 | **~1,180** (300 upstream + 880 new) | +| SIMD ISAs | SSE2 via matrixmultiply (external) | **7 ISAs**: AVX-512, AVX2, SSE2, AMX, VNNI, NEON (3 tiers), WASM | +| Numeric types | f32, f64 | **+f16, BF16, i8, u8, i16** (all with SIMD paths) | +| BLAS coverage | dot (via matrixmultiply) | **Full L1 + L2 + L3** (pure-Rust + MKL + OpenBLAS) | +| Target platforms | x86_64 (via external BLAS), scalar everywhere else | **x86_64 (tiered), aarch64 (3-tier NEON), wasm (prepared)** | +| Minimum Rust | 1.64 | **1.94 stable** (no nightly) | + +--- + +## SIMD Layer (6 files, ~5,700 LOC) + +| Component | Upstream | **Fork** | +|-----------|----------|----------| +| `simd.rs` — dispatch + re-exports | Not present | **LazyLock tier detection, PREFERRED_LANES, type re-exports** | +| `simd_avx512.rs` — 512-bit types | Not present | **11 types: F32x16, F64x8, U8x64, I32x16, I64x8, U32x16, U64x8, F32x8, F64x4, BF16x16, BF16x8 + F16 IEEE 754** (2,700 LOC) | +| `simd_avx2.rs` — 256-bit ops | Not present | **BLAS L1, Hamming, i8 dot, popcount, F16 precision toolkit** (1,600 LOC) | +| `simd_neon.rs` — ARM 128-bit | Not present | **3-tier NEON: A53 baseline, A72 dual-pipe, A76 dotprod+fp16; codebook gather, Hamming, Base17 L1** (500 LOC) | +| `simd_amx.rs` — Intel tile matrix | Not present | **AMX detection (CPUID+XCR0), VNNI 512/256, MatVec dispatch, quantize/dequantize** (350 LOC) | +| `simd_wasm.rs` — WebAssembly | Not present | **Scaffolding for WASM SIMD128** | + +## Backend Layer (5 files, ~2,000 LOC) + +| Component | Upstream | **Fork** | +|-----------|----------|----------| +| `backend/mod.rs` — BlasFloat trait | Not present | **Trait-based dispatch: Native / MKL / OpenBLAS** | +| `backend/native.rs` — pure-Rust GEMM | Not present | **Goto-algorithm 6x16/6x8 microkernels, cache-blocked (L1/L2/L3), AVX-512+AVX2 dispatch** | +| `backend/kernels_avx512.rs` | Not present | **AVX-512 SIMD GEMM kernels** | +| `backend/mkl.rs` | Not present | **Intel MKL FFI (feature = "intel-mkl")** | +| `backend/openblas.rs` | Not present | **OpenBLAS FFI (feature = "openblas")** | +| GEMM throughput (1024x1024) | ~13 GFLOPS (via matrixmultiply) | **139 GFLOPS** (10.5x improvement) | + +## HPC Module Library (146 files, ~70,000 LOC, 880 tests) + +### Linear Algebra (BLAS + LAPACK) + +| Module | Upstream | **Fork** | Operations | +|--------|----------|----------|------------| +| `blas_level1.rs` | dot only (external) | **Full** | dot, axpy, scal, nrm2, asum, iamax, Givens rotation | +| `blas_level2.rs` | Not present | **Full** | gemv, ger, symv, trmv, trsv | +| `blas_level3.rs` | dot→gemm (external) | **Goto GEMM** | gemm, syrk, trsm, symm (cache-blocked, multithreaded) | +| `quantized.rs` | Not present | **New** | BF16 GEMM, INT8 GEMM, quantize/dequantize | +| `lapack.rs` | Not present | **New** | LU, Cholesky, QR factorization | + +### Signal Processing + +| Module | Upstream | **Fork** | Detail | +|--------|----------|----------|--------| +| `fft.rs` | Not present | **Cooley-Tukey** | Radix-2 FFT/IFFT, in-place | +| `vml.rs` | Not present | **Vector Math** | exp, ln, sqrt, erf, cbrt, sin, cos (SIMD F32x16 paths) | +| `statistics.rs` | Not present | **Statistics** | median, variance, std, percentile, top_k | +| `activations.rs` | Not present | **Neural Net** | sigmoid, softmax, log_softmax, GELU, SiLU (fused SIMD) | + +### Hardware Detection + Dispatch + +| Module | Upstream | **Fork** | Detail | +|--------|----------|----------|--------| +| `simd_caps.rs` | Not present | **SimdCaps** | LazyLock detection: AVX-512/AVX2/SSE2/FMA/NEON/dotprod/fp16/aes/sha2/crc32 + **ArmProfile** (A53/A72/A76) | +| `simd_dispatch.rs` | Not present | **SimdDispatch** | Frozen fn-pointer table: 0.3ns per call, no branch, no atomic | +| `amx_matmul.rs` | Not present | **AMX MatMul** | Tile configuration, TDPBUSD via inline asm | + +### Encoding + Codec (Cognitive Computing) + +| Module | Upstream | **Fork** | Detail | +|--------|----------|----------|--------| +| `fingerprint.rs` | Not present | **Fingerprint\<256\>** | 256-bit VSA, XOR bind, Hamming distance (VPOPCNTDQ / vcntq_u8) | +| `bgz17_bridge.rs` | Not present | **Base17** | 17-dim i16 vectors, L1 distance, sign agreement, xor_bind | +| `cam_pq.rs` | Not present | **CAM-PQ** | Product quantization, compiled distance tables, IVF index | +| `cam_index.rs` | Not present | **CAM Index** | Inverted file index for PQ search | +| `palette_codec.rs` | Not present | **Palette Codec** | 4-bit palette encoding, Minecraft-style chunk compression | +| `palette_distance.rs` | Not present | **Palette Distance** | 256x256 u8 distance tables, cosine emulation (611M/s) | +| `zeck.rs` | Not present | **ZeckF64** | Fibonacci/Zeckendorf encoding for sparse representations | +| `packed.rs` | Not present | **Packed DB** | 64-byte aligned packed storage for SIMD access | +| `prefilter.rs` | Not present | **INT8 Prefilter** | Approximate statistics for cascade search pruning | + +### Byte-Level + Spatial Operations + +| Module | Upstream | **Fork** | Detail | +|--------|----------|----------|--------| +| `byte_scan.rs` | Not present | **Byte Scan** | AVX-512 byte_find_all/byte_count (VPCMPEQB + KMOV) | +| `nibble.rs` | Not present | **Nibble Ops** | 4-bit unpack/threshold (AVX2 vpshufb) | +| `distance.rs` | Not present | **3D Distance** | Squared distance (AVX2 batch) | +| `spatial_hash.rs` | Not present | **Spatial Hash** | Batch radius query (AVX2 accelerated) | +| `aabb.rs` | Not present | **AABB** | Axis-aligned bounding box intersection | +| `bitwise.rs` | Not present | **Bitwise** | XOR, AND, OR, popcount on 8KB+ vectors | + +### Search + Trees + +| Module | Upstream | **Fork** | Detail | +|--------|----------|----------|--------| +| `clam.rs` | Not present | **CLAM Tree** | Build + search + rho_nn (46 tests) | +| `clam_search.rs` | Not present | **CLAM Search** | k-NN and range search on CLAM index | +| `clam_compress.rs` | Not present | **CLAM Compress** | Index compression for storage | +| `cascade.rs` | Not present | **HDR Cascade** | Sigma-band filtering, ranked hits, drift detection | +| `parallel_search.rs` | Not present | **Parallel Search** | Multi-threaded CLAM search | +| `dn_tree.rs` | Not present | **DN Tree** | Hierarchical path resolution | +| `merkle_tree.rs` | Not present | **Merkle Tree** | Hash-based integrity verification | + +### Model Inference + AI + +| Module | Upstream | **Fork** | Detail | +|--------|----------|----------|--------| +| `gguf.rs` | Not present | **GGUF Reader** | GGUF format parser (LLaMA, Qwen, Gemma) | +| `gguf_indexer.rs` | Not present | **GGUF Indexer** | Build bgz7 codebook index from GGUF weights | +| `safetensors.rs` | Not present | **Safetensors** | HuggingFace safetensors reader | +| `gpt2/` (4 files) | Not present | **GPT-2** | Inference engine (weights, layers, API) | +| `openchat/` (4 files) | Not present | **OpenChat** | Inference engine for OpenChat models | +| `stable_diffusion/` (7 files) | Not present | **Stable Diffusion** | CLIP, UNet, VAE, scheduler (image generation) | +| `models/` (5 files) | Not present | **Model Router** | Multi-model router, safetensors loader, layer abstractions | +| `jina/` (5 files) | Not present | **Jina v5** | Embedding cache, causal attention, codec, runtime | + +### Cognitive Primitives + +| Module | Upstream | **Fork** | Detail | +|--------|----------|----------|--------| +| `nars.rs` | Not present | **NARS** | Non-Axiomatic Reasoning System inference | +| `qualia.rs` | Not present | **Qualia** | Felt-sense quality encoding | +| `qualia_gate.rs` | Not present | **Qualia Gate** | Gated operations on quality values | +| `hdc.rs` | Not present | **HDC** | Hyperdimensional Computing primitives | +| `vsa.rs` | Not present | **VSA** | Vector Symbolic Architecture operations | +| `spo_bundle.rs` | Not present | **SPO Bundle** | Subject-Predicate-Object triple encoding | +| `causality.rs` | Not present | **Causality** | Causal graph operations | +| `causal_diff.rs` | Not present | **CausalEdge64** | u64-packed causal edges, quality scoring | +| `bf16_truth.rs` | Not present | **BF16 Truth** | Truth values in BF16 precision | +| `styles/` (34 files) | Not present | **Thinking Styles** | 34 cognitive primitives: rte, htd, smad, tcp, irs, mcp, tca, cdt, mct, lsi, pso, cdi, cws, are, tcf, ssr, etd, amp, zcf, hpm, cur, mpc, ssam, idr, spp, icr, sdd, dtmf, hkf | +| `blackboard.rs` | Not present | **Blackboard** | Typed slot arena (zero-copy shared memory) | +| `node.rs` | Not present | **Node** | Cognitive node representation | +| `plane.rs` | Not present | **Plane** | 16Kbit representation plane | +| `seal.rs` | Not present | **Seal** | Immutable snapshot encoding | +| `substrate.rs` | Not present | **Substrate** | Cognitive substrate operations | +| `binding_matrix.rs` | Not present | **Binding Matrix** | 3D permutation binding | +| `cyclic_bundle.rs` | Not present | **Cyclic Bundle** | Cyclic vector bundling | + +### JIT Compilation + +| Module | Upstream | **Fork** | Detail | +|--------|----------|----------|--------| +| `jitson/` (8 files) | Not present | **JITSON** | JSON parser + validator + template + scan pipeline | +| `jitson_cranelift/` (6 files) | Not present | **Cranelift JIT** | AVX-512 kernel compilation via Cranelift (feature-gated) | + +### Audio / OCR / Media + +| Module | Upstream | **Fork** | Detail | +|--------|----------|----------|--------| +| `holo.rs` | Not present | **Holographic** | Holographic reduced representations, cosine carriers | +| `ocr_felt.rs` | Not present | **OCR** | Character recognition via felt-sense matching | +| `ocr_simd.rs` | Not present | **OCR SIMD** | SIMD-accelerated binarization, Otsu threshold, density | +| `surround_metadata.rs` | Not present | **Surround** | Spatial audio metadata | +| `crystal_encoder.rs` | Not present | **Crystal** | Crystal symmetry encoding | + +### Miscellaneous + +| Module | Upstream | **Fork** | Detail | +|--------|----------|----------|--------| +| `arrow_bridge.rs` | Not present | **Arrow** | Apache Arrow zero-copy bridge | +| `bnn.rs` | Not present | **BNN** | Binary Neural Network operations | +| `bnn_causal_trajectory.rs` | Not present | **BNN Causal** | Causal trajectory tracking | +| `bnn_cross_plane.rs` | Not present | **BNN Cross-Plane** | Cross-plane BNN operations | +| `cogrecord.rs` | Not present | **CogRecord** | 4×16KB cognitive record unit | +| `compression_curves.rs` | Not present | **Compression** | Rate-distortion curve analysis | +| `graph.rs` | Not present | **Graph** | Basic graph operations | +| `heel_f64x8.rs` | Not present | **F64x8 Kernels** | SIMD dot product, cosine similarity | +| `http_reader.rs` | Not present | **HTTP Reader** | Stream weights from HTTP | +| `kernels.rs` | Not present | **SIMD Kernels** | Generic SIMD apply/map/reduce | +| `layered_distance.rs` | Not present | **Layered Distance** | Multi-layer distance computation | +| `organic.rs` | Not present | **Organic** | Organic growth patterns | +| `p64_bridge.rs` | Not present | **P64 Bridge** | Palette64 convergence point (ndarray <-> lance-graph) | +| `projection.rs` | Not present | **Projection** | Dimensionality reduction | +| `property_mask.rs` | Not present | **Property Mask** | Bitwise property filtering | +| `tekamolo.rs` | Not present | **Tekamolo** | Syntactic position encoding | +| `udf_kernels.rs` | Not present | **UDF Kernels** | User-defined function dispatch | +| `deepnsm.rs` | Not present | **DeepNSM** | Distributional semantic bridge | + +## Subcrates (2 crates) + +| Crate | Upstream | **Fork** | Detail | +|-------|----------|----------|--------| +| `crates/p64` | Not present | **P64** | Palette64 data structure — convergence highway between ndarray and lance-graph | +| `crates/phyllotactic-manifold` | Not present | **Phyllotactic Manifold** | Golden-angle spiral geometry for uniform point distribution | + +## Burn Backend (20 ops files) + +| Component | Upstream | **Fork** | Detail | +|-----------|----------|----------|--------| +| `crates/burn/` | Not present | **burn-ndarray** | SIMD-augmented burn backend (from tracel-ai/burn v0.21.0) | +| `ops/tensor.rs` | — | **try_vml_unary** | Routes f32 unary ops through ndarray hpc::vml (F32x16 SIMD) | +| `ops/activation.rs` | — | **Fused sigmoid** | SIMD-accelerated activation functions | +| `ops/matmul.rs` | — | **GEMM dispatch** | Routes to our Goto-algorithm GEMM | +| Remaining 17 ops files | — | **Standard burn ops** | conv, pooling, interpolate, quantization, etc. | + +## Summary + +| Category | Upstream Count | **Fork Count** | New | +|----------|---------------|----------------|-----| +| SIMD type files | 0 | 6 | +6 | +| Backend files | 0 | 5 | +5 | +| HPC modules | 0 | 146 | +146 | +| Burn ops | 0 | 20 | +20 | +| Subcrates | 0 | 2 | +2 | +| **Total new files** | — | — | **179** | +| **Total new LOC** | — | — | **80,131** | +| **Total new tests** | — | — | **~880** |