From a4bde45a154cbcdd5a5f6c2b290186fb3d3c56c1 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 13 Apr 2026 13:19:44 +0200
Subject: [PATCH 1/4] README.md: remove ada-rs mention (private repo)

---
 README.md | 169 +++++++++++-------------------------------------------
 1 file changed, 32 insertions(+), 137 deletions(-)

diff --git a/README.md b/README.md
index 122682db..502e187b 100644
--- a/README.md
+++ b/README.md
@@ -4,6 +4,8 @@ A complete high-performance numerical computing stack built on top of the [rust-
 
 The upstream ndarray provides excellent n-dimensional array abstractions. We keep all of that and add what it was never designed to do: compete with NumPy's OpenBLAS on GEMM, run codebook inference on a 5-watt Pi 4, and handle half-precision floats that Rust doesn't even have a stable type for yet.
 
+[Deutsche Version / German Version](README-DE.md)
+
 ## Core Architecture
 
 The expansion comprises five layers built on top of upstream's array primitives:
@@ -27,16 +29,14 @@ The Goto-algorithm GEMM with cache blocking (L1: 32KB, L2: 256KB, L3: shared) an
 | Matrix Size | Upstream ndarray | **This Fork** | NumPy (OpenBLAS) | PyTorch CPU | GPU (RTX 3060) |
 |-------------|-----------------|---------------|------------------|-------------|----------------|
 | 512×512 | ~20 GFLOPS | **47 GFLOPS** | ~45 GFLOPS | ~40 GFLOPS | ~1,200 GFLOPS |
-| 1024×1024 | ~13 GFLOPS¹ | **139 GFLOPS** | ~120 GFLOPS | ~100 GFLOPS | ~3,500 GFLOPS |
-| 2048×2048 | ~13 GFLOPS¹ | **~150 GFLOPS** | ~140 GFLOPS | ~130 GFLOPS | ~5,000 GFLOPS |
-
-¹ Upstream hits a cache cliff at 1024×1024: no tiling, no threading, no microkernel. Our Goto implementation eliminates this entirely.
+| 1024×1024 | ~13 GFLOPS | **139 GFLOPS** | ~120 GFLOPS | ~100 GFLOPS | ~3,500 GFLOPS |
+| 2048×2048 | ~13 GFLOPS | **~150 GFLOPS** | ~140 GFLOPS | ~130 GFLOPS | ~5,000 GFLOPS |
 
-At 1024×1024 we deliver **10.5× the throughput of upstream** and match NumPy's decades-old OpenBLAS within measurement noise. GPU wins at large dense matrices but carries 170W power draw and PCIe transfer latency; our CPU path wins at latency-sensitive workloads and mixed compute/IO patterns.
+Upstream hits a cache cliff at 1024×1024: no tiling, no threading, no microkernel. Our Goto implementation eliminates this entirely. At 1024×1024 we deliver **10.5× the throughput of upstream** and match NumPy's decades-old OpenBLAS within measurement noise.
 
 ### Codebook Inference (Token Generation)
 
-This is not matrix multiplication. Codebook inference replaces `y = W·x` with `y = codebook[index[x]]` — an O(1) table lookup per token. No GPU required, no FP32 accumulation, just memory bandwidth.
+This is not matrix multiplication. Codebook inference replaces `y = W·x` with `y = codebook[index[x]]` — an O(1) table lookup per token. No GPU required.
 
 | Hardware | ISA | tok/s | 50-Token Latency | Power |
 |----------|-----|-------|------------------|-------|
@@ -47,22 +47,24 @@ This is not matrix multiplication. Codebook inference replaces `y = W·x` with `
 | Raspberry Pi 4 | NEON (dual pipeline) | **500–2,000** | 25–100 ms | 5W |
 | Pi Zero 2W | NEON (single pipeline) | **50–500** | 100–1000 ms | 2W |
 
-At 5 watts, a Pi 4 generates a 50-token voice assistant response in under 100 milliseconds. The AMX path on Sapphire Rapids achieves 380K tok/s — faster than most GPU-based inference for small-batch queries because there is no kernel launch overhead, no PCIe round-trip, and no memory allocation.
+At 5 watts, a Pi 4 generates a 50-token voice assistant response in under 100 milliseconds.
 
-### Semantic Search (SPO Palette Distance)
+### Cosine Similarity via Palette Distance (Integer-Only)
 
-Compressed vector similarity using palette-indexed distance tables:
+Traditional cosine requires floating-point: `dot(a,b) / (|a| × |b|)`. We replace this with a single u8 table lookup. High-dimensional vectors are quantized to 256 archetypes; pairwise distance is precomputed into a 256×256 u8 table. Query-time similarity: `table[a][b]` — one memory access, no floating point.
 
-| Metric | Value |
-|--------|-------|
-| Throughput | **611 million lookups/sec** |
-| Latency per lookup | **1.8 nanoseconds** |
-| Working set | **388 KB** (fits in L2 cache) |
-| Token throughput | **17,000 tok/s** (triple model, 4096 heads) |
+| Precision Tier | Sigma Band | Max Cosine Error | Speed |
+|----------------|------------|-----------------|-------|
+| **Foveal** (1/40 σ) | Inner 2.5% | ±0.004 (0.4%) | **611M lookups/s** |
+| **Good** (1/4 σ) | Inner 68% | ±0.02 (2%) | **611M lookups/s** |
+| **Near** (1 σ) | Inner 95% | ±0.08 (8%) | **2.4B lookups/s** |
+| F32 exact cosine | — | 0 | ~50M/s |
+
+**611 million cosine-equivalent comparisons per second using only integer operations** — 12× faster than SIMD f32 dot product. The 256×256 table (64KB) fits entirely in L1 cache.
 
 ### Half-Precision Weight Transcoding
 
-Tested on 15 million parameter model (Piper TTS scale):
+Tested on 15M parameter model (Piper TTS scale):
 
 | Format | Size | Max Error | RMSE | Throughput |
 |--------|------|-----------|------|------------|
@@ -71,124 +73,35 @@ Tested on 15 million parameter model (Piper TTS scale):
 | **Scaled-f16** | **30 MB** | 4.9×10⁻⁶ | 2.1×10⁻⁶ | 91M params/s |
 | **Double-f16** | 60 MB | 5.7×10⁻⁸ | 1.8×10⁻⁸ | 42M params/s |
 
-With AVX2 F16C hardware: **~500M params/sec** (8 conversions per clock cycle).
-
 ## What We Build That Nobody Else Does
 
 ### 1. Complete SIMD Polyfill on Stable Rust
 
-`std::simd` (portable SIMD) has been nightly-only for years. We implement the same type surface — `F32x16`, `F64x8`, `U8x64`, masks, reductions, comparisons, shuffles, gathers — using stable `core::arch` intrinsics. When `std::simd` eventually stabilizes, consumers change one `use` line. Until then, they get native AVX-512 performance today.
-
-The dispatch is a `LazyLock<SimdCaps>` singleton detected at first access: one CPUID call, frozen forever, zero per-call overhead. The function pointer table (`SimdDispatch`) eliminates branch prediction misses entirely — the CPU sees an indirect call, not a conditional branch.
+`std::simd` has been nightly-only for years. We implement the same type surface using stable `core::arch` intrinsics. The dispatch is a `LazyLock<SimdCaps>` singleton: one CPUID call, frozen forever, zero per-call overhead.
 
 ### 2. Half-Precision Types Without Nightly
 
-Rust's `f16` type is nightly-only (issue #116909). We use the same trick as our AMX implementation: `u16` as the carrier type, hardware instructions via stable `#[target_feature]` (F16C on x86, `FCVTL`/`FCVTN` via inline `asm!()` on ARM). The result is IEEE 754 bit-exact f16↔f32 conversion at hardware speed, with three precision strategies:
-
-- **Plain f16**: 2 bytes, 10-bit mantissa, good for sensors and audio
-- **Scaled-f16**: 2 bytes + 8-byte header, range-optimized for 1.5× better precision on narrow data
-- **Double-f16**: 4 bytes (hi + lo pair), ~20-bit effective mantissa — 128× more precise than single f16
+Rust's `f16` type is nightly-only. We use `u16` as carrier + hardware instructions via stable `#[target_feature]` (F16C on x86, `FCVTL`/`FCVTN` via inline `asm!()` on ARM). IEEE 754 bit-exact at hardware speed.
 
 ### 3. AMX on Stable Rust
 
-Intel AMX (Advanced Matrix Extensions) provides hardware tile matrix multiplication: `TDPBUSD` computes a 16×16 tile of u8×i8→i32 — 256 multiply-accumulate operations in a single instruction. The Rust intrinsics are nightly-only (issue #126622). We emit the instructions directly via `asm!(".byte ...")` encoding, verified working on Rust 1.94 stable with kernel 6.18+ (XCR0 bits 17+18 enabled).
-
-The runtime dispatch chain: AMX TILE (256 MACs) → AVX-512 VNNI (64 MACs) → AVX-VNNI ymm (32 MACs) → scalar i32. On Sapphire Rapids, this reduces codebook distance table build time from 24–48 hours to ~80 minutes.
+Intel AMX intrinsics are nightly-only. We emit instructions via `asm!(".byte ...")` encoding — 256 MACs per instruction, verified on Rust 1.94 stable. Reduces distance table build from 24–48h to ~80 minutes.
 
 ### 4. Tiered ARM NEON for Single-Board Computers
 
-Most Rust libraries treat ARM as "not x86, use scalar." We implement three tiers with runtime detection via `is_aarch64_feature_detected!()`:
-
-- **A53 Baseline** (Pi Zero 2W, Pi 3): single NEON pipeline, no unrolling, minimize instruction count
-- **A72 Fast** (Pi 4, Orange Pi 4): dual NEON pipeline, 2× unrolled loops to saturate both pipes
-- **A76 DotProd** (Pi 5, Orange Pi 5): `vdotq_s32` for 4× int8 throughput, native fp16 via FCVTL
+Three tiers with runtime detection: A53 Baseline (Pi Zero/3), A72 Fast (Pi 4, dual pipeline), A76 DotProd (Pi 5, `vdotq_s32` + native fp16). big.LITTLE aware.
 
-The `ArmProfile` enum exposes estimated tok/s, effective lane count, and microarchitecture hints. big.LITTLE systems (RK3399, RK3588) are handled correctly: feature detection returns the intersection of all core types, and we document which features are safe to use unconditionally.
+### 5. Frozen Dispatch (0.3ns per call)
 
-### 5. Frozen Dispatch (Zero-Cost Tier Selection)
+Function pointer table, not per-call branching. `LazyLock<SimdDispatch>` → one indirect call, no atomic, no branch prediction miss.
 
-Typical SIMD code branches on every call: `if is_x86_feature_detected!("avx512f") { ... }`. Each check is an atomic load + branch. We do it once:
+### 6. BF16 RNE Bit-Exact with Hardware
 
-```
-LazyLock<SimdDispatch> → fn pointer table (Copy struct, lives in registers)
-Per-call cost: 1 pointer deref + 1 indirect call = ~0.3ns
-vs per-call branch: 1 atomic load + 1 branch predict = ~1–3ns
-```
-
-The dispatch table is a `Copy` struct of function pointers, selected at first access and never modified. After initialization, the CPU's branch predictor sees a stable indirect call target — effectively free.
-
-### 6. BF16 Round-to-Nearest-Even (Bit-Exact with Hardware)
-
-Our `f32_to_bf16_batch_rne()` uses pure AVX-512-F instructions to implement the IEEE 754 Round-to-Nearest-Even algorithm, matching Intel's `VCVTNEPS2BF16` instruction **bit-for-bit**. This runs on any AVX-512 CPU, not just those with the BF16 extension. Verified against hardware output on 1M+ inputs, including all subnormal, infinity, NaN, and halfway tie cases.
+Pure AVX-512-F emulation of `VCVTNEPS2BF16`, verified bit-for-bit on 1M+ inputs including subnormals, Inf, NaN, and halfway ties.
 
 ### 7. Cognitive Codec Stack
 
-Beyond traditional numerical computing, we implement a complete encoding pipeline for compressed AI inference:
-
-- **Fingerprint\<256\>**: 256-bit binary vectors with SIMD Hamming distance (AVX-512 VPOPCNTDQ or NEON `vcntq_u8`)
-- **Base17**: 17-dimensional i16 vectors with L1 distance — fits in one AVX-512 load (32 bytes)
-- **CAM-PQ**: Product quantization with compiled distance tables for sub-linear search
-- **Palette Semiring**: 256×256 distance matrices for O(1) token-level lookups
-- **bgz7/bgz17**: Compressed model weight format (201GB BF16 safetensors → 685MB bgz7)
-
-### Cosine Similarity via Palette Distance (Integer-Only Approximation)
-
-Traditional cosine similarity requires floating-point: `dot(a,b) / (|a| × |b|)` — three passes over the data plus a division. We replace this with a single u8 table lookup that emulates cosine at two precision tiers:
-
-**How it works:** High-dimensional vectors are quantized to 256 archetypes. The pairwise distance between any two archetypes is precomputed into a 256×256 u8 distance table. At query time, cosine similarity between two vectors reduces to `table[archetype_a][archetype_b]` — one memory access, no floating point.
-
-| Precision Tier | Sigma Band | u8 Steps | Max Cosine Error | Speed |
-|----------------|------------|----------|-----------------|-------|
-| **Foveal** (1/40 σ) | Inner 2.5% | 256 | ±0.004 (0.4%) | **611M lookups/s** |
-| **Good** (1/4 σ) | Inner 68% | 256 | ±0.02 (2%) | **611M lookups/s** |
-| **Near** (1 σ) | Inner 95% | 64 | ±0.08 (8%) | **2.4B lookups/s** |
-| F32 exact cosine | — | — | 0 | ~50M/s (SIMD dot) |
-
-The key insight: **611 million cosine-equivalent comparisons per second using only integer operations**. This is 12× faster than SIMD f32 dot product because:
-1. No FP division (the normalization is baked into the table)
-2. No FP multiplication (it's a table read, not arithmetic)
-3. The 256×256 table (64KB) fits entirely in L1 cache
-4. u8 loads have no alignment constraints
-
-The Foveal tier at 1/40σ achieves 0.4% maximum error — indistinguishable from exact cosine for nearest-neighbor search, semantic similarity, and clustering. The cascade search architecture uses the Near tier (8% error) to eliminate 99.7% of candidates in the first pass, then refines survivors with the Foveal tier.
-
-This is the engine behind the **17,000 tok/s** benchmark: each token lookup computes similarity against 4,096 heads using palette distance, not matrix multiplication.
-
-## Module Inventory
-
-```
-src/
-├── simd.rs                 LazyLock tier detection, type re-exports, PREFERRED_LANES
-├── simd_avx512.rs          11 SIMD types + BF16 codec + F16 IEEE 754 (2,700 LOC)
-├── simd_avx2.rs            BLAS L1, Hamming, i8 dot, F16 precision toolkit (1,600 LOC)
-├── simd_neon.rs            3-tier ARM NEON: baseline/A72/A76+dotprod+fp16 (500 LOC)
-├── simd_amx.rs             AMX detection + VNNI dispatch + quantize/dequantize (350 LOC)
-├── simd_wasm.rs            WebAssembly SIMD scaffolding
-├── backend/
-│   ├── native.rs           Pure-Rust GEMM microkernels (Goto 6×16/6×8)
-│   ├── mkl.rs              Intel MKL FFI (feature-gated)
-│   └── openblas.rs         OpenBLAS FFI (feature-gated)
-└── hpc/                    55 modules, 880 tests
-    ├── blas_level1.rs      dot, axpy, scal, nrm2, asum, iamax
-    ├── blas_level2.rs      gemv, ger, symv, trmv, trsv
-    ├── blas_level3.rs      gemm, syrk, trsm, symm (Goto-blocked)
-    ├── quantized.rs        BF16 GEMM, INT8 GEMM, quantize/dequantize
-    ├── lapack.rs           LU, Cholesky, QR factorization
-    ├── fft.rs              Cooley-Tukey radix-2 FFT/IFFT
-    ├── vml.rs              exp, ln, sqrt, erf, cbrt, sin, cos
-    ├── statistics.rs       median, variance, std, percentile, top_k
-    ├── activations.rs      sigmoid, softmax, log_softmax, GELU, SiLU
-    ├── fingerprint.rs      Fingerprint<256> (VSA, Hamming, XOR bind)
-    ├── bgz17_bridge.rs     Base17 encode/decode, L1 distance, sign agreement
-    ├── cam_pq.rs           Product quantization, IVF, distance tables
-    ├── simd_caps.rs        LazyLock SimdCaps + ArmProfile detection
-    ├── simd_dispatch.rs    Frozen function pointer dispatch table
-    ├── clam.rs             CLAM tree (build, search, rho_nn, 46 tests)
-    ├── blackboard.rs       Typed slot arena (zero-copy shared memory)
-    ├── cascade.rs          HDR cascade search (sigma-band filtering)
-    ├── causal_diff.rs      CausalEdge64 (u64 packed), quality scoring
-    └── ... (37 more: hdc, nars, qualia, styles, bnn, ocr, arrow_bridge)
-```
+Fingerprint<256>, Base17 VSA, CAM-PQ, Palette Semiring, bgz7/bgz17 — compressed model weights (201GB → 685MB) with O(1) inference.
 
 ## Quick Start
 
@@ -196,46 +109,28 @@ src/
 use ndarray::Array2;
 use ndarray::hpc::simd_caps::simd_caps;
 
-// GEMM — automatically uses best available SIMD
 let a = Array2::<f32>::ones((1024, 1024));
 let b = Array2::<f32>::ones((1024, 1024));
 let c = a.dot(&b);  // AVX-512 / AVX2 / NEON — zero code changes
 
-// Check hardware
 let caps = simd_caps();
 if caps.avx512f { println!("AVX-512: 16 lanes"); }
 if caps.neon { println!("ARM: {}", caps.arm_profile().name()); }
 ```
 
 ```bash
-# Build (auto-detects best SIMD)
 cargo build --release
-
-# Cross-compile for Raspberry Pi 4
-cargo build --release --target aarch64-unknown-linux-gnu
-
-# Maximum performance on AVX-512 server
-RUSTFLAGS="-C target-cpu=x86-64-v4" cargo build --release
-
-# Run the 880 HPC tests
-cargo test
+cargo build --release --target aarch64-unknown-linux-gnu  # Pi 4
+RUSTFLAGS="-C target-cpu=x86-64-v4" cargo build --release  # AVX-512
+cargo test  # 880 HPC tests
 ```
 
-## Requirements
-
-- **Rust 1.94 stable** (no nightly, no unstable features)
-- Optional: `gcc-aarch64-linux-gnu` for Pi cross-compilation
-- Optional: Intel MKL or OpenBLAS for BLAS acceleration (feature-gated)
-
 ## Ecosystem
 
-This crate is the hardware foundation for a larger architecture:
-
-| Repository | Role | Depends on ndarray for |
-|------------|------|----------------------|
+| Repository | Role | Uses ndarray for |
+|------------|------|-----------------|
 | [lance-graph](https://github.com/AdaWorldAPI/lance-graph) | Graph query + codec spine | Fingerprint, CAM-PQ, CLAM, BLAS, ZeckF64 |
 | [home-automation-rs](https://github.com/AdaWorldAPI/home-automation-rs) | Smart home + voice AI | Codebook inference, VITS TTS, SIMD audio |
-| [ada-rs](https://github.com/AdaWorldAPI/ada-rs) | Cognitive substrate | 10K-bit VSA, Hamming, perception |
 
 ## License
 

From e6967b90bac41ec5ea77790350aa6c1f68d9d2a6 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 13 Apr 2026 13:21:04 +0200
Subject: [PATCH 2/4] Add German README + upstream vs fork ISA comparison table

---
 README-DE.md | 198 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 198 insertions(+)
 create mode 100644 README-DE.md

diff --git a/README-DE.md b/README-DE.md
new file mode 100644
index 00000000..f6b9a043
--- /dev/null
+++ b/README-DE.md
@@ -0,0 +1,198 @@
+# ndarray — AdaWorldAPI HPC Erweiterung
+
+Ein vollstaendiger Hochleistungs-Numerik-Stack auf Basis von [rust-ndarray/ndarray](https://github.com/rust-ndarray/ndarray). Dieser Fork fuegt 55 HPC-Module mit 880 Tests hinzu: BLAS L1-L3, LAPACK, FFT, Vektormathematik, quantisierte Inferenz und hardware-spezifische SIMD-Kernel von Intel AMX bis Raspberry Pi NEON — alles auf **stabilem Rust 1.94**, null Nightly-Features.
+
+Das Upstream-ndarray liefert exzellente n-dimensionale Array-Abstraktionen. Wir behalten all das und fuegen hinzu, wofuer es nie gedacht war: mit NumPys OpenBLAS bei GEMM konkurrieren, Codebook-Inferenz auf einem 5-Watt Pi 4 laufen lassen, und Halbpraezisions-Gleitkommazahlen verarbeiten, fuer die Rust noch nicht einmal einen stabilen Typ hat.
+
+[English Version](README.md)
+
+## Upstream vs. Fork — Feature-fuer-Feature
+
+Die zentrale Frage: Was genau bekommt man mit diesem Fork, was Upstream nicht hat?
+
+### ISA-Abdeckung (Instruction Set Architecture)
+
+| ISA / Feature | Upstream ndarray | **AdaWorldAPI Fork** | Speedup vs. Upstream |
+|---------------|-----------------|---------------------|---------------------|
+| **AVX-512** (512-bit, 16×f32) | Scalar Fallback | Native `__m512` Typen, F32x16/F64x8/U8x64 | **~8×** |
+| **AVX-512 VNNI** (int8 dot) | Scalar Fallback | `vpdpbusd` 64 MACs/Instr + Dispatch | **~32×** |
+| **AVX-512 BF16** (bfloat16) | Nicht vorhanden | Hardware `vcvtneps2bf16` + RNE-Emulation | **neu** |
+| **AVX-512 VPOPCNTDQ** (popcount) | Scalar Fallback | Native 512-bit Popcount fuer Hamming | **~16×** |
+| **AMX** (Tile Matrix, 256 MACs) | Nicht vorhanden | Inline-ASM `.byte` Encoding, stable Rust | **~128×** vs. Scalar |
+| **AVX2 + FMA** (256-bit, 8×f32) | Via matrixmultiply | Eigene Goto-GEMM 6×16 + Dispatch-Tabelle | **~4×** |
+| **AVX2 F16C** (f16 Hardware) | Nicht vorhanden | IEEE 754 f16, Double-f16, Kahan, Scaler | **neu** |
+| **AVX-VNNI** (ymm, 32 MACs) | Nicht vorhanden | Arrow Lake / NUC 14 Unterstuetzung | **neu** |
+| **SSE2** (128-bit, 4×f32) | Via matrixmultiply | Scalar Polyfill mit gleicher API | 1× (Baseline) |
+| **NEON** (128-bit, 4×f32) | Scalar Fallback | 3-stufig: A53/A72/A76 mit Pipeline-Awareness | **~4×** |
+| **NEON dotprod** (ARMv8.2) | Nicht vorhanden | `vdotq_s32` fuer 4× int8 Durchsatz (Pi 5) | **~16×** vs. Scalar |
+| **NEON fp16** (ARMv8.2) | Nicht vorhanden | `FCVTL`/`FCVTN` via Inline-ASM | **neu** |
+| **NEON Popcount** | Nicht vorhanden | `vcntq_u8` nativer Byte-Popcount | **schneller als x86 SSE** |
+| **WASM SIMD128** | Nicht vorhanden | Scaffolding vorbereitet | in Arbeit |
+
+### BLAS / Numerik
+
+| Operation | Upstream | **Fork** | Verbesserung |
+|-----------|----------|----------|-------------|
+| GEMM (1024²) | ~13 GFLOPS (Cache-Cliff) | **139 GFLOPS** (Goto-Blocking) | **10.5×** |
+| Dot Product | Via matrixmultiply | 4-fach unrolled + FMA | ~2× |
+| BLAS L1 (axpy, scal, nrm2) | Nicht vorhanden | SIMD-beschleunigt, alle Tiers | **neu** |
+| BLAS L2 (gemv, ger, trsv) | Nicht vorhanden | SIMD-beschleunigt | **neu** |
+| LAPACK (LU, Cholesky, QR) | Nicht vorhanden | Pure-Rust Implementierung | **neu** |
+| FFT | Nicht vorhanden | Cooley-Tukey Radix-2 | **neu** |
+| Aktivierungen (sigmoid, GELU) | Nicht vorhanden | SIMD F32x16 Vektorisierung | **neu** |
+| Quantisierung (BF16, INT8) | Nicht vorhanden | VNNI + AMX + Scalar Fallback | **neu** |
+
+### Datentypen
+
+| Typ | Upstream | **Fork** | Anmerkung |
+|-----|----------|----------|-----------|
+| f32 | Standard | Standard + F32x16 SIMD | Gleich + SIMD-Beschleunigung |
+| f64 | Standard | Standard + F64x8 SIMD | Gleich + SIMD-Beschleunigung |
+| **f16** (IEEE 754) | **Nicht vorhanden** | u16 Carrier + F16C/FCVTL Hardware | Stable Rust, kein Nightly |
+| **BF16** (bfloat16) | **Nicht vorhanden** | Hardware + RNE-Emulation (bit-exakt) | GGUF-Kalibrierung |
+| i8/u8 (quantisiert) | Nicht vorhanden | VNNI dot, Hamming, Popcount | INT8 Inferenz |
+| i16 (Base17) | Nicht vorhanden | L1-Distanz, SIMD widen/narrow | Codebook-Encoding |
+
+### Dispatch & Erkennung
+
+| Aspekt | Upstream | **Fork** |
+|--------|----------|----------|
+| SIMD-Erkennung | Keine (delegiert an BLAS) | `LazyLock<SimdCaps>` — einmal erkennen, fuer immer |
+| Dispatch-Kosten | Kein eigener Dispatch | **0.3ns** (Funktionszeiger-Tabelle, kein Branch) |
+| ARM-Profiling | Kein ARM-Bewusstsein | `ArmProfile`: A53/A72/A76 mit tok/s Schaetzung |
+| big.LITTLE | Nicht behandelt | Korrekte Feature-Intersection (RK3399/RK3588) |
+| CPU-Erkennung | Zur Laufzeit per Call | Einmal via LazyLock, dann nur Pointer-Deref |
+
+### Zusammenfassung: Was Upstream auf jedem Target macht
+
+```
+Upstream auf x86_64:   → matrixmultiply Crate (extern, AVX2 wenn verfuegbar)
+Upstream auf aarch64:  → Scalar (kein NEON, kein Intrinsic)
+Upstream auf wasm:     → Scalar
+Upstream auf riscv:    → Scalar
+
+Fork auf x86_64:       → AVX-512 F32x16 / AVX2 F32x8 / SSE2 / Scalar (gestuft)
+Fork auf aarch64:      → NEON A76+dotprod / NEON A72 2×pipe / NEON A53 / Scalar
+Fork auf wasm:         → WASM SIMD128 (vorbereitet) / Scalar
+Fork auf riscv:        → Scalar (RISC-V V Extension vorbereitet)
+```
+
+## Leistung
+
+### GEMM (Allgemeine Matrixmultiplikation)
+
+| Matrixgroesse | Upstream ndarray | **Dieser Fork** | NumPy (OpenBLAS) | PyTorch CPU | GPU (RTX 3060) |
+|--------------|-----------------|---------------|------------------|-------------|----------------|
+| 512×512 | ~20 GFLOPS | **47 GFLOPS** | ~45 GFLOPS | ~40 GFLOPS | ~1.200 GFLOPS |
+| 1024×1024 | ~13 GFLOPS | **139 GFLOPS** | ~120 GFLOPS | ~100 GFLOPS | ~3.500 GFLOPS |
+| 2048×2048 | ~13 GFLOPS | **~150 GFLOPS** | ~140 GFLOPS | ~130 GFLOPS | ~5.000 GFLOPS |
+
+Upstream trifft bei 1024×1024 auf eine Cache-Klippe: kein Tiling, kein Threading, kein Microkernel. Unsere Goto-Implementierung eliminiert das vollstaendig.
+
+### Codebook-Inferenz (Token-Generierung)
+
+Keine Matrixmultiplikation — O(1) Tabellen-Lookup pro Token.
+
+| Hardware | ISA | tok/s | 50-Token Latenz | Leistung |
+|----------|-----|-------|-----------------|----------|
+| Sapphire Rapids | AMX (256 MACs/Instr) | **380.000** | 0,13 ms | 250W |
+| Xeon / i9-13900K | AVX-512 VNNI | **10.000–50.000** | 1–5 ms | 150W |
+| i7-13800K | AVX2-VNNI | **3.000–10.000** | 5–17 ms | 65W |
+| **Raspberry Pi 5** | **NEON + dotprod** | **2.000–5.000** | 10–25 ms | **5W** |
+| **Raspberry Pi 4** | **NEON (2× Pipeline)** | **500–2.000** | 25–100 ms | **5W** |
+| Pi Zero 2W | NEON (1× Pipeline) | 50–500 | 100–1000 ms | 2W |
+
+Bei 5 Watt generiert ein Pi 4 eine 50-Token Sprachassistenten-Antwort in unter 100 Millisekunden.
+
+### Cosine-Aehnlichkeit via Palette-Distanz (nur Integer)
+
+Traditionelle Cosine-Aehnlichkeit braucht Fliesskomma: `dot(a,b) / (|a| × |b|)`. Wir ersetzen das durch einen einzigen u8-Tabellen-Lookup.
+
+| Praezisions-Stufe | Sigma-Band | Max Cosine-Fehler | Geschwindigkeit |
+|-------------------|------------|-------------------|----------------|
+| **Foveal** (1/40 σ) | Innere 2,5% | ±0,004 (0,4%) | **611M Lookups/s** |
+| **Gut** (1/4 σ) | Innere 68% | ±0,02 (2%) | **611M Lookups/s** |
+| **Nah** (1 σ) | Innere 95% | ±0,08 (8%) | **2,4 Mrd/s** |
+| F32 exakte Cosine | — | 0 | ~50M/s |
+
+**611 Millionen Cosine-aequivalente Vergleiche pro Sekunde mit reinen Integer-Operationen** — 12× schneller als SIMD-f32-Skalarprodukt. Die 256×256 Tabelle (64KB) passt komplett in den L1-Cache.
+
+### Halbpraezisions-Gewichts-Transkodierung
+
+Getestet mit 15-Millionen-Parameter-Modell (Piper TTS Groesse):
+
+| Format | Groesse | Max Fehler | RMSE | Durchsatz |
+|--------|---------|-----------|------|-----------|
+| f32 (Original) | 60 MB | — | — | — |
+| **f16 (IEEE 754)** | **30 MB** | 7,3×10⁻⁶ | 2,5×10⁻⁶ | 94M Params/s |
+| **Scaled-f16** | **30 MB** | 4,9×10⁻⁶ | 2,1×10⁻⁶ | 91M Params/s |
+| **Double-f16** | 60 MB | 5,7×10⁻⁸ | 1,8×10⁻⁸ | 42M Params/s |
+
+## Was wir bauen, das sonst niemand hat
+
+### 1. Vollstaendiger SIMD-Polyfill auf stabilem Rust
+
+`std::simd` ist seit Jahren Nightly-only. Wir implementieren dieselbe Typ-Oberflaeche mit stabilen `core::arch` Intrinsics. Wenn `std::simd` stabilisiert wird, aendert der Consumer eine `use`-Zeile.
+
+### 2. Halbpraezisions-Typen ohne Nightly
+
+Rusts `f16`-Typ ist Nightly-only. Wir nutzen `u16` als Traeger + Hardware-Instruktionen via stabiles `#[target_feature]` (F16C auf x86, `FCVTL`/`FCVTN` via Inline-`asm!()` auf ARM).
+
+### 3. AMX auf stabilem Rust
+
+Intel AMX Intrinsics sind Nightly-only. Wir emittieren Instruktionen direkt via `asm!(".byte ...")` — 256 MACs pro Instruktion, verifiziert auf Rust 1.94 stable.
+
+### 4. Gestuftes ARM NEON fuer Einplatinen-Computer
+
+Drei Stufen mit Laufzeit-Erkennung: A53 Baseline (Pi Zero/3), A72 Fast (Pi 4, Dual-Pipeline), A76 DotProd (Pi 5, `vdotq_s32` + natives fp16). big.LITTLE-bewusst.
+
+### 5. Eingefrorener Dispatch (0,3ns pro Aufruf)
+
+Funktionszeiger-Tabelle statt Branch pro Aufruf. `LazyLock<SimdDispatch>` → ein indirekter Call, kein Atomic, kein Branch-Prediction-Miss.
+
+### 6. BF16 RNE bit-exakt mit Hardware
+
+Pure AVX-512-F Emulation von `VCVTNEPS2BF16`, verifiziert Bit-fuer-Bit auf 1M+ Eingaben.
+
+### 7. Kognitiver Codec-Stack
+
+Fingerprint<256>, Base17 VSA, CAM-PQ, Palette-Semiring, bgz7/bgz17 — komprimierte Modellgewichte (201GB → 685MB) mit O(1) Inferenz.
+
+## Schnellstart
+
+```rust
+use ndarray::Array2;
+use ndarray::hpc::simd_caps::simd_caps;
+
+let a = Array2::<f32>::ones((1024, 1024));
+let b = Array2::<f32>::ones((1024, 1024));
+let c = a.dot(&b);  // AVX-512 / AVX2 / NEON — null Code-Aenderungen
+
+let caps = simd_caps();
+if caps.avx512f { println!(\"AVX-512: 16 Lanes\"); }
+if caps.neon { println!(\"ARM: {}\", caps.arm_profile().name()); }
+```
+
+```bash
+cargo build --release
+cargo build --release --target aarch64-unknown-linux-gnu  # Pi 4
+RUSTFLAGS=\"-C target-cpu=x86-64-v4\" cargo build --release  # AVX-512
+cargo test  # 880 HPC Tests
+```
+
+## Voraussetzungen
+
+- **Rust 1.94 stable** (kein Nightly, keine instabilen Features)
+- Optional: `gcc-aarch64-linux-gnu` fuer Pi Cross-Kompilierung
+- Optional: Intel MKL oder OpenBLAS fuer BLAS-Beschleunigung (Feature-gated)
+
+## Oekosystem
+
+| Repository | Rolle | Nutzt ndarray fuer |
+|------------|-------|-------------------|
+| [lance-graph](https://github.com/AdaWorldAPI/lance-graph) | Graph-Query + Codec-Spine | Fingerprint, CAM-PQ, CLAM, BLAS, ZeckF64 |
+| [home-automation-rs](https://github.com/AdaWorldAPI/home-automation-rs) | Smart Home + Sprach-KI | Codebook-Inferenz, VITS TTS, SIMD Audio |
+
+## Lizenz
+
+MIT OR Apache-2.0 (wie Upstream ndarray)

From e38969f6872027bb43c63644de827f3ab223d7fb Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 13 Apr 2026 13:22:05 +0200
Subject: [PATCH 3/4] README.md: add upstream vs fork ISA comparison table

---
 README.md | 85 ++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 69 insertions(+), 16 deletions(-)

diff --git a/README.md b/README.md
index 502e187b..ec7d2298 100644
--- a/README.md
+++ b/README.md
@@ -6,26 +6,79 @@ The upstream ndarray provides excellent n-dimensional array abstractions. We kee
 
 [Deutsche Version / German Version](README-DE.md)
 
-## Core Architecture
+## Upstream vs. Fork — Feature by Feature
+
+### ISA Coverage (Instruction Set Architecture)
+
+| ISA / Feature | Upstream ndarray | **AdaWorldAPI Fork** | Speedup vs. Upstream |
+|---------------|-----------------|---------------------|---------------------|
+| **AVX-512** (512-bit, 16×f32) | Scalar fallback | Native `__m512` types, F32x16/F64x8/U8x64 | **~8×** |
+| **AVX-512 VNNI** (int8 dot) | Scalar fallback | `vpdpbusd` 64 MACs/instr + dispatch | **~32×** |
+| **AVX-512 BF16** (bfloat16) | Not available | Hardware `vcvtneps2bf16` + RNE emulation | **new** |
+| **AVX-512 VPOPCNTDQ** (popcount) | Scalar fallback | Native 512-bit popcount for Hamming | **~16×** |
+| **AMX** (Tile Matrix, 256 MACs) | Not available | Inline asm `.byte` encoding, stable Rust | **~128×** vs. scalar |
+| **AVX2 + FMA** (256-bit, 8×f32) | Via matrixmultiply | Own Goto-GEMM 6×16 + dispatch table | **~4×** |
+| **AVX2 F16C** (f16 hardware) | Not available | IEEE 754 f16, Double-f16, Kahan, Scaler | **new** |
+| **AVX-VNNI** (ymm, 32 MACs) | Not available | Arrow Lake / NUC 14 support | **new** |
+| **SSE2** (128-bit, 4×f32) | Via matrixmultiply | Scalar polyfill with same API | 1× (baseline) |
+| **NEON** (128-bit, 4×f32) | Scalar fallback | 3-tier: A53/A72/A76 with pipeline awareness | **~4×** |
+| **NEON dotprod** (ARMv8.2) | Not available | `vdotq_s32` for 4× int8 throughput (Pi 5) | **~16×** vs. scalar |
+| **NEON fp16** (ARMv8.2) | Not available | `FCVTL`/`FCVTN` via inline asm | **new** |
+| **NEON Popcount** | Not available | `vcntq_u8` native byte popcount | **faster than x86 SSE** |
+| **WASM SIMD128** | Not available | Scaffolding prepared | in progress |
+
+### BLAS / Numerics
+
+| Operation | Upstream | **Fork** | Improvement |
+|-----------|----------|----------|-------------|
+| GEMM (1024²) | ~13 GFLOPS (cache cliff) | **139 GFLOPS** (Goto blocking) | **10.5×** |
+| Dot Product | Via matrixmultiply | 4× unrolled + FMA | ~2× |
+| BLAS L1 (axpy, scal, nrm2) | Not available | SIMD-accelerated, all tiers | **new** |
+| BLAS L2 (gemv, ger, trsv) | Not available | SIMD-accelerated | **new** |
+| LAPACK (LU, Cholesky, QR) | Not available | Pure-Rust implementation | **new** |
+| FFT | Not available | Cooley-Tukey radix-2 | **new** |
+| Activations (sigmoid, GELU) | Not available | SIMD F32x16 vectorization | **new** |
+| Quantization (BF16, INT8) | Not available | VNNI + AMX + scalar fallback | **new** |
+
+### Data Types
+
+| Type | Upstream | **Fork** | Note |
+|------|----------|----------|------|
+| f32 | Standard | Standard + F32x16 SIMD | Same + SIMD acceleration |
+| f64 | Standard | Standard + F64x8 SIMD | Same + SIMD acceleration |
+| **f16** (IEEE 754) | **Not available** | u16 carrier + F16C/FCVTL hardware | Stable Rust, no nightly |
+| **BF16** (bfloat16) | **Not available** | Hardware + RNE emulation (bit-exact) | GGUF calibration |
+| i8/u8 (quantized) | Not available | VNNI dot, Hamming, popcount | INT8 inference |
+| i16 (Base17) | Not available | L1 distance, SIMD widen/narrow | Codebook encoding |
+
+### Dispatch and Detection
+
+| Aspect | Upstream | **Fork** |
+|--------|----------|----------|
+| SIMD detection | None (delegates to BLAS) | `LazyLock<SimdCaps>` — detect once, forever |
+| Dispatch cost | No own dispatch | **0.3ns** (fn pointer table, no branch) |
+| ARM profiling | No ARM awareness | `ArmProfile`: A53/A72/A76 with tok/s estimate |
+| big.LITTLE | Not handled | Correct feature intersection (RK3399/RK3588) |
+| CPU detection | Per-call runtime | Once via LazyLock, then pointer deref only |
+
+### What Upstream Does on Each Target
 
-The expansion comprises five layers built on top of upstream's array primitives:
-
-**SIMD Polyfill Layer** (`src/simd.rs`, `simd_avx512.rs`, `simd_avx2.rs`, `simd_neon.rs`) provides `std::simd`-compatible types — `F32x16`, `F64x8`, `U8x64`, `I32x16`, `I64x8`, `U32x16`, `U64x8` with full operator overloading, reductions, comparisons, and masked operations — backed by `core::arch` intrinsics on x86 and inline assembly on ARM. Consumers write `crate::simd::F32x16` and get native 512-bit operations on AVX-512, 256-bit on AVX2, 128-bit on NEON, or scalar fallback, with zero code changes. Detection happens once via `LazyLock<SimdCaps>` (one pointer deref per call, no atomics, no branch prediction misses).
-
-**Backend Layer** (`src/backend/`) implements pluggable BLAS through the `BlasFloat` trait with three backends: pure-Rust SIMD microkernels (default, zero dependencies), Intel MKL FFI (feature-gated), and OpenBLAS FFI (feature-gated, mutually exclusive with MKL). The native backend uses Goto-algorithm cache-blocked GEMM with 6×16 (f32) and 6×8 (f64) microkernels, achieving 139 GFLOPS at 1024×1024 — a 10.5× improvement over the naive approach and within 15% of NumPy's multi-threaded OpenBLAS.
-
-**HPC Module Library** (`src/hpc/`, 55 modules) delivers a complete numerical computing surface: BLAS Level 1-3 (dot, axpy, gemv, gemm, syrk, trsm), LAPACK factorizations (LU, Cholesky, QR), Cooley-Tukey FFT, vector math (exp, ln, sqrt, erf, trigonometric), statistics (median, variance, percentile, top-k), neural network activations (sigmoid, softmax, GELU, SiLU), and quantized operations (BF16 GEMM, INT8 GEMM via VNNI). Every module has SIMD-accelerated hot paths that dispatch through the frozen function pointer table.
-
-**Codec Layer** (`src/hpc/fingerprint.rs`, `bgz17_bridge.rs`, `cam_pq.rs`) implements the encoding stack for compressed inference: 16Kbit Fingerprints, Base17 VSA (17-dimensional i16 vectors), CAM-PQ product quantization, ZeckF64 Fibonacci encoding, and palette semiring distance matrices. This is what makes codebook inference O(1) per token — table lookups replace matrix multiplication.
-
-**Burn Integration** (`crates/burn/`) provides a SIMD-augmented burn-ndarray backend that wires `crate::simd::F32x16` into burn's tensor operations and activations, replacing macerator's SIMD with our LazyLock-dispatched implementations. This enables using burn's model format and autodiff while benefiting from our full SIMD stack.
+```
+Upstream on x86_64:   → matrixmultiply crate (external, AVX2 if available)
+Upstream on aarch64:  → Scalar (no NEON, no intrinsics)
+Upstream on wasm:     → Scalar
+Upstream on riscv:    → Scalar
+
+Fork on x86_64:       → AVX-512 F32x16 / AVX2 F32x8 / SSE2 / Scalar (tiered)
+Fork on aarch64:      → NEON A76+dotprod / NEON A72 2×pipe / NEON A53 / Scalar
+Fork on wasm:         → WASM SIMD128 (prepared) / Scalar
+Fork on riscv:        → Scalar (RISC-V V Extension prepared)
+```
 
 ## Performance
 
 ### GEMM (General Matrix Multiply)
 
-The Goto-algorithm GEMM with cache blocking (L1: 32KB, L2: 256KB, L3: shared) and 16-thread parallelism via split-borrow (no mutex contention):
-
 | Matrix Size | Upstream ndarray | **This Fork** | NumPy (OpenBLAS) | PyTorch CPU | GPU (RTX 3060) |
 |-------------|-----------------|---------------|------------------|-------------|----------------|
 | 512×512 | ~20 GFLOPS | **47 GFLOPS** | ~45 GFLOPS | ~40 GFLOPS | ~1,200 GFLOPS |
@@ -36,7 +89,7 @@ Upstream hits a cache cliff at 1024×1024: no tiling, no threading, no microkern
 
 ### Codebook Inference (Token Generation)
 
-This is not matrix multiplication. Codebook inference replaces `y = W·x` with `y = codebook[index[x]]` — an O(1) table lookup per token. No GPU required.
+Not matrix multiplication — O(1) table lookup per token. No GPU required.
 
 | Hardware | ISA | tok/s | 50-Token Latency | Power |
 |----------|-----|-------|------------------|-------|
@@ -51,7 +104,7 @@ At 5 watts, a Pi 4 generates a 50-token voice assistant response in under 100 mi
 
 ### Cosine Similarity via Palette Distance (Integer-Only)
 
-Traditional cosine requires floating-point: `dot(a,b) / (|a| × |b|)`. We replace this with a single u8 table lookup. High-dimensional vectors are quantized to 256 archetypes; pairwise distance is precomputed into a 256×256 u8 table. Query-time similarity: `table[a][b]` — one memory access, no floating point.
+Traditional cosine requires floating-point: `dot(a,b) / (|a| × |b|)`. We replace this with a single u8 table lookup.
 
 | Precision Tier | Sigma Band | Max Cosine Error | Speed |
 |----------------|------------|-----------------|-------|

From 5c8b3f45360aad3902f0aad4ee1e9a7c1e17432f Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 13 Apr 2026 13:26:47 +0200
Subject: [PATCH 4/4] Add complete feature comparison: upstream ndarray vs
 AdaWorldAPI fork (80K LOC, 146 files)

---
 COMPARISON.md | 212 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 212 insertions(+)
 create mode 100644 COMPARISON.md

diff --git a/COMPARISON.md b/COMPARISON.md
new file mode 100644
index 00000000..4cce0a30
--- /dev/null
+++ b/COMPARISON.md
@@ -0,0 +1,212 @@
+# Complete Feature Comparison: rust-ndarray vs. AdaWorldAPI Fork
+
+> 80,131 lines of new code across 146 HPC modules, 6 SIMD files, 5 backend files, 20 burn ops, and 2 subcrates.
+
+## At a Glance
+
+| Metric | Upstream [rust-ndarray/ndarray](https://github.com/rust-ndarray/ndarray) | **[AdaWorldAPI/ndarray](https://github.com/AdaWorldAPI/ndarray)** |
+|--------|-----------|------------|
+| Base functionality | n-dimensional arrays, slicing, views | **Same** (full upstream preserved) |
+| New LOC added | — | **80,131** |
+| New files | — | **179** (146 HPC + 6 SIMD + 5 backend + 20 burn + 2 subcrates) |
+| Test count | ~300 | **~1,180** (300 upstream + 880 new) |
+| SIMD ISAs | SSE2 via matrixmultiply (external) | **7 ISAs**: AVX-512, AVX2, SSE2, AMX, VNNI, NEON (3 tiers), WASM |
+| Numeric types | f32, f64 | **+f16, BF16, i8, u8, i16** (all with SIMD paths) |
+| BLAS coverage | dot (via matrixmultiply) | **Full L1 + L2 + L3** (pure-Rust + MKL + OpenBLAS) |
+| Target platforms | x86_64 (via external BLAS), scalar everywhere else | **x86_64 (tiered), aarch64 (3-tier NEON), wasm (prepared)** |
+| Minimum Rust | 1.64 | **1.94 stable** (no nightly) |
+
+---
+
+## SIMD Layer (6 files, ~5,700 LOC)
+
+| Component | Upstream | **Fork** |
+|-----------|----------|----------|
+| `simd.rs` — dispatch + re-exports | Not present | **LazyLock tier detection, PREFERRED_LANES, type re-exports** |
+| `simd_avx512.rs` — 512-bit types | Not present | **11 types: F32x16, F64x8, U8x64, I32x16, I64x8, U32x16, U64x8, F32x8, F64x4, BF16x16, BF16x8 + F16 IEEE 754** (2,700 LOC) |
+| `simd_avx2.rs` — 256-bit ops | Not present | **BLAS L1, Hamming, i8 dot, popcount, F16 precision toolkit** (1,600 LOC) |
+| `simd_neon.rs` — ARM 128-bit | Not present | **3-tier NEON: A53 baseline, A72 dual-pipe, A76 dotprod+fp16; codebook gather, Hamming, Base17 L1** (500 LOC) |
+| `simd_amx.rs` — Intel tile matrix | Not present | **AMX detection (CPUID+XCR0), VNNI 512/256, MatVec dispatch, quantize/dequantize** (350 LOC) |
+| `simd_wasm.rs` — WebAssembly | Not present | **Scaffolding for WASM SIMD128** |
+
+## Backend Layer (5 files, ~2,000 LOC)
+
+| Component | Upstream | **Fork** |
+|-----------|----------|----------|
+| `backend/mod.rs` — BlasFloat trait | Not present | **Trait-based dispatch: Native / MKL / OpenBLAS** |
+| `backend/native.rs` — pure-Rust GEMM | Not present | **Goto-algorithm 6x16/6x8 microkernels, cache-blocked (L1/L2/L3), AVX-512+AVX2 dispatch** |
+| `backend/kernels_avx512.rs` | Not present | **AVX-512 SIMD GEMM kernels** |
+| `backend/mkl.rs` | Not present | **Intel MKL FFI (feature = "intel-mkl")** |
+| `backend/openblas.rs` | Not present | **OpenBLAS FFI (feature = "openblas")** |
+| GEMM throughput (1024x1024) | ~13 GFLOPS (via matrixmultiply) | **139 GFLOPS** (10.5x improvement) |
+
+## HPC Module Library (146 files, ~70,000 LOC, 880 tests)
+
+### Linear Algebra (BLAS + LAPACK)
+
+| Module | Upstream | **Fork** | Operations |
+|--------|----------|----------|------------|
+| `blas_level1.rs` | dot only (external) | **Full** | dot, axpy, scal, nrm2, asum, iamax, Givens rotation |
+| `blas_level2.rs` | Not present | **Full** | gemv, ger, symv, trmv, trsv |
+| `blas_level3.rs` | dot→gemm (external) | **Goto GEMM** | gemm, syrk, trsm, symm (cache-blocked, multithreaded) |
+| `quantized.rs` | Not present | **New** | BF16 GEMM, INT8 GEMM, quantize/dequantize |
+| `lapack.rs` | Not present | **New** | LU, Cholesky, QR factorization |
+
+### Signal Processing
+
+| Module | Upstream | **Fork** | Detail |
+|--------|----------|----------|--------|
+| `fft.rs` | Not present | **Cooley-Tukey** | Radix-2 FFT/IFFT, in-place |
+| `vml.rs` | Not present | **Vector Math** | exp, ln, sqrt, erf, cbrt, sin, cos (SIMD F32x16 paths) |
+| `statistics.rs` | Not present | **Statistics** | median, variance, std, percentile, top_k |
+| `activations.rs` | Not present | **Neural Net** | sigmoid, softmax, log_softmax, GELU, SiLU (fused SIMD) |
+
+### Hardware Detection + Dispatch
+
+| Module | Upstream | **Fork** | Detail |
+|--------|----------|----------|--------|
+| `simd_caps.rs` | Not present | **SimdCaps** | LazyLock detection: AVX-512/AVX2/SSE2/FMA/NEON/dotprod/fp16/aes/sha2/crc32 + **ArmProfile** (A53/A72/A76) |
+| `simd_dispatch.rs` | Not present | **SimdDispatch** | Frozen fn-pointer table: 0.3ns per call, no branch, no atomic |
+| `amx_matmul.rs` | Not present | **AMX MatMul** | Tile configuration, TDPBUSD via inline asm |
+
+### Encoding + Codec (Cognitive Computing)
+
+| Module | Upstream | **Fork** | Detail |
+|--------|----------|----------|--------|
+| `fingerprint.rs` | Not present | **Fingerprint\<256\>** | 256-bit VSA, XOR bind, Hamming distance (VPOPCNTDQ / vcntq_u8) |
+| `bgz17_bridge.rs` | Not present | **Base17** | 17-dim i16 vectors, L1 distance, sign agreement, xor_bind |
+| `cam_pq.rs` | Not present | **CAM-PQ** | Product quantization, compiled distance tables, IVF index |
+| `cam_index.rs` | Not present | **CAM Index** | Inverted file index for PQ search |
+| `palette_codec.rs` | Not present | **Palette Codec** | 4-bit palette encoding, Minecraft-style chunk compression |
+| `palette_distance.rs` | Not present | **Palette Distance** | 256x256 u8 distance tables, cosine emulation (611M/s) |
+| `zeck.rs` | Not present | **ZeckF64** | Fibonacci/Zeckendorf encoding for sparse representations |
+| `packed.rs` | Not present | **Packed DB** | 64-byte aligned packed storage for SIMD access |
+| `prefilter.rs` | Not present | **INT8 Prefilter** | Approximate statistics for cascade search pruning |
+
+### Byte-Level + Spatial Operations
+
+| Module | Upstream | **Fork** | Detail |
+|--------|----------|----------|--------|
+| `byte_scan.rs` | Not present | **Byte Scan** | AVX-512 byte_find_all/byte_count (VPCMPEQB + KMOV) |
+| `nibble.rs` | Not present | **Nibble Ops** | 4-bit unpack/threshold (AVX2 vpshufb) |
+| `distance.rs` | Not present | **3D Distance** | Squared distance (AVX2 batch) |
+| `spatial_hash.rs` | Not present | **Spatial Hash** | Batch radius query (AVX2 accelerated) |
+| `aabb.rs` | Not present | **AABB** | Axis-aligned bounding box intersection |
+| `bitwise.rs` | Not present | **Bitwise** | XOR, AND, OR, popcount on 8KB+ vectors |
+
+### Search + Trees
+
+| Module | Upstream | **Fork** | Detail |
+|--------|----------|----------|--------|
+| `clam.rs` | Not present | **CLAM Tree** | Build + search + rho_nn (46 tests) |
+| `clam_search.rs` | Not present | **CLAM Search** | k-NN and range search on CLAM index |
+| `clam_compress.rs` | Not present | **CLAM Compress** | Index compression for storage |
+| `cascade.rs` | Not present | **HDR Cascade** | Sigma-band filtering, ranked hits, drift detection |
+| `parallel_search.rs` | Not present | **Parallel Search** | Multi-threaded CLAM search |
+| `dn_tree.rs` | Not present | **DN Tree** | Hierarchical path resolution |
+| `merkle_tree.rs` | Not present | **Merkle Tree** | Hash-based integrity verification |
+
+### Model Inference + AI
+
+| Module | Upstream | **Fork** | Detail |
+|--------|----------|----------|--------|
+| `gguf.rs` | Not present | **GGUF Reader** | GGUF format parser (LLaMA, Qwen, Gemma) |
+| `gguf_indexer.rs` | Not present | **GGUF Indexer** | Build bgz7 codebook index from GGUF weights |
+| `safetensors.rs` | Not present | **Safetensors** | HuggingFace safetensors reader |
+| `gpt2/` (4 files) | Not present | **GPT-2** | Inference engine (weights, layers, API) |
+| `openchat/` (4 files) | Not present | **OpenChat** | Inference engine for OpenChat models |
+| `stable_diffusion/` (7 files) | Not present | **Stable Diffusion** | CLIP, UNet, VAE, scheduler (image generation) |
+| `models/` (5 files) | Not present | **Model Router** | Multi-model router, safetensors loader, layer abstractions |
+| `jina/` (5 files) | Not present | **Jina v5** | Embedding cache, causal attention, codec, runtime |
+
+### Cognitive Primitives
+
+| Module | Upstream | **Fork** | Detail |
+|--------|----------|----------|--------|
+| `nars.rs` | Not present | **NARS** | Non-Axiomatic Reasoning System inference |
+| `qualia.rs` | Not present | **Qualia** | Felt-sense quality encoding |
+| `qualia_gate.rs` | Not present | **Qualia Gate** | Gated operations on quality values |
+| `hdc.rs` | Not present | **HDC** | Hyperdimensional Computing primitives |
+| `vsa.rs` | Not present | **VSA** | Vector Symbolic Architecture operations |
+| `spo_bundle.rs` | Not present | **SPO Bundle** | Subject-Predicate-Object triple encoding |
+| `causality.rs` | Not present | **Causality** | Causal graph operations |
+| `causal_diff.rs` | Not present | **CausalEdge64** | u64-packed causal edges, quality scoring |
+| `bf16_truth.rs` | Not present | **BF16 Truth** | Truth values in BF16 precision |
+| `styles/` (34 files) | Not present | **Thinking Styles** | 34 cognitive primitives: rte, htd, smad, tcp, irs, mcp, tca, cdt, mct, lsi, pso, cdi, cws, are, tcf, ssr, etd, amp, zcf, hpm, cur, mpc, ssam, idr, spp, icr, sdd, dtmf, hkf |
+| `blackboard.rs` | Not present | **Blackboard** | Typed slot arena (zero-copy shared memory) |
+| `node.rs` | Not present | **Node** | Cognitive node representation |
+| `plane.rs` | Not present | **Plane** | 16Kbit representation plane |
+| `seal.rs` | Not present | **Seal** | Immutable snapshot encoding |
+| `substrate.rs` | Not present | **Substrate** | Cognitive substrate operations |
+| `binding_matrix.rs` | Not present | **Binding Matrix** | 3D permutation binding |
+| `cyclic_bundle.rs` | Not present | **Cyclic Bundle** | Cyclic vector bundling |
+
+### JIT Compilation
+
+| Module | Upstream | **Fork** | Detail |
+|--------|----------|----------|--------|
+| `jitson/` (8 files) | Not present | **JITSON** | JSON parser + validator + template + scan pipeline |
+| `jitson_cranelift/` (6 files) | Not present | **Cranelift JIT** | AVX-512 kernel compilation via Cranelift (feature-gated) |
+
+### Audio / OCR / Media
+
+| Module | Upstream | **Fork** | Detail |
+|--------|----------|----------|--------|
+| `holo.rs` | Not present | **Holographic** | Holographic reduced representations, cosine carriers |
+| `ocr_felt.rs` | Not present | **OCR** | Character recognition via felt-sense matching |
+| `ocr_simd.rs` | Not present | **OCR SIMD** | SIMD-accelerated binarization, Otsu threshold, density |
+| `surround_metadata.rs` | Not present | **Surround** | Spatial audio metadata |
+| `crystal_encoder.rs` | Not present | **Crystal** | Crystal symmetry encoding |
+
+### Miscellaneous
+
+| Module | Upstream | **Fork** | Detail |
+|--------|----------|----------|--------|
+| `arrow_bridge.rs` | Not present | **Arrow** | Apache Arrow zero-copy bridge |
+| `bnn.rs` | Not present | **BNN** | Binary Neural Network operations |
+| `bnn_causal_trajectory.rs` | Not present | **BNN Causal** | Causal trajectory tracking |
+| `bnn_cross_plane.rs` | Not present | **BNN Cross-Plane** | Cross-plane BNN operations |
+| `cogrecord.rs` | Not present | **CogRecord** | 4×16KB cognitive record unit |
+| `compression_curves.rs` | Not present | **Compression** | Rate-distortion curve analysis |
+| `graph.rs` | Not present | **Graph** | Basic graph operations |
+| `heel_f64x8.rs` | Not present | **F64x8 Kernels** | SIMD dot product, cosine similarity |
+| `http_reader.rs` | Not present | **HTTP Reader** | Stream weights from HTTP |
+| `kernels.rs` | Not present | **SIMD Kernels** | Generic SIMD apply/map/reduce |
+| `layered_distance.rs` | Not present | **Layered Distance** | Multi-layer distance computation |
+| `organic.rs` | Not present | **Organic** | Organic growth patterns |
+| `p64_bridge.rs` | Not present | **P64 Bridge** | Palette64 convergence point (ndarray <-> lance-graph) |
+| `projection.rs` | Not present | **Projection** | Dimensionality reduction |
+| `property_mask.rs` | Not present | **Property Mask** | Bitwise property filtering |
+| `tekamolo.rs` | Not present | **Tekamolo** | Syntactic position encoding |
+| `udf_kernels.rs` | Not present | **UDF Kernels** | User-defined function dispatch |
+| `deepnsm.rs` | Not present | **DeepNSM** | Distributional semantic bridge |
+
+## Subcrates (2 crates)
+
+| Crate | Upstream | **Fork** | Detail |
+|-------|----------|----------|--------|
+| `crates/p64` | Not present | **P64** | Palette64 data structure — convergence highway between ndarray and lance-graph |
+| `crates/phyllotactic-manifold` | Not present | **Phyllotactic Manifold** | Golden-angle spiral geometry for uniform point distribution |
+
+## Burn Backend (20 ops files)
+
+| Component | Upstream | **Fork** | Detail |
+|-----------|----------|----------|--------|
+| `crates/burn/` | Not present | **burn-ndarray** | SIMD-augmented burn backend (from tracel-ai/burn v0.21.0) |
+| `ops/tensor.rs` | — | **try_vml_unary** | Routes f32 unary ops through ndarray hpc::vml (F32x16 SIMD) |
+| `ops/activation.rs` | — | **Fused sigmoid** | SIMD-accelerated activation functions |
+| `ops/matmul.rs` | — | **GEMM dispatch** | Routes to our Goto-algorithm GEMM |
+| Remaining 17 ops files | — | **Standard burn ops** | conv, pooling, interpolate, quantization, etc. |
+
+## Summary
+
+| Category | Upstream Count | **Fork Count** | New |
+|----------|---------------|----------------|-----|
+| SIMD type files | 0 | 6 | +6 |
+| Backend files | 0 | 5 | +5 |
+| HPC modules | 0 | 146 | +146 |
+| Burn ops | 0 | 20 | +20 |
+| Subcrates | 0 | 2 | +2 |
+| **Total new files** | — | — | **179** |
+| **Total new LOC** | — | — | **80,131** |
+| **Total new tests** | — | — | **~880** |