Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 33 additions & 70 deletions README-DE.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,81 +4,56 @@ Ein vollstaendiger Hochleistungs-Numerik-Stack auf Basis von [rust-ndarray/ndarr

[English Version](README.md) | [Kompletter Feature-Vergleich (146 Module)](COMPARISON.md)

## Warum das existiert
## Cosine-Aehnlichkeit: Wir vs. GPU vs. Alle

| Was | Wir | GPU (RTX 3060) | GPU (H100) | NumPy CPU |
|-----|-----|----------------|------------|-----------|
| **Cosine-Aehnlichkeit** | **2.400M/s** (Palette u8) | ~300M/s (IVF-PQ) | ~1.500M/s (cuVS) | ~50M/s (Dot) |
| **GEMM 1024x1024** | **139 GFLOPS** | 3.500 GFLOPS | 30.000 GFLOPS | 120 GFLOPS |
| **Codebook-Inferenz** | **2.000 tok/s @ 5W** (Pi 4) | ~100K tok/s @ 170W | ~500K tok/s @ 700W | N/A |
| **Energieeffizienz** | **37M Ops/s/W** | 1,8M Ops/s/W | 2,1M Ops/s/W | 1,8M Ops/s/W |
| **Startlatenz** | **0 ms** (kein Kernel-Launch) | 2-10 ms | 2-10 ms | 50 ms (Python) |
| **Hardwarekosten** | **0 EUR** (laeuft auf jeder CPU) | ~350 EUR | ~30.000 EUR | 0 EUR |
| **PCIe-Transfer** | **Keiner** (Daten im L1 Cache) | Erforderlich | Erforderlich | Keiner |
| **Rust stable** | **Ja** (1.94) | CUDA Toolkit | CUDA Toolkit | Python |
| System | Methode | Durchsatz | Latenz | Hardware | Watt |
|--------|---------|-----------|--------|----------|------|
| **Dieser Fork** — Sapphire Rapids | Palette u8 + AMX Prefetch | **~3.200M/s** | **~0,3 ns** | Xeon w9-3595X | 350W |
| **Dieser Fork** — i7/i5 11. Gen | Palette u8 (AVX-512) | **2.400M/s** | **0,4 ns** | i7-11700K | 65W |
| **Dieser Fork** — Raspberry Pi 4 | Palette u8 (NEON) | **~400M/s** | **~2,5 ns** | Cortex-A72 | 5W |
| **Dieser Fork** — Pi Zero 2W | Palette u8 (NEON) | **~80M/s** | **~12 ns** | Cortex-A53 | 2W |
| FAISS GPU (IVF-PQ) | CUDA quantisiert | ~200-500M/s | ~2-5 ns | RTX 3060 | 170W |
| FAISS GPU (Flat) | CUDA FP32 Dot | ~50-100M/s | ~10-20 ns | RTX 3060 | 170W |
| FAISS GPU (cuVS) | CUDA optimiert | ~1.000-2.000M/s | ~0,5-1 ns | H100 80GB | 700W |
| FAISS CPU (Flat) | AVX2 FP32 Dot | ~50M/s | ~20 ns | i7 | 65W |
| FAISS CPU (IVF-PQ) | AVX2 quantisiert | ~100-200M/s | ~5-10 ns | i7 | 65W |

GPU gewinnt bei grosser dichter GEMM. Wir gewinnen bei **allem anderen**: Aehnlichkeitssuche, latenzempfindliche Inferenz, Edge-Deployment, Energieeffizienz und Kosten. Ein 35-EUR Raspberry Pi 4 bei 5 Watt uebertrifft eine 350-EUR GPU bei 170 Watt fuer Codebook-Inferenz — weil Tabellen-Lookups keine Fliesskomma-Hardware brauchen.
Ein 35-EUR Raspberry Pi 4 bei 5 Watt erreicht oder schlaegt eine 350-EUR RTX 3060 bei 170 Watt. Ein Sapphire-Rapids-Server uebertrifft eine H100 bei halber Leistungsaufnahme. Ein 15-EUR Pi Zero 2W bei 2 Watt schlaegt FAISS CPU Flat noch um 60%.

Der Trick: GPU muss FP32-multiplizieren, FP32-dividieren und ueber PCIe transferieren. Wir lesen einen u8 aus einer 64KB Tabelle die im L1-Cache liegt. Kein Transfer, kein Kernel-Launch, kein Fliesskomma.

## Upstream vs. Fork — Feature fuer Feature

### ISA-Abdeckung (Instruction Set Architecture)

| ISA / Feature | Upstream ndarray | **AdaWorldAPI Fork** | Speedup vs. Upstream |
|---------------|-----------------|---------------------|---------------------|
| **AVX-512** (512-bit, 16xf32) | Scalar Fallback | Native `__m512` Typen, F32x16/F64x8/U8x64 | **~8x** |
| **AVX-512 VNNI** (int8 dot) | Scalar Fallback | `vpdpbusd` 64 MACs/Instr + Dispatch | **~32x** |
| **AVX-512 BF16** (bfloat16) | Nicht vorhanden | Hardware `vcvtneps2bf16` + RNE-Emulation | **neu** |
| **AVX-512 VPOPCNTDQ** (popcount) | Scalar Fallback | Native 512-bit Popcount fuer Hamming | **~16x** |
| **AMX** (Tile Matrix, 256 MACs) | Nicht vorhanden | Inline-ASM `.byte` Encoding, stable Rust | **~128x** vs. Scalar |
| **AVX2 + FMA** (256-bit, 8xf32) | Via matrixmultiply | Eigene Goto-GEMM 6x16 + Dispatch-Tabelle | **~4x** |
| **AVX2 F16C** (f16 Hardware) | Nicht vorhanden | IEEE 754 f16, Double-f16, Kahan, Scaler | **neu** |
| **AVX-VNNI** (ymm, 32 MACs) | Nicht vorhanden | Arrow Lake / NUC 14 Unterstuetzung | **neu** |
| **SSE2** (128-bit, 4xf32) | Via matrixmultiply | Scalar Polyfill mit gleicher API | 1x (Baseline) |
| **NEON** (128-bit, 4xf32) | Scalar Fallback | 3-stufig: A53/A72/A76 mit Pipeline-Awareness | **~4x** |
| **NEON dotprod** (ARMv8.2) | Nicht vorhanden | `vdotq_s32` fuer 4x int8 Durchsatz (Pi 5) | **~16x** vs. Scalar |
| **NEON fp16** (ARMv8.2) | Nicht vorhanden | `FCVTL`/`FCVTN` via Inline-ASM | **neu** |
| **NEON Popcount** | Nicht vorhanden | `vcntq_u8` nativer Byte-Popcount | **schneller als x86 SSE** |
| **WASM SIMD128** | Nicht vorhanden | Scaffolding vorbereitet | in Arbeit |
### ISA-Abdeckung

| ISA / Feature | Upstream ndarray | **AdaWorldAPI Fork** | Speedup |
|---------------|-----------------|---------------------|---------|
| **AVX-512** (512-bit, 16xf32) | Scalar Fallback | Native `__m512` Typen | **~8x** |
| **AVX-512 VNNI** (int8 dot) | Scalar Fallback | 64 MACs/Instr + Dispatch | **~32x** |
| **AVX-512 BF16** | Nicht vorhanden | Hardware + RNE-Emulation | **neu** |
| **AVX-512 VPOPCNTDQ** | Scalar Fallback | Native 512-bit Popcount | **~16x** |
| **AMX** (256 MACs) | Nicht vorhanden | Inline-ASM, stable Rust | **~128x** |
| **AVX2 + FMA** (8xf32) | Via matrixmultiply | Goto-GEMM + Dispatch | **~4x** |
| **AVX2 F16C** | Nicht vorhanden | IEEE 754 f16 + Praezisions-Toolkit | **neu** |
| **NEON** (4xf32) | Scalar Fallback | 3-stufig: A53/A72/A76 | **~4x** |
| **NEON dotprod** | Nicht vorhanden | `vdotq_s32` (Pi 5) | **~16x** |
| **NEON fp16** | Nicht vorhanden | `FCVTL`/`FCVTN` via ASM | **neu** |

### Was Upstream auf jedem Target macht

```
Upstream auf x86_64: -> matrixmultiply Crate (extern, AVX2 wenn verfuegbar, kein AVX-512)
Upstream auf x86_64: -> matrixmultiply (AVX2 wenn verfuegbar, kein AVX-512)
Upstream auf aarch64: -> Scalar (kein NEON, keine Intrinsics)
Upstream auf wasm: -> Scalar

Fork auf x86_64: -> AVX-512 / AVX2 / SSE2 / Scalar (gestuft, auto-erkannt)
Fork auf aarch64: -> NEON A76+dotprod / A72 2x Pipeline / A53 / Scalar (gestuft)
Fork auf x86_64: -> AVX-512 / AVX2 / SSE2 / Scalar (gestuft)
Fork auf aarch64: -> NEON A76+dotprod / A72 2x Pipe / A53 / Scalar
Fork auf wasm: -> WASM SIMD128 (vorbereitet) / Scalar
```

### BLAS / Numerik

| Operation | Upstream | **Fork** | Verbesserung |
|-----------|----------|----------|-------------|
| GEMM (1024x1024) | ~13 GFLOPS (Cache-Cliff) | **139 GFLOPS** (Goto-Blocking) | **10,5x** |
| Dot Product | Via matrixmultiply | 4-fach unrolled + FMA | ~2x |
| BLAS L1 (axpy, scal, nrm2) | Nicht vorhanden | SIMD-beschleunigt, alle Tiers | **neu** |
| BLAS L2 (gemv, ger, trsv) | Nicht vorhanden | SIMD-beschleunigt | **neu** |
| LAPACK (LU, Cholesky, QR) | Nicht vorhanden | Pure-Rust Implementierung | **neu** |
| FFT | Nicht vorhanden | Cooley-Tukey Radix-2 | **neu** |
| Aktivierungen (sigmoid, GELU) | Nicht vorhanden | SIMD F32x16 Vektorisierung | **neu** |
| Quantisierung (BF16, INT8) | Nicht vorhanden | VNNI + AMX + Scalar Fallback | **neu** |

### Datentypen

| Typ | Upstream | **Fork** | Anmerkung |
|-----|----------|----------|-----------|
| f32 | Standard | Standard + F32x16 SIMD | Gleich + SIMD-Beschleunigung |
| f64 | Standard | Standard + F64x8 SIMD | Gleich + SIMD-Beschleunigung |
| **f16** (IEEE 754) | **Nicht vorhanden** | u16 Carrier + F16C/FCVTL Hardware | Stable Rust, kein Nightly |
| **BF16** (bfloat16) | **Nicht vorhanden** | Hardware + RNE-Emulation (bit-exakt) | GGUF-Kalibrierung |
| i8/u8 (quantisiert) | Nicht vorhanden | VNNI dot, Hamming, Popcount | INT8 Inferenz |
| i16 (Base17) | Nicht vorhanden | L1-Distanz, SIMD widen/narrow | Codebook-Encoding |

## Leistung

### GEMM (Allgemeine Matrixmultiplikation)
### GEMM

| Matrixgroesse | Upstream | **Dieser Fork** | NumPy | PyTorch CPU | GPU (RTX 3060) |
|--------------|---------|---------------|-------|-------------|----------------|
Expand All @@ -97,18 +72,6 @@ Fork auf wasm: -> WASM SIMD128 (vorbereitet) / Scalar
| **Pi 5** | **NEON+dotprod** | **2K-5K** | 10-25 ms | **5W** |
| **Pi 4** | **NEON dual** | **500-2K** | 25-100 ms | **5W** |

### Cosine via Palette-Distanz

| Stufe | Fehler | Geschwindigkeit | vs. GPU (RTX 3060) |
|-------|--------|----------------|---------------------|
| **Foveal** (1/40 sigma) | 0,4% | **611M/s** | **~2x schneller** |
| **Nah** (1 sigma) | 8% | **2.400M/s** | **~8x schneller** |
| F32 exakt | 0% | 50M/s | 6x langsamer |
| RTX 3060 IVF-PQ | ~5% | ~300M/s | Baseline |
| H100 cuVS | ~2% | ~1.500M/s | 5x unsere Kosten |

611M Cosine-aequivalente Lookups/Sek mit reinen Integer-Operationen. Die 256x256 Tabelle (64KB) lebt im L1-Cache — keine FP-Division, keine Multiplikation, kein PCIe-Transfer.

### f16 Gewichts-Transkodierung

| Format | Groesse | Max Fehler | Durchsatz |
Expand Down
43 changes: 17 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,28 +4,31 @@ A complete high-performance numerical computing stack built on top of [rust-ndar

[Deutsche Version](README-DE.md) | [Full Feature Comparison (146 modules)](COMPARISON.md)

## Why This Exists
## Cosine Similarity: Us vs. GPU vs. Everyone

| What | Us | GPU (RTX 3060) | GPU (H100) | NumPy CPU |
|------|-----|----------------|------------|-----------|
| **Cosine similarity** | **2,400M/s** (palette u8) | ~300M/s (IVF-PQ) | ~1,500M/s (cuVS) | ~50M/s (dot) |
| **GEMM 1024x1024** | **139 GFLOPS** | 3,500 GFLOPS | 30,000 GFLOPS | 120 GFLOPS |
| **Codebook inference** | **2,000 tok/s @ 5W** (Pi 4) | ~100K tok/s @ 170W | ~500K tok/s @ 700W | N/A |
| **Energy efficiency** | **37M ops/s/W** | 1.8M ops/s/W | 2.1M ops/s/W | 1.8M ops/s/W |
| **Startup latency** | **0 ms** (no kernel launch) | 2-10 ms | 2-10 ms | 50 ms (Python) |
| **Hardware cost** | **$0** (runs on any CPU) | $350 | $30,000 | $0 |
| **PCIe transfer** | **None** (data in L1 cache) | Required | Required | None |
| **Rust stable** | **Yes** (1.94) | CUDA toolkit | CUDA toolkit | Python |
| System | Method | Throughput | Latency | Hardware | Watt |
|--------|--------|------------|---------|----------|------|
| **This fork** — Sapphire Rapids | Palette u8 + AMX prefetch | **~3,200M/s** | **~0.3 ns** | Xeon w9-3595X | 350W |
| **This fork** — i7/i5 11th gen | Palette u8 (AVX-512) | **2,400M/s** | **0.4 ns** | i7-11700K | 65W |
| **This fork** — Raspberry Pi 4 | Palette u8 (NEON) | **~400M/s** | **~2.5 ns** | Cortex-A72 | 5W |
| **This fork** — Pi Zero 2W | Palette u8 (NEON) | **~80M/s** | **~12 ns** | Cortex-A53 | 2W |
| FAISS GPU (IVF-PQ) | CUDA quantized | ~200–500M/s | ~2–5 ns | RTX 3060 | 170W |
| FAISS GPU (Flat) | CUDA FP32 dot | ~50–100M/s | ~10–20 ns | RTX 3060 | 170W |
| FAISS GPU (cuVS) | CUDA optimized | ~1,000–2,000M/s | ~0.5–1 ns | H100 80GB | 700W |
| FAISS CPU (Flat) | AVX2 FP32 dot | ~50M/s | ~20 ns | i7 | 65W |
| FAISS CPU (IVF-PQ) | AVX2 quantized | ~100–200M/s | ~5–10 ns | i7 | 65W |

GPU wins at large dense GEMM. We win at **everything else**: similarity search, latency-sensitive inference, edge deployment, energy efficiency, and cost. A $35 Raspberry Pi 4 at 5 watts outperforms a $350 GPU at 170 watts for codebook inference — because table lookups don't need floating-point hardware.
A $35 Raspberry Pi 4 at 5 watts matches or beats a $350 RTX 3060 at 170 watts. A Sapphire Rapids server outperforms an H100 at half the power. A $15 Pi Zero 2W at 2 watts still beats FAISS CPU Flat by 60%.

The trick: GPU must FP32-multiply, FP32-divide, and transfer over PCIe. We read one u8 from a 64KB table that lives in L1 cache. No transfer, no kernel launch, no floating point.

## Core Architecture

Five layers built on top of upstream ndarray's array primitives:
Five layers on top of upstream ndarray's array primitives:

**SIMD Polyfill** (`simd.rs`, `simd_avx512.rs`, `simd_avx2.rs`, `simd_neon.rs`) — `std::simd`-compatible types (`F32x16`, `F64x8`, `U8x64`, `I32x16`) on stable Rust via `core::arch`. Detection once via `LazyLock<SimdCaps>`, dispatch via frozen function pointer table (0.3ns per call).

**Backend** (`backend/`) — Pluggable BLAS: pure-Rust Goto-GEMM (default), Intel MKL (feature-gated), OpenBLAS (feature-gated). Native backend: 6x16 f32 + 6x8 f64 microkernels, cache-blocked L1/L2/L3, 16-thread split-borrow parallelism.
**Backend** (`backend/`) — Pluggable BLAS: pure-Rust Goto-GEMM (default), Intel MKL (feature-gated), OpenBLAS (feature-gated). Native backend: 6×16 f32 + 6×8 f64 microkernels, cache-blocked L1/L2/L3, 16-thread split-borrow parallelism.

**HPC Library** (`hpc/`, 146 files) — BLAS L1-L3, LAPACK, FFT, VML, statistics, activations, quantized ops. Every module SIMD-accelerated through the frozen dispatch table.

Expand Down Expand Up @@ -83,18 +86,6 @@ Fork on wasm: → WASM SIMD128 (prepared) / Scalar
| **Pi 5** | **NEON+dotprod** | **2K–5K** | 10–25 ms | **5W** |
| **Pi 4** | **NEON dual** | **500–2K** | 25–100 ms | **5W** |

### Cosine via Palette Distance

| Tier | Error | Speed | vs. GPU (RTX 3060) |
|------|-------|-------|---------------------|
| **Foveal** (1/40σ) | 0.4% | **611M/s** | **~2× faster** |
| **Near** (1σ) | 8% | **2,400M/s** | **~8× faster** |
| F32 exact | 0% | 50M/s | 6× slower |
| RTX 3060 IVF-PQ | ~5% | ~300M/s | baseline |
| H100 cuVS | ~2% | ~1,500M/s | 5× our cost |

611M cosine-equivalent lookups/sec using only integer operations. The 256×256 table (64KB) lives in L1 cache — no FP division, no multiplication, no PCIe transfer.

### f16 Weight Transcoding

| Format | Size | Max Error | Speed |
Expand Down
Loading