SKaiNET-developers · michalharakal · Apr 29, 2026 · Apr 29, 2026 · Apr 29, 2026
diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc
@@ -25,5 +25,8 @@
 * xref:explanation/examples/index.adoc[Worked examples]
 ** xref:explanation/examples/matmul.adoc[Matrix multiplication examples]
 * xref:explanation/perf/jvm-cpu.adoc[JVM CPU performance]
+* xref:explanation/perf/simd-kernels.adoc[How SIMD kernels are built]
+* xref:explanation/perf/quantized-simd-kernels.adoc[How quantized SIMD kernels are built]
+* xref:explanation/perf/native-ffm-plan.adoc[Plan: native FFM kernel provider]
 * xref:explanation/perf/java-25-cpu-backend.adoc[Java 25 CPU backend notes]
 * xref:explanation/issues/native-macos-accelerate-simd.adoc[Native macOS Accelerate SIMD issues]
diff --git a/docs/modules/ROOT/pages/explanation/perf/native-ffm-plan.adoc b/docs/modules/ROOT/pages/explanation/perf/native-ffm-plan.adoc
@@ -0,0 +1,278 @@
+= Plan: Native (FFM) Kernel Provider
+:description: Where the JVM Vector kernels stop, what a native priority-100 provider would look like, and when to build it.
+
+This page is a *plan*, not shipped code. The intent is to capture
+enough detail that the design doesn't drift between the time someone
+decides to start the work and the moment a PR is opened. The earlier
+content of this page lived briefly in `NATIVE_FFM_KERNEL_PROVIDER.md`
+at the repo root and was removed on advice of "ship the release first,
+keep the plan in docs"; this is its permanent home.
+
+== Where the JVM Vector kernels run out
+
+After the M5 milestone work landed (PRs #554–#565 across the 0.21.0
+release), every CPU matmul path goes through the kernel SPI — see
+xref:explanation/perf/simd-kernels.adoc[] and
+xref:explanation/perf/quantized-simd-kernels.adoc[]. The Panama Vector
+provider runs at:
+
+* ~73 GFLOPS on FP32 4096² matmul (Apple Silicon NEON)
+* ~73 GFLOPS on Q4_K 4096² matmul-vector (same regime; fused dequant
+adds essentially zero cost on top of the FMA)
+
+That's already in the ggml NEON ballpark in absolute terms. But
+ggml's hand-tuned NEON / AVX2 still outruns the JVM Vector API on:
+
+* dense FLOPs/cycle on shapes the Vector API can't tile-block
+optimally (the 8×8×128 default is heuristic)
+* AVX-512 VNNI fused INT8 dot products
+* NEON `bf16` / `fp16` SDOT instructions
+* future SVE / SME — none of which the Vector API exposes portably
+today
+
+A native provider closes that gap and unlocks two follow-ons that
+*can't* be built on the Vector API alone:
+
+. *M4 ↔ M5 zero-copy.* Mmap'd Q4_K weights stay as `MemorySegment`
+views; a native kernel reads the same pages with no heap copy and
+no staging buffer.
+. *Hardware-specific lanes* unreachable from portable Vector code.
+
+== Provider shape
+
+[cols="1,1,1",options="header"]
+|===
+| Priority | Provider | Status
+| 0 | `ScalarKernelProvider` | shipped (PR #554)
+| 50 | `PanamaVectorKernelProvider` | shipped (PRs #557, #560 + ServiceLoader #559)
+| *100* | *`NativeKernelProvider` (FFM)* | *this plan*
+|===
+
+The `KernelRegistry.bestAvailable()` cascade means: when the native
+lib loads, native wins; when it doesn't (sandbox, missing arch, JDK
+without FFM, kill-switch flipped), Panama wins; on Native targets and
+JS / Wasm where neither is available, scalar wins. No code change
+above the registry layer.
+
+== Goals
+
+. *A `NativeKernelProvider` registered at priority 100* that on JDK
+21+ wins `KernelRegistry.bestAvailable()` over Panama whenever the
+native lib loads successfully.
+. *A first concrete kernel: native Q4_K matmul.* It must:
+.. take a `MemorySegment` for both input (FP32) and packed Q4_K
+weights (canonical ggml layout — same as `Q4_KBlockTensorData`
+and `matmulF32Q4_KMemSeg`);
+.. produce numerically equivalent output to
+`PanamaVectorQ4KMatmulKernel` within `1e-4` relative tolerance
+(same parity bar `PanamaVectorQ4KMatmulKernelTest` uses);
+.. clear *≥2.5×* over the prior Q4_K scalar dequant baseline — the
+M5 success metric — on the bench shapes from
+`QuantizedMatmulBench` (1024², 4096×1024, 4096²).
+. *Optional follow-on kernels* — Q6_K, Q8_0, FP32 — share the build
+system but each ship as a separate small PR.
+. *One supported architecture for the first PR* (likely Apple
+Silicon NEON since that's the development hardware in use), with a
+clear extension path for `linuxX64` AVX2 / `linuxArm64` NEON.
+
+== Non-goals
+
+* *JNI.* The roadmap explicitly says "FFM not JNI". JNI's per-call
+overhead and the global JNI lock are wrong for hot per-token
+kernels; FFM (Java 22 stable, Java 21 preview) gives near-zero
+overhead native calls and direct `MemorySegment` ABI.
+* *Cross-compilation matrix on day one.* The first PR can ship just
+one (host-arch) variant; CI cross-arch builds come later.
+* *Replacing Panama.* Panama remains the priority-50 fallback for
+environments that can't load native libs (sandboxes, Wasm, Native
+targets, JDK without `jdk.incubator.vector`).
+* *Distribution via pre-built native artifacts on Maven Central.*
+Out of scope for the first PR — local build only. Publishing
+classifier JARs comes in a separate plan.
+
+== Architecture
+
+=== Module layout
+
+[source]
+----
+skainet-backends/
+  skainet-backend-native-cpu/                  # NEW
+    src/
+      jvmMain/kotlin/sk/ainet/exec/kernel/     # Kotlin side
+        NativeKernelProvider.kt                # priority=100, isAvailable()=libLoaded
+        NativeQ4KMatmulKernel.kt               # implements Q4KMatmulKernel via FFM
+        NativeLibraryLoader.kt                 # System.loadLibrary, locate, version
+      jvmMain/resources/META-INF/services/
+        sk.ainet.backend.api.kernel.KernelProvider  # appends NativeKernelProviderFactory
+      jvmTest/kotlin/sk/ainet/exec/kernel/
+        NativeQ4KMatmulKernelTest.kt           # parity vs PanamaVectorQ4KMatmulKernel
+      native/                                  # native source tree
+        c/
+          q4k_matmul.c                         # ggml-style hand-tuned kernel
+          q4k_matmul.h
+        CMakeLists.txt                         # or Bazel BUILD
+        build.gradle.kts                       # Gradle wrapper that invokes CMake
+----
+
+The native library compiles to a shared object (`libskainet_kernels.dylib`
+on macOS, `.so` on Linux, `.dll` on Windows) and is packaged into the
+module's resources for `System.loadLibrary` discovery.
+
+=== FFM binding pattern
+
+Single C entry point per kernel:
+
+[source,c]
+----
+// q4k_matmul.h
+void skainet_q4k_matmul(
+    const float* input,        // FP32 input vector, length input_dim
+    const uint8_t* weight,     // packed Q4_K bytes (canonical ggml layout)
+    int32_t weight_byte_offset,
+    int32_t input_dim,
+    int32_t output_dim,
+    float* output,             // FP32 output, length output_dim
+    int32_t output_offset
+);
+----
+
+Kotlin side:
+
+[source,kotlin]
+----
+internal object NativeQ4KMatmulKernel : Q4KMatmulKernel {
+    private val handle: MethodHandle = run {
+        val arena = Arena.ofAuto()
+        val symbol = NativeLibraryLoader.lib.find("skainet_q4k_matmul").orElseThrow()
+        Linker.nativeLinker().downcallHandle(
+            symbol,
+            FunctionDescriptor.ofVoid(
+                ValueLayout.ADDRESS, ValueLayout.ADDRESS, ValueLayout.JAVA_INT,
+                ValueLayout.JAVA_INT, ValueLayout.JAVA_INT,
+                ValueLayout.ADDRESS, ValueLayout.JAVA_INT,
+            ),
+        )
+    }
+
+    override fun matmul(
+        input: FloatArray, inputOffset: Int,
+        weight: ByteArray, weightByteOffset: Int,
+        inputDim: Int, outputDim: Int,
+        output: FloatArray, outputOffset: Int,
+    ) {
+        // Heap arrays: pass via temporary off-heap MemorySegment + bulk copy,
+        // OR (preferred) overload with a MemorySegment-input variant for
+        // mmap'd weights to avoid the copy.
+    }
+}
+----
+
+The cleaner path is to introduce a sibling `Q4KMemSegMatmulKernel`
+SPI (mentioned as out-of-scope in PR #563) that takes `MemorySegment`
+directly, and have the native provider implement *that* — no heap
+copy. The `Q4KMatmulKernel` (`ByteArray`) variant can wrap the
+MemSeg one with a temporary `Arena.ofConfined()` copy if needed for
+legacy callers.
+
+=== Build system
+
+*Gradle + CMake* is the path of least resistance:
+
+* A new Gradle module (or hand-rolled `Exec` tasks) invokes CMake
+for the native module's `build` task.
+* Native artifacts land in `build/native/<arch>/` and are copied
+into `src/jvmMain/resources/native/<os>-<arch>/` so
+`System.loadLibrary` finds them.
+* Kotlin compile depends on the native artifact being built first.
+
+The xnnpack backend already in the repo
+(`skainet-backends/skainet-backend-xnnpack/`) demonstrates a similar
+pattern — Gradle invokes CMake to build a native lib via cinterop.
+*Reuse that template* rather than reinventing.
+
+== Staged delivery
+
+PRs in order, each independently mergeable:
+
+. *`skainet-backend-native-cpu` module scaffolding.* Gradle module,
+`build.gradle.kts` wired to invoke CMake, a *trivial* C kernel
+(e.g. just multiplies its first input by 2.0) to prove the FFM
+pipeline end-to-end. `NativeKernelProvider` that's `isAvailable()
+= false` until the real kernel lands. Sets up CI artifact path on
+host arch.
+. *First real native kernel: Q4_K matmul (Apple Silicon NEON).*
+Hand-tuned kernel, parity tests vs `PanamaVectorQ4KMatmulKernel`,
+JMH bench variant added to `QuantizedMatmulBench`.
+. *`Q4KMemSegMatmulKernel` SPI sibling + native variant.* Closes
+the M4↔M5 zero-copy story for mmap'd weights.
+. *`linuxX64` AVX2 variant + cross-arch CI build.* The
+cross-compilation matrix story.
+. *Optional: native FP32 matmul, native Q6_K, native Q8_0.* Same
+shape as PRs 2–3, one per format.
+
+The first PR is the largest in scaffolding terms (~500–800 LoC of
+build glue + 1 trivial kernel), but every subsequent PR is small and
+template-able.
+
+== Success metrics
+
+* *PR 2 sign-off*: native Q4_K matmul on Apple Silicon clears *≥2.5×*
+over the scalar Q4_K dequant-then-matmul baseline at 4096² (the M5
+milestone target). For reference: Panama Q4_K SIMD already exceeds
+this metric (~73 GFLOPS, see
+xref:explanation/perf/quantized-simd-kernels.adoc[]), so the bar is
+"beats Panama by a meaningful margin", probably ≥1.5× over Panama.
+* *PR 3 sign-off*: Q4_K MemSeg native path is faster than the Panama
+Q4_K MemSeg path from PR #563, with no heap copy in the timed
+region.
+* *No regression on JVM-only environments* — when the native lib
+fails to load (sandbox, missing arch, kill-switch), `bestAvailable()`
+cleanly falls through to Panama, and existing tests / benches show
+the same numbers as today.
+
+== Risks & open questions
+
+. *JDK 21 preview FFM vs JDK 22 stable.* FFM left preview in Java 22.
+The repo currently builds on JDK 21 with `--enable-preview
+--add-modules jdk.incubator.vector`. Recommendation: stay on 21
+preview; flip to 22 in a separate toolchain-bump PR.
+. *`MethodHandle` invocation overhead.* Even with FFM, each native
+call has a small fixed cost (~µs). For the smallest matmul shapes
+(e.g. 256² FP32) this could swamp the FLOPs win. Mitigation: route
+small inputs to Panama and large inputs to native at the
+registry/provider level, OR accept that the win is sized for
+production-relevant shapes (4096²+).
+. *Native code quality and maintenance.* Hand-tuned NEON / AVX2 in C
+is harder to audit than Kotlin Vector API code. Mitigation: keep
+kernels small (<300 LoC each), parity-test exhaustively, prefer
+porting from ggml's reference (BSD-licensed, well-vetted) over
+writing from scratch.
+. *Distribution.* Native artifacts complicate Maven Central
+publication (need `<classifier>` per OS/arch). Not a blocker for
+the first internal-use PR; a separate "publish native classifier
+JARs" plan will be needed before community use.
+. *Cross-arch CI cost.* Building NEON natively on Apple Silicon CI
+plus AVX2 on linuxX64 plus Android NDK doubles or triples build
+time. The xnnpack backend's existing CI matrix is a precedent —
+reuse the same approach.
+. *Native `MemorySegment` lifetime.* The Kotlin caller owns the
+`Arena` for arrays it copies in. The native kernel must NOT retain
+pointers past the FFM call return. Document this contract in
+`NativeQ4KMatmulKernel.matmul` kdoc.
+
+== When to start
+
+Trigger conditions (any one):
+
+* Real workload demands the native ≥2.5× target (Panama Q4_K stops
+being fast enough on a customer machine).
+* A community contributor offers a hand-tuned NEON / AVX2 Q4_K
+kernel that's measurably faster than Panama.
+* A second M5 metric (e.g. SDPA throughput, training-loop
+throughput) needs hand-tuned native code.
+
+Until then: *pause.* The Panama provider is doing the
+milestone-equivalent work in absolute terms, and adding a native
+build system is a meaningful complexity tax to take on
+speculatively.