Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,8 @@
* xref:explanation/examples/index.adoc[Worked examples]
** xref:explanation/examples/matmul.adoc[Matrix multiplication examples]
* xref:explanation/perf/jvm-cpu.adoc[JVM CPU performance]
* xref:explanation/perf/simd-kernels.adoc[How SIMD kernels are built]
* xref:explanation/perf/quantized-simd-kernels.adoc[How quantized SIMD kernels are built]
* xref:explanation/perf/native-ffm-plan.adoc[Plan: native FFM kernel provider]
* xref:explanation/perf/java-25-cpu-backend.adoc[Java 25 CPU backend notes]
* xref:explanation/issues/native-macos-accelerate-simd.adoc[Native macOS Accelerate SIMD issues]
278 changes: 278 additions & 0 deletions docs/modules/ROOT/pages/explanation/perf/native-ffm-plan.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,278 @@
= Plan: Native (FFM) Kernel Provider
:description: Where the JVM Vector kernels stop, what a native priority-100 provider would look like, and when to build it.

This page is a *plan*, not shipped code. The intent is to capture
enough detail that the design doesn't drift between the time someone
decides to start the work and the moment a PR is opened. The earlier
content of this page lived briefly in `NATIVE_FFM_KERNEL_PROVIDER.md`
at the repo root and was removed on advice of "ship the release first,
keep the plan in docs"; this is its permanent home.

== Where the JVM Vector kernels run out

After the M5 milestone work landed (PRs #554–#565 across the 0.21.0
release), every CPU matmul path goes through the kernel SPI — see
xref:explanation/perf/simd-kernels.adoc[] and
xref:explanation/perf/quantized-simd-kernels.adoc[]. The Panama Vector
provider runs at:

* ~73 GFLOPS on FP32 4096² matmul (Apple Silicon NEON)
* ~73 GFLOPS on Q4_K 4096² matmul-vector (same regime; fused dequant
adds essentially zero cost on top of the FMA)

That's already in the ggml NEON ballpark in absolute terms. But
ggml's hand-tuned NEON / AVX2 still outruns the JVM Vector API on:

* dense FLOPs/cycle on shapes the Vector API can't tile-block
optimally (the 8×8×128 default is heuristic)
* AVX-512 VNNI fused INT8 dot products
* NEON `bf16` / `fp16` SDOT instructions
* future SVE / SME — none of which the Vector API exposes portably
today

A native provider closes that gap and unlocks two follow-ons that
*can't* be built on the Vector API alone:

. *M4 ↔ M5 zero-copy.* Mmap'd Q4_K weights stay as `MemorySegment`
views; a native kernel reads the same pages with no heap copy and
no staging buffer.
. *Hardware-specific lanes* unreachable from portable Vector code.

== Provider shape

[cols="1,1,1",options="header"]
|===
| Priority | Provider | Status
| 0 | `ScalarKernelProvider` | shipped (PR #554)
| 50 | `PanamaVectorKernelProvider` | shipped (PRs #557, #560 + ServiceLoader #559)
| *100* | *`NativeKernelProvider` (FFM)* | *this plan*
|===

The `KernelRegistry.bestAvailable()` cascade means: when the native
lib loads, native wins; when it doesn't (sandbox, missing arch, JDK
without FFM, kill-switch flipped), Panama wins; on Native targets and
JS / Wasm where neither is available, scalar wins. No code change
above the registry layer.

== Goals

. *A `NativeKernelProvider` registered at priority 100* that on JDK
21+ wins `KernelRegistry.bestAvailable()` over Panama whenever the
native lib loads successfully.
. *A first concrete kernel: native Q4_K matmul.* It must:
.. take a `MemorySegment` for both input (FP32) and packed Q4_K
weights (canonical ggml layout — same as `Q4_KBlockTensorData`
and `matmulF32Q4_KMemSeg`);
.. produce numerically equivalent output to
`PanamaVectorQ4KMatmulKernel` within `1e-4` relative tolerance
(same parity bar `PanamaVectorQ4KMatmulKernelTest` uses);
.. clear *≥2.5×* over the prior Q4_K scalar dequant baseline — the
M5 success metric — on the bench shapes from
`QuantizedMatmulBench` (1024², 4096×1024, 4096²).
. *Optional follow-on kernels* — Q6_K, Q8_0, FP32 — share the build
system but each ship as a separate small PR.
. *One supported architecture for the first PR* (likely Apple
Silicon NEON since that's the development hardware in use), with a
clear extension path for `linuxX64` AVX2 / `linuxArm64` NEON.

== Non-goals

* *JNI.* The roadmap explicitly says "FFM not JNI". JNI's per-call
overhead and the global JNI lock are wrong for hot per-token
kernels; FFM (Java 22 stable, Java 21 preview) gives near-zero
overhead native calls and direct `MemorySegment` ABI.
* *Cross-compilation matrix on day one.* The first PR can ship just
one (host-arch) variant; CI cross-arch builds come later.
* *Replacing Panama.* Panama remains the priority-50 fallback for
environments that can't load native libs (sandboxes, Wasm, Native
targets, JDK without `jdk.incubator.vector`).
* *Distribution via pre-built native artifacts on Maven Central.*
Out of scope for the first PR — local build only. Publishing
classifier JARs comes in a separate plan.

== Architecture

=== Module layout

[source]
----
skainet-backends/
skainet-backend-native-cpu/ # NEW
src/
jvmMain/kotlin/sk/ainet/exec/kernel/ # Kotlin side
NativeKernelProvider.kt # priority=100, isAvailable()=libLoaded
NativeQ4KMatmulKernel.kt # implements Q4KMatmulKernel via FFM
NativeLibraryLoader.kt # System.loadLibrary, locate, version
jvmMain/resources/META-INF/services/
sk.ainet.backend.api.kernel.KernelProvider # appends NativeKernelProviderFactory
jvmTest/kotlin/sk/ainet/exec/kernel/
NativeQ4KMatmulKernelTest.kt # parity vs PanamaVectorQ4KMatmulKernel
native/ # native source tree
c/
q4k_matmul.c # ggml-style hand-tuned kernel
q4k_matmul.h
CMakeLists.txt # or Bazel BUILD
build.gradle.kts # Gradle wrapper that invokes CMake
----

The native library compiles to a shared object (`libskainet_kernels.dylib`
on macOS, `.so` on Linux, `.dll` on Windows) and is packaged into the
module's resources for `System.loadLibrary` discovery.

=== FFM binding pattern

Single C entry point per kernel:

[source,c]
----
// q4k_matmul.h
void skainet_q4k_matmul(
const float* input, // FP32 input vector, length input_dim
const uint8_t* weight, // packed Q4_K bytes (canonical ggml layout)
int32_t weight_byte_offset,
int32_t input_dim,
int32_t output_dim,
float* output, // FP32 output, length output_dim
int32_t output_offset
);
----

Kotlin side:

[source,kotlin]
----
internal object NativeQ4KMatmulKernel : Q4KMatmulKernel {
private val handle: MethodHandle = run {
val arena = Arena.ofAuto()
val symbol = NativeLibraryLoader.lib.find("skainet_q4k_matmul").orElseThrow()
Linker.nativeLinker().downcallHandle(
symbol,
FunctionDescriptor.ofVoid(
ValueLayout.ADDRESS, ValueLayout.ADDRESS, ValueLayout.JAVA_INT,
ValueLayout.JAVA_INT, ValueLayout.JAVA_INT,
ValueLayout.ADDRESS, ValueLayout.JAVA_INT,
),
)
}

override fun matmul(
input: FloatArray, inputOffset: Int,
weight: ByteArray, weightByteOffset: Int,
inputDim: Int, outputDim: Int,
output: FloatArray, outputOffset: Int,
) {
// Heap arrays: pass via temporary off-heap MemorySegment + bulk copy,
// OR (preferred) overload with a MemorySegment-input variant for
// mmap'd weights to avoid the copy.
}
}
----

The cleaner path is to introduce a sibling `Q4KMemSegMatmulKernel`
SPI (mentioned as out-of-scope in PR #563) that takes `MemorySegment`
directly, and have the native provider implement *that* — no heap
copy. The `Q4KMatmulKernel` (`ByteArray`) variant can wrap the
MemSeg one with a temporary `Arena.ofConfined()` copy if needed for
legacy callers.

=== Build system

*Gradle + CMake* is the path of least resistance:

* A new Gradle module (or hand-rolled `Exec` tasks) invokes CMake
for the native module's `build` task.
* Native artifacts land in `build/native/<arch>/` and are copied
into `src/jvmMain/resources/native/<os>-<arch>/` so
`System.loadLibrary` finds them.
* Kotlin compile depends on the native artifact being built first.

The xnnpack backend already in the repo
(`skainet-backends/skainet-backend-xnnpack/`) demonstrates a similar
pattern — Gradle invokes CMake to build a native lib via cinterop.
*Reuse that template* rather than reinventing.

== Staged delivery

PRs in order, each independently mergeable:

. *`skainet-backend-native-cpu` module scaffolding.* Gradle module,
`build.gradle.kts` wired to invoke CMake, a *trivial* C kernel
(e.g. just multiplies its first input by 2.0) to prove the FFM
pipeline end-to-end. `NativeKernelProvider` that's `isAvailable()
= false` until the real kernel lands. Sets up CI artifact path on
host arch.
. *First real native kernel: Q4_K matmul (Apple Silicon NEON).*
Hand-tuned kernel, parity tests vs `PanamaVectorQ4KMatmulKernel`,
JMH bench variant added to `QuantizedMatmulBench`.
. *`Q4KMemSegMatmulKernel` SPI sibling + native variant.* Closes
the M4↔M5 zero-copy story for mmap'd weights.
. *`linuxX64` AVX2 variant + cross-arch CI build.* The
cross-compilation matrix story.
. *Optional: native FP32 matmul, native Q6_K, native Q8_0.* Same
shape as PRs 2–3, one per format.

The first PR is the largest in scaffolding terms (~500–800 LoC of
build glue + 1 trivial kernel), but every subsequent PR is small and
template-able.

== Success metrics

* *PR 2 sign-off*: native Q4_K matmul on Apple Silicon clears *≥2.5×*
over the scalar Q4_K dequant-then-matmul baseline at 4096² (the M5
milestone target). For reference: Panama Q4_K SIMD already exceeds
this metric (~73 GFLOPS, see
xref:explanation/perf/quantized-simd-kernels.adoc[]), so the bar is
"beats Panama by a meaningful margin", probably ≥1.5× over Panama.
* *PR 3 sign-off*: Q4_K MemSeg native path is faster than the Panama
Q4_K MemSeg path from PR #563, with no heap copy in the timed
region.
* *No regression on JVM-only environments* — when the native lib
fails to load (sandbox, missing arch, kill-switch), `bestAvailable()`
cleanly falls through to Panama, and existing tests / benches show
the same numbers as today.

== Risks & open questions

. *JDK 21 preview FFM vs JDK 22 stable.* FFM left preview in Java 22.
The repo currently builds on JDK 21 with `--enable-preview
--add-modules jdk.incubator.vector`. Recommendation: stay on 21
preview; flip to 22 in a separate toolchain-bump PR.
. *`MethodHandle` invocation overhead.* Even with FFM, each native
call has a small fixed cost (~µs). For the smallest matmul shapes
(e.g. 256² FP32) this could swamp the FLOPs win. Mitigation: route
small inputs to Panama and large inputs to native at the
registry/provider level, OR accept that the win is sized for
production-relevant shapes (4096²+).
. *Native code quality and maintenance.* Hand-tuned NEON / AVX2 in C
is harder to audit than Kotlin Vector API code. Mitigation: keep
kernels small (<300 LoC each), parity-test exhaustively, prefer
porting from ggml's reference (BSD-licensed, well-vetted) over
writing from scratch.
. *Distribution.* Native artifacts complicate Maven Central
publication (need `<classifier>` per OS/arch). Not a blocker for
the first internal-use PR; a separate "publish native classifier
JARs" plan will be needed before community use.
. *Cross-arch CI cost.* Building NEON natively on Apple Silicon CI
plus AVX2 on linuxX64 plus Android NDK doubles or triples build
time. The xnnpack backend's existing CI matrix is a precedent —
reuse the same approach.
. *Native `MemorySegment` lifetime.* The Kotlin caller owns the
`Arena` for arrays it copies in. The native kernel must NOT retain
pointers past the FFM call return. Document this contract in
`NativeQ4KMatmulKernel.matmul` kdoc.

== When to start

Trigger conditions (any one):

* Real workload demands the native ≥2.5× target (Panama Q4_K stops
being fast enough on a customer machine).
* A community contributor offers a hand-tuned NEON / AVX2 Q4_K
kernel that's measurably faster than Panama.
* A second M5 metric (e.g. SDPA throughput, training-loop
throughput) needs hand-tuned native code.

Until then: *pause.* The Panama provider is doing the
milestone-equivalent work in absolute terms, and adding a native
build system is a meaningful complexity tax to take on
speculatively.
Loading
Loading