Follow-up gap surfaced by the Native-parity kernel work (#711/#715/#716/#720) and visualized in the eager backends & kernels mindmap / kernel-support matrix.
Problem
On Kotlin/Native linux targets (linuxX64, linuxArm64) the CPU backend runs the scalar floor only — no SIMD, no BLAS:
PlatformCpuOpsFactory.linux.kt returns plain DefaultCpuOps (scalar elementwise/reduction/matmul) + registers ScalarKernelProvider.
- The SIMD/accelerated tiers are all JVM-only:
PanamaVectorKernelProvider needs jdk.incubator.vector; the native-FFM provider needs java.lang.foreign. Neither compiles on Kotlin/Native.
- Apple native targets already have
AccelerateCpuOps (cinterop to the Accelerate framework: cblas_sgemm, vDSP_*) for dense FP32 — linux has no equivalent.
Result in the support matrix: every format on Native·linux is scalar. Correct, but slow — packed-quant matmul (now functional on Native after #715) and dense FP32 both run unvectorized.
Ask
Add an accelerated CPU ops / kernel path for linux native, mirroring the Apple Accelerate precedent. Suggested phasing:
- Dense FP32 via cinterop BLAS (highest value, smallest surface). Add a
linuxMain BlasCpuOps/OpenBlasCpuOps (cinterop to OpenBLAS or BLIS cblas_sgemm + a few cblas_*/vectorized elementwise), wired through PlatformCpuOpsFactory.linux.kt with a graceful fallback to scalar DefaultCpuOps when the shared lib isn't present. This matches what AccelerateCpuOps does on Apple.
- Packed-quant SIMD on native (larger). Either:
- a Kotlin/Native
KernelProvider that cinterops a small C kernel lib (the Kotlin/Native analogue of the JVM skainet-backend-native-cpu FFM module — same C kernels, bound via cinterop instead of FFM), registered in PlatformCpuOpsFactory.linux.kt; or
- hand-vectorized Kotlin/Native kernels (no portable SIMD intrinsics story in Kotlin/Native today, so cinterop to C is likely the pragmatic route).
Notes / acceptance
- A cinterop
.def for the chosen lib (OpenBLAS/BLIS) + the build wiring (the repo has no .def files yet; Apple Accelerate links the system framework without one).
- Graceful runtime fallback to scalar when the native lib is unavailable (don't hard-fail a
linuxX64 build/run without the lib installed).
- Benchmark vs the scalar floor (the repo has a Phoronix-compatible bench harness) showing the dense FP32 speedup.
linuxArm64 covered too (NEON via OpenBLAS).
Related
Follow-up gap surfaced by the Native-parity kernel work (#711/#715/#716/#720) and visualized in the eager backends & kernels mindmap / kernel-support matrix.
Problem
On Kotlin/Native linux targets (
linuxX64,linuxArm64) the CPU backend runs the scalar floor only — no SIMD, no BLAS:PlatformCpuOpsFactory.linux.ktreturns plainDefaultCpuOps(scalar elementwise/reduction/matmul) + registersScalarKernelProvider.PanamaVectorKernelProviderneedsjdk.incubator.vector; the native-FFM provider needsjava.lang.foreign. Neither compiles on Kotlin/Native.AccelerateCpuOps(cinterop to the Accelerate framework:cblas_sgemm,vDSP_*) for dense FP32 — linux has no equivalent.Result in the support matrix: every format on
Native·linuxisscalar. Correct, but slow — packed-quant matmul (now functional on Native after #715) and dense FP32 both run unvectorized.Ask
Add an accelerated CPU ops / kernel path for linux native, mirroring the Apple Accelerate precedent. Suggested phasing:
linuxMainBlasCpuOps/OpenBlasCpuOps(cinterop to OpenBLAS or BLIScblas_sgemm+ a fewcblas_*/vectorized elementwise), wired throughPlatformCpuOpsFactory.linux.ktwith a graceful fallback to scalarDefaultCpuOpswhen the shared lib isn't present. This matches whatAccelerateCpuOpsdoes on Apple.KernelProviderthat cinterops a small C kernel lib (the Kotlin/Native analogue of the JVMskainet-backend-native-cpuFFM module — same C kernels, bound via cinterop instead of FFM), registered inPlatformCpuOpsFactory.linux.kt; orNotes / acceptance
.deffor the chosen lib (OpenBLAS/BLIS) + the build wiring (the repo has no.deffiles yet; Apple Accelerate links the system framework without one).linuxX64build/run without the lib installed).linuxArm64covered too (NEON via OpenBLAS).Related