Skip to content

Accelerated CPU ops for Kotlin/Native linux targets (linuxX64/linuxArm64 are scalar-only) #722

@michalharakal

Description

@michalharakal

Follow-up gap surfaced by the Native-parity kernel work (#711/#715/#716/#720) and visualized in the eager backends & kernels mindmap / kernel-support matrix.

Problem

On Kotlin/Native linux targets (linuxX64, linuxArm64) the CPU backend runs the scalar floor only — no SIMD, no BLAS:

  • PlatformCpuOpsFactory.linux.kt returns plain DefaultCpuOps (scalar elementwise/reduction/matmul) + registers ScalarKernelProvider.
  • The SIMD/accelerated tiers are all JVM-only: PanamaVectorKernelProvider needs jdk.incubator.vector; the native-FFM provider needs java.lang.foreign. Neither compiles on Kotlin/Native.
  • Apple native targets already have AccelerateCpuOps (cinterop to the Accelerate framework: cblas_sgemm, vDSP_*) for dense FP32 — linux has no equivalent.

Result in the support matrix: every format on Native·linux is scalar. Correct, but slow — packed-quant matmul (now functional on Native after #715) and dense FP32 both run unvectorized.

Ask

Add an accelerated CPU ops / kernel path for linux native, mirroring the Apple Accelerate precedent. Suggested phasing:

  1. Dense FP32 via cinterop BLAS (highest value, smallest surface). Add a linuxMain BlasCpuOps/OpenBlasCpuOps (cinterop to OpenBLAS or BLIS cblas_sgemm + a few cblas_*/vectorized elementwise), wired through PlatformCpuOpsFactory.linux.kt with a graceful fallback to scalar DefaultCpuOps when the shared lib isn't present. This matches what AccelerateCpuOps does on Apple.
  2. Packed-quant SIMD on native (larger). Either:
    • a Kotlin/Native KernelProvider that cinterops a small C kernel lib (the Kotlin/Native analogue of the JVM skainet-backend-native-cpu FFM module — same C kernels, bound via cinterop instead of FFM), registered in PlatformCpuOpsFactory.linux.kt; or
    • hand-vectorized Kotlin/Native kernels (no portable SIMD intrinsics story in Kotlin/Native today, so cinterop to C is likely the pragmatic route).

Notes / acceptance

  • A cinterop .def for the chosen lib (OpenBLAS/BLIS) + the build wiring (the repo has no .def files yet; Apple Accelerate links the system framework without one).
  • Graceful runtime fallback to scalar when the native lib is unavailable (don't hard-fail a linuxX64 build/run without the lib installed).
  • Benchmark vs the scalar floor (the repo has a Phoronix-compatible bench harness) showing the dense FP32 speedup.
  • linuxArm64 covered too (NEON via OpenBLAS).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions