Accelerated CPU ops for Kotlin/Native linux targets (linuxX64/linuxArm64 are scalar-only)

Follow-up gap surfaced by the Native-parity kernel work (#711/#715/#716/#720) and visualized in the [eager backends & kernels mindmap](../blob/develop/docs/eager-execution-backends-and-kernels.md) / [kernel-support matrix](../blob/develop/docs/kernel-support-matrix.md).

## Problem

On **Kotlin/Native linux** targets (`linuxX64`, `linuxArm64`) the CPU backend runs the **scalar floor only** — no SIMD, no BLAS:

- `PlatformCpuOpsFactory.linux.kt` returns plain `DefaultCpuOps` (scalar elementwise/reduction/matmul) + registers `ScalarKernelProvider`.
- The SIMD/accelerated tiers are all JVM-only: `PanamaVectorKernelProvider` needs `jdk.incubator.vector`; the native-FFM provider needs `java.lang.foreign`. Neither compiles on Kotlin/Native.
- **Apple** native targets already have `AccelerateCpuOps` (cinterop to the Accelerate framework: `cblas_sgemm`, `vDSP_*`) for dense FP32 — **linux has no equivalent**.

Result in the support matrix: every format on `Native·linux` is `scalar`. Correct, but slow — packed-quant matmul (now functional on Native after #715) and dense FP32 both run unvectorized.

## Ask

Add an accelerated CPU ops / kernel path for linux native, mirroring the Apple Accelerate precedent. Suggested phasing:

1. **Dense FP32 via cinterop BLAS** (highest value, smallest surface). Add a `linuxMain` `BlasCpuOps`/`OpenBlasCpuOps` (cinterop to **OpenBLAS** or **BLIS** `cblas_sgemm` + a few `cblas_*`/vectorized elementwise), wired through `PlatformCpuOpsFactory.linux.kt` with a graceful fallback to scalar `DefaultCpuOps` when the shared lib isn't present. This matches what `AccelerateCpuOps` does on Apple.
2. **Packed-quant SIMD on native** (larger). Either:
   - a Kotlin/Native `KernelProvider` that cinterops a small C kernel lib (the Kotlin/Native analogue of the JVM `skainet-backend-native-cpu` FFM module — same C kernels, bound via cinterop instead of FFM), registered in `PlatformCpuOpsFactory.linux.kt`; or
   - hand-vectorized Kotlin/Native kernels (no portable SIMD intrinsics story in Kotlin/Native today, so cinterop to C is likely the pragmatic route).

## Notes / acceptance

- A cinterop `.def` for the chosen lib (OpenBLAS/BLIS) + the build wiring (the repo has no `.def` files yet; Apple Accelerate links the system framework without one).
- Graceful runtime fallback to scalar when the native lib is unavailable (don't hard-fail a `linuxX64` build/run without the lib installed).
- Benchmark vs the scalar floor (the repo has a Phoronix-compatible bench harness) showing the dense FP32 speedup.
- `linuxArm64` covered too (NEON via OpenBLAS).

## Related
- JVM already has Panama-Vector (priority 50) + native-FFM (priority 100).
- Native-FFM **Q5/Q6_K** is a separate gap: #708 (core) / SKaiNET-transformers#170 (converter). This issue is specifically the **linux Kotlin/Native acceleration** path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerated CPU ops for Kotlin/Native linux targets (linuxX64/linuxArm64 are scalar-only) #722

Problem

Ask

Notes / acceptance

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Accelerated CPU ops for Kotlin/Native linux targets (linuxX64/linuxArm64 are scalar-only) #722

Description

Problem

Ask

Notes / acceptance

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions