feat(q4_0): Panama SIMD kernel + reconcile MemSeg to split layout by michalharakal · Pull Request #649 · SKaiNET-developers/SKaiNET

michalharakal · 2026-05-30T17:47:47Z

Phase A, part 2. Stacked on #648.

What

PanamaVectorQ4_0MatmulKernel — JDK Vector API kernel (decode scale → unpack split-layout nibbles into scratch → SIMD-FMA). Wired via PanamaVectorKernelProvider.matmulQ4_0() (priority 50).
Latent-bug fix: the existing JVM MemSegment Q4_0 path used an interleaved nibble layout that doesn't match real GGUF Q4_0 weights. Reconciled dotQ4_0BlockMemSeg, Q4MemorySegmentTensorData (get/set/copyToFloatArray), and the test encoder to the canonical ggml split layout — so MemSeg now agrees with the heap type, the SPI kernels, and DequantOps.dequantQ4_0FromBytes.

Behavior change

This changes the numerical output of the pre-existing Q4_0 MemSeg matmul path (it was self-consistent but mismatched vs ggml). That path had no callers in this repo and was unverified; the fix makes it correct for real Q4_0 weights.

Tests

PanamaVectorQ4_0MatmulKernelParityTest — scalar ≈ panama within FMA tolerance across matvec / attention / FFN shapes.
QuantizedMemSegMatmulTest — green under the corrected split layout.
apiCheck green (delta: PanamaVectorQ4_0MatmulKernel).

Targeting 0.27.0. Next: PR3 Native FFM.

🤖 Generated with Claude Code

Adds `PanamaVectorQ4_0MatmulKernel` (JDK Vector API): per block, decode the FP16 scale, unpack the 16 code bytes into 32 sign-corrected floats in the canonical ggml split layout, then SIMD-FMA against the input window. Wired through `PanamaVectorKernelProvider.matmulQ4_0()` (priority 50), so `DefaultCpuOpsJvm`'s `q4_0MatmulKernel` now prefers it over the scalar floor on JDK 21+. Also fixes a latent layout bug: the existing JVM MemSegment Q4_0 path (`JvmQuantizedVectorKernels.dotQ4_0BlockMemSeg` and `Q4MemorySegmentTensorData` get/set/copyToFloatArray) used an *interleaved* nibble layout (code[2k]/[2k+1] from byte k), which does NOT match real GGUF Q4_0 weights (split layout: low nibbles → 0..15, high → 16..31, per `DequantOps.dequantQ4_0FromBytes`). This mismatch is the likely reason the Q4_0 MemSeg path was never exercised end-to-end. All three sites + the test encoder are reconciled to the split layout, so the MemSeg path now agrees with the heap `Q4_0BlockTensorData`, the scalar/Panama SPI kernels, and canonical ggml. Tests: PanamaVectorQ4_0MatmulKernelParityTest (scalar≈panama within FMA tolerance), QuantizedMemSegMatmulTest still green under split layout. apiCheck green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Completes the Q4_0 kernel stack with a hand-written C kernel at priority 100. Adds native/src/q4_0_matmul.c (split-layout `(code - 8) * d` decode, tight auto-vectorizing inner loop mirroring q8_0_matmul.c), declares skainet_q4_0_matmul in skainet_kernels.h, and adds it to CMakeLists. Kotlin side: NativeQ4_0MatmulKernel (FFM downcall, mirrors NativeQ8_0MatmulKernel) wired through NativeKernelProvider.matmulQ4_0(). With the bundled libskainet_kernels loaded, KernelRegistry.bestAvailable() now prefers native → Panama → scalar for Q4_0, same cascade as Q8_0/Q4_K. Verified locally (cmake build): NativeQ4_0MatmulKernelParityTest passes — native output matches PanamaVectorQ4_0MatmulKernel within FMA tolerance across matvec / attention / FFN shapes. CI without the native lib stays green via the same availability gate the other native parity tests use. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(q4_0): native FFM kernel (skainet_q4_0_matmul)

michalharakal and others added 2 commits May 30, 2026 19:47

michalharakal mentioned this pull request May 30, 2026

feat(q4_0): native FFM kernel (skainet_q4_0_matmul) #650

Merged

Merge pull request #650 from SKaiNET-developers/feature/q4_0-native

7c5a1c9

feat(q4_0): native FFM kernel (skainet_q4_0_matmul)

Base automatically changed from feature/q4_0-core-format to chore/resync-api-dumps May 30, 2026 17:53

michalharakal merged commit e4b16f8 into chore/resync-api-dumps May 30, 2026
7 checks passed

michalharakal deleted the feature/q4_0-panama branch May 30, 2026 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(q4_0): Panama SIMD kernel + reconcile MemSeg to split layout#649

feat(q4_0): Panama SIMD kernel + reconcile MemSeg to split layout#649
michalharakal merged 3 commits into
chore/resync-api-dumpsfrom
feature/q4_0-panama

michalharakal commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented May 30, 2026

What

Behavior change

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant