feat(q4_0): Panama SIMD kernel + reconcile MemSeg to split layout#649
Merged
Conversation
Adds `PanamaVectorQ4_0MatmulKernel` (JDK Vector API): per block, decode the FP16 scale, unpack the 16 code bytes into 32 sign-corrected floats in the canonical ggml split layout, then SIMD-FMA against the input window. Wired through `PanamaVectorKernelProvider.matmulQ4_0()` (priority 50), so `DefaultCpuOpsJvm`'s `q4_0MatmulKernel` now prefers it over the scalar floor on JDK 21+. Also fixes a latent layout bug: the existing JVM MemSegment Q4_0 path (`JvmQuantizedVectorKernels.dotQ4_0BlockMemSeg` and `Q4MemorySegmentTensorData` get/set/copyToFloatArray) used an *interleaved* nibble layout (code[2k]/[2k+1] from byte k), which does NOT match real GGUF Q4_0 weights (split layout: low nibbles → 0..15, high → 16..31, per `DequantOps.dequantQ4_0FromBytes`). This mismatch is the likely reason the Q4_0 MemSeg path was never exercised end-to-end. All three sites + the test encoder are reconciled to the split layout, so the MemSeg path now agrees with the heap `Q4_0BlockTensorData`, the scalar/Panama SPI kernels, and canonical ggml. Tests: PanamaVectorQ4_0MatmulKernelParityTest (scalar≈panama within FMA tolerance), QuantizedMemSegMatmulTest still green under split layout. apiCheck green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes the Q4_0 kernel stack with a hand-written C kernel at priority 100. Adds native/src/q4_0_matmul.c (split-layout `(code - 8) * d` decode, tight auto-vectorizing inner loop mirroring q8_0_matmul.c), declares skainet_q4_0_matmul in skainet_kernels.h, and adds it to CMakeLists. Kotlin side: NativeQ4_0MatmulKernel (FFM downcall, mirrors NativeQ8_0MatmulKernel) wired through NativeKernelProvider.matmulQ4_0(). With the bundled libskainet_kernels loaded, KernelRegistry.bestAvailable() now prefers native → Panama → scalar for Q4_0, same cascade as Q8_0/Q4_K. Verified locally (cmake build): NativeQ4_0MatmulKernelParityTest passes — native output matches PanamaVectorQ4_0MatmulKernel within FMA tolerance across matvec / attention / FFN shapes. CI without the native lib stays green via the same availability gate the other native parity tests use. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat(q4_0): native FFM kernel (skainet_q4_0_matmul)
Base automatically changed from
feature/q4_0-core-format
to
chore/resync-api-dumps
May 30, 2026 17:53
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase A, part 2. Stacked on #648.
What
PanamaVectorQ4_0MatmulKernel— JDK Vector API kernel (decode scale → unpack split-layout nibbles into scratch → SIMD-FMA). Wired viaPanamaVectorKernelProvider.matmulQ4_0()(priority 50).dotQ4_0BlockMemSeg,Q4MemorySegmentTensorData(get/set/copyToFloatArray), and the test encoder to the canonical ggml split layout — so MemSeg now agrees with the heap type, the SPI kernels, andDequantOps.dequantQ4_0FromBytes.Behavior change
This changes the numerical output of the pre-existing Q4_0 MemSeg matmul path (it was self-consistent but mismatched vs ggml). That path had no callers in this repo and was unverified; the fix makes it correct for real Q4_0 weights.
Tests
PanamaVectorQ4_0MatmulKernelParityTest— scalar ≈ panama within FMA tolerance across matvec / attention / FFN shapes.QuantizedMemSegMatmulTest— green under the corrected split layout.apiCheckgreen (delta:PanamaVectorQ4_0MatmulKernel).Targeting 0.27.0. Next: PR3 Native FFM.
🤖 Generated with Claude Code