feat(q4_0): native FFM kernel (skainet_q4_0_matmul)#650
Merged
Conversation
Completes the Q4_0 kernel stack with a hand-written C kernel at priority 100. Adds native/src/q4_0_matmul.c (split-layout `(code - 8) * d` decode, tight auto-vectorizing inner loop mirroring q8_0_matmul.c), declares skainet_q4_0_matmul in skainet_kernels.h, and adds it to CMakeLists. Kotlin side: NativeQ4_0MatmulKernel (FFM downcall, mirrors NativeQ8_0MatmulKernel) wired through NativeKernelProvider.matmulQ4_0(). With the bundled libskainet_kernels loaded, KernelRegistry.bestAvailable() now prefers native → Panama → scalar for Q4_0, same cascade as Q8_0/Q4_K. Verified locally (cmake build): NativeQ4_0MatmulKernelParityTest passes — native output matches PanamaVectorQ4_0MatmulKernel within FMA tolerance across matvec / attention / FFN shapes. CI without the native lib stays green via the same availability gate the other native parity tests use. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced May 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase A, part 3 — completes the Q4_0 kernel stack. Stacked on #649.
What
native/src/q4_0_matmul.c—skainet_q4_0_matmul, split-layout(code-8)*ddecode with a tight auto-vectorizing inner loop (mirrorsq8_0_matmul.c). Declared inskainet_kernels.h, added toCMakeLists.txt.NativeQ4_0MatmulKernel(mirrorsNativeQ8_0MatmulKernel), wired viaNativeKernelProvider.matmulQ4_0()at priority 100.With the bundled
libskainet_kernelsloaded,KernelRegistry.bestAvailable()now prefers native → Panama → scalar for Q4_0 — the same cascade as Q8_0 / Q4_K.Verification
Built the native lib locally via CMake and ran
NativeQ4_0MatmulKernelParityTest: green — native ≈ Panama within FMA tolerance across matvec / attention / FFN shapes. CI without the native toolchain stays green via the same availability gate (@BeforeTest assertTrue(isAvailable())) the existing native parity tests use.No
.apichange (native module isn't API-validated; new symbols are internal).Targeting 0.27.0. Next: PR4 (FP32→Q4_0 quantizer + loader policy).
🤖 Generated with Claude Code