feat(gemma): eager Q5_K packed path + Kotlin/Native board load path#176
Merged
Conversation
FunctionGemma-270M ships as Q5_K_M, but GemmaMemSegConverter dequantized
Q5_K weights to FP32 on load ("no native matmul kernel yet for Q5_K"),
losing the memory savings and the in-kernel dequant. Upstream SKaiNET
0.29.1 now provides a first-class Q5_K packed matmul (Q5_KBlockTensorData
+ Q5KMatmulKernel: scalar/Panama/native), so keep Q5_K packed here too:
relayout GGUF bytes to block-major + wrap as Q5_KBlockTensorData (176 B/
block). Dispatch + lazy transpose reach it via DefaultCpuOps.
- Bump skainet 0.28.1 -> 0.29.1 (source-of-truth for the llm-bom platform).
- settings.gradle.kts: mavenLocal first so a locally-published SKaiNET
0.29.1 (carrying the in-progress Q5_K kernel) shadows Maven Central until
it's released; Central remains the fallback.
Verified (GemmaQ5KPackedParityTest, -PincludeIntegration): the Q5_K packed
path decodes FunctionGemma byte-identically to the FP32 baseline —
[262146, 236769, 3255, 718, 498, 1373, 262152, 106] -> `<tool_0>(state="on")
<end>` for "Turn the light on." (the known-good tool call), 0.81 tok/s on
the JVM host incl. prefill.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ard path The board binary is Kotlin/Native, but GemmaMemSegConverter (the NATIVE_OPTIMIZED packed-weight path) is jvmMain-only (java.lang.foreign). Move the reusable, platform-neutral pieces to commonMain so K/N can keep K-quant weights packed: - GemmaQuantLayout.kt (commonMain): logicalShapeFor + relayoutKSeriesRowMajor ToBlockMajor (now copyInto, KMP-safe) + packGemmaKQuant<T>() which builds heap-packed Q4_K/Q5_K/Q6_KBlockTensorData directly (no MemSeg/Arena). - GemmaMemSegConverter (jvmMain) now shares those commonMain helpers (dup removed); MemSeg/FFM conversion + FP32 fallbacks stay JVM-only. - commonTest GemmaQuantLayoutTest: block-transpose relayout + packing, runs on every target. Verified: gemma compiles for JVM + linuxX64; layout tests green (3). Next (board integration): a commonMain convertGemmaWeightsPacked wired into the K/N load path (byte extraction differs JVM IntArrayTensorData vs native Byte- backed), then a full K/N decode on the SL2610. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oad() NATIVE_OPTIMIZED loads produce raw-byte quant tensors the network mapper can't consume; on JVM an external convertGemmaWeightsToMemSeg (FFM) handled that, but the Kotlin/Native board has no such path. Add a commonMain converter and make load() apply it, so load(NATIVE_OPTIMIZED) yields a runnable network on the board AND the JVM (previously it couldn't be built from raw-byte weights at all). - GemmaPackedWeights.kt (commonMain): convertGemmaWeightsPacked — packs Q4/5/6_K matmul weights to heap Q*_KBlockTensorData (packGemmaKQuant), dequants token_embd/output to FP32 (gathered, no transpose) and other quant types to FP32 [out,in]. No java.lang.foreign. Plus extractRawBytes, which reads the loader's bytes back across both backings (JVM IntArrayTensorData / native Byte-typed). - GemmaNetworkLoader.load(): for NATIVE_OPTIMIZED, run convertGemmaWeightsPacked before applyWeightsToNetwork. Verified on JVM AND linuxX64 (GemmaQuantLayoutTest, 4 tests each): relayout, packing, and the byte-extraction round-trip — so native byte extraction is executed, not just compiled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extends GemmaQ5KPackedParityTest to also decode via GemmaNetworkLoader.load(NATIVE_OPTIMIZED) — the wired commonMain convertGemmaWeightsPacked (board) path, no MemSeg/Arena. All three paths (FP32 baseline, jvmMain MemSeg-packed, load() packed) produce the identical token sequence -> `<tool_0>(state="on")<end>` for "Turn the light on." Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Six real-model integration tests (RealGemmaLoad/Eager/BakeIrpa/ExternalParam/ DequantDump + GemmaBehavioralAb) pointed at an old workspace path (/home/miso/projects/coral/sl2610-voice-cc-kt/models/...) and failed with "File not found" under -PincludeIntegration. Repoint them to the actual model location (SKaiNET-embedded/sl2610-function-calling/models/), matching GemmaQ5KPackedParityTest. Verified: all 6 pass against skainet 0.30.0 (mavenLocal), -PincludeIntegration.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Wires the new SKaiNET Q5_K packed kernel into the eager Gemma runtime and adds the Kotlin/Native (board) weight-load path, so FunctionGemma-270M (
Q5_K_M) runs eager with KV-cache + in-kernel Q5_K dequant — no FP32 inflation.What's here
GemmaMemSegConverterkeeps Q5_K weights packed (Q5_KBlockTensorData, 176 B/block) instead of dequantizing to FP32 — runs the in-kernel dequant matmul.GemmaQuantLayout.kt(relayoutKSeriesRowMajorToBlockMajor+logicalShapeFor+packGemmaKQuant) andGemmaPackedWeights.kt(convertGemmaWeightsPacked+extractRawBytes), the K/N analogue of the jvmMain MemSeg converter (nojava.lang.foreign). Wired intoGemmaNetworkLoader.load(NATIVE_OPTIMIZED).skainet0.28.1 → 0.29.1;mavenLocal()first in settings (Central fallback).Verification
GemmaQuantLayoutTest: relayout + packing + byte-extraction round-trip green on JVM and linuxX64 (native byte extraction executes, not just compiles).GemmaQ5KPackedParityTest: FP32 baseline, jvmMain MemSeg-packed, and the wiredload(NATIVE_OPTIMIZED)path all decode FunctionGemma to the identical token sequence →<tool_0>(state="on")<end>for "Turn the light on."Remaining (board)
Full on-device FunctionGemma decode on the SL2610 (build the gemma stack for
linuxArm64, run on device) + benchmark vs the IREE path.🤖 Generated with Claude Code