Follow-up to #169.
Context
#169 fixed the crash when loading a Gemma GGUF with kernel-less quant weights (e.g. Q5_1, as in functiongemma-270m "Q5_K_M") under QuantPolicy.NATIVE_OPTIMIZED: GemmaMemSegConverter now dequantizes those tensors to FP32 instead of leaving raw bytes (which made linearProject -> ops.transpose throw "Transpose requires at least 2 dimensions").
That makes the model correct (skainet -m functiongemma-…-Q5_K_M.gguf "The capital of France is" → "Paris."), but for those specific tensors it gives up the packed memory/speed benefit — they round-trip through FP32 while Q4_K/Q6_K/Q4_0/Q8_0 stay packed and fast.
Ask
Add a packed SIMD matmul path for Q5_1 (and Q5_0 for completeness), mirroring the existing Q4_K/Q6_K chain, so NATIVE_OPTIMIZED keeps these weights packed instead of dequantizing.
This spans two repos:
- Core backend (
SKaiNET-developers/SKaiNET):
- a
Q5_1 packed tensor-data type (cf. Q4_KBlockTensorData / Q6_KBlockTensorData)
matmulQ5_1Vec in JvmQuantizedVectorKernels (FP32 activations × packed Q5_1 weights)
- a lazy
Q5_1 transpose branch in DefaultCpuOpsJvm.transpose (shape-swap, no copy), like the Q4_K/Q6_K branches
- Transformers (this repo):
Acceptance
- A
Q5_1 Gemma checkpoint runs under NATIVE_OPTIMIZED with the Q5_1 weights packed (no FP32 blow-up), output matching the DEQUANTIZE_TO_FP32 reference token-for-token.
- A kernel unit test (FP32 activations × packed Q5_1) vs an FP32-dequant reference, mirroring the Q4_K/Q6_K kernel tests.
Notes
Follow-up to #169.
Context
#169 fixed the crash when loading a Gemma GGUF with kernel-less quant weights (e.g. Q5_1, as in
functiongemma-270m"Q5_K_M") underQuantPolicy.NATIVE_OPTIMIZED:GemmaMemSegConverternow dequantizes those tensors to FP32 instead of leaving raw bytes (which madelinearProject -> ops.transposethrow "Transpose requires at least 2 dimensions").That makes the model correct (
skainet -m functiongemma-…-Q5_K_M.gguf "The capital of France is"→ "Paris."), but for those specific tensors it gives up the packed memory/speed benefit — they round-trip through FP32 whileQ4_K/Q6_K/Q4_0/Q8_0stay packed and fast.Ask
Add a packed SIMD matmul path for Q5_1 (and
Q5_0for completeness), mirroring the existingQ4_K/Q6_Kchain, soNATIVE_OPTIMIZEDkeeps these weights packed instead of dequantizing.This spans two repos:
SKaiNET-developers/SKaiNET):Q5_1packed tensor-data type (cf.Q4_KBlockTensorData/Q6_KBlockTensorData)matmulQ5_1VecinJvmQuantizedVectorKernels(FP32 activations × packed Q5_1 weights)Q5_1transpose branch inDefaultCpuOpsJvm.transpose(shape-swap, no copy), like the Q4_K/Q6_K branchesQ5_1(andQ5_0) case inGemmaMemSegConverter.convertOnethat produces the packed type (with any required GGUF row-major → block-major re-layout), replacing the dequant fallback added in fix(gemma): dequant kernel-less quant types in NATIVE_OPTIMIZED instead of leaving raw bytes #169.Acceptance
Q5_1Gemma checkpoint runs underNATIVE_OPTIMIZEDwith the Q5_1 weights packed (no FP32 blow-up), output matching theDEQUANTIZE_TO_FP32reference token-for-token.Notes
functiongemma-270m: ~0.67 tok/s with the dequant fallback vs ~0.23 when forcing a global FP32 dequant; a packed Q5_1 kernel should close the remaining gap to the fully-packed models.