Skip to content

Packed SIMD matmul kernel for Q5_1 / Q5_0 (keep NATIVE_OPTIMIZED fast, not dequant-to-FP32) #170

@michalharakal

Description

@michalharakal

Follow-up to #169.

Context

#169 fixed the crash when loading a Gemma GGUF with kernel-less quant weights (e.g. Q5_1, as in functiongemma-270m "Q5_K_M") under QuantPolicy.NATIVE_OPTIMIZED: GemmaMemSegConverter now dequantizes those tensors to FP32 instead of leaving raw bytes (which made linearProject -> ops.transpose throw "Transpose requires at least 2 dimensions").

That makes the model correct (skainet -m functiongemma-…-Q5_K_M.gguf "The capital of France is" → "Paris."), but for those specific tensors it gives up the packed memory/speed benefit — they round-trip through FP32 while Q4_K/Q6_K/Q4_0/Q8_0 stay packed and fast.

Ask

Add a packed SIMD matmul path for Q5_1 (and Q5_0 for completeness), mirroring the existing Q4_K/Q6_K chain, so NATIVE_OPTIMIZED keeps these weights packed instead of dequantizing.

This spans two repos:

  1. Core backend (SKaiNET-developers/SKaiNET):
    • a Q5_1 packed tensor-data type (cf. Q4_KBlockTensorData / Q6_KBlockTensorData)
    • matmulQ5_1Vec in JvmQuantizedVectorKernels (FP32 activations × packed Q5_1 weights)
    • a lazy Q5_1 transpose branch in DefaultCpuOpsJvm.transpose (shape-swap, no copy), like the Q4_K/Q6_K branches
  2. Transformers (this repo):

Acceptance

  • A Q5_1 Gemma checkpoint runs under NATIVE_OPTIMIZED with the Q5_1 weights packed (no FP32 blow-up), output matching the DEQUANTIZE_TO_FP32 reference token-for-token.
  • A kernel unit test (FP32 activations × packed Q5_1) vs an FP32-dequant reference, mirroring the Q4_K/Q6_K kernel tests.

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions