Packed SIMD matmul kernel for Q5_1 / Q5_0 (keep NATIVE_OPTIMIZED fast, not dequant-to-FP32)

Follow-up to #169.

## Context

#169 fixed the **crash** when loading a Gemma GGUF with kernel-less quant weights (e.g. **Q5_1**, as in `functiongemma-270m` "Q5_K_M") under `QuantPolicy.NATIVE_OPTIMIZED`: `GemmaMemSegConverter` now **dequantizes** those tensors to FP32 instead of leaving raw bytes (which made `linearProject -> ops.transpose` throw "Transpose requires at least 2 dimensions").

That makes the model **correct** (`skainet -m functiongemma-…-Q5_K_M.gguf "The capital of France is"` → "Paris."), but for those specific tensors it gives up the packed memory/speed benefit — they round-trip through FP32 while `Q4_K/Q6_K/Q4_0/Q8_0` stay packed and fast.

## Ask

Add a **packed SIMD matmul path for Q5_1** (and `Q5_0` for completeness), mirroring the existing `Q4_K`/`Q6_K` chain, so `NATIVE_OPTIMIZED` keeps these weights packed instead of dequantizing.

This spans two repos:

1. **Core backend (`SKaiNET-developers/SKaiNET`):**
   - a `Q5_1` packed tensor-data type (cf. `Q4_KBlockTensorData` / `Q6_KBlockTensorData`)
   - `matmulQ5_1Vec` in `JvmQuantizedVectorKernels` (FP32 activations × packed Q5_1 weights)
   - a lazy `Q5_1` transpose branch in `DefaultCpuOpsJvm.transpose` (shape-swap, no copy), like the Q4_K/Q6_K branches
2. **Transformers (this repo):**
   - a `Q5_1` (and `Q5_0`) case in `GemmaMemSegConverter.convertOne` that produces the packed type (with any required GGUF row-major → block-major re-layout), replacing the dequant fallback added in #169.

## Acceptance

- A `Q5_1` Gemma checkpoint runs under `NATIVE_OPTIMIZED` with the Q5_1 weights **packed** (no FP32 blow-up), output matching the `DEQUANTIZE_TO_FP32` reference token-for-token.
- A kernel unit test (FP32 activations × packed Q5_1) vs an FP32-dequant reference, mirroring the Q4_K/Q6_K kernel tests.

## Notes

- The dequant fallback from #169 should remain as the correctness path for any quant type without a packed kernel — this issue just removes Q5_1/Q5_0 from that fallback.
- Current measured impact on `functiongemma-270m`: ~0.67 tok/s with the dequant fallback vs ~0.23 when forcing a global FP32 dequant; a packed Q5_1 kernel should close the remaining gap to the fully-packed models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Packed SIMD matmul kernel for Q5_1 / Q5_0 (keep NATIVE_OPTIMIZED fast, not dequant-to-FP32) #170

Context

Ask

Acceptance

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Packed SIMD matmul kernel for Q5_1 / Q5_0 (keep NATIVE_OPTIMIZED fast, not dequant-to-FP32) #170

Description

Context

Ask

Acceptance

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions