feat(backend-cpu): packed Q5_1 / Q5_0 matmul kernels + lazy transpose by michalharakal · Pull Request #709 · SKaiNET-developers/SKaiNET

michalharakal · 2026-06-07T22:34:49Z

Closes #708.

What

Adds a packed matmul path for the GGML Q5_1 and Q5_0 quantized formats, mirroring the existing Q4_0 / Q8_0 / Q4_K / Q6_K chain, so these weights can be consumed packed instead of dequantized to FP32. Motivated by functiongemma-270m (attention/FFN weights are Q5_1), where the lack of a packed kernel forced a full FP32 dequant downstream (SKaiNET-developers/SKaiNET-transformers#169 / #170).

Changes

TensorEncoding: add Q5_0 (22 B/block) and Q5_1 (24 B/block) data objects.
New Q5_1TensorData / Q5_0TensorData interfaces + Q5_{1,0}BlockTensorData, with dequantizeBlock matching DequantOps.dequantQ5_{1,0}FromBytes exactly:
- Q5_1: w = d * (code + (highBit << 4)) + m
- Q5_0: w = d * (code + (highBit << 4) - 16)
JvmQuantizedVectorKernels.matmulQ5_1Vec / matmulQ5_0Vec: row-major [out, in] packing (output row o's in weights are in/32 contiguous blocks), so the kernel consumes raw GGUF bytes directly — no block-major re-layout.
DefaultCpuOpsJvm: matmul dispatch branches for Q5_1TensorData / Q5_0TensorData, and lazy transpose branches (pure shape swap, keep packed bytes) so ops.matmul(x, ops.transpose(W)) runs without a dequant round-trip.

Layout note

Q4_K/Q6_K kernels are block-major and require the converter to re-layout. Q5_1/Q5_0 are intentionally row-major so the downstream converter case (SKaiNET-transformers#170) just wraps the raw bytes — no re-layout.

Tests

Q5MatmulDispatchTest: packed Q5_1/Q5_0 matmul through ops.matmul(x, ops.transpose(W)) matches the FP32-dequant matmul to <1e-3, single- and multi-batch. The FP32 reference is dequantized inline in the test (independent of the data-type code under test), matching ggml/DequantOps. Existing Q8_0 / Q4_0 / MemSeg / transpose tests stay green; affected modules compile clean (no sealed-when exhaustiveness breaks).

Follow-ups

Scalar inner loop keeps weights packed (the memory win); SIMD vectorization of the dequant+dot loop is a follow-up.
The transformers converter case that produces Q5_1BlockTensorData (replacing the Model Sharing #169 dequant fallback) lands in SKaiNET-transformers#170 once this is published.

🤖 Generated with Claude Code

Closes #708. Adds a packed SIMD-path (currently scalar) matmul for the GGML Q5_1 and Q5_0 quantized formats, mirroring the existing Q4_0/Q8_0/Q4_K/Q6_K chain, so these weights can be consumed packed instead of dequantized to FP32 (avoids the FP32 memory blow-up; e.g. functiongemma-270m's attention/FFN weights are Q5_1). Changes: - TensorEncoding: add Q5_0 (22 B/block) and Q5_1 (24 B/block) data objects. - New Q5_1TensorData / Q5_0TensorData interfaces + Q5_{1,0}BlockTensorData, with dequantizeBlock matching DequantOps.dequantQ5_{1,0}FromBytes exactly (w = d*(code + (highBit<<4)) + m for Q5_1; d*(code + (highBit<<4) - 16) for Q5_0). - JvmQuantizedVectorKernels.matmulQ5_1Vec / matmulQ5_0Vec: row-major [out, in] packing (output row o's `in` weights are in/32 contiguous blocks), so the kernel consumes raw GGUF bytes directly — no block-major re-layout. - DefaultCpuOpsJvm: matmul dispatch branches for Q5_1TensorData / Q5_0TensorData, and lazy transpose branches (pure shape swap, keep packed bytes) so `ops.matmul(x, ops.transpose(W))` runs without a dequant round-trip. Layout note: Q4_K/Q6_K kernels are block-major and need a converter re-layout; Q5_1/Q5_0 are intentionally row-major so the downstream converter case (SKaiNET-transformers#170) just wraps the raw bytes. Tests (Q5MatmulDispatchTest): packed Q5_1/Q5_0 matmul through ops.matmul(x, transpose(W)) matches the FP32-dequant matmul to <1e-3, across single/multi-batch, with the FP32 reference dequantized inline (independent of the data-type code under test). Existing Q8_0/Q4_0/MemSeg/transpose tests stay green. Scalar inner loop keeps the weights packed (the memory win); SIMD vectorization of the dequant+dot loop is a follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-07T22:38:45Z

📖 Documentation Preview

The documentation has been built successfully for this PR.

Generated Files:

Operator documentation: docs/modules/operators/_generated_/
JSON schema output: operators.json

Artifacts:

Download the documentation-preview-709 artifact to view the complete documentation locally.

This comment will be updated automatically when the PR is updated.

michalharakal merged commit 586db77 into develop Jun 8, 2026
11 checks passed

michalharakal deleted the feature/708-q5-packed-matmul-kernels branch June 8, 2026 10:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(backend-cpu): packed Q5_1 / Q5_0 matmul kernels + lazy transpose#709

feat(backend-cpu): packed Q5_1 / Q5_0 matmul kernels + lazy transpose#709
michalharakal merged 1 commit into
developfrom
feature/708-q5-packed-matmul-kernels

michalharakal commented Jun 7, 2026

Uh oh!

github-actions Bot commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented Jun 7, 2026

What

Changes

Layout note

Tests

Follow-ups

Uh oh!

github-actions Bot commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant