Skip to content

feat(backend-cpu): packed Q5_1 / Q5_0 matmul kernels + lazy transpose#709

Merged
michalharakal merged 1 commit into
developfrom
feature/708-q5-packed-matmul-kernels
Jun 8, 2026
Merged

feat(backend-cpu): packed Q5_1 / Q5_0 matmul kernels + lazy transpose#709
michalharakal merged 1 commit into
developfrom
feature/708-q5-packed-matmul-kernels

Conversation

@michalharakal

Copy link
Copy Markdown
Contributor

Closes #708.

What

Adds a packed matmul path for the GGML Q5_1 and Q5_0 quantized formats, mirroring the existing Q4_0 / Q8_0 / Q4_K / Q6_K chain, so these weights can be consumed packed instead of dequantized to FP32. Motivated by functiongemma-270m (attention/FFN weights are Q5_1), where the lack of a packed kernel forced a full FP32 dequant downstream (SKaiNET-developers/SKaiNET-transformers#169 / #170).

Changes

  • TensorEncoding: add Q5_0 (22 B/block) and Q5_1 (24 B/block) data objects.
  • New Q5_1TensorData / Q5_0TensorData interfaces + Q5_{1,0}BlockTensorData, with dequantizeBlock matching DequantOps.dequantQ5_{1,0}FromBytes exactly:
    • Q5_1: w = d * (code + (highBit << 4)) + m
    • Q5_0: w = d * (code + (highBit << 4) - 16)
  • JvmQuantizedVectorKernels.matmulQ5_1Vec / matmulQ5_0Vec: row-major [out, in] packing (output row o's in weights are in/32 contiguous blocks), so the kernel consumes raw GGUF bytes directly — no block-major re-layout.
  • DefaultCpuOpsJvm: matmul dispatch branches for Q5_1TensorData / Q5_0TensorData, and lazy transpose branches (pure shape swap, keep packed bytes) so ops.matmul(x, ops.transpose(W)) runs without a dequant round-trip.

Layout note

Q4_K/Q6_K kernels are block-major and require the converter to re-layout. Q5_1/Q5_0 are intentionally row-major so the downstream converter case (SKaiNET-transformers#170) just wraps the raw bytes — no re-layout.

Tests

Q5MatmulDispatchTest: packed Q5_1/Q5_0 matmul through ops.matmul(x, ops.transpose(W)) matches the FP32-dequant matmul to <1e-3, single- and multi-batch. The FP32 reference is dequantized inline in the test (independent of the data-type code under test), matching ggml/DequantOps. Existing Q8_0 / Q4_0 / MemSeg / transpose tests stay green; affected modules compile clean (no sealed-when exhaustiveness breaks).

Follow-ups

  • Scalar inner loop keeps weights packed (the memory win); SIMD vectorization of the dequant+dot loop is a follow-up.
  • The transformers converter case that produces Q5_1BlockTensorData (replacing the Model Sharing #169 dequant fallback) lands in SKaiNET-transformers#170 once this is published.

🤖 Generated with Claude Code

Closes #708.

Adds a packed SIMD-path (currently scalar) matmul for the GGML Q5_1 and Q5_0
quantized formats, mirroring the existing Q4_0/Q8_0/Q4_K/Q6_K chain, so these
weights can be consumed packed instead of dequantized to FP32 (avoids the FP32
memory blow-up; e.g. functiongemma-270m's attention/FFN weights are Q5_1).

Changes:
- TensorEncoding: add Q5_0 (22 B/block) and Q5_1 (24 B/block) data objects.
- New Q5_1TensorData / Q5_0TensorData interfaces + Q5_{1,0}BlockTensorData,
  with dequantizeBlock matching DequantOps.dequantQ5_{1,0}FromBytes exactly
  (w = d*(code + (highBit<<4)) + m for Q5_1; d*(code + (highBit<<4) - 16) for Q5_0).
- JvmQuantizedVectorKernels.matmulQ5_1Vec / matmulQ5_0Vec: row-major
  [out, in] packing (output row o's `in` weights are in/32 contiguous blocks),
  so the kernel consumes raw GGUF bytes directly — no block-major re-layout.
- DefaultCpuOpsJvm: matmul dispatch branches for Q5_1TensorData / Q5_0TensorData,
  and lazy transpose branches (pure shape swap, keep packed bytes) so
  `ops.matmul(x, ops.transpose(W))` runs without a dequant round-trip.

Layout note: Q4_K/Q6_K kernels are block-major and need a converter re-layout;
Q5_1/Q5_0 are intentionally row-major so the downstream converter case
(SKaiNET-transformers#170) just wraps the raw bytes.

Tests (Q5MatmulDispatchTest): packed Q5_1/Q5_0 matmul through
ops.matmul(x, transpose(W)) matches the FP32-dequant matmul to <1e-3, across
single/multi-batch, with the FP32 reference dequantized inline (independent of
the data-type code under test). Existing Q8_0/Q4_0/MemSeg/transpose tests stay green.

Scalar inner loop keeps the weights packed (the memory win); SIMD vectorization
of the dequant+dot loop is a follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown

📖 Documentation Preview

The documentation has been built successfully for this PR.

Generated Files:

  • Operator documentation: docs/modules/operators/_generated_/
  • JSON schema output: operators.json

Artifacts:

  • Download the documentation-preview-709 artifact to view the complete documentation locally.

This comment will be updated automatically when the PR is updated.

@michalharakal michalharakal merged commit 586db77 into develop Jun 8, 2026
11 checks passed
@michalharakal michalharakal deleted the feature/708-q5-packed-matmul-kernels branch June 8, 2026 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Packed Q5_1 / Q5_0 SIMD matmul kernel + lazy transpose (CPU backend)

1 participant