feat(backend-cpu): packed Q5_1 / Q5_0 matmul kernels + lazy transpose#709
Merged
Merged
Conversation
Closes #708. Adds a packed SIMD-path (currently scalar) matmul for the GGML Q5_1 and Q5_0 quantized formats, mirroring the existing Q4_0/Q8_0/Q4_K/Q6_K chain, so these weights can be consumed packed instead of dequantized to FP32 (avoids the FP32 memory blow-up; e.g. functiongemma-270m's attention/FFN weights are Q5_1). Changes: - TensorEncoding: add Q5_0 (22 B/block) and Q5_1 (24 B/block) data objects. - New Q5_1TensorData / Q5_0TensorData interfaces + Q5_{1,0}BlockTensorData, with dequantizeBlock matching DequantOps.dequantQ5_{1,0}FromBytes exactly (w = d*(code + (highBit<<4)) + m for Q5_1; d*(code + (highBit<<4) - 16) for Q5_0). - JvmQuantizedVectorKernels.matmulQ5_1Vec / matmulQ5_0Vec: row-major [out, in] packing (output row o's `in` weights are in/32 contiguous blocks), so the kernel consumes raw GGUF bytes directly — no block-major re-layout. - DefaultCpuOpsJvm: matmul dispatch branches for Q5_1TensorData / Q5_0TensorData, and lazy transpose branches (pure shape swap, keep packed bytes) so `ops.matmul(x, ops.transpose(W))` runs without a dequant round-trip. Layout note: Q4_K/Q6_K kernels are block-major and need a converter re-layout; Q5_1/Q5_0 are intentionally row-major so the downstream converter case (SKaiNET-transformers#170) just wraps the raw bytes. Tests (Q5MatmulDispatchTest): packed Q5_1/Q5_0 matmul through ops.matmul(x, transpose(W)) matches the FP32-dequant matmul to <1e-3, across single/multi-batch, with the FP32 reference dequantized inline (independent of the data-type code under test). Existing Q8_0/Q4_0/MemSeg/transpose tests stay green. Scalar inner loop keeps the weights packed (the memory win); SIMD vectorization of the dequant+dot loop is a follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
📖 Documentation Preview The documentation has been built successfully for this PR. Generated Files:
Artifacts:
This comment will be updated automatically when the PR is updated. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #708.
What
Adds a packed matmul path for the GGML Q5_1 and Q5_0 quantized formats, mirroring the existing
Q4_0 / Q8_0 / Q4_K / Q6_Kchain, so these weights can be consumed packed instead of dequantized to FP32. Motivated byfunctiongemma-270m(attention/FFN weights areQ5_1), where the lack of a packed kernel forced a full FP32 dequant downstream (SKaiNET-developers/SKaiNET-transformers#169 / #170).Changes
TensorEncoding: addQ5_0(22 B/block) andQ5_1(24 B/block) data objects.Q5_1TensorData/Q5_0TensorDatainterfaces +Q5_{1,0}BlockTensorData, withdequantizeBlockmatchingDequantOps.dequantQ5_{1,0}FromBytesexactly:w = d * (code + (highBit << 4)) + mw = d * (code + (highBit << 4) - 16)JvmQuantizedVectorKernels.matmulQ5_1Vec/matmulQ5_0Vec: row-major[out, in]packing (output rowo'sinweights arein/32contiguous blocks), so the kernel consumes raw GGUF bytes directly — no block-major re-layout.DefaultCpuOpsJvm: matmul dispatch branches forQ5_1TensorData/Q5_0TensorData, and lazy transpose branches (pure shape swap, keep packed bytes) soops.matmul(x, ops.transpose(W))runs without a dequant round-trip.Layout note
Q4_K/Q6_Kkernels are block-major and require the converter to re-layout.Q5_1/Q5_0are intentionally row-major so the downstream converter case (SKaiNET-transformers#170) just wraps the raw bytes — no re-layout.Tests
Q5MatmulDispatchTest: packed Q5_1/Q5_0 matmul throughops.matmul(x, ops.transpose(W))matches the FP32-dequant matmul to <1e-3, single- and multi-batch. The FP32 reference is dequantized inline in the test (independent of the data-type code under test), matching ggml/DequantOps. ExistingQ8_0/Q4_0/ MemSeg / transpose tests stay green; affected modules compile clean (no sealed-whenexhaustiveness breaks).Follow-ups
Q5_1BlockTensorData(replacing the Model Sharing #169 dequant fallback) lands in SKaiNET-transformers#170 once this is published.🤖 Generated with Claude Code