feat(quantized): GGUF-compat Q4_0 quant/dequant for burn QuantValue::Q4F/Q4S (sprint A5)#120
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7609ccd67f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let lo = ((block[j] * id).round() + 8.5).floor().clamp(0.0, 15.0) as u8; | ||
| let hi = ((block[j + Q4_0_BYTES_PER_BLOCK] * id).round() + 8.5) | ||
| .floor() | ||
| .clamp(0.0, 15.0) as u8; |
There was a problem hiding this comment.
Use GGUF Q4_0 quantizer rounding rule
quantize_f32_to_q4_0 currently computes each nibble with ((x * id).round() + 8.5).floor(), but GGUF/llama.cpp Q4_0 uses truncation of x * id + 8.5 (effectively floor(x * id + 8.5) for this nonnegative range). These are not equivalent for negative half-step inputs (e.g. x*id = -0.5 gives 7 here vs 8 in GGUF), so this can produce different packed bytes from the same weights and break the advertised byte-level compatibility with existing Q4_0 tensors.
Useful? React with 👍 / 👎.
Add quantize_f32_to_q4_0 / dequantize_q4_0_to_f32 implementing the GGUF / llama.cpp per-32-element block scheme: 16 packed bytes plus one f32 scale d = max_signed/-8 per block, with the canonical interleaved nibble layout (element j -> low nibble of byte j; element j+16 -> high nibble of byte j). The existing per-tensor quantize_f32_to_i4 (low-nibble-first, non-interleaved, scale = abs_max/7) is preserved unchanged for backwards compatibility. Burn QuantValue::Q4F / Q4S callers can opt into either scheme. Tests: i4 boundary +/-7 and clamp +/-8; Q4_0 single-block, multi-block, zero-block, interleaved layout, non-aligned panic. https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
7609ccd to
376aacb
Compare
Summary
Sprint A5 of burn-ndarray parity sprint v1. Closes item (11) of the parity list — Q4 quant helpers needed for burn's
QuantValue::Q4F/Q4S.Existing
quantize_f32_to_i4auditPre-existing impl at
src/hpc/quantized.rs:355:pub fn quantize_f32_to_i4(data: &[f32]) -> (Vec<u8>, QuantParams)— already publicscale = abs_max / 7.0,zero_point = 0[-8, 7], sign-extended on dequantDecision
Option (a) additive — kept existing
quantize_f32_to_i4untouched (no breaking change to existing callers). Added new GGUF-compat functions alongside.What's new (+211 LOC)
src/hpc/quantized.rs:466-676:The packing layout is the exact GGUF Q4_0 interleave (not the linear layout
quantize_f32_to_i4uses), matching whatllama.cppproduces.Tests (6 new, 17/17 pass)
test_i4_boundary_values— exact boundaries at ±7 (scale=1.0) and clamp at ±8 (scale=8/7)test_q4_0_roundtrip_single_block— 32 floats round-triptest_q4_0_roundtrip_multi_block— 3-block (96 floats)test_q4_0_zero_block— d=0 edge casetest_q4_0_packing_layout_interleaved— asserts byte j holds elements j and j+16test_q4_0_requires_block_aligned—#[should_panic]for non-32-multiple inputAcceptance
cargo build: clean (existing 39 warnings, none new)cargo test --lib quantized: 17 passed, 0 failedcargo fmt --check: only pre-existing diffs in code A5 didn't touch; new code is fmt-cleanPlan reference
.claude/plans/burn-ndarray-parity-sprint-v1.md— Item (11)Notes
GPG-signed commit (A5 worked around the env's codesign-helper quirk by mirror-committing in
/home/user/ndarrayand fetching the SHA into its worktree). Same key as recent master commits.https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
Generated by Claude Code