fix(ggml): correct I2_S decode to microsoft strided block layout#156
Conversation
|
Thanks for this — the strided-layout diagnosis is exactly right, and the Two things before merge: 1. The header doc-comment is now stale (and contradicts the code). 2. The tests don't actually exercise the production path.
Minor: the partial-tail packing ( Heads-up: |
The current dequantize_i2_s assumes 4 sequential trits per byte
(byte b holds elements 4b..4b+4) with the mapping 0b01->+1,
0b10->-1. That is NOT how microsoft/BitNet packs I2_S.
Verified against microsoft/BitNet ggml-bitnet-mad.cpp
(quantize_i2_s + the AVX2 vec_dot): I2_S uses a STRIDED layout.
Elements are grouped into 128-element blocks; each block occupies
32 bytes. Within a block, byte p (0..32) packs the four elements
{p, p+32, p+64, p+96} at bit-shifts 6,4,2,0 (group g -> shift
6-2g). The 2-bit code is UNSIGNED {0,1,2}; the ternary value is
code-1 (0 -> -1, 1 -> 0, 2 -> +1). A true zero tensor therefore
packs to 0x55 bytes, not 0x00.
The contiguous decode silently scrambles every weight: the model
loads fine and returns confident garbage. On
microsoft/bitnet-b1.58-2B-4T the difference is stark:
prompt 'The capital of France is'
contiguous decode -> cluster / mass / mam (nonsense)
strided decode -> Paris 94.5% (correct)
This patch:
* rewrites dequantize_i2_s to the strided block layout with the
correct unsigned-code -> trit mapping (general invariant: a
block of B bytes holds 4*B elements, byte p carrying
{p, p+B, p+2B, p+3B}; full 128/32 blocks + a partial tail).
* rewrites quantize_i2_s as the exact inverse so round-trips
hold and the encoder matches reality.
* updates the I2_S tests for the strided layout + 0x55-zero
encoding, and adds a strided-layout round-trip regression.
Verified end-to-end on the real microsoft/bitnet-b1.58-2B-4T GGUF
(Paris 94.5%, 'two plus two equals' -> four). cargo test -p
larql-models --lib i2_s: 9 passed.
Addresses PR chrishayuk#156 review: - Rewrote the header block-comment above dequantize_i2_s: it still documented the OLD contiguous '4 sequential trits/byte' layout and the wrong 0b01->+1 / 0b10->-1 mapping (and the wrong sub_norm scale source). Now documents the strided 128-elem/32-byte layout, the unsigned code-1 mapping, the 0x55-zero encoding, and the trailing per-tensor scale, with the ggml-bitnet-mad.cpp cross-reference. - Added i2_s_full_block_round_trip (256 elems = two full 128-blocks) driving the full_blocks path + the +32/+64/+96 stride that the 4-8 element tail tests never touched. - Added i2_s_known_pattern_pins_strided_offsets: a hand-constructed block (byte 0 = 0b10_00_01_00, rest 0x55) whose +0/+32/+64/+96 decode is pinned to known trits computed by hand from microsoft's layout \u2014 fails if the stride or shift is wrong, so the layout claim is anchored by the suite, not only by comments. - Noted the partial-tail packing is NOT the upstream tail format. cargo test -p larql-models --lib i2_s: 11 passed.
|
Thanks — addressed both, and rebased onto current main (clears the inherited coverage failure; ran
|
|
sweet, thanks.. i'll merge this in tonight before i do anything else, hahaha, appreciate the contribution |
Problem
dequantize_i2_sdecodes I2_S as 4 sequential trits per byte (0b01->+1,0b10->-1). That is not howmicrosoft/BitNetpacks I2_S, so every weight is scrambled — the model loads and returns confident garbage.What I verified
Against
microsoft/BitNetggml-bitnet-mad.cpp(quantize_i2_s+ the AVX2vec_dot): I2_S uses a strided layout. Elements are grouped into 128-element blocks of 32 bytes; within a block, byteppacks elements{p, p+32, p+64, p+96}at bit-shifts 6/4/2/0 (groupg-> shift6-2g). The 2-bit code is unsigned{0,1,2}, ternary value= code-1. A zero tensor packs to0x55, not0x00.On
microsoft/bitnet-b1.58-2B-4T:Changes
dequantize_i2_s: strided block layout + correct unsigned-code->trit mapping (general invariant: a block of B bytes holds 4·B elements, byte p -> {p, p+B, p+2B, p+3B}; full 128/32 blocks plus a partial tail).quantize_i2_s: rewritten as the exact inverse.0x55-zero encoding; added a strided round-trip regression.Verified end-to-end on the real GGUF.
cargo test -p larql-models --lib i2_s: 9 passed.This is a prerequisite for any correct BitNet 1.58 inference on the engine; happy to follow up with the native-ternary forward-pass work that consumes this.