Skip to content

fix(ggml): correct I2_S decode to microsoft strided block layout#156

Merged
chrishayuk merged 2 commits into
chrishayuk:mainfrom
gburd:pr/i2s-decode-fix
Jun 19, 2026
Merged

fix(ggml): correct I2_S decode to microsoft strided block layout#156
chrishayuk merged 2 commits into
chrishayuk:mainfrom
gburd:pr/i2s-decode-fix

Conversation

@gburd

@gburd gburd commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Problem

dequantize_i2_s decodes I2_S as 4 sequential trits per byte (0b01->+1, 0b10->-1). That is not how microsoft/BitNet packs I2_S, so every weight is scrambled — the model loads and returns confident garbage.

What I verified

Against microsoft/BitNet ggml-bitnet-mad.cpp (quantize_i2_s + the AVX2 vec_dot): I2_S uses a strided layout. Elements are grouped into 128-element blocks of 32 bytes; within a block, byte p packs elements {p, p+32, p+64, p+96} at bit-shifts 6/4/2/0 (group g -> shift 6-2g). The 2-bit code is unsigned {0,1,2}, ternary value = code-1. A zero tensor packs to 0x55, not 0x00.

On microsoft/bitnet-b1.58-2B-4T:

prompt contiguous decode (current) strided decode (this PR)
"The capital of France is" cluster / mass / mam Paris 94.5%
"Two plus two equals" (nonsense) four

Changes

  • dequantize_i2_s: strided block layout + correct unsigned-code->trit mapping (general invariant: a block of B bytes holds 4·B elements, byte p -> {p, p+B, p+2B, p+3B}; full 128/32 blocks plus a partial tail).
  • quantize_i2_s: rewritten as the exact inverse.
  • Tests updated for the strided layout + 0x55-zero encoding; added a strided round-trip regression.

Verified end-to-end on the real GGUF. cargo test -p larql-models --lib i2_s: 9 passed.

This is a prerequisite for any correct BitNet 1.58 inference on the engine; happy to follow up with the native-ternary forward-pass work that consumes this.

@chrishayuk

Copy link
Copy Markdown
Owner

Thanks for this — the strided-layout diagnosis is exactly right, and the Paris 94.5% before/after is a compelling proof it's a real correctness fix (the old contiguous decode was scrambling every BitNet weight). The strided indexing and the quantize_i2_s inverse read as clean and self-consistent.

Two things before merge:

1. The header doc-comment is now stale (and contradicts the code).
The block-comment above the functions (~tq.rs:363–380) still documents the old contiguous "4 sequential trits/byte" layout and the old 0b01→+1 / 0b10→-1 mapping — the exact thing this PR overturns. Anyone reading it will be misled. Please update it to the strided 128-elem/32-byte layout + the new unsigned code-1 mapping.

2. The tests don't actually exercise the production path.
Every round-trip test uses 4–8 elements, which only hits the partial-tail branch — the full-block loop and the +32/+64/+96 / 32-byte stride (i.e. the layout this PR is all about) have zero coverage. And because the tests are self-inverse, they'd pass even if the stride/shift were wrong. Could you add:

  • a ≥128-element round-trip that drives the full_blocks path, and
  • one test that pins a known byte pattern decoding to known trits at the +32/+64/+96 offsets (ideally a small real-tensor slice checked against upstream ggml-bitnet), so the "this matches Microsoft's layout" claim is anchored by the suite rather than only by comments.

Minor: the partial-tail packing (tb = tail_elems/4) isn't Microsoft's real partial-block format — harmless since real tensors are 128-multiples, but worth a // not the upstream tail format note so nobody relies on it.

Heads-up: main is green again now, so a rebase will clear the inherited gguf/writer.rs coverage failure automatically; the tq.rs fmt failure is just cargo fmt.

gburd added 2 commits June 19, 2026 09:35
The current dequantize_i2_s assumes 4 sequential trits per byte
(byte b holds elements 4b..4b+4) with the mapping 0b01->+1,
0b10->-1.  That is NOT how microsoft/BitNet packs I2_S.

Verified against microsoft/BitNet ggml-bitnet-mad.cpp
(quantize_i2_s + the AVX2 vec_dot): I2_S uses a STRIDED layout.
Elements are grouped into 128-element blocks; each block occupies
32 bytes.  Within a block, byte p (0..32) packs the four elements
{p, p+32, p+64, p+96} at bit-shifts 6,4,2,0 (group g -> shift
6-2g).  The 2-bit code is UNSIGNED {0,1,2}; the ternary value is
code-1 (0 -> -1, 1 -> 0, 2 -> +1).  A true zero tensor therefore
packs to 0x55 bytes, not 0x00.

The contiguous decode silently scrambles every weight: the model
loads fine and returns confident garbage.  On
microsoft/bitnet-b1.58-2B-4T the difference is stark:

  prompt 'The capital of France is'
    contiguous decode -> cluster / mass / mam   (nonsense)
    strided decode    -> Paris 94.5%            (correct)

This patch:
  * rewrites dequantize_i2_s to the strided block layout with the
    correct unsigned-code -> trit mapping (general invariant: a
    block of B bytes holds 4*B elements, byte p carrying
    {p, p+B, p+2B, p+3B}; full 128/32 blocks + a partial tail).
  * rewrites quantize_i2_s as the exact inverse so round-trips
    hold and the encoder matches reality.
  * updates the I2_S tests for the strided layout + 0x55-zero
    encoding, and adds a strided-layout round-trip regression.

Verified end-to-end on the real microsoft/bitnet-b1.58-2B-4T GGUF
(Paris 94.5%, 'two plus two equals' -> four).  cargo test -p
larql-models --lib i2_s: 9 passed.
Addresses PR chrishayuk#156 review:
- Rewrote the header block-comment above dequantize_i2_s: it still
  documented the OLD contiguous '4 sequential trits/byte' layout and
  the wrong 0b01->+1 / 0b10->-1 mapping (and the wrong sub_norm scale
  source).  Now documents the strided 128-elem/32-byte layout, the
  unsigned code-1 mapping, the 0x55-zero encoding, and the trailing
  per-tensor scale, with the ggml-bitnet-mad.cpp cross-reference.
- Added i2_s_full_block_round_trip (256 elems = two full 128-blocks)
  driving the full_blocks path + the +32/+64/+96 stride that the
  4-8 element tail tests never touched.
- Added i2_s_known_pattern_pins_strided_offsets: a hand-constructed
  block (byte 0 = 0b10_00_01_00, rest 0x55) whose +0/+32/+64/+96
  decode is pinned to known trits computed by hand from microsoft's
  layout \u2014 fails if the stride or shift is wrong, so the layout
  claim is anchored by the suite, not only by comments.
- Noted the partial-tail packing is NOT the upstream tail format.

cargo test -p larql-models --lib i2_s: 11 passed.
@gburd gburd force-pushed the pr/i2s-decode-fix branch from 9c97354 to 60c7faf Compare June 19, 2026 14:22
@gburd

gburd commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Thanks — addressed both, and rebased onto current main (clears the inherited coverage failure; ran cargo fmt).

  1. Stale header — rewrote the block-comment above dequantize_i2_s. It now documents the strided 128-elem/32-byte layout, the unsigned code-1 mapping (0→-1, 1→0, 2→+1), the 0x55-for-zero encoding, and the trailing per-tensor scale, with the ggml-bitnet-mad.cpp cross-reference. (Also corrected the old line that claimed the scale lived in *_sub_norm.weight — it doesn't.)

  2. Tests now exercise the production path — added:

    • i2_s_full_block_round_trip: 256 elements = two full 128-blocks, drives the full_blocks loop and the real +32/+64/+96 / 32-byte stride.
    • i2_s_known_pattern_pins_strided_offsets: a hand-constructed block (byte 0 = 0b10_00_01_00, rest 0x55) whose decode at offsets +0/+32/+64/+96 is pinned to known trits computed by hand from Microsoft's layout — independent of the self-inverse round-trips, so it fails if the stride or shift is wrong.
  3. Added a // NOT the upstream tail format note on the partial-tail packing.

cargo test -p larql-models --lib i2_s: 11 passed.

@chrishayuk

Copy link
Copy Markdown
Owner

sweet, thanks.. i'll merge this in tonight before i do anything else, hahaha, appreciate the contribution

@chrishayuk chrishayuk merged commit bc616b0 into chrishayuk:main Jun 19, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants