fix(ggml): correct I2_S decode to microsoft strided block layout by gburd · Pull Request #156 · chrishayuk/larql

gburd · 2026-06-17T19:56:16Z

Problem

dequantize_i2_s decodes I2_S as 4 sequential trits per byte (0b01->+1, 0b10->-1). That is not how microsoft/BitNet packs I2_S, so every weight is scrambled — the model loads and returns confident garbage.

What I verified

Against microsoft/BitNet ggml-bitnet-mad.cpp (quantize_i2_s + the AVX2 vec_dot): I2_S uses a strided layout. Elements are grouped into 128-element blocks of 32 bytes; within a block, byte p packs elements {p, p+32, p+64, p+96} at bit-shifts 6/4/2/0 (group g -> shift 6-2g). The 2-bit code is unsigned {0,1,2}, ternary value = code-1. A zero tensor packs to 0x55, not 0x00.

On microsoft/bitnet-b1.58-2B-4T:

prompt	contiguous decode (current)	strided decode (this PR)
"The capital of France is"	cluster / mass / mam	Paris 94.5%
"Two plus two equals"	(nonsense)	four

Changes

dequantize_i2_s: strided block layout + correct unsigned-code->trit mapping (general invariant: a block of B bytes holds 4·B elements, byte p -> {p, p+B, p+2B, p+3B}; full 128/32 blocks plus a partial tail).
quantize_i2_s: rewritten as the exact inverse.
Tests updated for the strided layout + 0x55-zero encoding; added a strided round-trip regression.

Verified end-to-end on the real GGUF. cargo test -p larql-models --lib i2_s: 9 passed.

This is a prerequisite for any correct BitNet 1.58 inference on the engine; happy to follow up with the native-ternary forward-pass work that consumes this.

chrishayuk · 2026-06-18T22:58:22Z

Thanks for this — the strided-layout diagnosis is exactly right, and the Paris 94.5% before/after is a compelling proof it's a real correctness fix (the old contiguous decode was scrambling every BitNet weight). The strided indexing and the quantize_i2_s inverse read as clean and self-consistent.

Two things before merge:

1. The header doc-comment is now stale (and contradicts the code).
The block-comment above the functions (~tq.rs:363–380) still documents the old contiguous "4 sequential trits/byte" layout and the old 0b01→+1 / 0b10→-1 mapping — the exact thing this PR overturns. Anyone reading it will be misled. Please update it to the strided 128-elem/32-byte layout + the new unsigned code-1 mapping.

2. The tests don't actually exercise the production path.
Every round-trip test uses 4–8 elements, which only hits the partial-tail branch — the full-block loop and the +32/+64/+96 / 32-byte stride (i.e. the layout this PR is all about) have zero coverage. And because the tests are self-inverse, they'd pass even if the stride/shift were wrong. Could you add:

a ≥128-element round-trip that drives the full_blocks path, and
one test that pins a known byte pattern decoding to known trits at the +32/+64/+96 offsets (ideally a small real-tensor slice checked against upstream ggml-bitnet), so the "this matches Microsoft's layout" claim is anchored by the suite rather than only by comments.

Minor: the partial-tail packing (tb = tail_elems/4) isn't Microsoft's real partial-block format — harmless since real tensors are 128-multiples, but worth a // not the upstream tail format note so nobody relies on it.

Heads-up: main is green again now, so a rebase will clear the inherited gguf/writer.rs coverage failure automatically; the tq.rs fmt failure is just cargo fmt.

The current dequantize_i2_s assumes 4 sequential trits per byte (byte b holds elements 4b..4b+4) with the mapping 0b01->+1, 0b10->-1. That is NOT how microsoft/BitNet packs I2_S. Verified against microsoft/BitNet ggml-bitnet-mad.cpp (quantize_i2_s + the AVX2 vec_dot): I2_S uses a STRIDED layout. Elements are grouped into 128-element blocks; each block occupies 32 bytes. Within a block, byte p (0..32) packs the four elements {p, p+32, p+64, p+96} at bit-shifts 6,4,2,0 (group g -> shift 6-2g). The 2-bit code is UNSIGNED {0,1,2}; the ternary value is code-1 (0 -> -1, 1 -> 0, 2 -> +1). A true zero tensor therefore packs to 0x55 bytes, not 0x00. The contiguous decode silently scrambles every weight: the model loads fine and returns confident garbage. On microsoft/bitnet-b1.58-2B-4T the difference is stark: prompt 'The capital of France is' contiguous decode -> cluster / mass / mam (nonsense) strided decode -> Paris 94.5% (correct) This patch: * rewrites dequantize_i2_s to the strided block layout with the correct unsigned-code -> trit mapping (general invariant: a block of B bytes holds 4*B elements, byte p carrying {p, p+B, p+2B, p+3B}; full 128/32 blocks + a partial tail). * rewrites quantize_i2_s as the exact inverse so round-trips hold and the encoder matches reality. * updates the I2_S tests for the strided layout + 0x55-zero encoding, and adds a strided-layout round-trip regression. Verified end-to-end on the real microsoft/bitnet-b1.58-2B-4T GGUF (Paris 94.5%, 'two plus two equals' -> four). cargo test -p larql-models --lib i2_s: 9 passed.

Addresses PR chrishayuk#156 review: - Rewrote the header block-comment above dequantize_i2_s: it still documented the OLD contiguous '4 sequential trits/byte' layout and the wrong 0b01->+1 / 0b10->-1 mapping (and the wrong sub_norm scale source). Now documents the strided 128-elem/32-byte layout, the unsigned code-1 mapping, the 0x55-zero encoding, and the trailing per-tensor scale, with the ggml-bitnet-mad.cpp cross-reference. - Added i2_s_full_block_round_trip (256 elems = two full 128-blocks) driving the full_blocks path + the +32/+64/+96 stride that the 4-8 element tail tests never touched. - Added i2_s_known_pattern_pins_strided_offsets: a hand-constructed block (byte 0 = 0b10_00_01_00, rest 0x55) whose +0/+32/+64/+96 decode is pinned to known trits computed by hand from microsoft's layout \u2014 fails if the stride or shift is wrong, so the layout claim is anchored by the suite, not only by comments. - Noted the partial-tail packing is NOT the upstream tail format. cargo test -p larql-models --lib i2_s: 11 passed.

gburd · 2026-06-19T14:23:50Z

Thanks — addressed both, and rebased onto current main (clears the inherited coverage failure; ran cargo fmt).

Stale header — rewrote the block-comment above dequantize_i2_s. It now documents the strided 128-elem/32-byte layout, the unsigned code-1 mapping (0→-1, 1→0, 2→+1), the 0x55-for-zero encoding, and the trailing per-tensor scale, with the ggml-bitnet-mad.cpp cross-reference. (Also corrected the old line that claimed the scale lived in *_sub_norm.weight — it doesn't.)
Tests now exercise the production path — added:
- i2_s_full_block_round_trip: 256 elements = two full 128-blocks, drives the full_blocks loop and the real +32/+64/+96 / 32-byte stride.
- i2_s_known_pattern_pins_strided_offsets: a hand-constructed block (byte 0 = 0b10_00_01_00, rest 0x55) whose decode at offsets +0/+32/+64/+96 is pinned to known trits computed by hand from Microsoft's layout — independent of the self-inverse round-trips, so it fails if the stride or shift is wrong.
Added a // NOT the upstream tail format note on the partial-tail packing.

cargo test -p larql-models --lib i2_s: 11 passed.

chrishayuk · 2026-06-19T15:13:10Z

sweet, thanks.. i'll merge this in tonight before i do anything else, hahaha, appreciate the contribution

chrishayuk mentioned this pull request Jun 18, 2026

feat(bitnet): native-ternary 1.58 inference end-to-end (microsoft/bitnet-b1.58-2B-4T) #159

Open

gburd added 2 commits June 19, 2026 09:35

gburd force-pushed the pr/i2s-decode-fix branch from 9c97354 to 60c7faf Compare June 19, 2026 14:22

chrishayuk merged commit bc616b0 into chrishayuk:main Jun 19, 2026
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ggml): correct I2_S decode to microsoft strided block layout#156

fix(ggml): correct I2_S decode to microsoft strided block layout#156
chrishayuk merged 2 commits into
chrishayuk:mainfrom
gburd:pr/i2s-decode-fix

gburd commented Jun 17, 2026

Uh oh!

chrishayuk commented Jun 18, 2026

Uh oh!

gburd commented Jun 19, 2026

Uh oh!

chrishayuk commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gburd commented Jun 17, 2026

Problem

What I verified

Changes

Uh oh!

chrishayuk commented Jun 18, 2026

Uh oh!

gburd commented Jun 19, 2026

Uh oh!

chrishayuk commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants