From a548301ecb9fc74af57aa5d7ecc27813a8b5099e Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 30 May 2026 20:00:33 +0200
Subject: [PATCH] docs(q4_0): changelog + quantized-kernels page for
 first-class Q4_0

- CHANGELOG [Unreleased]: Added entry for the first-class Q4_0 format
  (heap type, SPI + scalar/Panama/native kernels, quantizer) and a Fixed
  entry for the MemSegment split-layout reconciliation.
- quantized-simd-kernels.adoc: correct the Q4_0 section to the canonical
  split nibble layout (was describing the old interleaved layout), note
  the SPI/scalar/Panama/native promotion + Q4_0Quantizer, and mark the
  per-format coverage matrix Q4_0 row as having an SPI sibling.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md                                  | 12 +++++
 .../perf/quantized-simd-kernels.adoc          | 49 ++++++++++---------
 2 files changed, 39 insertions(+), 22 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 953dcec7..3bda9d29 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -2,6 +2,18 @@
 
 ## [Unreleased]
 
+### Added
+
+- **Q4_0 promoted to a first-class quantized format.** The older GGML 4-bit format (18 bytes / 32 elements) was previously a JVM/MemSegment-only, GGUF-only side-path; it is now wired across the full provider stack mirroring Q8_0 / Q4_K:
+  - commonMain heap `Q4_0TensorData` / `Q4_0BlockTensorData` (+ `TensorEncoding.Q4_0`) so any loader can produce it and non-JVM targets can use it.
+  - `Q4_0MatmulKernel` SPI + `KernelProvider.matmulQ4_0()`, with **scalar** (commonMain), **Panama Vector** (JVM SIMD), and **native FFM** (`skainet_q4_0_matmul`) implementations selected via `KernelRegistry.bestAvailable()` (native → Panama → scalar). `DefaultCpuOpsJvm.chooseQuantizedMatmul` gains an `is Q4_0TensorData` branch.
+  - `Q4_0Quantizer` (FP32 → Q4_0) — the produce side was missing, so dense weights from any source (SafeTensors / JSON / in-memory) can now be quantized to canonical ggml Q4_0 without going through GGUF.
+- All Q4_0 paths use the canonical ggml **split** nibble layout (low nibbles → elements 0..15, high → 16..31, `(code - 8) * d`).
+
+### Fixed
+
+- **Q4_0 MemSegment matmul layout.** The pre-existing JVM MemSegment Q4_0 kernel (`JvmQuantizedVectorKernels.dotQ4_0BlockMemSeg`) and `Q4MemorySegmentTensorData` used an *interleaved* nibble layout that did not match real GGUF Q4_0 weights (a latent correctness bug; the path was unverified and had no in-repo callers). Reconciled to the canonical split layout so it now agrees with the heap type, the SPI kernels, and `DequantOps.dequantQ4_0FromBytes`.
+
 ## [0.25.0] - 2026-05-25
 
 ### Added
diff --git a/docs/modules/ROOT/pages/explanation/perf/quantized-simd-kernels.adoc b/docs/modules/ROOT/pages/explanation/perf/quantized-simd-kernels.adoc
index f3cf8a0c..99883e86 100644
--- a/docs/modules/ROOT/pages/explanation/perf/quantized-simd-kernels.adoc
+++ b/docs/modules/ROOT/pages/explanation/perf/quantized-simd-kernels.adoc
@@ -170,40 +170,45 @@ already existed; this change only replaced the scalar dequant loop.
 
 === Q4_0 — 32 elements / 18 bytes
 
-The simplest layout (single FP16 scale + 16 packed nibble bytes), but
-also the *least* SIMD-friendly: adjacent elements share a byte
-(`code[2k]` lo, `code[2k+1]` hi), so getting codes in natural element
-order from a `ByteVector` would need a lane-interleave shuffle or a
-strided gather. NEON has no native gather instruction, so on Apple
-Silicon a gather-based pipeline would fall back to scalar loads.
-
-The shipped Q4_0 implementation uses the *partial-vec* pattern Q4_K
-used before its fully-fused rewrite:
+The simplest layout: a single FP16 scale + 16 packed nibble bytes.
+Q4_0 uses the canonical ggml *split* layout — the low nibbles of bytes
+0..15 decode elements 0..15, the high nibbles decode elements 16..31
+(`(nibble - 8) * d`). This is the layout real GGUF Q4_0 weights ship in
+and what `DequantOps.dequantQ4_0FromBytes` produces.
+
+As of the first-class promotion, Q4_0 is a full SPI format on par with
+Q8_0 / Q4_K: a `Q4_0MatmulKernel` interface with scalar (commonMain),
+Panama Vector, and native FFM implementations, selected via
+`KernelRegistry.bestAvailable()`, plus a `Q4_0Quantizer` for producing
+Q4_0 from dense FP32. The Panama kernel uses the *partial-vec* pattern
+Q4_K used before its fully-fused rewrite — a scalar split-layout unpack
+into a 32-element scratch buffer, then a SIMD FMA dot product:
 
 [source,kotlin]
 ----
-// Stage 1: scalar byte-pair unpack into a 32-element scratch FloatArray.
-for (k in 0 until 16) {
-    val b = weightSeg.get(JAVA_BYTE_LE, codesOffset + k.toLong()).toInt() and 0xFF
-    codeBuf[2 * k] = (b and 0x0F).toFloat() - 8f
-    codeBuf[2 * k + 1] = (b ushr 4).toFloat() - 8f
+// Stage 1: split-layout unpack into a 32-element scratch FloatArray.
+for (j in 0 until 16) {
+    val b = weight[codesBase + j].toInt() and 0xFF
+    codeBuf[j] = ((b and 0x0F) - 8).toFloat()       // elements 0..15
+    codeBuf[16 + j] = ((b ushr 4) - 8).toFloat()    // elements 16..31
 }
 // Stage 2: SIMD FMA dot product.
 var accVec = FloatVector.zero(floatSpecies)
 while (idx < loopBound) {
-    val iv = FloatVector.fromArray(floatSpecies, input, inputOffset + idx)
+    val iv = FloatVector.fromArray(floatSpecies, input, inputBase + idx)
     val cv = FloatVector.fromArray(floatSpecies, codeBuf, idx)
     accVec = iv.fma(cv, accVec)
     idx += step
 }
-return (accVec.reduceLanes(ADD) + scalarTail) * scale
+return (accVec.reduceLanes(ADD) + scalarTail) * d
 ----
 
-If Q4_0 ever becomes a hot path (it's rarely seen in modern weights —
-Q4_K_M / Q4_K_S dominate Gemma 4, Llama, Qwen), the upgrade to a
-fully-fused `ByteVector` pipeline is a reasonable follow-up — same
-shape as the Q4_K rewrite, with the lane-interleave done via
-`VectorShuffle`.
+Q4_0 is rarely the hot path in modern weights (Q4_K_M / Q4_K_S dominate
+Gemma 4, Llama, Qwen), so the scratch-then-SIMD shape is a deliberate
+balance. A fully-fused `ByteVector` pipeline is a reasonable follow-up:
+the split layout is friendlier than it looks — lo/hi nibble masks of a
+16-byte `ByteVector` load yield elements 0..15 and 16..31 directly, no
+lane-interleave shuffle required.
 
 == Per-format coverage matrix
 
@@ -213,7 +218,7 @@ shape as the Q4_K rewrite, with the lane-interleave done via
 | Q8_0 | no | yes | Fully fused (`ByteVector.castShape` + scaled FMA)
 | Q4_K | yes (`Q4KMatmulKernel`) | yes (inline, same algorithm) | Fully fused (single byte load → lo+hi nibble accumulators, lazy `dmin`)
 | Q6_K | no | n/a | SIMD dequant into scratch + SIMD dot (two-stage)
-| Q4_0 | no | yes | Scalar unpack into scratch + SIMD dot (two-stage)
+| Q4_0 | yes (`Q4_0MatmulKernel`) | yes | Scalar split-layout unpack into scratch + SIMD dot (two-stage)
 |===
 
 == Where to look in the code