Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,18 @@

## [Unreleased]

### Added

- **Q4_0 promoted to a first-class quantized format.** The older GGML 4-bit format (18 bytes / 32 elements) was previously a JVM/MemSegment-only, GGUF-only side-path; it is now wired across the full provider stack mirroring Q8_0 / Q4_K:
- commonMain heap `Q4_0TensorData` / `Q4_0BlockTensorData` (+ `TensorEncoding.Q4_0`) so any loader can produce it and non-JVM targets can use it.
- `Q4_0MatmulKernel` SPI + `KernelProvider.matmulQ4_0()`, with **scalar** (commonMain), **Panama Vector** (JVM SIMD), and **native FFM** (`skainet_q4_0_matmul`) implementations selected via `KernelRegistry.bestAvailable()` (native → Panama → scalar). `DefaultCpuOpsJvm.chooseQuantizedMatmul` gains an `is Q4_0TensorData` branch.
- `Q4_0Quantizer` (FP32 → Q4_0) — the produce side was missing, so dense weights from any source (SafeTensors / JSON / in-memory) can now be quantized to canonical ggml Q4_0 without going through GGUF.
- All Q4_0 paths use the canonical ggml **split** nibble layout (low nibbles → elements 0..15, high → 16..31, `(code - 8) * d`).

### Fixed

- **Q4_0 MemSegment matmul layout.** The pre-existing JVM MemSegment Q4_0 kernel (`JvmQuantizedVectorKernels.dotQ4_0BlockMemSeg`) and `Q4MemorySegmentTensorData` used an *interleaved* nibble layout that did not match real GGUF Q4_0 weights (a latent correctness bug; the path was unverified and had no in-repo callers). Reconciled to the canonical split layout so it now agrees with the heap type, the SPI kernels, and `DequantOps.dequantQ4_0FromBytes`.

## [0.25.0] - 2026-05-25

### Added
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -170,40 +170,45 @@ already existed; this change only replaced the scalar dequant loop.

=== Q4_0 — 32 elements / 18 bytes

The simplest layout (single FP16 scale + 16 packed nibble bytes), but
also the *least* SIMD-friendly: adjacent elements share a byte
(`code[2k]` lo, `code[2k+1]` hi), so getting codes in natural element
order from a `ByteVector` would need a lane-interleave shuffle or a
strided gather. NEON has no native gather instruction, so on Apple
Silicon a gather-based pipeline would fall back to scalar loads.

The shipped Q4_0 implementation uses the *partial-vec* pattern Q4_K
used before its fully-fused rewrite:
The simplest layout: a single FP16 scale + 16 packed nibble bytes.
Q4_0 uses the canonical ggml *split* layout — the low nibbles of bytes
0..15 decode elements 0..15, the high nibbles decode elements 16..31
(`(nibble - 8) * d`). This is the layout real GGUF Q4_0 weights ship in
and what `DequantOps.dequantQ4_0FromBytes` produces.

As of the first-class promotion, Q4_0 is a full SPI format on par with
Q8_0 / Q4_K: a `Q4_0MatmulKernel` interface with scalar (commonMain),
Panama Vector, and native FFM implementations, selected via
`KernelRegistry.bestAvailable()`, plus a `Q4_0Quantizer` for producing
Q4_0 from dense FP32. The Panama kernel uses the *partial-vec* pattern
Q4_K used before its fully-fused rewrite — a scalar split-layout unpack
into a 32-element scratch buffer, then a SIMD FMA dot product:

[source,kotlin]
----
// Stage 1: scalar byte-pair unpack into a 32-element scratch FloatArray.
for (k in 0 until 16) {
val b = weightSeg.get(JAVA_BYTE_LE, codesOffset + k.toLong()).toInt() and 0xFF
codeBuf[2 * k] = (b and 0x0F).toFloat() - 8f
codeBuf[2 * k + 1] = (b ushr 4).toFloat() - 8f
// Stage 1: split-layout unpack into a 32-element scratch FloatArray.
for (j in 0 until 16) {
val b = weight[codesBase + j].toInt() and 0xFF
codeBuf[j] = ((b and 0x0F) - 8).toFloat() // elements 0..15
codeBuf[16 + j] = ((b ushr 4) - 8).toFloat() // elements 16..31
}
// Stage 2: SIMD FMA dot product.
var accVec = FloatVector.zero(floatSpecies)
while (idx < loopBound) {
val iv = FloatVector.fromArray(floatSpecies, input, inputOffset + idx)
val iv = FloatVector.fromArray(floatSpecies, input, inputBase + idx)
val cv = FloatVector.fromArray(floatSpecies, codeBuf, idx)
accVec = iv.fma(cv, accVec)
idx += step
}
return (accVec.reduceLanes(ADD) + scalarTail) * scale
return (accVec.reduceLanes(ADD) + scalarTail) * d
----

If Q4_0 ever becomes a hot path (it's rarely seen in modern weights —
Q4_K_M / Q4_K_S dominate Gemma 4, Llama, Qwen), the upgrade to a
fully-fused `ByteVector` pipeline is a reasonable follow-up — same
shape as the Q4_K rewrite, with the lane-interleave done via
`VectorShuffle`.
Q4_0 is rarely the hot path in modern weights (Q4_K_M / Q4_K_S dominate
Gemma 4, Llama, Qwen), so the scratch-then-SIMD shape is a deliberate
balance. A fully-fused `ByteVector` pipeline is a reasonable follow-up:
the split layout is friendlier than it looks — lo/hi nibble masks of a
16-byte `ByteVector` load yield elements 0..15 and 16..31 directly, no
lane-interleave shuffle required.

== Per-format coverage matrix

Expand All @@ -213,7 +218,7 @@ shape as the Q4_K rewrite, with the lane-interleave done via
| Q8_0 | no | yes | Fully fused (`ByteVector.castShape` + scaled FMA)
| Q4_K | yes (`Q4KMatmulKernel`) | yes (inline, same algorithm) | Fully fused (single byte load → lo+hi nibble accumulators, lazy `dmin`)
| Q6_K | no | n/a | SIMD dequant into scratch + SIMD dot (two-stage)
| Q4_0 | no | yes | Scalar unpack into scratch + SIMD dot (two-stage)
| Q4_0 | yes (`Q4_0MatmulKernel`) | yes | Scalar split-layout unpack into scratch + SIMD dot (two-stage)
|===

== Where to look in the code
Expand Down
Loading