Skip to content

release: SKaiNET-transformers 0.30.0#177

Merged
michalharakal merged 9 commits into
developfrom
release/0.30.0
Jun 14, 2026
Merged

release: SKaiNET-transformers 0.30.0#177
michalharakal merged 9 commits into
developfrom
release/0.30.0

Conversation

@michalharakal

Copy link
Copy Markdown
Contributor

Prepares the 0.30.0 release, version-aligned with the released SKaiNET 0.30.0 (Q5_K packed matmul, NEON native kernels, Kotlin/Native cinterop). Skips 0.29.x — tracked internally without a tagged release.

Headline

  • Q5_K stays packed in the eager Gemma runtime. GemmaMemSegConverter used to dequantize Q5_K weights to FP32 on load; the engine now provides a first-class Q5_K packed matmul (Q5_KBlockTensorData + Q5KMatmulKernel), so weights stay packed (176 B/block). FunctionGemma-270M (Q5_K_M) decodes byte-identically to the FP32 baseline (GemmaQ5KPackedParityTest).
  • Gemma NATIVE_OPTIMIZED path is Kotlin/Native–ready. The layout + packing helpers (GemmaQuantLayout.kt, GemmaPackedWeights.kt) moved to commonMain, and GemmaNetworkLoader.load() now runs convertGemmaWeightsPacked — the board binary keeps K-quant weights packed with no java.lang.foreign MemSeg dependency. Verified on JVM and linuxX64.
  • Fixes. Kernel-less quant types under NATIVE_OPTIMIZED now dequant to FP32 [out, in] instead of crashing on a rank-1 transpose; DecoderGgufMemSegConverter dequantizes Q4_1 and every other non-packed quant type (#654).

Release prep in this PR

  • gradle.properties: VERSION_NAME 0.28.1 → 0.30.0 (catalog skainet already pinned to 0.30.0).
  • settings.gradle.kts: reverted the mavenLocal()-first dev shim — 0.30.0 is on Maven Central; the -PuseLocalSkainet composite build is unchanged.
  • CHANGELOG.md: [0.30.0] entry + tag link.
  • README.md + doc tutorials: "Current release" / BOM coordinates → 0.30.0; new "What's new in 0.30.0".
  • API dumps refreshed (./gradlew apiDump). jvmApiCheck had flagged stale dumps; all deltas reflect public API already in the source — the 0.23.3 prefill callback (llm-agent), convertGemmaWeightsPacked (gemma), and the KClass dtype param on the vendored transformer modules (llm-core).

Validation

./gradlew buildBUILD SUCCESSFUL in 3m 3s, no failed tasks (compilation, tests, all apiCheck variants).

Integration-tagged tests (-PincludeIntegration, e.g. GemmaQ5KPackedParityTest) are not part of the default build and were not run in this pass.

🤖 Generated with Claude Code

michalharakal and others added 9 commits June 10, 2026 23:41
FunctionGemma-270M ships as Q5_K_M, but GemmaMemSegConverter dequantized
Q5_K weights to FP32 on load ("no native matmul kernel yet for Q5_K"),
losing the memory savings and the in-kernel dequant. Upstream SKaiNET
0.29.1 now provides a first-class Q5_K packed matmul (Q5_KBlockTensorData
+ Q5KMatmulKernel: scalar/Panama/native), so keep Q5_K packed here too:
relayout GGUF bytes to block-major + wrap as Q5_KBlockTensorData (176 B/
block). Dispatch + lazy transpose reach it via DefaultCpuOps.

- Bump skainet 0.28.1 -> 0.29.1 (source-of-truth for the llm-bom platform).
- settings.gradle.kts: mavenLocal first so a locally-published SKaiNET
  0.29.1 (carrying the in-progress Q5_K kernel) shadows Maven Central until
  it's released; Central remains the fallback.

Verified (GemmaQ5KPackedParityTest, -PincludeIntegration): the Q5_K packed
path decodes FunctionGemma byte-identically to the FP32 baseline —
[262146, 236769, 3255, 718, 498, 1373, 262152, 106] -> `<tool_0>(state="on")
<end>` for "Turn the light on." (the known-good tool call), 0.81 tok/s on
the JVM host incl. prefill.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ard path

The board binary is Kotlin/Native, but GemmaMemSegConverter (the NATIVE_OPTIMIZED
packed-weight path) is jvmMain-only (java.lang.foreign). Move the reusable,
platform-neutral pieces to commonMain so K/N can keep K-quant weights packed:

- GemmaQuantLayout.kt (commonMain): logicalShapeFor + relayoutKSeriesRowMajor
  ToBlockMajor (now copyInto, KMP-safe) + packGemmaKQuant<T>() which builds
  heap-packed Q4_K/Q5_K/Q6_KBlockTensorData directly (no MemSeg/Arena).
- GemmaMemSegConverter (jvmMain) now shares those commonMain helpers (dup
  removed); MemSeg/FFM conversion + FP32 fallbacks stay JVM-only.
- commonTest GemmaQuantLayoutTest: block-transpose relayout + packing, runs on
  every target.

Verified: gemma compiles for JVM + linuxX64; layout tests green (3).

Next (board integration): a commonMain convertGemmaWeightsPacked wired into the
K/N load path (byte extraction differs JVM IntArrayTensorData vs native Byte-
backed), then a full K/N decode on the SL2610.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oad()

NATIVE_OPTIMIZED loads produce raw-byte quant tensors the network mapper can't
consume; on JVM an external convertGemmaWeightsToMemSeg (FFM) handled that, but
the Kotlin/Native board has no such path. Add a commonMain converter and make
load() apply it, so load(NATIVE_OPTIMIZED) yields a runnable network on the
board AND the JVM (previously it couldn't be built from raw-byte weights at all).

- GemmaPackedWeights.kt (commonMain): convertGemmaWeightsPacked — packs
  Q4/5/6_K matmul weights to heap Q*_KBlockTensorData (packGemmaKQuant),
  dequants token_embd/output to FP32 (gathered, no transpose) and other quant
  types to FP32 [out,in]. No java.lang.foreign. Plus extractRawBytes, which
  reads the loader's bytes back across both backings (JVM IntArrayTensorData /
  native Byte-typed).
- GemmaNetworkLoader.load(): for NATIVE_OPTIMIZED, run convertGemmaWeightsPacked
  before applyWeightsToNetwork.

Verified on JVM AND linuxX64 (GemmaQuantLayoutTest, 4 tests each): relayout,
packing, and the byte-extraction round-trip — so native byte extraction is
executed, not just compiled.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extends GemmaQ5KPackedParityTest to also decode via
GemmaNetworkLoader.load(NATIVE_OPTIMIZED) — the wired commonMain
convertGemmaWeightsPacked (board) path, no MemSeg/Arena. All three paths
(FP32 baseline, jvmMain MemSeg-packed, load() packed) produce the identical
token sequence -> `<tool_0>(state="on")<end>` for "Turn the light on."

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Six real-model integration tests (RealGemmaLoad/Eager/BakeIrpa/ExternalParam/
DequantDump + GemmaBehavioralAb) pointed at an old workspace path
(/home/miso/projects/coral/sl2610-voice-cc-kt/models/...) and failed with
"File not found" under -PincludeIntegration. Repoint them to the actual model
location (SKaiNET-embedded/sl2610-function-calling/models/), matching
GemmaQ5KPackedParityTest.

Verified: all 6 pass against skainet 0.30.0 (mavenLocal), -PincludeIntegration.
Version-aligned with the released SKaiNET 0.30.0 (Q5_K packed matmul, NEON
native kernels, Kotlin/Native cinterop), already pinned in the catalog.

- gradle.properties: VERSION_NAME 0.28.1 -> 0.30.0.
- settings.gradle.kts: revert the mavenLocal()-first dev shim (0.30.0 is on
  Maven Central; the -PuseLocalSkainet composite build stays for local work).
- CHANGELOG.md: add the [0.30.0] entry (Q5_K packed eager runtime, K/N-ready
  NATIVE_OPTIMIZED Gemma path, kernel-less/Q4_1 dequant fixes) + tag link.
- README.md: bump "Current release" + BOM snippet to 0.30.0; add
  "What's new in 0.30.0".
- docs tutorials: bump BOM coordinates 0.28.1 -> 0.30.0.

No merge, no tag.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`./gradlew build` runs `jvmApiCheck`, which flagged the committed `.api`
dumps as stale. Regenerated via `./gradlew apiDump`; all changes reflect
public API already present in the source on this branch:

- llm-agent: the 0.23.3 prefill-progress callback — `generateUntilStop`
  gained its `onPrefill` `Function2` param and `AgentListener` gained
  `onPrefillProgress(Int, Int)`; the dump was never refreshed.
- llm-inference/gemma: `convertGemmaWeightsPacked` — the commonMain
  packed-weight converter added for the Kotlin/Native NATIVE_OPTIMIZED path.
- llm-core: trailing `KClass` dtype param on the vendored transformer
  modules (AttentionImpl / RMSNormalization / GeGLUFFN / MultiHeadAttention
  / LayerScalarMul / VoidDense) from earlier engine-aligned work.

`./gradlew build` now green end-to-end (3m 3s, no failed tasks).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…st blocks

The real-model FunctionGemma-270M integration tests (-PincludeIntegration)
OOM'd with `Java heap space` at the previous 8g default once the model file
is present: GemmaQ5KPackedParityTest holds the FP32 baseline plus both packed
decode networks at once, and the bake-to-irpa test holds weights + serialized
bytes simultaneously.

- Bump the `gemmaTestMaxHeap` default 8g -> 12g.
- Merge the two overlapping `tasks.withType<Test>().configureEach { }` blocks
  into one — the second silently overrode the first's maxHeapSize (so jvmArgs
  ran with 6g declared but 8g effective). Now jvmArgs, heap, and the seqLen
  system property live in a single block.

CI is unaffected: without the model file the integration tests self-skip and
never allocate the headroom. Verified: `:llm-inference:gemma:jvmTest
-PincludeIntegration` green with no -P override (87 tests, 6 skipped, 0
failures); GemmaQ5KPackedParityTest runs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit eb505fe into develop Jun 14, 2026
6 checks passed
@michalharakal michalharakal deleted the release/0.30.0 branch June 14, 2026 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant