SKaiNET-developers · michalharakal · Jun 16, 2026 · Jun 15, 2026 · Jun 15, 2026 · Jun 15, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,47 @@ version line is kept in lock-step with the underlying SKaiNET engine
 The format roughly follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [0.31.0] — 2026-06-15
+
+Version-aligned with **SKaiNET 0.31.0**. Completes the eager board-decode path
+for FunctionGemma: the tied **Q8_0 lm_head now stays packed** (paired with the
+engine's `ops.transpose` fix for all packed dtypes), and `load()` can cap the
+context to fit constrained devices.
+
+### Added
+
+- **`maxInferenceLen` on `GemmaNetworkLoader.load()`** — an optional cap on the
+  context length the eager network sizes its KV cache + RoPE tables for (default
+  `min(contextLength, 4096)`, threaded through `applyWeightsToNetwork` →
+  `gemmaNetwork`). A constrained-device consumer (e.g. the 1.9 GB SL2610 board)
+  can pass a small value (e.g. `32` for a short tool-call prompt) to shrink the
+  KV cache ~100×, which otherwise allocates ~0.4 GB at the first forward and OOMs
+  the board after the weights load. Default `null` preserves existing behaviour. (#180)
+
+### Changed
+
+- **`gradle/libs.versions.toml` `skainet` pin: 0.30.0 → 0.31.0.** Picks up the
+  engine's `ops.transpose` lazy-rewrap fix for **all** packed matmul dtypes
+  (Q8_0/Q4_0 added) — required so the packed Q8_0 lm_head below transposes
+  through `linearProject` instead of throwing `ClassCastException`. Downstream
+  consumers get the upstream SKaiNET BOM transparently via `:llm-bom`.
+- **`gradle.properties` `VERSION_NAME=0.31.0`.** Lock-step with the engine.
+- **`com.networknt:json-schema-validator` → 3.0.4.** (#175)
+
+### Fixed
+
+- **Tied Q8_0 lm_head stays packed in the eager `NATIVE_OPTIMIZED` Gemma path.**
+  FunctionGemma's `token_embd` is Q8_0 and tied, so `convertGemmaWeightsPacked`
+  was dequantizing **both** `token_embd` and `output` to FP32 (2×~0.67 GB) —
+  OOM on the 1.9 GB SL2610. `output`/lm_head now packs as Q8_0
+  (`packGemmaKQuant` gained a Q8_0 case; the row-major→block-major relayout is
+  generalized with a `blockSize` param) and runs on the (NEON) Q8_0 kernel;
+  `token_embd` stays FP32 (it is gathered, not matmul'd) but is wrapped no-copy
+  via `DenseFloatArrayTensorData` instead of `ctx.fromFloatArray` (which
+  allocated a second ~0.67 GB buffer). Tied embed/lm_head footprint
+  ~1.34 GB → ~0.76 GB. Verified byte-identical decode parity
+  (`GemmaQ5KPackedParityTest`) and a stable ~1.06 GB load on the SL2610. (#179)
+
 ## [0.30.0] — 2026-06-14
 
 Version-aligned with **SKaiNET 0.30.0**. Skips 0.29.x — SKaiNET-transformers
@@ -489,6 +530,7 @@ Version-aligned with **SKaiNET 0.21.0**.
 Last published transformers release before the engine-aligned version line.
 See `git log v0.16.0..0.18.0` for details.
 
+[0.31.0]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.31.0
 [0.30.0]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.30.0
 [0.28.1]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.28.1
 [0.23.1]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.23.1

diff --git a/README.md b/README.md
@@ -103,21 +103,20 @@ Honest status — see the project-status note at the top of this README.
 
 ## Current release
 
-The current release is **0.30.0** — version-aligned with **SKaiNET 0.30.0**.
-Skips 0.29.x: SKaiNET-transformers tracked the engine internally across that
-window without a tagged release. The headline is that **Q5_K weights now stay
-packed in the eager Gemma runtime** (SKaiNET 0.30.0 ships a first-class Q5_K
-packed matmul) and the Gemma `NATIVE_OPTIMIZED` packed-weight path is now
-**Kotlin/Native–ready** — the board binary can keep K-quant weights packed
-without the JVM's `java.lang.foreign` MemSeg path. FunctionGemma-270M (`Q5_K_M`)
-decodes byte-identically across the FP32 baseline and both packed paths
-(`GemmaQ5KPackedParityTest`).
+The current release is **0.31.0** — version-aligned with **SKaiNET 0.31.0**.
+The headline is that the eager `NATIVE_OPTIMIZED` Gemma path now keeps the
+**tied Q8_0 lm_head packed** (paired with SKaiNET 0.31.0's `ops.transpose` fix
+for all packed dtypes), and `GemmaNetworkLoader.load()` takes an optional
+`maxInferenceLen` to cap the KV cache for constrained devices — together
+dropping FunctionGemma-270M's footprint enough to load eagerly on the 1.9 GB
+Astra Machina SL2610. FunctionGemma (`Q5_K_M`) still decodes byte-identically
+across the FP32 baseline and both packed paths (`GemmaQ5KPackedParityTest`).
 
 The recommended way to consume is via the BOM. It pins every published `skainet-transformers-*` artifact and re-exports the upstream `sk.ainet:skainet-bom`, so the engine-side `sk.ainet.core:skainet-*` artifacts get the matching version too — you only need to declare the BOM version in one place.
 
 ```kotlin
 dependencies {
-    implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0"))
+    implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.0"))
 
     // Versions resolved from the BOM:
     implementation("sk.ainet.transformers:skainet-transformers-core")
@@ -194,6 +193,22 @@ try (KLlamaSession session = KLlamaJava.loadGGUF(modelPath, /* systemPrompt */ n
 
 See `llm-test/llm-test-java/src/test/java/.../KLlamaJavaToolCallingTest.java` for a runnable reference.
 
+## What's new in 0.31.0
+
+- **Tied Q8_0 lm_head stays packed (eager `NATIVE_OPTIMIZED`).** FunctionGemma's
+  `token_embd` is Q8_0 and tied, so `convertGemmaWeightsPacked` was dequantizing
+  *both* `token_embd` and `output` to FP32 (2×~0.67 GB) — OOM on the 1.9 GB
+  SL2610. `output`/lm_head now packs as Q8_0 (runs on the NEON Q8_0 kernel);
+  `token_embd` stays FP32 (it's gathered) but is wrapped no-copy. Footprint
+  ~1.34 GB → ~0.76 GB; byte-identical decode (`GemmaQ5KPackedParityTest`),
+  stable ~1.06 GB load on the SL2610.
+- **`GemmaNetworkLoader.load(maxInferenceLen = …)`** — cap the context so the KV
+  cache + RoPE tables stay tiny on constrained devices (default
+  `min(contextLength, 4096)`).
+- **Engine pin `skainet 0.30.0 → 0.31.0`** — picks up `ops.transpose`'s
+  lazy-rewrap fix for all packed matmul dtypes (Q8_0/Q4_0), required so the
+  packed lm_head transposes through `linearProject` instead of `ClassCastException`.
+
 ## What's new in 0.30.0
 
 - **Q5_K stays packed in the eager Gemma runtime.** `GemmaMemSegConverter` used to

diff --git a/docs/modules/ROOT/pages/tutorials/getting-started-java.adoc b/docs/modules/ROOT/pages/tutorials/getting-started-java.adoc
@@ -25,7 +25,7 @@ In your `build.gradle.kts`:
 [source,kotlin]
 ----
 dependencies {
-    implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0"))
+    implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.0"))
 
     implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama")
     implementation("sk.ainet.transformers:skainet-transformers-agent")
@@ -41,7 +41,7 @@ Or in Maven (Maven needs the `-jvm` classifier suffix on platform artifacts):
     <dependency>
       <groupId>sk.ainet.transformers</groupId>
       <artifactId>skainet-transformers-bom</artifactId>
-      <version>0.30.0</version>
+      <version>0.31.0</version>
       <type>pom</type>
       <scope>import</scope>
     </dependency>

diff --git a/docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc b/docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc
@@ -52,7 +52,7 @@ The pieces you need live in three modules:
 [source,kotlin]
 ----
 dependencies {
-    implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0"))
+    implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.0"))
 
     implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama")
     implementation("sk.ainet.transformers:skainet-transformers-agent")

diff --git a/gradle.properties b/gradle.properties
@@ -1,5 +1,5 @@
 GROUP=sk.ainet.transformers
-VERSION_NAME=0.30.0
+VERSION_NAME=0.31.0
 
 POM_DESCRIPTION=SKaiNET-transformers
 
@@ -33,6 +33,15 @@ kotlin.mpp.enableCInteropCommonization=true
 #Android
 android.useAndroidX=true
 android.nonTransitiveRClass=true
+# AGP's DependencyResolutionChecks fails the build when a configuration resolves
+# at configuration time. KGP's KotlinPackageJsonTask resolves the Kotlin/JS + Wasm
+# `*NpmAggregated` configs at config time (we have JS npm deps: ktor-client-js,
+# kotlinx-browser), so `assemble`/`allTests` throw `Configuration 'jsNpmAggregated'
+# was resolved during configuration time` (gradle#31483) — a false positive against
+# KGP's known behaviour. Downgrade AGP's check from fail to warn. NOTE: AGP reads
+# this option only from the project gradle.properties — NOT from -P or the CI's
+# ~/.gradle/gradle.properties.
+android.dependencyResolutionAtConfigurationTime.disallow=false
 
 kotlin.mpp.stability.nowarn=true
 

diff --git a/gradle/libs.versions.toml b/gradle/libs.versions.toml
@@ -1,5 +1,5 @@
 [versions]
-skainet = "0.30.0"
+skainet = "0.31.0"
 agp = "9.2.1"
 jacksonDatabind = "2.22.0"
 jsonSchemaValidator = "3.0.4"

diff --git a/llm-inference/gemma/api/jvm/gemma.api b/llm-inference/gemma/api/jvm/gemma.api
@@ -862,7 +862,8 @@ public final class sk/ainet/models/gemma/GemmaNetworkLoader$WeightsProvider$Safe
 }
 
 public final class sk/ainet/models/gemma/GemmaNetworkLoaderKt {
-	public static final fun applyWeightsToNetworkNonReified (Lsk/ainet/context/ExecutionContext;Lsk/ainet/models/gemma/Gemma4Weights;Lkotlin/reflect/KClass;Z)Lsk/ainet/lang/nn/Module;
+	public static final fun applyWeightsToNetworkNonReified (Lsk/ainet/context/ExecutionContext;Lsk/ainet/models/gemma/Gemma4Weights;Lkotlin/reflect/KClass;ZLjava/lang/Integer;)Lsk/ainet/lang/nn/Module;
+	public static synthetic fun applyWeightsToNetworkNonReified$default (Lsk/ainet/context/ExecutionContext;Lsk/ainet/models/gemma/Gemma4Weights;Lkotlin/reflect/KClass;ZLjava/lang/Integer;ILjava/lang/Object;)Lsk/ainet/lang/nn/Module;
 }
 
 public final class sk/ainet/models/gemma/GemmaPackedWeightsKt {

diff --git a/llm-inference/gemma/src/commonTest/kotlin/sk/ainet/models/gemma/GemmaQuantLayoutTest.kt b/llm-inference/gemma/src/commonTest/kotlin/sk/ainet/models/gemma/GemmaQuantLayoutTest.kt
@@ -8,6 +8,7 @@ import sk.ainet.context.DirectCpuExecutionContext
 import sk.ainet.io.gguf.GGMLQuantizationType
 import sk.ainet.lang.tensor.Shape
 import sk.ainet.lang.tensor.data.Q5_KBlockTensorData
+import sk.ainet.lang.tensor.data.Q8_0BlockTensorData
 import sk.ainet.lang.types.FP32
 import sk.ainet.lang.types.Int8
 
@@ -55,8 +56,17 @@ class GemmaQuantLayoutTest {
     }
 
     @Test
-    fun pack_non_kquant_returns_null() {
-        assertNull(packGemmaKQuant<FP32>(ByteArray(34), GGMLQuantizationType.Q8_0, Shape(1, 32)))
+    fun pack_q8_0_produces_block_tensor() {
+        // Q8_0 is now packed (32 elems / 34 B per block) so a tied Q8_0 lm_head
+        // stays packed and runs on the Q8_0 kernel instead of dequanting to FP32.
+        val td = packGemmaKQuant<FP32>(ByteArray(34), GGMLQuantizationType.Q8_0, Shape(1, 32))
+        assertTrue(td is Q8_0BlockTensorData, "Q8_0 should pack to Q8_0BlockTensorData")
+    }
+
+    @Test
+    fun pack_unsupported_quant_returns_null() {
+        // A quant type with no packed kernel (e.g. Q4_1) falls back to FP32 dequant.
+        assertNull(packGemmaKQuant<FP32>(ByteArray(20), GGMLQuantizationType.Q4_1, Shape(1, 32)))
     }
 
     @Test