diff --git a/CHANGELOG.md b/CHANGELOG.md
index d688dc8..e308e8a 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -7,6 +7,47 @@ version line is kept in lock-step with the underlying SKaiNET engine
The format roughly follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.31.0] — 2026-06-15
+
+Version-aligned with **SKaiNET 0.31.0**. Completes the eager board-decode path
+for FunctionGemma: the tied **Q8_0 lm_head now stays packed** (paired with the
+engine's `ops.transpose` fix for all packed dtypes), and `load()` can cap the
+context to fit constrained devices.
+
+### Added
+
+- **`maxInferenceLen` on `GemmaNetworkLoader.load()`** — an optional cap on the
+ context length the eager network sizes its KV cache + RoPE tables for (default
+ `min(contextLength, 4096)`, threaded through `applyWeightsToNetwork` →
+ `gemmaNetwork`). A constrained-device consumer (e.g. the 1.9 GB SL2610 board)
+ can pass a small value (e.g. `32` for a short tool-call prompt) to shrink the
+ KV cache ~100×, which otherwise allocates ~0.4 GB at the first forward and OOMs
+ the board after the weights load. Default `null` preserves existing behaviour. (#180)
+
+### Changed
+
+- **`gradle/libs.versions.toml` `skainet` pin: 0.30.0 → 0.31.0.** Picks up the
+ engine's `ops.transpose` lazy-rewrap fix for **all** packed matmul dtypes
+ (Q8_0/Q4_0 added) — required so the packed Q8_0 lm_head below transposes
+ through `linearProject` instead of throwing `ClassCastException`. Downstream
+ consumers get the upstream SKaiNET BOM transparently via `:llm-bom`.
+- **`gradle.properties` `VERSION_NAME=0.31.0`.** Lock-step with the engine.
+- **`com.networknt:json-schema-validator` → 3.0.4.** (#175)
+
+### Fixed
+
+- **Tied Q8_0 lm_head stays packed in the eager `NATIVE_OPTIMIZED` Gemma path.**
+ FunctionGemma's `token_embd` is Q8_0 and tied, so `convertGemmaWeightsPacked`
+ was dequantizing **both** `token_embd` and `output` to FP32 (2×~0.67 GB) —
+ OOM on the 1.9 GB SL2610. `output`/lm_head now packs as Q8_0
+ (`packGemmaKQuant` gained a Q8_0 case; the row-major→block-major relayout is
+ generalized with a `blockSize` param) and runs on the (NEON) Q8_0 kernel;
+ `token_embd` stays FP32 (it is gathered, not matmul'd) but is wrapped no-copy
+ via `DenseFloatArrayTensorData` instead of `ctx.fromFloatArray` (which
+ allocated a second ~0.67 GB buffer). Tied embed/lm_head footprint
+ ~1.34 GB → ~0.76 GB. Verified byte-identical decode parity
+ (`GemmaQ5KPackedParityTest`) and a stable ~1.06 GB load on the SL2610. (#179)
+
## [0.30.0] — 2026-06-14
Version-aligned with **SKaiNET 0.30.0**. Skips 0.29.x — SKaiNET-transformers
@@ -489,6 +530,7 @@ Version-aligned with **SKaiNET 0.21.0**.
Last published transformers release before the engine-aligned version line.
See `git log v0.16.0..0.18.0` for details.
+[0.31.0]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.31.0
[0.30.0]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.30.0
[0.28.1]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.28.1
[0.23.1]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.23.1
diff --git a/README.md b/README.md
index a2d7681..b74313e 100644
--- a/README.md
+++ b/README.md
@@ -103,21 +103,20 @@ Honest status — see the project-status note at the top of this README.
## Current release
-The current release is **0.30.0** — version-aligned with **SKaiNET 0.30.0**.
-Skips 0.29.x: SKaiNET-transformers tracked the engine internally across that
-window without a tagged release. The headline is that **Q5_K weights now stay
-packed in the eager Gemma runtime** (SKaiNET 0.30.0 ships a first-class Q5_K
-packed matmul) and the Gemma `NATIVE_OPTIMIZED` packed-weight path is now
-**Kotlin/Native–ready** — the board binary can keep K-quant weights packed
-without the JVM's `java.lang.foreign` MemSeg path. FunctionGemma-270M (`Q5_K_M`)
-decodes byte-identically across the FP32 baseline and both packed paths
-(`GemmaQ5KPackedParityTest`).
+The current release is **0.31.0** — version-aligned with **SKaiNET 0.31.0**.
+The headline is that the eager `NATIVE_OPTIMIZED` Gemma path now keeps the
+**tied Q8_0 lm_head packed** (paired with SKaiNET 0.31.0's `ops.transpose` fix
+for all packed dtypes), and `GemmaNetworkLoader.load()` takes an optional
+`maxInferenceLen` to cap the KV cache for constrained devices — together
+dropping FunctionGemma-270M's footprint enough to load eagerly on the 1.9 GB
+Astra Machina SL2610. FunctionGemma (`Q5_K_M`) still decodes byte-identically
+across the FP32 baseline and both packed paths (`GemmaQ5KPackedParityTest`).
The recommended way to consume is via the BOM. It pins every published `skainet-transformers-*` artifact and re-exports the upstream `sk.ainet:skainet-bom`, so the engine-side `sk.ainet.core:skainet-*` artifacts get the matching version too — you only need to declare the BOM version in one place.
```kotlin
dependencies {
- implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0"))
+ implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.0"))
// Versions resolved from the BOM:
implementation("sk.ainet.transformers:skainet-transformers-core")
@@ -194,6 +193,22 @@ try (KLlamaSession session = KLlamaJava.loadGGUF(modelPath, /* systemPrompt */ n
See `llm-test/llm-test-java/src/test/java/.../KLlamaJavaToolCallingTest.java` for a runnable reference.
+## What's new in 0.31.0
+
+- **Tied Q8_0 lm_head stays packed (eager `NATIVE_OPTIMIZED`).** FunctionGemma's
+ `token_embd` is Q8_0 and tied, so `convertGemmaWeightsPacked` was dequantizing
+ *both* `token_embd` and `output` to FP32 (2×~0.67 GB) — OOM on the 1.9 GB
+ SL2610. `output`/lm_head now packs as Q8_0 (runs on the NEON Q8_0 kernel);
+ `token_embd` stays FP32 (it's gathered) but is wrapped no-copy. Footprint
+ ~1.34 GB → ~0.76 GB; byte-identical decode (`GemmaQ5KPackedParityTest`),
+ stable ~1.06 GB load on the SL2610.
+- **`GemmaNetworkLoader.load(maxInferenceLen = …)`** — cap the context so the KV
+ cache + RoPE tables stay tiny on constrained devices (default
+ `min(contextLength, 4096)`).
+- **Engine pin `skainet 0.30.0 → 0.31.0`** — picks up `ops.transpose`'s
+ lazy-rewrap fix for all packed matmul dtypes (Q8_0/Q4_0), required so the
+ packed lm_head transposes through `linearProject` instead of `ClassCastException`.
+
## What's new in 0.30.0
- **Q5_K stays packed in the eager Gemma runtime.** `GemmaMemSegConverter` used to
diff --git a/docs/modules/ROOT/pages/tutorials/getting-started-java.adoc b/docs/modules/ROOT/pages/tutorials/getting-started-java.adoc
index 87548dc..4723dd7 100644
--- a/docs/modules/ROOT/pages/tutorials/getting-started-java.adoc
+++ b/docs/modules/ROOT/pages/tutorials/getting-started-java.adoc
@@ -25,7 +25,7 @@ In your `build.gradle.kts`:
[source,kotlin]
----
dependencies {
- implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0"))
+ implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.0"))
implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama")
implementation("sk.ainet.transformers:skainet-transformers-agent")
@@ -41,7 +41,7 @@ Or in Maven (Maven needs the `-jvm` classifier suffix on platform artifacts):
sk.ainet.transformers
skainet-transformers-bom
- 0.30.0
+ 0.31.0
pom
import
diff --git a/docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc b/docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc
index 07f123c..0be131c 100644
--- a/docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc
+++ b/docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc
@@ -52,7 +52,7 @@ The pieces you need live in three modules:
[source,kotlin]
----
dependencies {
- implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0"))
+ implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.0"))
implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama")
implementation("sk.ainet.transformers:skainet-transformers-agent")
diff --git a/gradle.properties b/gradle.properties
index 1987d82..942ff89 100644
--- a/gradle.properties
+++ b/gradle.properties
@@ -1,5 +1,5 @@
GROUP=sk.ainet.transformers
-VERSION_NAME=0.30.0
+VERSION_NAME=0.31.0
POM_DESCRIPTION=SKaiNET-transformers
@@ -33,6 +33,15 @@ kotlin.mpp.enableCInteropCommonization=true
#Android
android.useAndroidX=true
android.nonTransitiveRClass=true
+# AGP's DependencyResolutionChecks fails the build when a configuration resolves
+# at configuration time. KGP's KotlinPackageJsonTask resolves the Kotlin/JS + Wasm
+# `*NpmAggregated` configs at config time (we have JS npm deps: ktor-client-js,
+# kotlinx-browser), so `assemble`/`allTests` throw `Configuration 'jsNpmAggregated'
+# was resolved during configuration time` (gradle#31483) — a false positive against
+# KGP's known behaviour. Downgrade AGP's check from fail to warn. NOTE: AGP reads
+# this option only from the project gradle.properties — NOT from -P or the CI's
+# ~/.gradle/gradle.properties.
+android.dependencyResolutionAtConfigurationTime.disallow=false
kotlin.mpp.stability.nowarn=true
diff --git a/gradle/libs.versions.toml b/gradle/libs.versions.toml
index 08e5498..462b011 100644
--- a/gradle/libs.versions.toml
+++ b/gradle/libs.versions.toml
@@ -1,5 +1,5 @@
[versions]
-skainet = "0.30.0"
+skainet = "0.31.0"
agp = "9.2.1"
jacksonDatabind = "2.22.0"
jsonSchemaValidator = "3.0.4"
diff --git a/llm-inference/gemma/api/jvm/gemma.api b/llm-inference/gemma/api/jvm/gemma.api
index 4483f8c..4360f2a 100644
--- a/llm-inference/gemma/api/jvm/gemma.api
+++ b/llm-inference/gemma/api/jvm/gemma.api
@@ -862,7 +862,8 @@ public final class sk/ainet/models/gemma/GemmaNetworkLoader$WeightsProvider$Safe
}
public final class sk/ainet/models/gemma/GemmaNetworkLoaderKt {
- public static final fun applyWeightsToNetworkNonReified (Lsk/ainet/context/ExecutionContext;Lsk/ainet/models/gemma/Gemma4Weights;Lkotlin/reflect/KClass;Z)Lsk/ainet/lang/nn/Module;
+ public static final fun applyWeightsToNetworkNonReified (Lsk/ainet/context/ExecutionContext;Lsk/ainet/models/gemma/Gemma4Weights;Lkotlin/reflect/KClass;ZLjava/lang/Integer;)Lsk/ainet/lang/nn/Module;
+ public static synthetic fun applyWeightsToNetworkNonReified$default (Lsk/ainet/context/ExecutionContext;Lsk/ainet/models/gemma/Gemma4Weights;Lkotlin/reflect/KClass;ZLjava/lang/Integer;ILjava/lang/Object;)Lsk/ainet/lang/nn/Module;
}
public final class sk/ainet/models/gemma/GemmaPackedWeightsKt {
diff --git a/llm-inference/gemma/src/commonTest/kotlin/sk/ainet/models/gemma/GemmaQuantLayoutTest.kt b/llm-inference/gemma/src/commonTest/kotlin/sk/ainet/models/gemma/GemmaQuantLayoutTest.kt
index 52a1cdd..82f40d9 100644
--- a/llm-inference/gemma/src/commonTest/kotlin/sk/ainet/models/gemma/GemmaQuantLayoutTest.kt
+++ b/llm-inference/gemma/src/commonTest/kotlin/sk/ainet/models/gemma/GemmaQuantLayoutTest.kt
@@ -8,6 +8,7 @@ import sk.ainet.context.DirectCpuExecutionContext
import sk.ainet.io.gguf.GGMLQuantizationType
import sk.ainet.lang.tensor.Shape
import sk.ainet.lang.tensor.data.Q5_KBlockTensorData
+import sk.ainet.lang.tensor.data.Q8_0BlockTensorData
import sk.ainet.lang.types.FP32
import sk.ainet.lang.types.Int8
@@ -55,8 +56,17 @@ class GemmaQuantLayoutTest {
}
@Test
- fun pack_non_kquant_returns_null() {
- assertNull(packGemmaKQuant(ByteArray(34), GGMLQuantizationType.Q8_0, Shape(1, 32)))
+ fun pack_q8_0_produces_block_tensor() {
+ // Q8_0 is now packed (32 elems / 34 B per block) so a tied Q8_0 lm_head
+ // stays packed and runs on the Q8_0 kernel instead of dequanting to FP32.
+ val td = packGemmaKQuant(ByteArray(34), GGMLQuantizationType.Q8_0, Shape(1, 32))
+ assertTrue(td is Q8_0BlockTensorData, "Q8_0 should pack to Q8_0BlockTensorData")
+ }
+
+ @Test
+ fun pack_unsupported_quant_returns_null() {
+ // A quant type with no packed kernel (e.g. Q4_1) falls back to FP32 dequant.
+ assertNull(packGemmaKQuant(ByteArray(20), GGMLQuantizationType.Q4_1, Shape(1, 32)))
}
@Test