diff --git a/CHANGELOG.md b/CHANGELOG.md index d688dc8..e308e8a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,47 @@ version line is kept in lock-step with the underlying SKaiNET engine The format roughly follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [0.31.0] — 2026-06-15 + +Version-aligned with **SKaiNET 0.31.0**. Completes the eager board-decode path +for FunctionGemma: the tied **Q8_0 lm_head now stays packed** (paired with the +engine's `ops.transpose` fix for all packed dtypes), and `load()` can cap the +context to fit constrained devices. + +### Added + +- **`maxInferenceLen` on `GemmaNetworkLoader.load()`** — an optional cap on the + context length the eager network sizes its KV cache + RoPE tables for (default + `min(contextLength, 4096)`, threaded through `applyWeightsToNetwork` → + `gemmaNetwork`). A constrained-device consumer (e.g. the 1.9 GB SL2610 board) + can pass a small value (e.g. `32` for a short tool-call prompt) to shrink the + KV cache ~100×, which otherwise allocates ~0.4 GB at the first forward and OOMs + the board after the weights load. Default `null` preserves existing behaviour. (#180) + +### Changed + +- **`gradle/libs.versions.toml` `skainet` pin: 0.30.0 → 0.31.0.** Picks up the + engine's `ops.transpose` lazy-rewrap fix for **all** packed matmul dtypes + (Q8_0/Q4_0 added) — required so the packed Q8_0 lm_head below transposes + through `linearProject` instead of throwing `ClassCastException`. Downstream + consumers get the upstream SKaiNET BOM transparently via `:llm-bom`. +- **`gradle.properties` `VERSION_NAME=0.31.0`.** Lock-step with the engine. +- **`com.networknt:json-schema-validator` → 3.0.4.** (#175) + +### Fixed + +- **Tied Q8_0 lm_head stays packed in the eager `NATIVE_OPTIMIZED` Gemma path.** + FunctionGemma's `token_embd` is Q8_0 and tied, so `convertGemmaWeightsPacked` + was dequantizing **both** `token_embd` and `output` to FP32 (2×~0.67 GB) — + OOM on the 1.9 GB SL2610. `output`/lm_head now packs as Q8_0 + (`packGemmaKQuant` gained a Q8_0 case; the row-major→block-major relayout is + generalized with a `blockSize` param) and runs on the (NEON) Q8_0 kernel; + `token_embd` stays FP32 (it is gathered, not matmul'd) but is wrapped no-copy + via `DenseFloatArrayTensorData` instead of `ctx.fromFloatArray` (which + allocated a second ~0.67 GB buffer). Tied embed/lm_head footprint + ~1.34 GB → ~0.76 GB. Verified byte-identical decode parity + (`GemmaQ5KPackedParityTest`) and a stable ~1.06 GB load on the SL2610. (#179) + ## [0.30.0] — 2026-06-14 Version-aligned with **SKaiNET 0.30.0**. Skips 0.29.x — SKaiNET-transformers @@ -489,6 +530,7 @@ Version-aligned with **SKaiNET 0.21.0**. Last published transformers release before the engine-aligned version line. See `git log v0.16.0..0.18.0` for details. +[0.31.0]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.31.0 [0.30.0]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.30.0 [0.28.1]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.28.1 [0.23.1]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.23.1 diff --git a/README.md b/README.md index a2d7681..b74313e 100644 --- a/README.md +++ b/README.md @@ -103,21 +103,20 @@ Honest status — see the project-status note at the top of this README. ## Current release -The current release is **0.30.0** — version-aligned with **SKaiNET 0.30.0**. -Skips 0.29.x: SKaiNET-transformers tracked the engine internally across that -window without a tagged release. The headline is that **Q5_K weights now stay -packed in the eager Gemma runtime** (SKaiNET 0.30.0 ships a first-class Q5_K -packed matmul) and the Gemma `NATIVE_OPTIMIZED` packed-weight path is now -**Kotlin/Native–ready** — the board binary can keep K-quant weights packed -without the JVM's `java.lang.foreign` MemSeg path. FunctionGemma-270M (`Q5_K_M`) -decodes byte-identically across the FP32 baseline and both packed paths -(`GemmaQ5KPackedParityTest`). +The current release is **0.31.0** — version-aligned with **SKaiNET 0.31.0**. +The headline is that the eager `NATIVE_OPTIMIZED` Gemma path now keeps the +**tied Q8_0 lm_head packed** (paired with SKaiNET 0.31.0's `ops.transpose` fix +for all packed dtypes), and `GemmaNetworkLoader.load()` takes an optional +`maxInferenceLen` to cap the KV cache for constrained devices — together +dropping FunctionGemma-270M's footprint enough to load eagerly on the 1.9 GB +Astra Machina SL2610. FunctionGemma (`Q5_K_M`) still decodes byte-identically +across the FP32 baseline and both packed paths (`GemmaQ5KPackedParityTest`). The recommended way to consume is via the BOM. It pins every published `skainet-transformers-*` artifact and re-exports the upstream `sk.ainet:skainet-bom`, so the engine-side `sk.ainet.core:skainet-*` artifacts get the matching version too — you only need to declare the BOM version in one place. ```kotlin dependencies { - implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0")) + implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.0")) // Versions resolved from the BOM: implementation("sk.ainet.transformers:skainet-transformers-core") @@ -194,6 +193,22 @@ try (KLlamaSession session = KLlamaJava.loadGGUF(modelPath, /* systemPrompt */ n See `llm-test/llm-test-java/src/test/java/.../KLlamaJavaToolCallingTest.java` for a runnable reference. +## What's new in 0.31.0 + +- **Tied Q8_0 lm_head stays packed (eager `NATIVE_OPTIMIZED`).** FunctionGemma's + `token_embd` is Q8_0 and tied, so `convertGemmaWeightsPacked` was dequantizing + *both* `token_embd` and `output` to FP32 (2×~0.67 GB) — OOM on the 1.9 GB + SL2610. `output`/lm_head now packs as Q8_0 (runs on the NEON Q8_0 kernel); + `token_embd` stays FP32 (it's gathered) but is wrapped no-copy. Footprint + ~1.34 GB → ~0.76 GB; byte-identical decode (`GemmaQ5KPackedParityTest`), + stable ~1.06 GB load on the SL2610. +- **`GemmaNetworkLoader.load(maxInferenceLen = …)`** — cap the context so the KV + cache + RoPE tables stay tiny on constrained devices (default + `min(contextLength, 4096)`). +- **Engine pin `skainet 0.30.0 → 0.31.0`** — picks up `ops.transpose`'s + lazy-rewrap fix for all packed matmul dtypes (Q8_0/Q4_0), required so the + packed lm_head transposes through `linearProject` instead of `ClassCastException`. + ## What's new in 0.30.0 - **Q5_K stays packed in the eager Gemma runtime.** `GemmaMemSegConverter` used to diff --git a/docs/modules/ROOT/pages/tutorials/getting-started-java.adoc b/docs/modules/ROOT/pages/tutorials/getting-started-java.adoc index 87548dc..4723dd7 100644 --- a/docs/modules/ROOT/pages/tutorials/getting-started-java.adoc +++ b/docs/modules/ROOT/pages/tutorials/getting-started-java.adoc @@ -25,7 +25,7 @@ In your `build.gradle.kts`: [source,kotlin] ---- dependencies { - implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0")) + implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.0")) implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama") implementation("sk.ainet.transformers:skainet-transformers-agent") @@ -41,7 +41,7 @@ Or in Maven (Maven needs the `-jvm` classifier suffix on platform artifacts): sk.ainet.transformers skainet-transformers-bom - 0.30.0 + 0.31.0 pom import diff --git a/docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc b/docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc index 07f123c..0be131c 100644 --- a/docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc +++ b/docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc @@ -52,7 +52,7 @@ The pieces you need live in three modules: [source,kotlin] ---- dependencies { - implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0")) + implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.0")) implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama") implementation("sk.ainet.transformers:skainet-transformers-agent") diff --git a/gradle.properties b/gradle.properties index 1987d82..942ff89 100644 --- a/gradle.properties +++ b/gradle.properties @@ -1,5 +1,5 @@ GROUP=sk.ainet.transformers -VERSION_NAME=0.30.0 +VERSION_NAME=0.31.0 POM_DESCRIPTION=SKaiNET-transformers @@ -33,6 +33,15 @@ kotlin.mpp.enableCInteropCommonization=true #Android android.useAndroidX=true android.nonTransitiveRClass=true +# AGP's DependencyResolutionChecks fails the build when a configuration resolves +# at configuration time. KGP's KotlinPackageJsonTask resolves the Kotlin/JS + Wasm +# `*NpmAggregated` configs at config time (we have JS npm deps: ktor-client-js, +# kotlinx-browser), so `assemble`/`allTests` throw `Configuration 'jsNpmAggregated' +# was resolved during configuration time` (gradle#31483) — a false positive against +# KGP's known behaviour. Downgrade AGP's check from fail to warn. NOTE: AGP reads +# this option only from the project gradle.properties — NOT from -P or the CI's +# ~/.gradle/gradle.properties. +android.dependencyResolutionAtConfigurationTime.disallow=false kotlin.mpp.stability.nowarn=true diff --git a/gradle/libs.versions.toml b/gradle/libs.versions.toml index 08e5498..462b011 100644 --- a/gradle/libs.versions.toml +++ b/gradle/libs.versions.toml @@ -1,5 +1,5 @@ [versions] -skainet = "0.30.0" +skainet = "0.31.0" agp = "9.2.1" jacksonDatabind = "2.22.0" jsonSchemaValidator = "3.0.4" diff --git a/llm-inference/gemma/api/jvm/gemma.api b/llm-inference/gemma/api/jvm/gemma.api index 4483f8c..4360f2a 100644 --- a/llm-inference/gemma/api/jvm/gemma.api +++ b/llm-inference/gemma/api/jvm/gemma.api @@ -862,7 +862,8 @@ public final class sk/ainet/models/gemma/GemmaNetworkLoader$WeightsProvider$Safe } public final class sk/ainet/models/gemma/GemmaNetworkLoaderKt { - public static final fun applyWeightsToNetworkNonReified (Lsk/ainet/context/ExecutionContext;Lsk/ainet/models/gemma/Gemma4Weights;Lkotlin/reflect/KClass;Z)Lsk/ainet/lang/nn/Module; + public static final fun applyWeightsToNetworkNonReified (Lsk/ainet/context/ExecutionContext;Lsk/ainet/models/gemma/Gemma4Weights;Lkotlin/reflect/KClass;ZLjava/lang/Integer;)Lsk/ainet/lang/nn/Module; + public static synthetic fun applyWeightsToNetworkNonReified$default (Lsk/ainet/context/ExecutionContext;Lsk/ainet/models/gemma/Gemma4Weights;Lkotlin/reflect/KClass;ZLjava/lang/Integer;ILjava/lang/Object;)Lsk/ainet/lang/nn/Module; } public final class sk/ainet/models/gemma/GemmaPackedWeightsKt { diff --git a/llm-inference/gemma/src/commonTest/kotlin/sk/ainet/models/gemma/GemmaQuantLayoutTest.kt b/llm-inference/gemma/src/commonTest/kotlin/sk/ainet/models/gemma/GemmaQuantLayoutTest.kt index 52a1cdd..82f40d9 100644 --- a/llm-inference/gemma/src/commonTest/kotlin/sk/ainet/models/gemma/GemmaQuantLayoutTest.kt +++ b/llm-inference/gemma/src/commonTest/kotlin/sk/ainet/models/gemma/GemmaQuantLayoutTest.kt @@ -8,6 +8,7 @@ import sk.ainet.context.DirectCpuExecutionContext import sk.ainet.io.gguf.GGMLQuantizationType import sk.ainet.lang.tensor.Shape import sk.ainet.lang.tensor.data.Q5_KBlockTensorData +import sk.ainet.lang.tensor.data.Q8_0BlockTensorData import sk.ainet.lang.types.FP32 import sk.ainet.lang.types.Int8 @@ -55,8 +56,17 @@ class GemmaQuantLayoutTest { } @Test - fun pack_non_kquant_returns_null() { - assertNull(packGemmaKQuant(ByteArray(34), GGMLQuantizationType.Q8_0, Shape(1, 32))) + fun pack_q8_0_produces_block_tensor() { + // Q8_0 is now packed (32 elems / 34 B per block) so a tied Q8_0 lm_head + // stays packed and runs on the Q8_0 kernel instead of dequanting to FP32. + val td = packGemmaKQuant(ByteArray(34), GGMLQuantizationType.Q8_0, Shape(1, 32)) + assertTrue(td is Q8_0BlockTensorData, "Q8_0 should pack to Q8_0BlockTensorData") + } + + @Test + fun pack_unsupported_quant_returns_null() { + // A quant type with no packed kernel (e.g. Q4_1) falls back to FP32 dequant. + assertNull(packGemmaKQuant(ByteArray(20), GGMLQuantizationType.Q4_1, Shape(1, 32))) } @Test