Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,47 @@ version line is kept in lock-step with the underlying SKaiNET engine
The format roughly follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.31.0] — 2026-06-15

Version-aligned with **SKaiNET 0.31.0**. Completes the eager board-decode path
for FunctionGemma: the tied **Q8_0 lm_head now stays packed** (paired with the
engine's `ops.transpose` fix for all packed dtypes), and `load()` can cap the
context to fit constrained devices.

### Added

- **`maxInferenceLen` on `GemmaNetworkLoader.load()`** — an optional cap on the
context length the eager network sizes its KV cache + RoPE tables for (default
`min(contextLength, 4096)`, threaded through `applyWeightsToNetwork` →
`gemmaNetwork`). A constrained-device consumer (e.g. the 1.9 GB SL2610 board)
can pass a small value (e.g. `32` for a short tool-call prompt) to shrink the
KV cache ~100×, which otherwise allocates ~0.4 GB at the first forward and OOMs
the board after the weights load. Default `null` preserves existing behaviour. (#180)

### Changed

- **`gradle/libs.versions.toml` `skainet` pin: 0.30.0 → 0.31.0.** Picks up the
engine's `ops.transpose` lazy-rewrap fix for **all** packed matmul dtypes
(Q8_0/Q4_0 added) — required so the packed Q8_0 lm_head below transposes
through `linearProject` instead of throwing `ClassCastException`. Downstream
consumers get the upstream SKaiNET BOM transparently via `:llm-bom`.
- **`gradle.properties` `VERSION_NAME=0.31.0`.** Lock-step with the engine.
- **`com.networknt:json-schema-validator` → 3.0.4.** (#175)

### Fixed

- **Tied Q8_0 lm_head stays packed in the eager `NATIVE_OPTIMIZED` Gemma path.**
FunctionGemma's `token_embd` is Q8_0 and tied, so `convertGemmaWeightsPacked`
was dequantizing **both** `token_embd` and `output` to FP32 (2×~0.67 GB) —
OOM on the 1.9 GB SL2610. `output`/lm_head now packs as Q8_0
(`packGemmaKQuant` gained a Q8_0 case; the row-major→block-major relayout is
generalized with a `blockSize` param) and runs on the (NEON) Q8_0 kernel;
`token_embd` stays FP32 (it is gathered, not matmul'd) but is wrapped no-copy
via `DenseFloatArrayTensorData` instead of `ctx.fromFloatArray` (which
allocated a second ~0.67 GB buffer). Tied embed/lm_head footprint
~1.34 GB → ~0.76 GB. Verified byte-identical decode parity
(`GemmaQ5KPackedParityTest`) and a stable ~1.06 GB load on the SL2610. (#179)

## [0.30.0] — 2026-06-14

Version-aligned with **SKaiNET 0.30.0**. Skips 0.29.x — SKaiNET-transformers
Expand Down Expand Up @@ -489,6 +530,7 @@ Version-aligned with **SKaiNET 0.21.0**.
Last published transformers release before the engine-aligned version line.
See `git log v0.16.0..0.18.0` for details.

[0.31.0]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.31.0
[0.30.0]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.30.0
[0.28.1]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.28.1
[0.23.1]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.23.1
Expand Down
35 changes: 25 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,21 +103,20 @@ Honest status — see the project-status note at the top of this README.

## Current release

The current release is **0.30.0** — version-aligned with **SKaiNET 0.30.0**.
Skips 0.29.x: SKaiNET-transformers tracked the engine internally across that
window without a tagged release. The headline is that **Q5_K weights now stay
packed in the eager Gemma runtime** (SKaiNET 0.30.0 ships a first-class Q5_K
packed matmul) and the Gemma `NATIVE_OPTIMIZED` packed-weight path is now
**Kotlin/Native–ready** — the board binary can keep K-quant weights packed
without the JVM's `java.lang.foreign` MemSeg path. FunctionGemma-270M (`Q5_K_M`)
decodes byte-identically across the FP32 baseline and both packed paths
(`GemmaQ5KPackedParityTest`).
The current release is **0.31.0** — version-aligned with **SKaiNET 0.31.0**.
The headline is that the eager `NATIVE_OPTIMIZED` Gemma path now keeps the
**tied Q8_0 lm_head packed** (paired with SKaiNET 0.31.0's `ops.transpose` fix
for all packed dtypes), and `GemmaNetworkLoader.load()` takes an optional
`maxInferenceLen` to cap the KV cache for constrained devices — together
dropping FunctionGemma-270M's footprint enough to load eagerly on the 1.9 GB
Astra Machina SL2610. FunctionGemma (`Q5_K_M`) still decodes byte-identically
across the FP32 baseline and both packed paths (`GemmaQ5KPackedParityTest`).

The recommended way to consume is via the BOM. It pins every published `skainet-transformers-*` artifact and re-exports the upstream `sk.ainet:skainet-bom`, so the engine-side `sk.ainet.core:skainet-*` artifacts get the matching version too — you only need to declare the BOM version in one place.

```kotlin
dependencies {
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0"))
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.0"))

// Versions resolved from the BOM:
implementation("sk.ainet.transformers:skainet-transformers-core")
Expand Down Expand Up @@ -194,6 +193,22 @@ try (KLlamaSession session = KLlamaJava.loadGGUF(modelPath, /* systemPrompt */ n

See `llm-test/llm-test-java/src/test/java/.../KLlamaJavaToolCallingTest.java` for a runnable reference.

## What's new in 0.31.0

- **Tied Q8_0 lm_head stays packed (eager `NATIVE_OPTIMIZED`).** FunctionGemma's
`token_embd` is Q8_0 and tied, so `convertGemmaWeightsPacked` was dequantizing
*both* `token_embd` and `output` to FP32 (2×~0.67 GB) — OOM on the 1.9 GB
SL2610. `output`/lm_head now packs as Q8_0 (runs on the NEON Q8_0 kernel);
`token_embd` stays FP32 (it's gathered) but is wrapped no-copy. Footprint
~1.34 GB → ~0.76 GB; byte-identical decode (`GemmaQ5KPackedParityTest`),
stable ~1.06 GB load on the SL2610.
- **`GemmaNetworkLoader.load(maxInferenceLen = …)`** — cap the context so the KV
cache + RoPE tables stay tiny on constrained devices (default
`min(contextLength, 4096)`).
- **Engine pin `skainet 0.30.0 → 0.31.0`** — picks up `ops.transpose`'s
lazy-rewrap fix for all packed matmul dtypes (Q8_0/Q4_0), required so the
packed lm_head transposes through `linearProject` instead of `ClassCastException`.

## What's new in 0.30.0

- **Q5_K stays packed in the eager Gemma runtime.** `GemmaMemSegConverter` used to
Expand Down
4 changes: 2 additions & 2 deletions docs/modules/ROOT/pages/tutorials/getting-started-java.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ In your `build.gradle.kts`:
[source,kotlin]
----
dependencies {
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0"))
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.0"))

implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama")
implementation("sk.ainet.transformers:skainet-transformers-agent")
Expand All @@ -41,7 +41,7 @@ Or in Maven (Maven needs the `-jvm` classifier suffix on platform artifacts):
<dependency>
<groupId>sk.ainet.transformers</groupId>
<artifactId>skainet-transformers-bom</artifactId>
<version>0.30.0</version>
<version>0.31.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
Expand Down
2 changes: 1 addition & 1 deletion docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ The pieces you need live in three modules:
[source,kotlin]
----
dependencies {
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0"))
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.0"))
implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama")
implementation("sk.ainet.transformers:skainet-transformers-agent")
Expand Down
11 changes: 10 additions & 1 deletion gradle.properties
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
GROUP=sk.ainet.transformers
VERSION_NAME=0.30.0
VERSION_NAME=0.31.0

POM_DESCRIPTION=SKaiNET-transformers

Expand Down Expand Up @@ -33,6 +33,15 @@ kotlin.mpp.enableCInteropCommonization=true
#Android
android.useAndroidX=true
android.nonTransitiveRClass=true
# AGP's DependencyResolutionChecks fails the build when a configuration resolves
# at configuration time. KGP's KotlinPackageJsonTask resolves the Kotlin/JS + Wasm
# `*NpmAggregated` configs at config time (we have JS npm deps: ktor-client-js,
# kotlinx-browser), so `assemble`/`allTests` throw `Configuration 'jsNpmAggregated'
# was resolved during configuration time` (gradle#31483) — a false positive against
# KGP's known behaviour. Downgrade AGP's check from fail to warn. NOTE: AGP reads
# this option only from the project gradle.properties — NOT from -P or the CI's
# ~/.gradle/gradle.properties.
android.dependencyResolutionAtConfigurationTime.disallow=false

kotlin.mpp.stability.nowarn=true

Expand Down
2 changes: 1 addition & 1 deletion gradle/libs.versions.toml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[versions]
skainet = "0.30.0"
skainet = "0.31.0"
agp = "9.2.1"
jacksonDatabind = "2.22.0"
jsonSchemaValidator = "3.0.4"
Expand Down
3 changes: 2 additions & 1 deletion llm-inference/gemma/api/jvm/gemma.api
Original file line number Diff line number Diff line change
Expand Up @@ -862,7 +862,8 @@ public final class sk/ainet/models/gemma/GemmaNetworkLoader$WeightsProvider$Safe
}

public final class sk/ainet/models/gemma/GemmaNetworkLoaderKt {
public static final fun applyWeightsToNetworkNonReified (Lsk/ainet/context/ExecutionContext;Lsk/ainet/models/gemma/Gemma4Weights;Lkotlin/reflect/KClass;Z)Lsk/ainet/lang/nn/Module;
public static final fun applyWeightsToNetworkNonReified (Lsk/ainet/context/ExecutionContext;Lsk/ainet/models/gemma/Gemma4Weights;Lkotlin/reflect/KClass;ZLjava/lang/Integer;)Lsk/ainet/lang/nn/Module;
public static synthetic fun applyWeightsToNetworkNonReified$default (Lsk/ainet/context/ExecutionContext;Lsk/ainet/models/gemma/Gemma4Weights;Lkotlin/reflect/KClass;ZLjava/lang/Integer;ILjava/lang/Object;)Lsk/ainet/lang/nn/Module;
}

public final class sk/ainet/models/gemma/GemmaPackedWeightsKt {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import sk.ainet.context.DirectCpuExecutionContext
import sk.ainet.io.gguf.GGMLQuantizationType
import sk.ainet.lang.tensor.Shape
import sk.ainet.lang.tensor.data.Q5_KBlockTensorData
import sk.ainet.lang.tensor.data.Q8_0BlockTensorData
import sk.ainet.lang.types.FP32
import sk.ainet.lang.types.Int8

Expand Down Expand Up @@ -55,8 +56,17 @@ class GemmaQuantLayoutTest {
}

@Test
fun pack_non_kquant_returns_null() {
assertNull(packGemmaKQuant<FP32>(ByteArray(34), GGMLQuantizationType.Q8_0, Shape(1, 32)))
fun pack_q8_0_produces_block_tensor() {
// Q8_0 is now packed (32 elems / 34 B per block) so a tied Q8_0 lm_head
// stays packed and runs on the Q8_0 kernel instead of dequanting to FP32.
val td = packGemmaKQuant<FP32>(ByteArray(34), GGMLQuantizationType.Q8_0, Shape(1, 32))
assertTrue(td is Q8_0BlockTensorData, "Q8_0 should pack to Q8_0BlockTensorData")
}

@Test
fun pack_unsupported_quant_returns_null() {
// A quant type with no packed kernel (e.g. Q4_1) falls back to FP32 dequant.
assertNull(packGemmaKQuant<FP32>(ByteArray(20), GGMLQuantizationType.Q4_1, Shape(1, 32)))
}

@Test
Expand Down