Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 106 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,110 @@ version line is kept in lock-step with the underlying SKaiNET engine
The format roughly follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.30.0] — 2026-06-14

Version-aligned with **SKaiNET 0.30.0**. Skips 0.29.x — SKaiNET-transformers
tracked the engine internally across that window (the in-progress Q5_K kernel
shipped as a local `0.29.1`) without a tagged release. The headline is
**Q5_K stays packed in the eager Gemma runtime** and the **Gemma
`NATIVE_OPTIMIZED` packed-weight path is now Kotlin/Native–ready** — the board
binary can keep K-quant weights packed without the JVM's `java.lang.foreign`
MemSeg path.

### Added

- **Q5_K packed in-kernel dequant in the eager Gemma runtime.** FunctionGemma-270M
ships as `Q5_K_M`, but `GemmaMemSegConverter` previously dequantized Q5_K
weights to FP32 on load ("no native matmul kernel yet for Q5_K"), giving up
both the memory saving and the in-kernel dequant. SKaiNET 0.30.0 provides a
first-class Q5_K packed matmul (`Q5_KBlockTensorData` + `Q5KMatmulKernel`:
scalar / Panama / native), so the converter now relayouts the GGUF bytes to
block-major and wraps them as `Q5_KBlockTensorData` (176 B/block). Dispatch and
the lazy transpose reach the kernel through `DefaultCpuOps`. Verified by
`GemmaQ5KPackedParityTest` (`-PincludeIntegration`): the Q5_K packed path
decodes FunctionGemma byte-identically to the FP32 baseline —
`[262146, 236769, 3255, 718, 498, 1373, 262152, 106]` →
`<tool_0>(state="on")<end>` for *"Turn the light on."*
- **Kotlin/Native–ready Gemma packed-weight path.** The `NATIVE_OPTIMIZED`
packed conversion was `jvmMain`-only (it built `MemSeg`/`Arena`-backed tensors
via `java.lang.foreign`), so the Kotlin/Native board binary couldn't keep
K-quant weights packed. The platform-neutral pieces now live in `commonMain`:
- **`GemmaQuantLayout.kt`** (`commonMain`) — `logicalShapeFor`,
`relayoutKSeriesRowMajorToBlockMajor` (KMP-safe `copyInto`), and
`packGemmaKQuant<T>()`, which builds heap-packed Q4_K/Q5_K/Q6_K
`BlockTensorData` directly with no `MemSeg`/`Arena`.
- **`GemmaPackedWeights.kt`** (`commonMain`) — `convertGemmaWeightsPacked`
packs Q4/Q5/Q6_K matmul weights to heap `Q*_KBlockTensorData`, dequants
`token_embd`/`output` to FP32 (gathered, no transpose) and any other quant
type to FP32 `[out, in]`. `extractRawBytes` reads the loader's bytes back
across both backings (JVM `IntArrayTensorData` / native `Byte`-typed).
- **`GemmaNetworkLoader.load()`** now runs `convertGemmaWeightsPacked` before
`applyWeightsToNetwork` under `NATIVE_OPTIMIZED`, so `load(NATIVE_OPTIMIZED)`
yields a runnable network on the board *and* the JVM (previously it could not
be built from raw-byte weights at all). `GemmaMemSegConverter` (`jvmMain`)
now shares the `commonMain` helpers; only the `MemSeg`/FFM conversion and the
FP32 fallbacks stay JVM-only.
Verified on JVM and `linuxX64` (`GemmaQuantLayoutTest`): relayout, packing, and
the native byte-extraction round-trip run on every target, and
`GemmaQ5KPackedParityTest` confirms all three paths (FP32 baseline, `jvmMain`
MemSeg-packed, `load()` packed) produce the identical token sequence.

### Changed

- **`gradle/libs.versions.toml` `skainet` pin: 0.28.1 → 0.30.0.** Picks up the
released Q5_K packed matmul, the NEON native kernels, and the Kotlin/Native
cinterop. Downstream consumers get the upstream SKaiNET BOM transparently via
`:llm-bom`, so no per-consumer migration is needed.
- **`gradle.properties` `VERSION_NAME=0.30.0`.** Lock-step with the engine.
- **`settings.gradle.kts` reverts the `mavenLocal()`-first dev shim.** The
ordering added while consuming the in-progress local SKaiNET `0.29.1` is no
longer needed now that 0.30.0 is on Maven Central; the release resolves the
engine purely from Central. The opt-in `-PuseLocalSkainet` composite build is
unchanged for local engine work.

### Fixed

- **`fix(gemma): dequant kernel-less quant types in `NATIVE_OPTIMIZED` instead of
leaving raw bytes`.** Loading a Gemma GGUF whose attention/FFN weights used a
quant type with no packed SIMD kernel (e.g. Q5_1) under
`QuantPolicy.NATIVE_OPTIMIZED` crashed at the first decode step
(`Transpose requires at least 2 dimensions` in `MultiHeadAttention` →
`linearProject`): `GemmaMemSegConverter.convertOne` left every unhandled quant
type as raw 1-D bytes. Kernel-less types now dequantize to a correct FP32
`[out, in]` weight via a new `dequantPackedToFp32` helper (mirroring the proven
`Gemma4WeightLoader.createTensor` column-major → row-major transpose). The
supported packed types (Q4_0/Q8_0/Q4_K/Q6_K) keep their fast SIMD form; only
kernel-less types pay the FP32 dequant.
- **`fix(llama): dequantize Q4_1 (and all non-packed quant types) in
`DecoderGgufMemSegConverter``.** The converter handled only Q4_0/Q8_0 (packed)
and Q4_K/Q5_K/Q6_K (dequant); every other quant type fell through an `else`
branch that logged a warning and passed the raw quant bytes through unchanged,
crashing deep inside matmul (e.g. `unsupported quant type Q4_1 for
blk.0.ffn_down.weight` on Q4_1 Qwen3 models). The `else` branch now routes
through `DequantOps.dequantFromBytes` to FP32, covering Q4_1, Q5_0, Q5_1, Q8_1,
IQ4_NL/XS, TQ1/2_0, etc.; genuinely unknown types now fail explicitly at load
time instead of crashing later inside matmul. Closes
[#654](https://github.com/SKaiNET-developers/SKaiNET-transformers/issues/654).

### Tests / CI

- **`GemmaQ5KPackedParityTest`** — byte-identical decode parity across the FP32
baseline, the `jvmMain` MemSeg-packed path, and the `load(NATIVE_OPTIMIZED)`
`commonMain` packed path.
- **`GemmaQuantLayoutTest`** (`commonTest`) — block-transpose relayout, packing,
and the byte-extraction round-trip; runs on JVM and `linuxX64`.
- **`DecoderGgufMemSegConverterTest`** — regression that a Q4_1 weight is
dequantized to its logical 2-D FP32 shape rather than passed through as 1-D
bytes.
- **`fix(gemma): macosArm64 target for `gemma-iree``** and CI parity fixes:
MLIR-dump tests write to a portable build dir instead of a hardcoded local
path; browser Mocha gets a 60 s timeout (parity with the engine repo).
- **`test(gemma): repoint stale FunctionGemma GGUF path`** — six real-model
integration tests now point at the in-repo
`sl2610-function-calling/models/` location, matching
`GemmaQ5KPackedParityTest`; all pass against the published SKaiNET 0.30.0
(`-PincludeIntegration`).

## [0.28.1] — 2026-06-06

Version-aligned with **SKaiNET 0.28.1**. Skips 0.26.x / 0.27.x —
Expand Down Expand Up @@ -385,6 +489,8 @@ Version-aligned with **SKaiNET 0.21.0**.
Last published transformers release before the engine-aligned version line.
See `git log v0.16.0..0.18.0` for details.

[0.30.0]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.30.0
[0.28.1]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.28.1
[0.23.1]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.23.1
[0.21.1]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.21.1
[0.21.0]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.21.0
Expand Down
42 changes: 31 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,22 +103,21 @@ Honest status — see the project-status note at the top of this README.

## Current release

The current release is **0.28.1** — version-aligned with **SKaiNET 0.28.1**.
Skips 0.26.x / 0.27.x: SKaiNET-transformers tracked the engine internally across
that window without a tagged release. The headline is that the engine's
**Kotlin DSL → StableHLO → IREE export path is now complete** — a full gemma3
graph traces and lowers to StableHLO that `iree-compile`s to a `vmfb`
(`GemmaMlirDumpTest` / `GemmaTraceTest` are green against 0.28.1). SKaiNET
0.28.0/0.28.1 fixed the remaining export bugs: result-type inference for
`reshape`/`matmul`/`concatenate` ([#673](https://github.com/SKaiNET-developers/SKaiNET/issues/673))
and `conv1d`/`gather`/pooling/`flatten` shapes plus the `reduce_window` emission
form ([#675](https://github.com/SKaiNET-developers/SKaiNET/issues/675)).
The current release is **0.30.0** — version-aligned with **SKaiNET 0.30.0**.
Skips 0.29.x: SKaiNET-transformers tracked the engine internally across that
window without a tagged release. The headline is that **Q5_K weights now stay
packed in the eager Gemma runtime** (SKaiNET 0.30.0 ships a first-class Q5_K
packed matmul) and the Gemma `NATIVE_OPTIMIZED` packed-weight path is now
**Kotlin/Native–ready** — the board binary can keep K-quant weights packed
without the JVM's `java.lang.foreign` MemSeg path. FunctionGemma-270M (`Q5_K_M`)
decodes byte-identically across the FP32 baseline and both packed paths
(`GemmaQ5KPackedParityTest`).

The recommended way to consume is via the BOM. It pins every published `skainet-transformers-*` artifact and re-exports the upstream `sk.ainet:skainet-bom`, so the engine-side `sk.ainet.core:skainet-*` artifacts get the matching version too — you only need to declare the BOM version in one place.

```kotlin
dependencies {
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.28.1"))
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0"))

// Versions resolved from the BOM:
implementation("sk.ainet.transformers:skainet-transformers-core")
Expand Down Expand Up @@ -195,6 +194,27 @@ try (KLlamaSession session = KLlamaJava.loadGGUF(modelPath, /* systemPrompt */ n

See `llm-test/llm-test-java/src/test/java/.../KLlamaJavaToolCallingTest.java` for a runnable reference.

## What's new in 0.30.0

- **Q5_K stays packed in the eager Gemma runtime.** `GemmaMemSegConverter` used to
dequantize Q5_K weights to FP32 on load; SKaiNET 0.30.0 provides a first-class
Q5_K packed matmul (`Q5_KBlockTensorData` + `Q5KMatmulKernel`), so the converter
now relayouts the GGUF bytes to block-major and keeps them packed (176 B/block).
FunctionGemma-270M (`Q5_K_M`) decodes byte-identically to the FP32 baseline
(`GemmaQ5KPackedParityTest`).
- **Gemma `NATIVE_OPTIMIZED` path is Kotlin/Native–ready.** The reusable layout +
packing helpers (`GemmaQuantLayout.kt`, `GemmaPackedWeights.kt`) moved to
`commonMain`, and `GemmaNetworkLoader.load()` now runs `convertGemmaWeightsPacked`
under `NATIVE_OPTIMIZED` — so the board binary keeps K-quant weights packed with
no `java.lang.foreign` MemSeg dependency. Verified on JVM and `linuxX64`.
- **Engine pin `skainet 0.28.1 → 0.30.0`** — released Q5_K packed matmul, NEON
native kernels, and Kotlin/Native cinterop. The `mavenLocal()`-first dev shim is
reverted; the release resolves the engine from Maven Central.
- **Fixes.** Kernel-less quant types under `NATIVE_OPTIMIZED` now dequant to FP32
`[out, in]` instead of crashing on a rank-1 transpose; `DecoderGgufMemSegConverter`
dequantizes Q4_1 and every other non-packed quant type instead of passing raw
bytes through to a matmul crash ([#654](https://github.com/SKaiNET-developers/SKaiNET-transformers/issues/654)).

## What's new in 0.28.1

- **Engine pin `skainet 0.27.0 → 0.28.1`.** Picks up the completed Kotlin DSL →
Expand Down
4 changes: 2 additions & 2 deletions docs/modules/ROOT/pages/tutorials/getting-started-java.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ In your `build.gradle.kts`:
[source,kotlin]
----
dependencies {
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.28.1"))
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0"))

implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama")
implementation("sk.ainet.transformers:skainet-transformers-agent")
Expand All @@ -41,7 +41,7 @@ Or in Maven (Maven needs the `-jvm` classifier suffix on platform artifacts):
<dependency>
<groupId>sk.ainet.transformers</groupId>
<artifactId>skainet-transformers-bom</artifactId>
<version>0.28.1</version>
<version>0.30.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
Expand Down
2 changes: 1 addition & 1 deletion docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ The pieces you need live in three modules:
[source,kotlin]
----
dependencies {
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.28.1"))
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0"))

implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama")
implementation("sk.ainet.transformers:skainet-transformers-agent")
Expand Down
2 changes: 1 addition & 1 deletion gradle.properties
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
GROUP=sk.ainet.transformers
VERSION_NAME=0.28.1
VERSION_NAME=0.30.0

POM_DESCRIPTION=SKaiNET-transformers

Expand Down
2 changes: 1 addition & 1 deletion gradle/libs.versions.toml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[versions]
skainet = "0.28.1"
skainet = "0.30.0"
agp = "9.2.1"
jacksonDatabind = "2.22.0"
jsonSchemaValidator = "3.0.3"
Expand Down
6 changes: 4 additions & 2 deletions llm-agent/api/jvm/llm-agent.api
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
public final class sk/ainet/apps/kllama/agent/GenerateExtensionsKt {
public static final fun generateUntilStop (Lsk/ainet/apps/llm/InferenceRuntime;[IIIFLkotlin/random/Random;Lkotlin/jvm/functions/Function1;Lkotlin/jvm/functions/Function1;)Lsk/ainet/apps/kllama/agent/GenerateResult;
public static synthetic fun generateUntilStop$default (Lsk/ainet/apps/llm/InferenceRuntime;[IIIFLkotlin/random/Random;Lkotlin/jvm/functions/Function1;Lkotlin/jvm/functions/Function1;ILjava/lang/Object;)Lsk/ainet/apps/kllama/agent/GenerateResult;
public static final fun generateUntilStop (Lsk/ainet/apps/llm/InferenceRuntime;[IIIFLkotlin/random/Random;Lkotlin/jvm/functions/Function1;Lkotlin/jvm/functions/Function1;Lkotlin/jvm/functions/Function2;)Lsk/ainet/apps/kllama/agent/GenerateResult;
public static synthetic fun generateUntilStop$default (Lsk/ainet/apps/llm/InferenceRuntime;[IIIFLkotlin/random/Random;Lkotlin/jvm/functions/Function1;Lkotlin/jvm/functions/Function1;Lkotlin/jvm/functions/Function2;ILjava/lang/Object;)Lsk/ainet/apps/kllama/agent/GenerateResult;
public static final fun sampleFromLogits (Lsk/ainet/lang/tensor/Tensor;FLkotlin/random/Random;)I
public static synthetic fun sampleFromLogits$default (Lsk/ainet/lang/tensor/Tensor;FLkotlin/random/Random;ILjava/lang/Object;)I
}
Expand Down Expand Up @@ -45,6 +45,7 @@ public final class sk/ainet/apps/kllama/chat/AgentConfig {
public abstract interface class sk/ainet/apps/kllama/chat/AgentListener {
public fun onAssistantMessage (Ljava/lang/String;)V
public fun onComplete (Ljava/lang/String;)V
public fun onPrefillProgress (II)V
public fun onThinking (Ljava/lang/String;)V
public fun onToken (Ljava/lang/String;)V
public fun onToolCallValidationFailed (Lsk/ainet/apps/kllama/chat/ToolCall;Ljava/lang/String;)V
Expand All @@ -55,6 +56,7 @@ public abstract interface class sk/ainet/apps/kllama/chat/AgentListener {
public final class sk/ainet/apps/kllama/chat/AgentListener$DefaultImpls {
public static fun onAssistantMessage (Lsk/ainet/apps/kllama/chat/AgentListener;Ljava/lang/String;)V
public static fun onComplete (Lsk/ainet/apps/kllama/chat/AgentListener;Ljava/lang/String;)V
public static fun onPrefillProgress (Lsk/ainet/apps/kllama/chat/AgentListener;II)V
public static fun onThinking (Lsk/ainet/apps/kllama/chat/AgentListener;Ljava/lang/String;)V
public static fun onToken (Lsk/ainet/apps/kllama/chat/AgentListener;Ljava/lang/String;)V
public static fun onToolCallValidationFailed (Lsk/ainet/apps/kllama/chat/AgentListener;Lsk/ainet/apps/kllama/chat/ToolCall;Ljava/lang/String;)V
Expand Down
Loading