Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,25 @@

## [Unreleased]

## [0.27.0] - 2026-06-04

### Added

- **Full gemma3 network lowers to StableHLO with zero gaps.** A batch of new core converters closes every remaining op gap on the Kotlin DSL → MLIR StableHLO path, so a complete gemma3 graph traces and lowers end-to-end (verified by `GemmaTraceTest` over the composite build: 140 nodes → 255 lines, 0 unsupported, 0 arity errors):
- **`scaledDotProductAttention` converter** (`AttentionOperationsConverter`) — lowers the atomic SDPA op to the standard StableHLO subgraph: `scores = Q·Kᵀ` (`dot_general`, contract `head_dim`), `* scale` (arg or `1/sqrt(head_dim)`), numerically stable softmax over key length, `out = attn·V`. Batched `[..,S,D]` with all leading dims as batching dims. SDPA is a core `TensorOps` op, so its converter lives in core.
- **Causal mask for SDPA** — when the node's `causal` attr is set, emits an additive `-inf` mask before softmax (`iota`/`compare GE`/`select`) so each query attends only to keys at or before it. Validated EXACT against a NumPy causal reference and accepted by `iree-compile`.
- **Explicit SDPA mask operand** — `AttentionOperationsConverter` now consumes `operands[3]` (the additive mask), broadcasting it trailing-aligned to the scores shape before softmax. Fixes gemma sliding-window layers (`causal=false` + explicit causal+window mask) that previously exported unmasked and attended to future tokens. (PR #661)
- **`permute` + `narrow` converters** — `permute` registered as an arbitrary-axis alias routed to the existing transpose path; `narrow` slicing support.
- **Multi-output converter support + `split` converter** — per-`(nodeId, outputPort)` SSA naming in `ConversionContext`, operand resolution that walks incoming edges by `destinationInputIndex` and resolves each by the edge's `sourceOutputIndex`, and `split`/`chunk` lowering to N `stablehlo.slice` ops each registered on its own output port (lowers the RoPE `split` gap).
- **Boxing-free `FloatArray` weight externalization for `.irpa` baking.** `finalize()` now stores resolved weights as the primitive `FloatArray` instead of `.toList()` (boxing a real LLM weight — e.g. a 262153×640 embedding → ~2.7 GB `List<Float>` — OOMed the trace). `ConstantOperationsConverter` externalizes `FloatArray` directly (`floatArrayToLittleEndianBytes` + `tryMaterializeExternalFloats`, inlining small/`InlineAlways` tensors via `asList()`), and `IrpaWriter` writes byte ranges in one shot. With this, the real Gemma-270M function bakes: 1 func arg (tokens) + 360 weights externalized to `util.global #flow.parameter.named`.
- **DSL prescribes element dtype for placeholder weights.** The DSL can now specify the element dtype for placeholder weights during tracing.
- **Numerical validation harness for SDPA lowering.** Dumps a small `scaledDotProductAttention` StableHLO graph; `iree-compile` + `iree-run-module` output matches a NumPy reference exactly to 5 decimals, confirming the attention converter is numerically correct, not just structurally valid.

### Fixed

- **IREE-valid StableHLO syntax — full gemma3 compiles to `vmfb`.** Aligned converter emission to what `iree-compile`'s StableHLO parser accepts (verified by compiling the full gemma3 graph end-to-end): `gather` uses the generic MLIR form, `slice`/`narrow`/`split` use the canonical bracket form via a shared `sliceLine()` helper, `concatenate` emits the full functional type, and batch matmul derives batch dims as `min(lhsRank,rhsRank)-2` (fixes 3D-activation @ 2D-weight Linear projections). Result: SKaiNET gemma3 DSL → StableHLO → `iree-compile` (llvm-cpu; +neon aarch64) → `vmfb` for both host x64 and aarch64 targets.
- **`VoidTensorOps.gather` output shape for multi-dim indices.** The void/tracing gather collapsed the gathered axis to `indices.shape[0]`, so a `[vocab,emb]` table with `[batch,seq]` indices traced to `[batch,emb]` instead of `[batch,seq,emb]` (breaking the embedding's downstream reshape during weight-free tracing). Now replaces the axis with the full indices shape, matching `DefaultCpuOps.gather`, unblocking tracing of full transformer (gemma3) graphs.

## [0.26.0] - 2026-05-30

### Added
Expand Down
13 changes: 7 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Add the core dependencies (Gradle Kotlin DSL):
```kotlin
dependencies {
// Recommended: import the umbrella BOM and drop versions on the engine modules.
implementation(platform("sk.ainet:skainet-bom:0.26.0"))
implementation(platform("sk.ainet:skainet-bom:0.27.0"))

implementation("sk.ainet.core:skainet-lang-core")
implementation("sk.ainet.core:skainet-backend-cpu")
Expand Down Expand Up @@ -193,15 +193,16 @@ deployment, the StableHLO path for native and edge targets.

---

## What's New in 0.26.0
## What's New in 0.27.0

- **Q4_0 is now a first-class quantized format.** The older GGML 4-bit format joins Q8_0 / Q4_K across the full provider stack: a heap `Q4_0TensorData` any loader can produce, a `Q4_0MatmulKernel` SPI with scalar / Panama-Vector / native-FFM implementations auto-selected by `KernelRegistry`, and a `Q4_0Quantizer` to pack dense FP32 weights into canonical ggml Q4_0 without going through GGUF. (PRs #648–#651)
- **`tanh` is now a first-class activation primitive.** Promoted from a `NotImplementedError` stub to a fully wired `@Diff @ActivationDsl` op — `TensorOps` interface, `Tensor.tanh()` extension, CPU backend, recording decorator, and autograd backward (`1 - output^2`) — so downstream consumers no longer re-derive the `2*sigmoid(2x)-1` polyfill. Pinned end-to-end by a micrograd tanh-MLP training test on the moons dataset. (Issue #630, PR #631)
- **CPU tensor `convert` op.** Dtype conversion now has a real CPU backend implementation. (PR #636)
- Plus test, build, and CI hygiene: portable KMP `@Ignore` for common tests, restored BatchNorm coverage, Gradle build-warning cleanup, and narrower feature-PR CI triggers. (PRs #633, #634, #638, #640, #645)
- **A full gemma3 network now lowers to StableHLO and compiles to an IREE `vmfb`.** A batch of new core converters closes every remaining op gap on the Kotlin DSL → MLIR StableHLO path, so a complete gemma3 graph traces and lowers end-to-end with zero gaps (verified by `GemmaTraceTest`: 140 nodes → 255 lines, 0 unsupported), then compiles through `iree-compile` (llvm-cpu; +neon aarch64) to a `vmfb` for both host x64 and aarch64.
- **`scaledDotProductAttention` converter.** Lowers the atomic SDPA op to the standard StableHLO subgraph (`Q·Kᵀ` → scale → stable softmax → `attn·V`), with causal-mask emission and explicit additive-mask operand support (fixes gemma sliding-window layers). Numerically validated EXACT against a NumPy reference via `iree-run-module`.
- **`permute`, `narrow`, and multi-output `split` converters.** Per-`(nodeId, outputPort)` SSA naming and edge-accurate operand resolution let a consumer of a multi-output op (e.g. RoPE's `split`) get the right output port.
- **Boxing-free `FloatArray` weight externalization for `.irpa` baking.** Resolved weights stay primitive `FloatArray` (no `List<Float>` boxing that OOMed multi-GB embeddings); the real Gemma-270M function bakes its 360 weights to `util.global #flow.parameter.named`.

### Recent releases

- **0.26.0** — Q4_0 promoted to a first-class quantized format across the provider stack, `tanh` as a first-class activation primitive, and a CPU tensor `convert` op, plus test/build/CI hygiene. (PRs #648–#651, #631, #636)
- **0.25.0** — BF16 and Q8_0 matmul kernels end-to-end across the provider stack, autograd completeness for `pow`/`log` and the conv/pool/upsample/split family, the hybrid adaptive dtype-constraint DSL, the `@DarcValidated` operator-doc flag, and the SentencePiece special-token splitter. (PRs #595, #605–#628)
- **0.23.0** — Real-model GGUFs no longer OOM at network construction (lazy `TensorDataFactory.placeholder(...)`); Kotlin/Native can finally load GGUFs over 2 GiB via the new POSIX-`pread`-backed `PosixPreadRandomAccessSource`. (Issues #587, #589; PRs #588, #591)
- **0.22.2** — `sk.ainet:skainet-bom` now resolves from Maven Central (earlier versions shipped at the wrong coordinates). (Issue #584)
Expand Down
4 changes: 2 additions & 2 deletions docs/modules/ROOT/pages/how-to/io-readers.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Add the following dependencies to your `build.gradle.kts`:
[source,kotlin]
----
dependencies {
implementation(platform("sk.ainet:skainet-bom:0.26.0"))
implementation(platform("sk.ainet:skainet-bom:0.27.0"))

implementation("sk.ainet.core:skainet-io-gguf")
implementation("org.jetbrains.kotlinx:kotlinx-io-core:0.8.2")
Expand All @@ -32,7 +32,7 @@ dependencies {
[source,kotlin]
----
dependencies {
implementation(platform("sk.ainet:skainet-bom:0.26.0"))
implementation(platform("sk.ainet:skainet-bom:0.27.0"))

implementation("sk.ainet.core:skainet-io-onnx")
implementation("org.jetbrains.kotlinx:kotlinx-io-core:0.8.2")
Expand Down
2 changes: 1 addition & 1 deletion docs/modules/ROOT/pages/how-to/java-model-training.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ This guide covers building neural networks, defining loss functions and optimize
<dependency>
<groupId>sk.ainet</groupId>
<artifactId>skainet-bom</artifactId>
<version>0.26.0</version>
<version>0.27.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
= AI-NET Operators Reference

Generated from version `0.26.0` on 2026-05-30
Generated from version `0.27.0` on 2026-06-04

== Operators by Modality

Expand Down
2 changes: 1 addition & 1 deletion docs/modules/ROOT/pages/reference/ops-status-matrix.adoc
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
= Operator Coverage Matrix
:description: Cross-backend status for every operator function in SKaiNET.

Generated from `operators.json` version `0.26.0` on 2026-05-30.
Generated from `operators.json` version `0.27.0` on 2026-06-04.

Rows are `Operator.function` pairs. The `Validated` column shows whether the function's documentation has been DARC-validated by a reviewer (see xref:contributing/darc-workflow.adoc[DARC workflow]). Remaining columns are backends that appear in any function's `statusByBackend` map — a missing entry means the backend makes no claim about the function (treat it as "unknown", not "not supported").

Expand Down
4 changes: 2 additions & 2 deletions docs/modules/ROOT/pages/tutorials/java-getting-started.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ The `skainet-bom` manages all SKaiNET module versions so you never have to keep
----
<project>
<properties>
<skainet.version>0.26.0</skainet.version>
<skainet.version>0.27.0</skainet.version>
</properties>

<dependencyManagement>
Expand Down Expand Up @@ -144,7 +144,7 @@ repositories {

dependencies {
// Import BOM for version alignment
implementation(platform("sk.ainet:skainet-bom:0.26.0"))
implementation(platform("sk.ainet:skainet-bom:0.27.0"))

// Core tensor library
implementation("sk.ainet:skainet-lang-core-jvm")
Expand Down
2 changes: 1 addition & 1 deletion gradle.properties
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
GROUP=sk.ainet.core
VERSION_NAME=0.26.0
VERSION_NAME=0.27.0
POM_DESCRIPTION=SKaiNET

POM_URL=https://github.com/SKaiNET-developers/skainet/
Expand Down
Loading
Loading