For architecture details see ARCHITECTURE.md.
SKaiNET is a Kotlin Multiplatform AI framework. New here? Choose the path that matches what you want to try first.
| Goal | Start here | Time |
|---|---|---|
| Run tensor operations | Quickstart (below) | 2–5 min |
| Build and train a neural net | Hello Neural Net (below) | 5 min |
| Run a local GGUF model | SKaiNET Transformers starter | 5 min after model setup |
| Export a secure MCU bundle | Minerva getting started | 10 min without firmware flashing |
Working in Java? SKaiNET ships first-class Java support — see the Java getting-started guide.
Use the version shown in this README as the source of truth for first-run snippets. If another page shows a different version, please open an issue or PR.
Add the core dependencies (Gradle Kotlin DSL):
dependencies {
// Recommended: import the umbrella BOM and drop versions on the engine modules.
implementation(platform("sk.ainet:skainet-bom:0.31.0"))
implementation("sk.ainet.core:skainet-lang-core")
implementation("sk.ainet.core:skainet-backend-cpu")
}The BOM was first correctly published to Maven Central in 0.22.2 — earlier versions shipped at the wrong coordinates and could not be imported. Pin versions directly if you need an older release.
val model = nn {
input(28 * 28)
dense(out = 128)
relu()
dense(out = 10)
}val a = tensor(shape(2, 2)) { float(1f, 2f, 3f, 4f) }
val b = tensor(shape(2, 2)) { float(5f, 6f, 7f, 8f) }
val c = a matMul b
val d = c.relu()// Recommended: streaming reader — memory-efficient, supports quantized types
val source = JvmRandomAccessSource.open("model.gguf")
StreamingGGUFReader.open(source).use { reader ->
println("Tensors: ${reader.tensorCount}")
// Load specific tensor on demand (no whole-file loading)
val bytes = reader.loadTensor("token_embd.weight")
// Or get a TensorStorage descriptor with encoding/placement metadata
val storage = reader.loadTensorStorage("token_embd.weight")
}More examples: SKaiNET-examples | SKaiNET-notebook
SKaiNET is a modular ecosystem. While this repository contains the core engine, specialized high-level libraries are maintained in standalone repositories:
| Project | Description |
|---|---|
| SKaiNET-transformers | Pre-built transformer architectures and layers |
| SKaiNET-examples | Sample projects and integration demos |
| Goal | Start here |
|---|---|
| Examples and sample projects | SKaiNET-examples |
| Interactive notebooks | SKaiNET-notebook |
| Eager backends & kernels (what runs where) | Backends & kernels mindmap |
SKaiNET ships an official Phoronix-Test-Suite-compatible benchmark
program for the compute engine. See the
methodology and replay docs,
the release manifest, and the
CI workflow. Smoke runs fire
on every PR via ubuntu-latest; full publishable runs fire on a
self-hosted Linux x86 runner on release.
Quick local replay:
./gradlew :skainet-backends:benchmarks:jvm-cpu-publish:shadowJar
./scripts/run_engine_smoke.shSKaiNET is built around one path: a model is defined once in the Kotlin DSL, then either compiled to native code or executed eagerly — without rewriting it.
- Define the model with the DSL (
nn { }/dag { }). - Capture it as a tape (traced execution) or a DAG (explicit graph).
- Run it one of two ways:
- Compile — lower the graph to MLIR / StableHLO (
HloGenerator) and compile to native code (IREE-compatible) for native / edge targets. - Eager — execute directly on an available backend. On the JVM this is the primary, go-to path.
- Compile — lower the graph to MLIR / StableHLO (
flowchart LR
DSL["Model — Kotlin DSL"] --> Graph["Tape / DAG"]
Graph --> HLO["MLIR / StableHLO"]
Graph --> Eager["Eager backend (JVM, …)"]
HLO --> Native["Native code"]
The same DSL model feeds both paths — eager execution for development and JVM deployment, the StableHLO path for native and edge targets.
SKaiNET now includes a Minerva export backend for secure MCU deployment. It is a sibling to StableHLO and Arduino/C99 export: it starts from a supported ComputeGraph, lowers static MLPs to a Minerva compiler input, invokes libminerva when configured, and packages generated weights, host fixtures, firmware skeletons, and a fingerprinted manifest.json.
Start here:
- Minerva getting started — run the maintained tiny MLP dry sample, then the real libminerva runtime profile.
- Minerva export how-to — configure compiler paths, keys, calibration, CMake/CTest host verification, and troubleshooting.
- How Minerva secure MCU export fits — understand why Minerva is not an Arduino replacement and when to choose StableHLO instead.
Runnable examples:
./gradlew :skainet-compile:skainet-compile-minerva:runMinervaSecureMcuExamples
./gradlew :skainet-compile:skainet-compile-minerva:runMinervaSecureMcuExamples \
-Pminerva.example=sensor-classifier- Targets: JVM, macOS (Native), JS, WASM (Browser + WasmWasi)
- Single codebase shared across all platforms via Kotlin Multiplatform
- ComputeGraphExecutor: Optimized engine with fusion passes and trace-to-DAG bridging.
- SDPA & Gather: High-performance Scaled Dot-Product Attention and indexing operations.
- TurboQuant: Runtime KV-cache compression (~8x at 4-bit) for long-context LLM inference. Presets:
safe-lowbit,balanced,experimental-max. SeeTurboQuantUsagefor integration guide.
- Sequential:
nn { input(); dense(); relu(); dense() } - DAG / Graph: arbitrary wiring with
dag { }for ResNet, YOLO-style architectures - Layers: Dense, Conv1d/2d/3d, MaxPool, AvgPool, BatchNorm, Dropout, LeakyReLU, ELU
- KAN (Kolmogorov–Arnold Networks) layer (experimental)
- Autograd engine with reverse-mode gradients, SGD and Adam/AdamW optimizers
- Built-in loaders: MNIST, Fashion-MNIST, CIFAR-10
- Formats: GGUF, ONNX, SafeTensors, JSON, Image (JPEG, PNG)
- Type-safe transform DSL: resize, crop, normalize, toTensor
- Export trained models to standalone, optimized C99 with static memory allocation
- Ready-to-use Arduino library output
- Export supported static MLP graphs to Minerva project bundles for secure MCU inference
- Emits compiler NPZ input, libminerva weights, a fingerprinted manifest, host harness, firmware example, and host verification results
- Start with the Minerva getting started guide
- Lower Kotlin DSL to MLIR StableHLO dialect
- Optimization passes: constant folding, operation fusion, dead code elimination
- Valid IREE-compilable output with streaming API and public
HloGenerator
- Use StableHLO when you want portable MLIR/IREE-compatible graphs for native, accelerator, or ecosystem compiler flows.
- Use Arduino / C99 export when you want standalone generated C with static memory allocation and no external secure runtime.
- Use Minerva export when you need a secure MCU project bundle that goes through libminerva packaging and host verification.
ops.transposelazily handles every packed matmul dtype. The CPU backend rewraps packed bytes with a flipped shape (metadata-only "lazy transpose") so a packed weight surviveslinearProject'smatmul(x, transpose(W))instead of inflating to FP32 — but Q8_0 and Q4_0 were missing and threwByte → Float ClassCastException. Now the full dispatch set (Q4_K/Q5_K/Q6_K/Q5_0/Q5_1/Q8_0/Q4_0) transposes lazily, so a packed Q8_0/Q4_0 matmul weight (e.g. a tied Q8_0lm_head) stays packed end-to-end on its NEON/SIMD kernel. Regression-tested across all seven packed types. (PRs #736, #737)- Dependency:
com.networknt:json-schema-validator→ 3.0.4. (PR #733)
-
0.30.0 — First-class Q5_K packed in-kernel dequant-matmul across the CPU backends (
Q5_KBlockTensorData+Q5KMatmulKernelSPI: scalar / Panama Vector / native-C), hand-written ARM NEON kernels (fp32/q8_0/q4k/q5k,-march=armv8.2-a+fp16+dotprod), and Kotlin/Native consumption of the C kernels via cinterop (skainet-backend-native-cpustatic archive +linuxX64/linuxArm64KernelProvider). (PR #734) -
0.29.1 —
sk.ainet.core:skainet-compile-minervanow publishes to Maven Central (packaging fix for the Minerva export module shipped in 0.29.0). -
0.29.0 — Minerva secure-MCU export module: an end-to-end pipeline that lowers a SKaiNET model through shared graph-export contracts → Minerva IR → an
.npzcompiler input → a libminerva-packaged secure MCU project bundle, with host-side runtime verification and fingerprinted manifest artifacts (runnable sample, examples, ONNX workflow, getting-started docs). Plus packed-quant matmul kernels with Kotlin/Native parity (Q5_0/Q5_1/Q4_K/Q6_K — commonMain scalar + SPI, packed-quant dispatch inDefaultCpuOpsBase, Panama Vector for Q5_1/Q5_0 and Q6_K via theKernelRegistry), and an auto-generated, CI-gated kernel × platform support matrix. (PRs #697–#726) -
0.28.1 — Kotlin DSL → StableHLO → IREE is green end-to-end for the whole conformance suite (7/7 models, 27/27 ops compile to a
vmfb):inferDagOutputSpecsnow infers correct output shapes for shape-changing ops, andreduce_window(pooling) emits IREE's generic region form. (PRs #674, #676) -
0.28.0 — Four StableHLO export bugs fixed (reshape #666, concatenate #667, constants/reductions #663,
HloGeneratortracing #668) plus non-JVM image runtime support (#671). (PRs #664, #670, #671) -
0.27.0 — A full gemma3 network lowers to StableHLO and compiles to an IREE
vmfb(zero op gaps, verified byGemmaTraceTest): newscaledDotProductAttention(with causal + explicit additive mask),permute,narrow, and multi-outputsplitconverters, plus boxing-freeFloatArrayweight externalization for.irpabaking. (PRs #661 et al.) -
0.26.0 — Q4_0 promoted to a first-class quantized format across the provider stack,
tanhas a first-class activation primitive, and a CPU tensorconvertop, plus test/build/CI hygiene. (PRs #648–#651, #631, #636) -
0.25.0 — BF16 and Q8_0 matmul kernels end-to-end across the provider stack, autograd completeness for
pow/logand the conv/pool/upsample/split family, the hybrid adaptive dtype-constraint DSL, the@DarcValidatedoperator-doc flag, and the SentencePiece special-token splitter. (PRs #595, #605–#628) -
0.23.0 — Real-model GGUFs no longer OOM at network construction (lazy
TensorDataFactory.placeholder(...)); Kotlin/Native can finally load GGUFs over 2 GiB via the new POSIX-pread-backedPosixPreadRandomAccessSource. (Issues #587, #589; PRs #588, #591) -
0.22.2 —
sk.ainet:skainet-bomnow resolves from Maven Central (earlier versions shipped at the wrong coordinates). (Issue #584) -
0.22.1 —
StreamingShardedSafeTensorsReader.loadTensorStorageMappedfor zero-copy reads of multi-shard tensors above the 2 GB JVMByteArraylimit. (PR #582) -
0.22.0 — Native (FFM) CPU kernel provider: 4–6× faster Q4_K matmul, 1.5–1.8× FP32 SGEMM vs Panama Vector; auto-selected via
KernelRegistry.bestAvailable(). (PR #571)
See CHANGELOG.md for the full release history.
- Q1 2026: Comprehensive documentation ✅
- Q2 2026: TurboQuant KV-cache compression ✅ (shipped in 0.18.0); Qwen/LLaMA tokenizers ✅ (shipped in 0.20.0)
- Q3 2026: Agentic AI enhancements ✅ (tool calling shipped in 0.13.0; ongoing)
- Q4 2026: Federated learning support for multi-device training
We love contributions! Whether it's a new operator, documentation, or a bug fix:
- Read our Contribution Guide.
- Check the Good First Issues.
- Open a discussion or issue on GitHub.
Browse the full codebase documentation on DeepWiki.
- Dhia Chemingui (@dhiaspaner) — Android KMP plugin migration (#385, #386)
MIT — see LICENCE.
