SKaiNET-developers · michalharakal · May 2, 2026 · May 2, 2026
diff --git a/APERTUS_ROLLOUT.md b/APERTUS_ROLLOUT.md
@@ -1,116 +1,50 @@
-# Apertus Support Rollout
+# Apertus Support Rollout — COMPLETE
 
-**Status:** PR 3 in flight (tool-calling support).
-**Owner:** unassigned.
-**Plan PR:** #91 (merged). PR 1: #92 (merged). PR 2: #93 (merged).
+**Status:** complete (3 of 3 PRs merged + deprecated-runtime cleanup).
+**Plan PR:** #91. **Implementation PRs:** #92 (routing), #93 (chat-template docs), #94 (tool calling).
 
-## Context
+## Summary
 
-The Apertus model (Swiss AI / EPFL multilingual decoder-only transformer) is **architecturally complete in the transformers library layer** but has three integration gaps that make it semi-broken end-to-end today:
+Apertus (Swiss AI / EPFL multilingual decoder-only transformer) reached production parity with kllama and kgemma over four PRs landed 2026-05-01 and 2026-05-02:
 
-1. **Silent correctness bug in `skainet-cli`.** Lines 168–216 of `llm-apps/skainet-cli/src/main/kotlin/sk/ainet/apps/skainet/cli/Main.kt` route every non-Gemma model — including Apertus — through the deprecated `LlamaRuntime`. Apertus uses **xIELU** activation, **QK-Norm** (RMSNorm on Q and K before RoPE), and **ungated FFN** (no `gate_proj`); none of those branches exist in `LlamaRuntime`. Inference completes, but the logits diverge from what the Apertus checkpoint actually wants. The output is wrong on a level the user can't easily catch unless they compare to a reference.
-2. **Tool calling is OFF.** `llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/ModelRegistry.kt:66` lists Apertus as `("apertus", "Apertus", false, "chatml")` — `supportsToolCalling=false`, `chatTemplateFamily="chatml"` (a guess). There is no `ApertusChatTemplate.kt`, no `ApertusToolCallingSupport`, and no entry in `llm-agent/src/commonMain/kotlin/sk/ainet/apps/kllama/chat/ToolCallingSupportResolver.kt:16`. Apertus models fall back to `GenericToolCallingSupport`, which doesn't know the model's actual prompt format.
-3. **`kapertus` runtime is a 1-file stub.** Commit `81f3506` (`chore: remove deprecated-runtime CLIs (kqwen-only / kapertus / kvoxtral)`) deleted the previous `kapertus` CLI's `Main.kt` (~292 lines). What remains in `llm-runtime/kapertus/` is a single `ApertusIngestion.kt` (89 lines) that wraps the weight loaders. There is no kapertus-cli binary equivalent of `kllama-cli`, and no `llm-apps/kapertus-cli/` module.
+| PR    | Title                                                     |
+| ----: | --------------------------------------------------------- |
+| #91   | Plan + this tracking doc                                  |
+| #92   | `fix(apertus): route through OptimizedLLMRuntime + apertusNetwork()` |
+| #93   | `docs(apertus): document chat-template format`            |
+| #94   | `feat(apertus): tool calling support`                     |
 
-The architecture / library layer itself is solid:
-- `apertusNetwork()` DSL with xIELU, QK-Norm, ungated FFN — `llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusNetworkDef.kt:29`.
-- Weight loading from GGUF + SafeTensors + quantized — `ApertusWeightLoader.kt`, `ApertusSafeTensorsLoader.kt`, `ApertusQuantizedRuntime.kt`.
-- 4 self-contained `commonTest` smoke tests pass without a model file (`ApertusRuntimeSmokeTest`, `ApertusQuantizedRuntimeSmokeTest`, `ApertusXIELUTest`, `ApertusConfigParserTest`).
+After this stack:
+- `skainet-cli` routes Apertus models through `OptimizedLLMRuntime + apertusNetwork()` (xIELU + QK-Norm + ungated FFN — the previous `LlamaRuntime` fallback silently produced wrong logits).
+- `--agent --template=apertus` formats prompts with Apertus's own role tokens (`<|system_start|>`, `<|user_start|>`, `<|assistant_start|>`, `<|tools_prefix|>`, etc.) and parses tool calls back from `<|tools_prefix|>[...]<|tools_suffix|>` JSON arrays.
+- `ModelRegistry.APERTUS.supportsToolCalling = true`, `chatTemplateFamily = "apertus"`.
+- `KernelRegistry` auto-discovers native FFM kernels for the matmul path via the 0.22.0 native-cpu module.
 
-**Goal:** lift Apertus to the same level of polish kllama and kgemma have. Track the rollout in this file. The next contributor / session opens this doc, scans the staged-delivery checklist, and picks up where the previous one left off.
+## What's not in this rollout
 
----
+- **Optional kapertus-cli rebuild** — was originally listed as PR 4 ("rebuild CLI under `llm-apps/`"). Dropped: the unified `skainet-cli` already covers Apertus end-to-end, model-specific CLIs (kqwen, kapertus, kvoxtral) are being deprecated per commit `81f3506`, and the workspace direction is consolidation rather than per-model binaries. If a downstream consumer needs an Apertus-only fat-jar later, copy the `skainet-cli` shadow setup.
+- **Native Apertus kernels** — Apertus shares matmul shapes with Llama; the native FFM kernels from SKaiNET 0.22.0 (Q4_K, FP32) work transparently. No Apertus-specific kernel work needed.
+- **TurboQuant KV-cache compression for Apertus** — tracked separately under the TurboQuant workstream.
 
-## Staged delivery
+## Reference docs
 
-- [x] **PR 1 — `fix(apertus): route through OptimizedLLMRuntime + apertusNetwork()`** (correctness fix) — #92
-- [x] **PR 2 — `docs(apertus): document chat template format`** (research) — #93
-- [x] **PR 3 — `feat(apertus): tool calling support`** (implementation, depends on PR 2) — this PR
-- [ ] **PR 4 — `feat(kapertus): rebuild CLI under llm-apps/`** (parity, optional)
+- `docs/specs/apertus-chat-template.md` — full spec for the Apertus chat template (PR 2). Source of truth for the `ApertusChatTemplate` implementation.
 
-Each PR ticks its own checkbox when merged. `Status:` at the top of this doc reflects the most recent merged PR.
+## Cleanup that landed alongside the rollout (this commit)
 
----
+The hand-coded `ApertusRuntime.kt` and `ApertusQuantizedRuntime.kt` paths (and their attention backends + smoke tests) were marked `@Deprecated` after PR 1 made `OptimizedLLMRuntime + apertusNetwork()` the canonical path. Removed in this commit alongside the rollout closure:
 
-## PR 1 — fix skainet-cli routing for Apertus
+- `ApertusRuntime.kt` — hand-coded decoder runtime, deprecated.
+- `ApertusQuantizedRuntime.kt` — lazy-dequant variant, deprecated.
+- `ApertusAttentionBackend.kt` + `ApertusCpuAttentionBackend.kt` — only used by the two deleted runtimes.
+- `ApertusRuntimeSmokeTest.kt` + `ApertusQuantizedRuntimeSmokeTest.kt` — exercised the deleted runtimes.
 
-**Why first:** today's `skainet-cli` produces silently-wrong logits for Apertus models. Fix the worst class of bug first; everything else is additive.
+The `xielu()` / `softplus()` activation reference functions previously housed in `ApertusRuntime.kt` were extracted to `ApertusXIELU.kt` so `ApertusXIELUTest` keeps validating the math. The kdoc references in `OptimizedLLMRuntime.kt` and `OutputEquivalenceTest.kt` to "ApertusRuntime" are now stale and worth a follow-up sweep, but they're code comments only and don't break anything.
 
-**Changes:**
-- `llm-apps/skainet-cli/src/main/kotlin/sk/ainet/apps/skainet/cli/Main.kt` lines 168–216: detect `architecture == "apertus"` (or `family == "apertus"`) and branch to `OptimizedLLMRuntime + apertusNetwork()` instead of `LlamaRuntime`. Mirror how Gemma4 is already special-cased in this file.
-- Reuse `ApertusNetworkLoader.kt:31` for the module-build path.
-- Reuse `apertusNetwork()` from `ApertusNetworkDef.kt:29`.
+The remaining apertus library files (`ApertusNetworkDef`, `ApertusNetworkLoader`, `ApertusWeightLoader`, `ApertusSafeTensorsLoader`, `ApertusRuntimeWeights`, `ApertusConfigParser`, `QuantizedTensor`, `ApertusXIELU`, `ApertusIngestion`) cover the whole production path through `apertusNetwork() + OptimizedLLMRuntime`.
 
-**Verification:**
-- `skainet-cli -m <apertus.gguf> "Hello"` produces coherent output (the divergence is silent today; loading a real checkpoint and seeing meaningful text is the canary).
-- xIELU activation actually fires — verify by setting a debug breakpoint or `println` in `ApertusNetworkDef`'s xIELU branch on the first forward pass.
+## Test footprint (post-cleanup)
 
-**No new tests required** — the existing `:llm-inference:apertus:commonTest` already exercises `apertusNetwork()` end-to-end at toy scale; the routing fix is a Main.kt change covered by manual run.
-
----
-
-## PR 2 — document the chat template format
-
-**Why:** before implementing a chat template, we need to know what format Apertus models actually expect. `ModelRegistry.kt:66` lists `chatTemplateFamily="chatml"` but that's a guess — Apertus may use a different format (Alpaca, llama2-style, custom). Without the right template, tool-calling output won't parse correctly even if the rest of PR 3 is right.
-
-**Changes:**
-- Inspect a real Apertus GGUF (download from HuggingFace if needed: `swiss-ai/Apertus-1B` or similar). Read the `tokenizer.chat_template` GGUF metadata key.
-- Create `docs/explanation/models/apertus-chat-template.md` documenting:
-  - The actual chat-template Jinja string from the GGUF, byte-for-byte.
-  - Special tokens (`<|im_start|>`-style? `[INST]`-style? Alpaca?).
-  - Tool calling format (if the template has any).
-  - Whether the template matches an existing family (`chatml`, `llama3`, `gemma`) or needs a new `apertus` strategy.
-- Update `ModelRegistry.kt:66` `chatTemplateFamily` if the research shows a different family is correct.
-
-**Verification:** doc exists; rendered template matches the GGUF's `tokenizer.chat_template` byte-for-byte for one canonical message exchange (system + user + assistant + user roles).
-
----
-
-## PR 3 — tool calling support
-
-**Depends on PR 2** (need the template format documented).
-
-**Changes:**
-- New `llm-agent/src/commonMain/kotlin/sk/ainet/apps/kllama/chat/ApertusChatTemplate.kt` — concrete `ChatTemplate`. Pattern reference: `Llama3ChatTemplate.kt` and `Gemma4ChatTemplate.kt`.
-- New `llm-agent/src/commonMain/kotlin/sk/ainet/apps/kllama/chat/ApertusToolCallingSupport.kt` — analog of `Gemma4ToolCallingSupport`. Bundles the chat template + the tool-call markup parser.
-- Register in `ToolCallingSupportResolver.kt:16` — add a branch for `family == "apertus"` returning the new support.
-- `ModelRegistry.kt:66`: flip `supportsToolCalling=true`.
-- Tests:
-  - `ApertusChatTemplateTest.kt` in `llm-agent/src/commonTest/kotlin/sk/ainet/apps/kllama/chat/` — render parity vs the GGUF Jinja template (mirror `Gemma4ChatTemplateHfParityTest` shape).
-  - `ApertusToolCallParserStrategyTest.kt` — parse the Apertus tool-call output format.
-
-**Verification:**
-- `:llm-agent:commonTest --tests '*Apertus*'` green.
-- End-to-end: `skainet-cli -m <apertus.gguf> --agent --template=apertus "What is 17 * 23?"` invokes a calculator tool the same way the kllama TinyLlama tool-calling smoke test does.
-
----
-
-## PR 4 — rebuild kapertus CLI under llm-apps/ (optional)
-
-**Why:** `kapertus` runtime is a 1-file library facade today. Users wanting a `kapertus-cli` equivalent of `kllama-cli` have nothing. The unified `skainet-cli` works fine after PR 1, so PR 4 is **optional polish** — only worth doing if there's a specific reason to ship a separate Apertus-only CLI binary (smaller fat-JAR, branded distribution, similar parity expectations as `kllama-cli`).
-
-**Changes:**
-- New `llm-apps/kapertus-cli/build.gradle.kts` mirroring `kllama-cli/build.gradle.kts`: `kotlin("jvm")` + `shadow` + `application` plugins.
-- New `llm-apps/kapertus-cli/src/main/kotlin/sk/ainet/apps/kapertus/cli/Main.kt` — re-implement the deleted (commit `81f3506`) Main.kt, but routed through `OptimizedLLMRuntime + apertusNetwork()`.
-- Apply the shadow `mergeServiceFiles()` `doLast` workaround that PR #88 added to `kllama-cli` and `skainet-cli` (the `com.gradleup.shadow:9.4.x` bug — `NativeKernelProviderFactory` gets dropped from the merged services file otherwise).
-- Add to `settings.gradle.kts`: `include("llm-apps:kapertus-cli")`.
-
-**Verification:**
-- `:llm-apps:kapertus-cli:shadowJar` produces a runnable fat JAR.
-- `unzip -p kapertus-all.jar META-INF/services/sk.ainet.backend.api.kernel.KernelProvider` shows all 3 KernelProvider entries (Scalar + PanamaVector + Native).
-- `java -jar kapertus-all.jar -m <apertus.gguf> "Hello"` produces coherent output.
-
----
-
-## Out of scope
-
-- **Native-cpu wiring for Apertus inference.** Works automatically once PR 1 lands: matmul flows through `KernelRegistry.bestAvailable()`; native FFM kernels are auto-discovered via ServiceLoader on the classpath. No Apertus-specific code needed.
-- **Q4_K-quantized Apertus checkpoints.** Should work today via the existing Q4_K matmul path; no Apertus-specific code needed.
-- **TurboQuant KV-cache compression for Apertus.** Tracked separately under the broader TurboQuant workstream.
-- **Removing the deprecated `ApertusRuntime.kt`** (hand-coded path). Leave as `@Deprecated` until consumers have migrated to `OptimizedLLMRuntime + apertusNetwork()`.
-
----
-
-## Why this lives in transformers, not SKaiNET
-
-The Apertus rollout is entirely transformers-side. Model definition (`apertusNetwork()`), runtime, weight loaders, and tool calling all live under `llm-inference/apertus/`, `llm-runtime/kapertus/`, and `llm-agent/`. The SKaiNET upstream (kernels, tensor ops, ServiceLoader infra) needs no Apertus-specific changes.
+- `:llm-inference:apertus:jvmTest` — 12 tests (ConfigParser 6, XIELU 6).
+- `:llm-agent:jvmTest --tests '*Apertus*'` — 21 tests (ChatTemplate 10, ParserStrategy 11).
+- 33 Apertus-specific tests total, all green.
diff --git a/...nference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusAttentionBackend.kt b/...nference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusAttentionBackend.kt