From 8a7e0ff3196e2744db485d1a53b5ba15657aa807 Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 2 May 2026 11:32:22 +0200
Subject: [PATCH] =?UTF-8?q?chore(apertus):=20close=20out=20rollout=20?=
 =?UTF-8?q?=E2=80=94=20remove=20deprecated=20runtimes?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Closes the Apertus rollout (APERTUS_ROLLOUT.md) at 3-of-3 PRs
(plan #91, routing #92, chat-template docs #93, tool calling
#94). Drops the optional PR 4 (rebuild kapertus-cli) — the
unified skainet-cli already covers Apertus end-to-end after #92,
and the workspace direction (per `81f3506` deleting kqwen / kvoxtral
/ kapertus CLIs) is consolidation rather than per-model binaries.

Also lands the deprecated-runtime cleanup that the rollout
deferred. After PR 1 made `OptimizedLLMRuntime + apertusNetwork()`
the canonical path, the hand-coded `ApertusRuntime` and its
quantized variant served no production callers — both were
flagged @Deprecated in #92's wake. Removing them now rather than
maintaining stale code through a separate deprecation cycle.

Deleted:

- llm-inference/apertus/.../ApertusRuntime.kt — hand-coded
  decoder runtime. Replaced by OptimizedLLMRuntime + apertusNetwork()
  per #92.
- llm-inference/apertus/.../ApertusQuantizedRuntime.kt — lazy-dequant
  variant. Same canonical replacement path; QuantPolicy.NATIVE_OPTIMIZED
  through the unified loader covers the same memory profile.
- llm-inference/apertus/.../ApertusAttentionBackend.kt — interface
  used only by the two deleted runtimes.
- llm-inference/apertus/.../ApertusCpuAttentionBackend.kt —
  implementation, only used by the two deleted runtimes.
- llm-inference/apertus/.../ApertusRuntimeSmokeTest.kt — exercised
  the deleted ApertusRuntime.
- llm-inference/apertus/.../ApertusQuantizedRuntimeSmokeTest.kt —
  exercised the deleted ApertusQuantizedRuntime.

Extracted (kept):

- llm-inference/apertus/.../ApertusXIELU.kt (new) — `xielu()` and
  `softplus()` reference activation helpers were public functions
  in ApertusRuntime.kt and are still useful as a numerical
  reference (ApertusXIELUTest validates the math, and future
  xIELU implementations can point at this file as the golden
  reference). Pulled out as a standalone activation module so the
  test keeps compiling after ApertusRuntime is gone.

Untouched (still on the production path):

- ApertusNetworkDef.kt — the apertusNetwork() DSL with xIELU op,
  QK-Norm, ungated FFN.
- ApertusNetworkLoader.kt — module-build entry point.
- ApertusWeightLoader.kt + ApertusSafeTensorsLoader.kt — GGUF +
  SafeTensors ingestion. Still used by both ApertusNetworkLoader
  and (in transition) ApertusIngestion's loadQuantized* methods.
- ApertusRuntimeWeights.kt — data classes (ApertusModelMetadata,
  ApertusLayerWeights, ApertusXIELUParams). Used by the network
  path.
- ApertusIngestion.kt (kapertus runtime) — thin facade. Its
  loadQuantized* methods reference ApertusQuantizedRuntimeWeights,
  which lives in the (still-extant) ApertusWeightLoader codepath.
  Verified compiles cleanly after this PR.

Stale code-comment references to "ApertusRuntime" remain in
OptimizedLLMRuntime.kt and llm-core's OutputEquivalenceTest.kt
kdocs — they describe the migration history. Not load-bearing;
left for a future docs sweep.

APERTUS_ROLLOUT.md rewritten as a closure document (status:
complete, summary of the four merged PRs, what was dropped from
PR 4, post-cleanup test footprint).

Verification:

- `:llm-inference:apertus:jvmTest` — 12/12 (ConfigParser 6, XIELU 6).
- `:llm-agent:jvmTest --tests '*Apertus*'` — 21/21 (ChatTemplate 10,
  ParserStrategy 11).
- `:llm-runtime:kapertus:compileKotlinJvm`, `:llm-apps:skainet-cli:compileKotlin`,
  `:llm-core:compileTestKotlinJvm` — all green after the deletes.

Total Apertus test footprint after this commit: 33 tests, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 APERTUS_ROLLOUT.md                            | 134 +++------
 .../models/apertus/ApertusAttentionBackend.kt |  51 ----
 .../apertus/ApertusCpuAttentionBackend.kt     | 242 ---------------
 .../models/apertus/ApertusQuantizedRuntime.kt | 185 ------------
 .../sk/ainet/models/apertus/ApertusRuntime.kt | 242 ---------------
 .../sk/ainet/models/apertus/ApertusXIELU.kt   |  51 ++++
 .../ApertusQuantizedRuntimeSmokeTest.kt       | 275 ------------------
 .../models/apertus/ApertusRuntimeSmokeTest.kt | 175 -----------
 8 files changed, 85 insertions(+), 1270 deletions(-)
 delete mode 100644 llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusAttentionBackend.kt
 delete mode 100644 llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusCpuAttentionBackend.kt
 delete mode 100644 llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusQuantizedRuntime.kt
 delete mode 100644 llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusRuntime.kt
 create mode 100644 llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusXIELU.kt
 delete mode 100644 llm-inference/apertus/src/commonTest/kotlin/sk/ainet/models/apertus/ApertusQuantizedRuntimeSmokeTest.kt
 delete mode 100644 llm-inference/apertus/src/commonTest/kotlin/sk/ainet/models/apertus/ApertusRuntimeSmokeTest.kt

diff --git a/APERTUS_ROLLOUT.md b/APERTUS_ROLLOUT.md
index 2e782ee6..742973b1 100644
--- a/APERTUS_ROLLOUT.md
+++ b/APERTUS_ROLLOUT.md
@@ -1,116 +1,50 @@
-# Apertus Support Rollout
+# Apertus Support Rollout — COMPLETE
 
-**Status:** PR 3 in flight (tool-calling support).
-**Owner:** unassigned.
-**Plan PR:** #91 (merged). PR 1: #92 (merged). PR 2: #93 (merged).
+**Status:** complete (3 of 3 PRs merged + deprecated-runtime cleanup).
+**Plan PR:** #91. **Implementation PRs:** #92 (routing), #93 (chat-template docs), #94 (tool calling).
 
-## Context
+## Summary
 
-The Apertus model (Swiss AI / EPFL multilingual decoder-only transformer) is **architecturally complete in the transformers library layer** but has three integration gaps that make it semi-broken end-to-end today:
+Apertus (Swiss AI / EPFL multilingual decoder-only transformer) reached production parity with kllama and kgemma over four PRs landed 2026-05-01 and 2026-05-02:
 
-1. **Silent correctness bug in `skainet-cli`.** Lines 168–216 of `llm-apps/skainet-cli/src/main/kotlin/sk/ainet/apps/skainet/cli/Main.kt` route every non-Gemma model — including Apertus — through the deprecated `LlamaRuntime`. Apertus uses **xIELU** activation, **QK-Norm** (RMSNorm on Q and K before RoPE), and **ungated FFN** (no `gate_proj`); none of those branches exist in `LlamaRuntime`. Inference completes, but the logits diverge from what the Apertus checkpoint actually wants. The output is wrong on a level the user can't easily catch unless they compare to a reference.
-2. **Tool calling is OFF.** `llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/ModelRegistry.kt:66` lists Apertus as `("apertus", "Apertus", false, "chatml")` — `supportsToolCalling=false`, `chatTemplateFamily="chatml"` (a guess). There is no `ApertusChatTemplate.kt`, no `ApertusToolCallingSupport`, and no entry in `llm-agent/src/commonMain/kotlin/sk/ainet/apps/kllama/chat/ToolCallingSupportResolver.kt:16`. Apertus models fall back to `GenericToolCallingSupport`, which doesn't know the model's actual prompt format.
-3. **`kapertus` runtime is a 1-file stub.** Commit `81f3506` (`chore: remove deprecated-runtime CLIs (kqwen-only / kapertus / kvoxtral)`) deleted the previous `kapertus` CLI's `Main.kt` (~292 lines). What remains in `llm-runtime/kapertus/` is a single `ApertusIngestion.kt` (89 lines) that wraps the weight loaders. There is no kapertus-cli binary equivalent of `kllama-cli`, and no `llm-apps/kapertus-cli/` module.
+| PR    | Title                                                     |
+| ----: | --------------------------------------------------------- |
+| #91   | Plan + this tracking doc                                  |
+| #92   | `fix(apertus): route through OptimizedLLMRuntime + apertusNetwork()` |
+| #93   | `docs(apertus): document chat-template format`            |
+| #94   | `feat(apertus): tool calling support`                     |
 
-The architecture / library layer itself is solid:
-- `apertusNetwork()` DSL with xIELU, QK-Norm, ungated FFN — `llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusNetworkDef.kt:29`.
-- Weight loading from GGUF + SafeTensors + quantized — `ApertusWeightLoader.kt`, `ApertusSafeTensorsLoader.kt`, `ApertusQuantizedRuntime.kt`.
-- 4 self-contained `commonTest` smoke tests pass without a model file (`ApertusRuntimeSmokeTest`, `ApertusQuantizedRuntimeSmokeTest`, `ApertusXIELUTest`, `ApertusConfigParserTest`).
+After this stack:
+- `skainet-cli` routes Apertus models through `OptimizedLLMRuntime + apertusNetwork()` (xIELU + QK-Norm + ungated FFN — the previous `LlamaRuntime` fallback silently produced wrong logits).
+- `--agent --template=apertus` formats prompts with Apertus's own role tokens (`<|system_start|>`, `<|user_start|>`, `<|assistant_start|>`, `<|tools_prefix|>`, etc.) and parses tool calls back from `<|tools_prefix|>[...]<|tools_suffix|>` JSON arrays.
+- `ModelRegistry.APERTUS.supportsToolCalling = true`, `chatTemplateFamily = "apertus"`.
+- `KernelRegistry` auto-discovers native FFM kernels for the matmul path via the 0.22.0 native-cpu module.
 
-**Goal:** lift Apertus to the same level of polish kllama and kgemma have. Track the rollout in this file. The next contributor / session opens this doc, scans the staged-delivery checklist, and picks up where the previous one left off.
+## What's not in this rollout
 
----
+- **Optional kapertus-cli rebuild** — was originally listed as PR 4 ("rebuild CLI under `llm-apps/`"). Dropped: the unified `skainet-cli` already covers Apertus end-to-end, model-specific CLIs (kqwen, kapertus, kvoxtral) are being deprecated per commit `81f3506`, and the workspace direction is consolidation rather than per-model binaries. If a downstream consumer needs an Apertus-only fat-jar later, copy the `skainet-cli` shadow setup.
+- **Native Apertus kernels** — Apertus shares matmul shapes with Llama; the native FFM kernels from SKaiNET 0.22.0 (Q4_K, FP32) work transparently. No Apertus-specific kernel work needed.
+- **TurboQuant KV-cache compression for Apertus** — tracked separately under the TurboQuant workstream.
 
-## Staged delivery
+## Reference docs
 
-- [x] **PR 1 — `fix(apertus): route through OptimizedLLMRuntime + apertusNetwork()`** (correctness fix) — #92
-- [x] **PR 2 — `docs(apertus): document chat template format`** (research) — #93
-- [x] **PR 3 — `feat(apertus): tool calling support`** (implementation, depends on PR 2) — this PR
-- [ ] **PR 4 — `feat(kapertus): rebuild CLI under llm-apps/`** (parity, optional)
+- `docs/specs/apertus-chat-template.md` — full spec for the Apertus chat template (PR 2). Source of truth for the `ApertusChatTemplate` implementation.
 
-Each PR ticks its own checkbox when merged. `Status:` at the top of this doc reflects the most recent merged PR.
+## Cleanup that landed alongside the rollout (this commit)
 
----
+The hand-coded `ApertusRuntime.kt` and `ApertusQuantizedRuntime.kt` paths (and their attention backends + smoke tests) were marked `@Deprecated` after PR 1 made `OptimizedLLMRuntime + apertusNetwork()` the canonical path. Removed in this commit alongside the rollout closure:
 
-## PR 1 — fix skainet-cli routing for Apertus
+- `ApertusRuntime.kt` — hand-coded decoder runtime, deprecated.
+- `ApertusQuantizedRuntime.kt` — lazy-dequant variant, deprecated.
+- `ApertusAttentionBackend.kt` + `ApertusCpuAttentionBackend.kt` — only used by the two deleted runtimes.
+- `ApertusRuntimeSmokeTest.kt` + `ApertusQuantizedRuntimeSmokeTest.kt` — exercised the deleted runtimes.
 
-**Why first:** today's `skainet-cli` produces silently-wrong logits for Apertus models. Fix the worst class of bug first; everything else is additive.
+The `xielu()` / `softplus()` activation reference functions previously housed in `ApertusRuntime.kt` were extracted to `ApertusXIELU.kt` so `ApertusXIELUTest` keeps validating the math. The kdoc references in `OptimizedLLMRuntime.kt` and `OutputEquivalenceTest.kt` to "ApertusRuntime" are now stale and worth a follow-up sweep, but they're code comments only and don't break anything.
 
-**Changes:**
-- `llm-apps/skainet-cli/src/main/kotlin/sk/ainet/apps/skainet/cli/Main.kt` lines 168–216: detect `architecture == "apertus"` (or `family == "apertus"`) and branch to `OptimizedLLMRuntime + apertusNetwork()` instead of `LlamaRuntime`. Mirror how Gemma4 is already special-cased in this file.
-- Reuse `ApertusNetworkLoader.kt:31` for the module-build path.
-- Reuse `apertusNetwork()` from `ApertusNetworkDef.kt:29`.
+The remaining apertus library files (`ApertusNetworkDef`, `ApertusNetworkLoader`, `ApertusWeightLoader`, `ApertusSafeTensorsLoader`, `ApertusRuntimeWeights`, `ApertusConfigParser`, `QuantizedTensor`, `ApertusXIELU`, `ApertusIngestion`) cover the whole production path through `apertusNetwork() + OptimizedLLMRuntime`.
 
-**Verification:**
-- `skainet-cli -m <apertus.gguf> "Hello"` produces coherent output (the divergence is silent today; loading a real checkpoint and seeing meaningful text is the canary).
-- xIELU activation actually fires — verify by setting a debug breakpoint or `println` in `ApertusNetworkDef`'s xIELU branch on the first forward pass.
+## Test footprint (post-cleanup)
 
-**No new tests required** — the existing `:llm-inference:apertus:commonTest` already exercises `apertusNetwork()` end-to-end at toy scale; the routing fix is a Main.kt change covered by manual run.
-
----
-
-## PR 2 — document the chat template format
-
-**Why:** before implementing a chat template, we need to know what format Apertus models actually expect. `ModelRegistry.kt:66` lists `chatTemplateFamily="chatml"` but that's a guess — Apertus may use a different format (Alpaca, llama2-style, custom). Without the right template, tool-calling output won't parse correctly even if the rest of PR 3 is right.
-
-**Changes:**
-- Inspect a real Apertus GGUF (download from HuggingFace if needed: `swiss-ai/Apertus-1B` or similar). Read the `tokenizer.chat_template` GGUF metadata key.
-- Create `docs/explanation/models/apertus-chat-template.md` documenting:
-  - The actual chat-template Jinja string from the GGUF, byte-for-byte.
-  - Special tokens (`<|im_start|>`-style? `[INST]`-style? Alpaca?).
-  - Tool calling format (if the template has any).
-  - Whether the template matches an existing family (`chatml`, `llama3`, `gemma`) or needs a new `apertus` strategy.
-- Update `ModelRegistry.kt:66` `chatTemplateFamily` if the research shows a different family is correct.
-
-**Verification:** doc exists; rendered template matches the GGUF's `tokenizer.chat_template` byte-for-byte for one canonical message exchange (system + user + assistant + user roles).
-
----
-
-## PR 3 — tool calling support
-
-**Depends on PR 2** (need the template format documented).
-
-**Changes:**
-- New `llm-agent/src/commonMain/kotlin/sk/ainet/apps/kllama/chat/ApertusChatTemplate.kt` — concrete `ChatTemplate`. Pattern reference: `Llama3ChatTemplate.kt` and `Gemma4ChatTemplate.kt`.
-- New `llm-agent/src/commonMain/kotlin/sk/ainet/apps/kllama/chat/ApertusToolCallingSupport.kt` — analog of `Gemma4ToolCallingSupport`. Bundles the chat template + the tool-call markup parser.
-- Register in `ToolCallingSupportResolver.kt:16` — add a branch for `family == "apertus"` returning the new support.
-- `ModelRegistry.kt:66`: flip `supportsToolCalling=true`.
-- Tests:
-  - `ApertusChatTemplateTest.kt` in `llm-agent/src/commonTest/kotlin/sk/ainet/apps/kllama/chat/` — render parity vs the GGUF Jinja template (mirror `Gemma4ChatTemplateHfParityTest` shape).
-  - `ApertusToolCallParserStrategyTest.kt` — parse the Apertus tool-call output format.
-
-**Verification:**
-- `:llm-agent:commonTest --tests '*Apertus*'` green.
-- End-to-end: `skainet-cli -m <apertus.gguf> --agent --template=apertus "What is 17 * 23?"` invokes a calculator tool the same way the kllama TinyLlama tool-calling smoke test does.
-
----
-
-## PR 4 — rebuild kapertus CLI under llm-apps/ (optional)
-
-**Why:** `kapertus` runtime is a 1-file library facade today. Users wanting a `kapertus-cli` equivalent of `kllama-cli` have nothing. The unified `skainet-cli` works fine after PR 1, so PR 4 is **optional polish** — only worth doing if there's a specific reason to ship a separate Apertus-only CLI binary (smaller fat-JAR, branded distribution, similar parity expectations as `kllama-cli`).
-
-**Changes:**
-- New `llm-apps/kapertus-cli/build.gradle.kts` mirroring `kllama-cli/build.gradle.kts`: `kotlin("jvm")` + `shadow` + `application` plugins.
-- New `llm-apps/kapertus-cli/src/main/kotlin/sk/ainet/apps/kapertus/cli/Main.kt` — re-implement the deleted (commit `81f3506`) Main.kt, but routed through `OptimizedLLMRuntime + apertusNetwork()`.
-- Apply the shadow `mergeServiceFiles()` `doLast` workaround that PR #88 added to `kllama-cli` and `skainet-cli` (the `com.gradleup.shadow:9.4.x` bug — `NativeKernelProviderFactory` gets dropped from the merged services file otherwise).
-- Add to `settings.gradle.kts`: `include("llm-apps:kapertus-cli")`.
-
-**Verification:**
-- `:llm-apps:kapertus-cli:shadowJar` produces a runnable fat JAR.
-- `unzip -p kapertus-all.jar META-INF/services/sk.ainet.backend.api.kernel.KernelProvider` shows all 3 KernelProvider entries (Scalar + PanamaVector + Native).
-- `java -jar kapertus-all.jar -m <apertus.gguf> "Hello"` produces coherent output.
-
----
-
-## Out of scope
-
-- **Native-cpu wiring for Apertus inference.** Works automatically once PR 1 lands: matmul flows through `KernelRegistry.bestAvailable()`; native FFM kernels are auto-discovered via ServiceLoader on the classpath. No Apertus-specific code needed.
-- **Q4_K-quantized Apertus checkpoints.** Should work today via the existing Q4_K matmul path; no Apertus-specific code needed.
-- **TurboQuant KV-cache compression for Apertus.** Tracked separately under the broader TurboQuant workstream.
-- **Removing the deprecated `ApertusRuntime.kt`** (hand-coded path). Leave as `@Deprecated` until consumers have migrated to `OptimizedLLMRuntime + apertusNetwork()`.
-
----
-
-## Why this lives in transformers, not SKaiNET
-
-The Apertus rollout is entirely transformers-side. Model definition (`apertusNetwork()`), runtime, weight loaders, and tool calling all live under `llm-inference/apertus/`, `llm-runtime/kapertus/`, and `llm-agent/`. The SKaiNET upstream (kernels, tensor ops, ServiceLoader infra) needs no Apertus-specific changes.
+- `:llm-inference:apertus:jvmTest` — 12 tests (ConfigParser 6, XIELU 6).
+- `:llm-agent:jvmTest --tests '*Apertus*'` — 21 tests (ChatTemplate 10, ParserStrategy 11).
+- 33 Apertus-specific tests total, all green.
diff --git a/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusAttentionBackend.kt b/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusAttentionBackend.kt
deleted file mode 100644
index 1980fd5b..00000000
--- a/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusAttentionBackend.kt
+++ /dev/null
@@ -1,51 +0,0 @@
-package sk.ainet.models.apertus
-
-import sk.ainet.lang.tensor.Tensor
-import sk.ainet.lang.types.DType
-
-/**
- * Strategy interface for Apertus attention computation.
- *
- * Similar to LLaMA's AttentionBackend but receives Q/K after QK-norm has been applied.
- * Applies RoPE encoding, KV cache management, and GQA attention scoring.
- *
- * Contract:
- * - Input: q [1, dim], k [1, kvDim], v [1, kvDim], layerIdx, position
- * - Output: attention output [1, dim]
- */
-public interface ApertusAttentionBackend<T : DType> {
-
-    /**
-     * Compute attention for one token at the given position.
-     *
-     * Q and K have already been QK-normed by the caller.
-     * This method applies RoPE, stores k/v in the KV cache,
-     * and returns the attention-weighted output.
-     */
-    public fun attention(
-        q: Tensor<T, Float>,
-        k: Tensor<T, Float>,
-        v: Tensor<T, Float>,
-        layerIdx: Int,
-        position: Int
-    ): Tensor<T, Float>
-
-    /**
-     * Compute attention for a batch of tokens starting at [startPos].
-     *
-     * Returns null if the backend does not support batch attention,
-     * in which case the runtime falls back to sequential processing.
-     */
-    public fun batchAttention(
-        q: Tensor<T, Float>,
-        k: Tensor<T, Float>,
-        v: Tensor<T, Float>,
-        layerIdx: Int,
-        startPos: Int,
-    ): Tensor<T, Float>? = null
-
-    /**
-     * Reset internal state (KV caches, position tracking, etc.).
-     */
-    public fun reset()
-}
diff --git a/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusCpuAttentionBackend.kt b/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusCpuAttentionBackend.kt
deleted file mode 100644
index 664410a5..00000000
--- a/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusCpuAttentionBackend.kt
+++ /dev/null
@@ -1,242 +0,0 @@
-package sk.ainet.models.apertus
-
-import kotlin.math.cos
-import kotlin.math.sin
-import kotlin.math.sqrt
-import sk.ainet.apps.llm.KvCache
-import sk.ainet.apps.llm.HeapKvCache
-import sk.ainet.apps.llm.applyRopeRotation
-import sk.ainet.apps.llm.softmaxInPlace
-import sk.ainet.context.ExecutionContext
-import sk.ainet.lang.tensor.Shape
-import sk.ainet.lang.tensor.Tensor
-import sk.ainet.lang.tensor.data.FloatArrayTensorData
-import sk.ainet.lang.types.DType
-import kotlin.reflect.KClass
-
-/**
- * CPU-based attention backend for Apertus.
- *
- * Applies RoPE with high base theta (12M default), stores into KV cache,
- * and computes GQA attention with explicit loops.
- *
- * Q and K are expected to be QK-normed before being passed to this backend.
- */
-public class ApertusCpuAttentionBackend<T : DType> private constructor(
-    private val ctx: ExecutionContext,
-    private val dtype: KClass<T>,
-    private val dim: Int,
-    private val seqLen: Int,
-    private val nLayers: Int,
-    private val nHeads: Int,
-    private val nKvHeads: Int,
-    private val headSize: Int,
-    private val kvDim: Int,
-    private val ropeDim: Int,
-    private val nHeadsPerKv: Int,
-    private val ropeFreqBase: Float,
-    private val cache: KvCache,
-    private val precomputedRopeFreqs: FloatArray?
-) : ApertusAttentionBackend<T> {
-
-    /**
-     * Primary constructor from full runtime weights.
-     */
-    public constructor(
-        ctx: ExecutionContext,
-        weights: ApertusRuntimeWeights<T>,
-        dtype: KClass<T>,
-        kvCache: KvCache? = null,
-        ropeFreqBase: Float = weights.metadata.ropeTheta
-    ) : this(
-        ctx = ctx,
-        dtype = dtype,
-        dim = weights.metadata.embeddingLength,
-        seqLen = weights.metadata.contextLength,
-        nLayers = weights.metadata.blockCount,
-        nHeads = weights.metadata.headCount,
-        nKvHeads = weights.metadata.kvHeadCount,
-        headSize = weights.metadata.embeddingLength / weights.metadata.headCount,
-        kvDim = weights.metadata.kvHeadCount * (weights.metadata.embeddingLength / weights.metadata.headCount),
-        ropeDim = weights.metadata.ropeDimensionCount
-            ?: (weights.metadata.embeddingLength / weights.metadata.headCount),
-        nHeadsPerKv = weights.metadata.headCount / weights.metadata.kvHeadCount,
-        ropeFreqBase = ropeFreqBase,
-        cache = kvCache ?: HeapKvCache(
-            weights.metadata.blockCount,
-            weights.metadata.contextLength,
-            weights.metadata.kvHeadCount * (weights.metadata.embeddingLength / weights.metadata.headCount)
-        ),
-        precomputedRopeFreqs = weights.ropeFreqs?.let { tensor ->
-            val data = tensor.data
-            if (data is FloatArrayTensorData<*>) data.buffer.copyOf()
-            else data.copyToFloatArray()
-        }
-    )
-
-    /**
-     * Constructor from metadata and optional rope frequencies (for quantized runtime).
-     */
-    public constructor(
-        ctx: ExecutionContext,
-        metadata: ApertusModelMetadata,
-        dtype: KClass<T>,
-        ropeFreqs: FloatArray? = null,
-        kvCache: KvCache? = null,
-        ropeFreqBase: Float = metadata.ropeTheta
-    ) : this(
-        ctx = ctx,
-        dtype = dtype,
-        dim = metadata.embeddingLength,
-        seqLen = metadata.contextLength,
-        nLayers = metadata.blockCount,
-        nHeads = metadata.headCount,
-        nKvHeads = metadata.kvHeadCount,
-        headSize = metadata.embeddingLength / metadata.headCount,
-        kvDim = metadata.kvHeadCount * (metadata.embeddingLength / metadata.headCount),
-        ropeDim = metadata.ropeDimensionCount
-            ?: (metadata.embeddingLength / metadata.headCount),
-        nHeadsPerKv = metadata.headCount / metadata.kvHeadCount,
-        ropeFreqBase = ropeFreqBase,
-        cache = kvCache ?: HeapKvCache(
-            metadata.blockCount,
-            metadata.contextLength,
-            metadata.kvHeadCount * (metadata.embeddingLength / metadata.headCount)
-        ),
-        precomputedRopeFreqs = ropeFreqs
-    )
-
-    private val scoreBuffer = FloatArray(seqLen)
-
-    override fun attention(
-        q: Tensor<T, Float>,
-        k: Tensor<T, Float>,
-        v: Tensor<T, Float>,
-        layerIdx: Int,
-        position: Int
-    ): Tensor<T, Float> {
-        val qBuf = q.expectFloatBuffer()
-        val kBuf = k.expectFloatBuffer()
-        val vBuf = v.expectFloatBuffer()
-
-        applyRopeGqa(qBuf, kBuf, position)
-        cache.store(layerIdx, position, kBuf, 0, vBuf, 0)
-
-        val attnOutRaw = attentionGqa(layerIdx, qBuf, position)
-        return ctx.fromFloatArray<T, Float>(Shape(1, dim), dtype, attnOutRaw)
-    }
-
-    override fun batchAttention(
-        q: Tensor<T, Float>,
-        k: Tensor<T, Float>,
-        v: Tensor<T, Float>,
-        layerIdx: Int,
-        startPos: Int,
-    ): Tensor<T, Float> {
-        val batchSize = q.shape[0]
-        val qAll = q.expectFloatBuffer()
-        val kAll = k.expectFloatBuffer()
-        val vAll = v.expectFloatBuffer()
-
-        val result = FloatArray(batchSize * dim)
-
-        for (i in 0 until batchSize) {
-            val pos = startPos + i
-
-            val qBuf = qAll.copyOfRange(i * dim, (i + 1) * dim)
-            val kBuf = kAll.copyOfRange(i * kvDim, (i + 1) * kvDim)
-            val vBuf = vAll.copyOfRange(i * kvDim, (i + 1) * kvDim)
-
-            applyRopeGqa(qBuf, kBuf, pos)
-            cache.store(layerIdx, pos, kBuf, 0, vBuf, 0)
-
-            val attnOut = attentionGqa(layerIdx, qBuf, pos)
-            attnOut.copyInto(result, i * dim)
-        }
-
-        return ctx.fromFloatArray<T, Float>(Shape(batchSize, dim), dtype, result)
-    }
-
-    override fun reset() {
-        cache.reset()
-    }
-
-    private fun applyRopeGqa(qBuf: FloatArray, kBuf: FloatArray, pos: Int) {
-        require(headSize % 2 == 0) { "RoPE requires even head size; got $headSize" }
-
-        if (precomputedRopeFreqs != null) {
-            applyRopeWithFreqs(qBuf, nHeads, headSize, pos, precomputedRopeFreqs)
-            applyRopeWithFreqs(kBuf, nKvHeads, headSize, pos, precomputedRopeFreqs)
-        } else {
-            val ropeStride = headSize / 2
-            applyRopeRotation(qBuf, nHeads, headSize, ropeDim, pos, ropeFreqBase, null, null, ropeStride)
-            applyRopeRotation(kBuf, nKvHeads, headSize, ropeDim, pos, ropeFreqBase, null, null, ropeStride)
-        }
-    }
-
-    /**
-     * Apply RoPE using precomputed inverse frequencies.
-     *
-     * For each pair index `p`, the angle is `pos * freqs[p]`.
-     * Rotation: out[2p] = in[2p] * cos(θ) - in[2p+1] * sin(θ)
-     *           out[2p+1] = in[2p] * sin(θ) + in[2p+1] * cos(θ)
-     */
-    private fun applyRopeWithFreqs(
-        buf: FloatArray,
-        numHeads: Int,
-        headSize: Int,
-        pos: Int,
-        freqs: FloatArray
-    ) {
-        val nPairs = freqs.size
-        for (h in 0 until numHeads) {
-            val headOffset = h * headSize
-            for (pair in 0 until nPairs) {
-                val angle = pos * freqs[pair]
-                val fcr = cos(angle)
-                val fci = sin(angle)
-                val i = pair * 2
-                val v0 = buf[headOffset + i]
-                val v1 = buf[headOffset + i + 1]
-                buf[headOffset + i] = (v0 * fcr - v1 * fci).toFloat()
-                buf[headOffset + i + 1] = (v0 * fci + v1 * fcr).toFloat()
-            }
-        }
-    }
-
-    private fun attentionGqa(layerIdx: Int, qBuf: FloatArray, pos: Int): FloatArray {
-        val out = FloatArray(dim)
-        val scale = 1f / sqrt(headSize.toDouble()).toFloat()
-        val scores = scoreBuffer
-
-        for (h in 0 until nHeads) {
-            val qHeadOffset = h * headSize
-            val kvHeadIdx = h / nHeadsPerKv
-            val kvHeadOffset = kvHeadIdx * headSize
-
-            for (t in 0..pos) {
-                var score = 0f
-                for (i in 0 until headSize) {
-                    score += qBuf[qHeadOffset + i] * cache.getKey(layerIdx, t, kvHeadOffset, i)
-                }
-                scores[t] = score * scale
-            }
-
-            softmaxInPlace(scores, pos + 1)
-
-            for (t in 0..pos) {
-                val weight = scores[t]
-                for (i in 0 until headSize) {
-                    out[qHeadOffset + i] += weight * cache.getValue(layerIdx, t, kvHeadOffset, i)
-                }
-            }
-        }
-        return out
-    }
-
-    private fun Tensor<T, Float>.expectFloatBuffer(): FloatArray {
-        val data = this.data
-        if (data is FloatArrayTensorData<*>) return data.buffer
-        return data.copyToFloatArray()
-    }
-}
diff --git a/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusQuantizedRuntime.kt b/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusQuantizedRuntime.kt
deleted file mode 100644
index 9117f2e9..00000000
--- a/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusQuantizedRuntime.kt
+++ /dev/null
@@ -1,185 +0,0 @@
-package sk.ainet.models.apertus
-
-import kotlin.math.sqrt
-import kotlin.random.Random
-import sk.ainet.apps.llm.DecoderRuntime
-import sk.ainet.context.ExecutionContext
-import sk.ainet.io.gguf.dequant.DequantOps
-import sk.ainet.lang.nn.layers.Embedding
-import sk.ainet.lang.nn.normalization.RMSNormalization
-import sk.ainet.lang.tensor.Shape
-import sk.ainet.lang.tensor.Tensor
-import sk.ainet.lang.tensor.matmul
-import sk.ainet.lang.tensor.plus
-import sk.ainet.lang.tensor.t
-import sk.ainet.lang.types.FP32
-
-/**
- * Apertus decoder runtime with **lazy dequantization**.
- *
- * Weight matrices (wq, wk, wv, wo, ffnUp, ffnDown, outputWeight) are stored
- * in their original quantized form ([QuantizedTensor]). Each layer dequantizes
- * its weights to FP32 on the fly during [runLayer], then discards the temporary.
- *
- * Memory profile for a 7B Q4_0 model:
- * - Resident: ~3.5 GB (quantized) + norms/embeddings in FP32 (~100 MB)
- * - Per-layer temporary: ~50 MB (one projection matrix at a time)
- * - vs. eager FP32: ~28 GB
- *
- * Trade-off: each token pays a dequantization cost per-layer. This is the same
- * approach used by llama.cpp and is well worth the 4-8x memory savings.
- */
-public class ApertusQuantizedRuntime(
-    private val ctx: ExecutionContext,
-    val weights: ApertusQuantizedRuntimeWeights,
-    private val attentionBackend: ApertusAttentionBackend<FP32>,
-    private val eps: Float = weights.metadata.rmsNormEps,
-    random: Random = Random.Default
-) : DecoderRuntime<FP32>(random) {
-
-    override val dim: Int = weights.metadata.embeddingLength
-    override val seqLen: Int = weights.metadata.contextLength
-    override val vocabSize: Int = weights.metadata.vocabSize
-    override val nLayers: Int = weights.layers.size
-    override val bosToken: Int = weights.metadata.bosTokenId
-
-    private val nHeads = weights.metadata.headCount
-    private val headDim = dim / nHeads
-    private val nKvHeads = weights.metadata.kvHeadCount
-
-    private val embedding = Embedding(
-        numEmbeddings = vocabSize,
-        embeddingDim = dim,
-        initWeight = weights.tokenEmbedding,
-        name = "token_embd"
-    )
-
-    private val outputNormLayer = RMSNormalization<FP32, Float>(
-        normalizedShape = intArrayOf(dim),
-        eps = eps.toDouble(),
-        name = "output_norm",
-        initWeight = weights.outputNorm
-    )
-
-    private val attnNorms = weights.layers.mapIndexed { i, layer ->
-        RMSNormalization<FP32, Float>(
-            normalizedShape = intArrayOf(dim),
-            eps = eps.toDouble(),
-            name = "layer_$i.attn_norm",
-            initWeight = layer.attnNorm
-        )
-    }
-
-    private val ffnNorms = weights.layers.mapIndexed { i, layer ->
-        RMSNormalization<FP32, Float>(
-            normalizedShape = intArrayOf(dim),
-            eps = eps.toDouble(),
-            name = "layer_$i.ffn_norm",
-            initWeight = layer.ffnNorm
-        )
-    }
-
-    override fun embedToken(tokenId: Int): Tensor<FP32, Float> =
-        embedding.forward(intArrayOf(tokenId), ctx)
-
-    override fun runLayer(layerIdx: Int, x: Tensor<FP32, Float>): Tensor<FP32, Float> {
-        val layer = weights.layers[layerIdx]
-
-        // 1. Attention norm
-        val attnNorm = attnNorms[layerIdx].forward(x, ctx)
-
-        // 2. QKV projections — dequant weight, matmul, discard temp
-        val q = attnNorm.matmul(dequant2D(layer.wq).t())
-        val k = attnNorm.matmul(dequant2D(layer.wk).t())
-        val v = attnNorm.matmul(dequant2D(layer.wv).t())
-
-        // 3. QK-norm: per-head RMSNorm on Q and K
-        val qNormed = applyPerHeadRMSNorm(q, nHeads, headDim, layer.qNorm)
-        val kNormed = applyPerHeadRMSNorm(k, nKvHeads, headDim, layer.kNorm)
-
-        // 4. Attention (RoPE + KV cache + GQA)
-        val attnOut = attentionBackend.attention(qNormed, kNormed, v, layerIdx, position)
-
-        // 5. Output projection + residual
-        val afterAttn = x + attnOut.matmul(dequant2D(layer.wo).t())
-
-        // 6. FFN norm
-        val ffnNorm = ffnNorms[layerIdx].forward(afterAttn, ctx)
-
-        // 7. Ungated MLP: up → xIELU → down
-        val up = ffnNorm.matmul(dequant2D(layer.ffnUp).t())
-        val activated = applyXIELU(up, layer.xieluParams)
-        val ffnOut = activated.matmul(dequant2D(layer.ffnDown).t())
-
-        // 8. Residual
-        return afterAttn + ffnOut
-    }
-
-    override fun outputNorm(x: Tensor<FP32, Float>): Tensor<FP32, Float> =
-        outputNormLayer.forward(x, ctx)
-
-    override fun outputProject(x: Tensor<FP32, Float>): Tensor<FP32, Float> =
-        x.matmul(dequant2D(weights.outputWeight).t())
-
-    override fun resetState() {
-        attentionBackend.reset()
-    }
-
-    // ---- Dequantization ----
-
-    /**
-     * Dequantize a 2D quantized tensor to FP32.
-     *
-     * GGUF stores 2D tensors column-major with shape [out, in].
-     * We transpose to row-major [in, out] so that `.t()` in the caller
-     * gives the correct matmul orientation.
-     */
-    private fun dequant2D(qt: QuantizedTensor): Tensor<FP32, Float> {
-        val floats = qt.dequantToFloat()
-        return if (qt.shape.rank == 2) {
-            val rows = qt.shape[0]
-            val cols = qt.shape[1]
-            val transposed = DequantOps.transposeColumnMajorToRowMajor(floats, rows, cols)
-            ctx.fromFloatArray<FP32, Float>(Shape(cols, rows), FP32::class, transposed)
-        } else {
-            ctx.fromFloatArray<FP32, Float>(qt.shape, FP32::class, floats)
-        }
-    }
-
-    // ---- Apertus-specific helpers (same as ApertusRuntime) ----
-
-    private fun applyPerHeadRMSNorm(
-        x: Tensor<FP32, Float>,
-        numHeads: Int,
-        headDim: Int,
-        weight: Tensor<FP32, Float>
-    ): Tensor<FP32, Float> {
-        val buf = x.expectFloatBuffer().copyOf()
-        val w = weight.expectFloatBuffer()
-        val totalDim = numHeads * headDim
-        val batchSize = if (x.shape.rank == 2) x.shape[0] else 1
-
-        for (b in 0 until batchSize) {
-            val batchOffset = b * totalDim
-            for (h in 0 until numHeads) {
-                val headOffset = batchOffset + h * headDim
-                var sumSq = 0f
-                for (i in 0 until headDim) {
-                    val v = buf[headOffset + i]
-                    sumSq += v * v
-                }
-                val rms = sqrt(sumSq / headDim + eps)
-                for (i in 0 until headDim) {
-                    buf[headOffset + i] = (buf[headOffset + i] / rms) * w[i]
-                }
-            }
-        }
-        return ctx.fromFloatArray<FP32, Float>(x.shape, FP32::class, buf)
-    }
-
-    private fun applyXIELU(x: Tensor<FP32, Float>, params: ApertusXIELUParams): Tensor<FP32, Float> {
-        val buf = x.expectFloatBuffer().copyOf()
-        xielu(buf, params)
-        return ctx.fromFloatArray<FP32, Float>(x.shape, FP32::class, buf)
-    }
-}
diff --git a/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusRuntime.kt b/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusRuntime.kt
deleted file mode 100644
index fafea08a..00000000
--- a/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusRuntime.kt
+++ /dev/null
@@ -1,242 +0,0 @@
-package sk.ainet.models.apertus
-
-import kotlin.math.exp
-import kotlin.math.ln
-import kotlin.math.min
-import kotlin.math.sqrt
-import kotlin.random.Random
-import sk.ainet.apps.llm.DecoderRuntime
-import sk.ainet.context.ExecutionContext
-import sk.ainet.lang.nn.layers.Embedding
-import sk.ainet.lang.tensor.Shape
-import sk.ainet.lang.tensor.Tensor
-import sk.ainet.lang.tensor.matmul
-import sk.ainet.lang.tensor.plus
-import sk.ainet.lang.tensor.t
-import sk.ainet.lang.tensor.data.FloatArrayTensorData
-import sk.ainet.lang.nn.normalization.RMSNormalization
-import sk.ainet.lang.types.DType
-import kotlin.reflect.KClass
-
-/**
- * Apertus decoder runtime with pluggable attention backend.
- *
- * Key differences from LLaMA:
- * - **xIELU activation** with per-layer learned scalar parameters (replaces SiLU)
- * - **Ungated MLP** — only up_proj + down_proj, no gate_proj
- * - **QK-norm** — per-head RMSNorm on Q and K before RoPE
- *
- * Extends [DecoderRuntime] for shared forward/generate/sample logic.
- */
-@Deprecated(
-    message = "Use OptimizedLLMRuntime with apertusNetwork() instead. " +
-        "See docs/optimizable-LLM-NNs-DAG.md for migration guide.",
-    replaceWith = ReplaceWith(
-        "OptimizedLLMRuntime.create(apertusNetwork(config), tensors, resolver, ctx)",
-        "sk.ainet.apps.llm.OptimizedLLMRuntime"
-    )
-)
-public class ApertusRuntime<T : DType>(
-    private val ctx: ExecutionContext,
-    val weights: ApertusRuntimeWeights<T>,
-    private val attentionBackend: ApertusAttentionBackend<T>,
-    private val dtype: KClass<T>,
-    private val eps: Float = weights.metadata.rmsNormEps,
-    random: Random = Random.Default
-) : DecoderRuntime<T>(random) {
-
-    // ---- DecoderRuntime abstract properties ----
-    override val dim: Int = weights.metadata.embeddingLength
-    override val seqLen: Int = weights.metadata.contextLength
-    override val vocabSize: Int = weights.metadata.vocabSize
-    override val nLayers: Int = weights.layers.size
-    override val bosToken: Int = weights.metadata.bosTokenId
-
-    private val nHeads = weights.metadata.headCount
-    private val headDim = dim / nHeads
-    private val nKvHeads = weights.metadata.kvHeadCount
-    private val kvDim = nKvHeads * headDim
-
-    private val embedding = Embedding(
-        numEmbeddings = vocabSize,
-        embeddingDim = dim,
-        initWeight = weights.tokenEmbedding,
-        name = "token_embd"
-    )
-
-    private val outputNormLayer = RMSNormalization<T, Float>(
-        normalizedShape = intArrayOf(dim),
-        eps = eps.toDouble(),
-        name = "output_norm",
-        initWeight = weights.outputNorm
-    )
-
-    private val attnNorms = weights.layers.mapIndexed { i, layer ->
-        RMSNormalization<T, Float>(
-            normalizedShape = intArrayOf(dim),
-            eps = eps.toDouble(),
-            name = "layer_$i.attn_norm",
-            initWeight = layer.attnNorm
-        )
-    }
-
-    private val ffnNorms = weights.layers.mapIndexed { i, layer ->
-        RMSNormalization<T, Float>(
-            normalizedShape = intArrayOf(dim),
-            eps = eps.toDouble(),
-            name = "layer_$i.ffn_norm",
-            initWeight = layer.ffnNorm
-        )
-    }
-
-    private val outputWeightT: Tensor<T, Float> = weights.outputWeight.t()
-
-    // ---- DecoderRuntime template methods ----
-
-    override fun embedToken(tokenId: Int): Tensor<T, Float> =
-        embedding.forward(intArrayOf(tokenId), ctx)
-
-    override fun runLayer(layerIdx: Int, x: Tensor<T, Float>): Tensor<T, Float> {
-        val layer = weights.layers[layerIdx]
-
-        // 1. Attention norm
-        val attnNorm = attnNorms[layerIdx].forward(x, ctx)
-
-        // 2. QKV projections (transpose on the fly to avoid double-memory peak)
-        val q = attnNorm.matmul(layer.wq.t())
-        val k = attnNorm.matmul(layer.wk.t())
-        val v = attnNorm.matmul(layer.wv.t())
-
-        // 3. QK-norm: per-head RMSNorm on Q and K
-        val qNormed = applyPerHeadRMSNorm(q, nHeads, headDim, layer.qNorm)
-        val kNormed = applyPerHeadRMSNorm(k, nKvHeads, headDim, layer.kNorm)
-
-        // 4. Attention (RoPE + KV cache + GQA) — backend receives QK-normed tensors
-        val attnOut = attentionBackend.attention(qNormed, kNormed, v, layerIdx, position)
-
-        // 5. Output projection + residual
-        val afterAttn = x + attnOut.matmul(layer.wo.t())
-
-        // 6. FFN norm
-        val ffnNorm = ffnNorms[layerIdx].forward(afterAttn, ctx)
-
-        // 7. Ungated MLP: up → xIELU → down
-        val up = ffnNorm.matmul(layer.ffnUp.t())
-        val activated = applyXIELU(up, layer.xieluParams)
-        val ffnOut = activated.matmul(layer.ffnDown.t())
-
-        // 8. Residual
-        return afterAttn + ffnOut
-    }
-
-    override fun outputNorm(x: Tensor<T, Float>): Tensor<T, Float> =
-        outputNormLayer.forward(x, ctx)
-
-    override fun outputProject(x: Tensor<T, Float>): Tensor<T, Float> =
-        x.matmul(outputWeightT)
-
-    override fun resetState() {
-        attentionBackend.reset()
-    }
-
-    // ---- Apertus-specific helpers ----
-
-    /**
-     * Apply per-head RMSNorm to Q or K tensor.
-     *
-     * Input shape: [batch, nHeads * headDim]
-     * Weight shape: [nHeads * headDim] (per-head, so each head has `headDim` weight values)
-     */
-    private fun applyPerHeadRMSNorm(
-        x: Tensor<T, Float>,
-        numHeads: Int,
-        headDim: Int,
-        weight: Tensor<T, Float>
-    ): Tensor<T, Float> {
-        val buf = x.expectFloatBuffer().copyOf()
-        val w = weight.expectFloatBuffer()
-        val totalDim = numHeads * headDim
-
-        // Handle batched input
-        val batchSize = if (x.shape.rank == 2) x.shape[0] else 1
-
-        for (b in 0 until batchSize) {
-            val batchOffset = b * totalDim
-            for (h in 0 until numHeads) {
-                val headOffset = batchOffset + h * headDim
-
-                // Compute RMS for this head
-                var sumSq = 0f
-                for (i in 0 until headDim) {
-                    val v = buf[headOffset + i]
-                    sumSq += v * v
-                }
-                val rms = sqrt(sumSq / headDim + eps)
-
-                // Normalize and scale (weight is per-head, shared across all heads)
-                for (i in 0 until headDim) {
-                    buf[headOffset + i] = (buf[headOffset + i] / rms) * w[i]
-                }
-            }
-        }
-
-        val shape = x.shape
-        return ctx.fromFloatArray<T, Float>(shape, dtype, buf)
-    }
-
-    /**
-     * Apply xIELU activation element-wise.
-     *
-     * xIELU formula:
-     * ```
-     * alpha_p_eff = softplus(alpha_p)       // ln(1 + exp(stored_alpha_p))
-     * alpha_n_eff = beta + softplus(alpha_n)
-     * if x > 0:  alpha_p_eff * x^2 + beta * x
-     * if x <= 0: (expm1(min(x, eps)) - x) * alpha_n_eff + beta * x
-     * ```
-     */
-    private fun applyXIELU(x: Tensor<T, Float>, params: ApertusXIELUParams): Tensor<T, Float> {
-        val buf = x.expectFloatBuffer().copyOf()
-        xielu(buf, params)
-        return ctx.fromFloatArray<T, Float>(x.shape, dtype, buf)
-    }
-
-}
-
-// ========== xIELU Implementation ==========
-
-/**
- * In-place xIELU activation on a float buffer.
- *
- * Public for unit testing.
- */
-public fun xielu(buf: FloatArray, params: ApertusXIELUParams) {
-    val alphaPEff = softplus(params.alphaP)
-    val alphaNEff = params.beta + softplus(params.alphaN)
-
-    for (i in buf.indices) {
-        val x = buf[i]
-        buf[i] = if (x > 0f) {
-            alphaPEff * x * x + params.beta * x
-        } else {
-            val clamped = min(x, params.eps)
-            (expm1(clamped) - x) * alphaNEff + params.beta * x
-        }
-    }
-}
-
-/**
- * softplus(x) = ln(1 + exp(x))
- *
- * Uses a numerically stable formulation:
- * - For large x: softplus(x) ≈ x
- * - For small x: use the exact formula
- */
-public fun softplus(x: Float): Float {
-    return if (x > 20f) x else ln(1f + exp(x))
-}
-
-/**
- * exp(x) - 1, avoiding catastrophic cancellation near zero.
- */
-private fun expm1(x: Float): Float = kotlin.math.expm1(x.toDouble()).toFloat()
diff --git a/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusXIELU.kt b/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusXIELU.kt
new file mode 100644
index 00000000..5c9a7afc
--- /dev/null
+++ b/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusXIELU.kt
@@ -0,0 +1,51 @@
+package sk.ainet.models.apertus
+
+import kotlin.math.exp
+import kotlin.math.ln
+import kotlin.math.min
+
+/**
+ * xIELU activation reference implementation.
+ *
+ * Mutates [buf] in place applying:
+ *
+ *   alpha_p_eff = softplus(alpha_p)
+ *   alpha_n_eff = beta + softplus(alpha_n)
+ *
+ *   if x > 0:  alpha_p_eff * x*x + beta * x
+ *   else:      (expm1(min(x, eps)) - x) * alpha_n_eff + beta * x
+ *
+ * Apertus models carry per-layer scalar `alpha_p`, `alpha_n`, `beta`, `eps`
+ * weights (see [ApertusXIELUParams]). The production network uses an
+ * equivalent op emitted by `apertusNetwork()` through SKaiNET's compute
+ * graph; this Kotlin reference exists so callers (notably tests) can
+ * verify the math without spinning up a runtime, and so future xIELU
+ * implementations have a single golden reference to point at.
+ *
+ * Public for unit testing.
+ */
+public fun xielu(buf: FloatArray, params: ApertusXIELUParams) {
+    val alphaPEff = softplus(params.alphaP)
+    val alphaNEff = params.beta + softplus(params.alphaN)
+
+    for (i in buf.indices) {
+        val x = buf[i]
+        buf[i] = if (x > 0f) {
+            alphaPEff * x * x + params.beta * x
+        } else {
+            val clamped = min(x, params.eps)
+            (expm1(clamped) - x) * alphaNEff + params.beta * x
+        }
+    }
+}
+
+/**
+ * `softplus(x) = ln(1 + exp(x))`, with the standard large-x asymptotic
+ * shortcut to avoid `exp` overflow.
+ */
+public fun softplus(x: Float): Float {
+    return if (x > 20f) x else ln(1f + exp(x))
+}
+
+/** `exp(x) - 1` without catastrophic cancellation near zero. */
+private fun expm1(x: Float): Float = kotlin.math.expm1(x.toDouble()).toFloat()
diff --git a/llm-inference/apertus/src/commonTest/kotlin/sk/ainet/models/apertus/ApertusQuantizedRuntimeSmokeTest.kt b/llm-inference/apertus/src/commonTest/kotlin/sk/ainet/models/apertus/ApertusQuantizedRuntimeSmokeTest.kt
deleted file mode 100644
index e375641a..00000000
--- a/llm-inference/apertus/src/commonTest/kotlin/sk/ainet/models/apertus/ApertusQuantizedRuntimeSmokeTest.kt
+++ /dev/null
@@ -1,275 +0,0 @@
-package sk.ainet.models.apertus
-
-import sk.ainet.context.DefaultDataExecutionContext
-import sk.ainet.io.gguf.GGMLQuantizationType
-import sk.ainet.lang.tensor.Shape
-import sk.ainet.lang.tensor.Tensor
-import sk.ainet.lang.types.FP32
-import kotlin.test.Test
-import kotlin.test.assertEquals
-import kotlin.test.assertTrue
-
-/**
- * Smoke test: builds a tiny Apertus model with quantized (simulated) weights
- * and verifies the lazy-dequant runtime produces finite logits and generates tokens.
- *
- * Uses F32-typed QuantizedTensors stored in GGUF column-major format as an identity
- * dequantization, so we can verify correctness against the eager FP32 runtime.
- */
-class ApertusQuantizedRuntimeSmokeTest {
-
-    private val dim = 8
-    private val ffDim = 16
-    private val vocabSize = 16
-    private val nHeads = 2
-    private val kvHeads = 2
-    private val headDim = dim / nHeads
-    private val kvDim = kvHeads * headDim
-
-    private val ctx = DefaultDataExecutionContext()
-
-    private fun ones(shape: Shape): Tensor<FP32, Float> {
-        val values = FloatArray(shape.volume) { 0.01f }
-        return ctx.fromFloatArray(shape, FP32::class, values)
-    }
-
-    private fun randnArray(shape: Shape, seed: Int): FloatArray {
-        val rng = kotlin.random.Random(seed)
-        return FloatArray(shape.volume) { (rng.nextFloat() - 0.5f) * 0.1f }
-    }
-
-    private fun randn(shape: Shape, seed: Int = 42): Tensor<FP32, Float> =
-        ctx.fromFloatArray(shape, FP32::class, randnArray(shape, seed))
-
-    private fun floatArrayToBytes(data: FloatArray): ByteArray {
-        val bytes = ByteArray(data.size * 4)
-        for (i in data.indices) {
-            val bits = data[i].toRawBits()
-            bytes[i * 4] = (bits and 0xFF).toByte()
-            bytes[i * 4 + 1] = ((bits shr 8) and 0xFF).toByte()
-            bytes[i * 4 + 2] = ((bits shr 16) and 0xFF).toByte()
-            bytes[i * 4 + 3] = ((bits shr 24) and 0xFF).toByte()
-        }
-        return bytes
-    }
-
-    /**
-     * Create a QuantizedTensor in GGUF column-major format from an eager row-major weight.
-     *
-     * For 2D tensors, GGUF stores shape [ne0, ne1] where ne0 = cols (fast dim),
-     * ne1 = rows, with data in column-major order. [dequant2D] will transpose this
-     * back to row-major [ne1, ne0] and then the runtime uses `.t()`.
-     *
-     * The eager runtime stores shape [rows, cols] in row-major and also uses `.t()`.
-     * Both paths produce the same effective matrix in the matmul.
-     *
-     * @param eagerShape The shape the eager runtime uses (row-major [rows, cols])
-     * @param rowMajorData The eager row-major float data
-     */
-    private fun asQuantizedGGUF(eagerShape: Shape, rowMajorData: FloatArray): QuantizedTensor {
-        if (eagerShape.rank == 2) {
-            val rows = eagerShape[0]
-            val cols = eagerShape[1]
-            // Convert row-major [rows, cols] to column-major [cols, rows] (GGUF layout)
-            val colMajor = FloatArray(rowMajorData.size)
-            for (r in 0 until rows) {
-                for (c in 0 until cols) {
-                    colMajor[c * rows + r] = rowMajorData[r * cols + c]
-                }
-            }
-            return QuantizedTensor(
-                data = floatArrayToBytes(colMajor),
-                quantType = GGMLQuantizationType.F32,
-                shape = Shape(cols, rows), // GGUF shape: [ne0=cols, ne1=rows]
-                nElements = eagerShape.volume
-            )
-        }
-        return QuantizedTensor(
-            data = floatArrayToBytes(rowMajorData),
-            quantType = GGMLQuantizationType.F32,
-            shape = eagerShape,
-            nElements = eagerShape.volume
-        )
-    }
-
-    /** Convenience: generate random data and create a GGUF-format QuantizedTensor. */
-    private fun randnQuantized(eagerShape: Shape, seed: Int): QuantizedTensor =
-        asQuantizedGGUF(eagerShape, randnArray(eagerShape, seed))
-
-    private fun buildMetadata() = ApertusModelMetadata(
-        architecture = "apertus",
-        embeddingLength = dim,
-        contextLength = 32,
-        blockCount = 1,
-        headCount = nHeads,
-        kvHeadCount = kvHeads,
-        feedForwardLength = ffDim,
-        ropeDimensionCount = headDim,
-        vocabSize = vocabSize,
-        ropeTheta = 12000000f,
-        qkNorm = true,
-        hiddenAct = "xielu",
-        tiedEmbeddings = false
-    )
-
-    private val xieluParams = ApertusXIELUParams(-0.5f, -0.3f, 0.8f, -5.0f)
-
-    @Test
-    fun quantizedForwardPassProducesFiniteLogits() {
-        val metadata = buildMetadata()
-        val layer = ApertusQuantizedLayerWeights(
-            attnNorm = ones(Shape(dim)),
-            qNorm = ones(Shape(headDim)),
-            kNorm = ones(Shape(headDim)),
-            ffnNorm = ones(Shape(dim)),
-            xieluParams = xieluParams,
-            wq = randnQuantized(Shape(dim, dim), seed = 1),
-            wk = randnQuantized(Shape(kvDim, dim), seed = 2),
-            wv = randnQuantized(Shape(kvDim, dim), seed = 3),
-            wo = randnQuantized(Shape(dim, dim), seed = 4),
-            ffnUp = randnQuantized(Shape(ffDim, dim), seed = 6),
-            ffnDown = randnQuantized(Shape(dim, ffDim), seed = 5)
-        )
-        val weights = ApertusQuantizedRuntimeWeights(
-            metadata = metadata,
-            tokenEmbedding = randn(Shape(vocabSize, dim), seed = 10),
-            layers = listOf(layer),
-            outputNorm = ones(Shape(dim)),
-            outputWeight = randnQuantized(Shape(vocabSize, dim), seed = 11)
-        )
-        val backend = ApertusCpuAttentionBackend<FP32>(
-            ctx = ctx, metadata = metadata, dtype = FP32::class, ropeFreqBase = 12000000f
-        )
-        val runtime = ApertusQuantizedRuntime(
-            ctx = ctx, weights = weights, attentionBackend = backend
-        )
-
-        val logits = runtime.forward(1)
-        assertEquals(2, logits.shape.rank, "logits should be 2D")
-        assertEquals(vocabSize, logits.shape[1], "logits dim should match vocab size")
-        val buf = logits.data.copyToFloatArray()
-        for (i in buf.indices) {
-            assertTrue(buf[i].isFinite(), "logit[$i] = ${buf[i]} is not finite")
-        }
-    }
-
-    @Test
-    fun quantizedAndEagerProduceSameLogits() {
-        val metadata = buildMetadata()
-        val tokenEmb = randn(Shape(vocabSize, dim), seed = 10)
-        val outputNorm = ones(Shape(dim))
-
-        // Generate weight data arrays (same seeds for both runtimes)
-        val wqData = randnArray(Shape(dim, dim), seed = 1)
-        val wkData = randnArray(Shape(kvDim, dim), seed = 2)
-        val wvData = randnArray(Shape(kvDim, dim), seed = 3)
-        val woData = randnArray(Shape(dim, dim), seed = 4)
-        val ffnDownData = randnArray(Shape(dim, ffDim), seed = 5)
-        val ffnUpData = randnArray(Shape(ffDim, dim), seed = 6)
-        val outputWData = randnArray(Shape(vocabSize, dim), seed = 11)
-
-        // --- Eager FP32 runtime ---
-        val eagerLayer = ApertusLayerWeights(
-            attnNorm = ones(Shape(dim)),
-            wq = ctx.fromFloatArray(Shape(dim, dim), FP32::class, wqData.copyOf()),
-            wk = ctx.fromFloatArray(Shape(kvDim, dim), FP32::class, wkData.copyOf()),
-            wv = ctx.fromFloatArray(Shape(kvDim, dim), FP32::class, wvData.copyOf()),
-            wo = ctx.fromFloatArray(Shape(dim, dim), FP32::class, woData.copyOf()),
-            qNorm = ones(Shape(headDim)),
-            kNorm = ones(Shape(headDim)),
-            ffnNorm = ones(Shape(dim)),
-            ffnDown = ctx.fromFloatArray(Shape(dim, ffDim), FP32::class, ffnDownData.copyOf()),
-            ffnUp = ctx.fromFloatArray(Shape(ffDim, dim), FP32::class, ffnUpData.copyOf()),
-            xieluParams = xieluParams
-        )
-        val eagerWeights = ApertusRuntimeWeights(
-            metadata = metadata, tokenEmbedding = tokenEmb,
-            layers = listOf(eagerLayer), outputNorm = outputNorm,
-            outputWeight = ctx.fromFloatArray(Shape(vocabSize, dim), FP32::class, outputWData.copyOf())
-        )
-        val eagerBackend = ApertusCpuAttentionBackend<FP32>(
-            ctx = ctx, weights = eagerWeights, dtype = FP32::class
-        )
-        val eagerRuntime = ApertusRuntime(
-            ctx = ctx, weights = eagerWeights,
-            attentionBackend = eagerBackend, dtype = FP32::class
-        )
-
-        // --- Quantized runtime (F32 bytes = identity dequant, GGUF column-major layout) ---
-        val qLayer = ApertusQuantizedLayerWeights(
-            attnNorm = ones(Shape(dim)),
-            qNorm = ones(Shape(headDim)),
-            kNorm = ones(Shape(headDim)),
-            ffnNorm = ones(Shape(dim)),
-            xieluParams = xieluParams,
-            wq = asQuantizedGGUF(Shape(dim, dim), wqData.copyOf()),
-            wk = asQuantizedGGUF(Shape(kvDim, dim), wkData.copyOf()),
-            wv = asQuantizedGGUF(Shape(kvDim, dim), wvData.copyOf()),
-            wo = asQuantizedGGUF(Shape(dim, dim), woData.copyOf()),
-            ffnUp = asQuantizedGGUF(Shape(ffDim, dim), ffnUpData.copyOf()),
-            ffnDown = asQuantizedGGUF(Shape(dim, ffDim), ffnDownData.copyOf())
-        )
-        val qWeights = ApertusQuantizedRuntimeWeights(
-            metadata = metadata, tokenEmbedding = tokenEmb,
-            layers = listOf(qLayer), outputNorm = outputNorm,
-            outputWeight = asQuantizedGGUF(Shape(vocabSize, dim), outputWData.copyOf())
-        )
-        val qBackend = ApertusCpuAttentionBackend<FP32>(
-            ctx = ctx, metadata = metadata, dtype = FP32::class
-        )
-        val qRuntime = ApertusQuantizedRuntime(
-            ctx = ctx, weights = qWeights, attentionBackend = qBackend
-        )
-
-        // Compare logits
-        val eagerLogits = eagerRuntime.forward(1).data.copyToFloatArray()
-        val quantLogits = qRuntime.forward(1).data.copyToFloatArray()
-
-        assertEquals(eagerLogits.size, quantLogits.size, "logit sizes should match")
-        for (i in eagerLogits.indices) {
-            val diff = kotlin.math.abs(eagerLogits[i] - quantLogits[i])
-            assertTrue(diff < 1e-4f, "logit[$i] diff=$diff: eager=${eagerLogits[i]} vs quant=${quantLogits[i]}")
-        }
-    }
-
-    @Test
-    fun quantizedGenerateProducesTokens() {
-        val metadata = buildMetadata()
-        val layer = ApertusQuantizedLayerWeights(
-            attnNorm = ones(Shape(dim)),
-            qNorm = ones(Shape(headDim)),
-            kNorm = ones(Shape(headDim)),
-            ffnNorm = ones(Shape(dim)),
-            xieluParams = xieluParams,
-            wq = randnQuantized(Shape(dim, dim), seed = 1),
-            wk = randnQuantized(Shape(kvDim, dim), seed = 2),
-            wv = randnQuantized(Shape(kvDim, dim), seed = 3),
-            wo = randnQuantized(Shape(dim, dim), seed = 4),
-            ffnUp = randnQuantized(Shape(ffDim, dim), seed = 6),
-            ffnDown = randnQuantized(Shape(dim, ffDim), seed = 5)
-        )
-        val weights = ApertusQuantizedRuntimeWeights(
-            metadata = metadata,
-            tokenEmbedding = randn(Shape(vocabSize, dim), seed = 10),
-            layers = listOf(layer),
-            outputNorm = ones(Shape(dim)),
-            outputWeight = randnQuantized(Shape(vocabSize, dim), seed = 11)
-        )
-        val backend = ApertusCpuAttentionBackend<FP32>(
-            ctx = ctx, metadata = metadata, dtype = FP32::class
-        )
-        val runtime = ApertusQuantizedRuntime(
-            ctx = ctx, weights = weights, attentionBackend = backend
-        )
-
-        val generated = mutableListOf<Int>()
-        runtime.generate(
-            prompt = intArrayOf(1, 5, 3), steps = 4, temperature = 1.0f
-        ) { generated.add(it) }
-
-        assertEquals(4, generated.size, "Should generate exactly 4 tokens")
-        for (tokenId in generated) {
-            assertTrue(tokenId in 0 until vocabSize, "Token $tokenId should be in [0, $vocabSize)")
-        }
-    }
-}
diff --git a/llm-inference/apertus/src/commonTest/kotlin/sk/ainet/models/apertus/ApertusRuntimeSmokeTest.kt b/llm-inference/apertus/src/commonTest/kotlin/sk/ainet/models/apertus/ApertusRuntimeSmokeTest.kt
deleted file mode 100644
index bf9381f0..00000000
--- a/llm-inference/apertus/src/commonTest/kotlin/sk/ainet/models/apertus/ApertusRuntimeSmokeTest.kt
+++ /dev/null
@@ -1,175 +0,0 @@
-package sk.ainet.models.apertus
-
-import sk.ainet.context.DefaultDataExecutionContext
-import sk.ainet.lang.tensor.Shape
-import sk.ainet.lang.tensor.Tensor
-import sk.ainet.lang.types.FP32
-import kotlin.test.Test
-import kotlin.test.assertEquals
-import kotlin.test.assertTrue
-
-/**
- * Smoke test: builds a tiny Apertus model (dim=8, 1 layer, vocab=16)
- * with random weights and verifies that forward pass produces finite logits.
- */
-class ApertusRuntimeSmokeTest {
-
-    private val dim = 8
-    private val ffDim = 16
-    private val vocabSize = 16
-    private val nHeads = 2
-    private val kvHeads = 2
-    private val headDim = dim / nHeads
-    private val kvDim = kvHeads * headDim
-
-    private val ctx = DefaultDataExecutionContext()
-
-    private fun ones(shape: Shape): Tensor<FP32, Float> {
-        val values = FloatArray(shape.volume) { 0.01f }
-        return ctx.fromFloatArray(shape, FP32::class, values)
-    }
-
-    private fun randn(shape: Shape, seed: Int = 42): Tensor<FP32, Float> {
-        val rng = kotlin.random.Random(seed)
-        val values = FloatArray(shape.volume) { (rng.nextFloat() - 0.5f) * 0.1f }
-        return ctx.fromFloatArray(shape, FP32::class, values)
-    }
-
-    @Test
-    fun forwardPassProducesFiniteLogits() {
-        val metadata = ApertusModelMetadata(
-            architecture = "apertus",
-            embeddingLength = dim,
-            contextLength = 32,
-            blockCount = 1,
-            headCount = nHeads,
-            kvHeadCount = kvHeads,
-            feedForwardLength = ffDim,
-            ropeDimensionCount = headDim,
-            vocabSize = vocabSize,
-            ropeTheta = 12000000f,
-            qkNorm = true,
-            hiddenAct = "xielu",
-            tiedEmbeddings = false
-        )
-
-        val layer = ApertusLayerWeights(
-            attnNorm = ones(Shape(dim)),
-            wq = randn(Shape(dim, dim), seed = 1),
-            wk = randn(Shape(kvDim, dim), seed = 2),
-            wv = randn(Shape(kvDim, dim), seed = 3),
-            wo = randn(Shape(dim, dim), seed = 4),
-            qNorm = ones(Shape(headDim)),
-            kNorm = ones(Shape(headDim)),
-            ffnNorm = ones(Shape(dim)),
-            ffnDown = randn(Shape(dim, ffDim), seed = 5),
-            ffnUp = randn(Shape(ffDim, dim), seed = 6),
-            xieluParams = ApertusXIELUParams(
-                alphaP = -0.5f,
-                alphaN = -0.3f,
-                beta = 0.8f,
-                eps = -5.0f
-            )
-        )
-
-        val weights = ApertusRuntimeWeights(
-            metadata = metadata,
-            tokenEmbedding = randn(Shape(vocabSize, dim), seed = 10),
-            layers = listOf(layer),
-            outputNorm = ones(Shape(dim)),
-            outputWeight = randn(Shape(vocabSize, dim), seed = 11)
-        )
-
-        val backend = ApertusCpuAttentionBackend<FP32>(
-            ctx = ctx,
-            weights = weights,
-            dtype = FP32::class,
-            ropeFreqBase = 12000000f
-        )
-
-        val runtime = ApertusRuntime(
-            ctx = ctx,
-            weights = weights,
-            attentionBackend = backend,
-            dtype = FP32::class
-        )
-
-        // Forward pass with token ID 1 (BOS)
-        val logits = runtime.forward(1)
-
-        // Verify shape: should be [1, vocabSize]
-        assertEquals(2, logits.shape.rank, "logits should be 2D")
-        assertEquals(vocabSize, logits.shape[1], "logits dim should match vocab size")
-
-        // Verify all values are finite
-        val buf = logits.data.copyToFloatArray()
-        for (i in buf.indices) {
-            assertTrue(buf[i].isFinite(), "logit[$i] = ${buf[i]} is not finite")
-        }
-    }
-
-    @Test
-    fun generateProducesTokens() {
-        val metadata = ApertusModelMetadata(
-            architecture = "apertus",
-            embeddingLength = dim,
-            contextLength = 32,
-            blockCount = 1,
-            headCount = nHeads,
-            kvHeadCount = kvHeads,
-            feedForwardLength = ffDim,
-            ropeDimensionCount = headDim,
-            vocabSize = vocabSize,
-            ropeTheta = 12000000f
-        )
-
-        val layer = ApertusLayerWeights(
-            attnNorm = ones(Shape(dim)),
-            wq = randn(Shape(dim, dim), seed = 1),
-            wk = randn(Shape(kvDim, dim), seed = 2),
-            wv = randn(Shape(kvDim, dim), seed = 3),
-            wo = randn(Shape(dim, dim), seed = 4),
-            qNorm = ones(Shape(headDim)),
-            kNorm = ones(Shape(headDim)),
-            ffnNorm = ones(Shape(dim)),
-            ffnDown = randn(Shape(dim, ffDim), seed = 5),
-            ffnUp = randn(Shape(ffDim, dim), seed = 6),
-            xieluParams = ApertusXIELUParams(-0.5f, -0.3f, 0.8f, -5.0f)
-        )
-
-        val weights = ApertusRuntimeWeights(
-            metadata = metadata,
-            tokenEmbedding = randn(Shape(vocabSize, dim), seed = 10),
-            layers = listOf(layer),
-            outputNorm = ones(Shape(dim)),
-            outputWeight = randn(Shape(vocabSize, dim), seed = 11)
-        )
-
-        val backend = ApertusCpuAttentionBackend<FP32>(
-            ctx = ctx,
-            weights = weights,
-            dtype = FP32::class
-        )
-
-        val runtime = ApertusRuntime(
-            ctx = ctx,
-            weights = weights,
-            attentionBackend = backend,
-            dtype = FP32::class
-        )
-
-        val generated = mutableListOf<Int>()
-        runtime.generate(
-            prompt = intArrayOf(1, 5, 3),
-            steps = 4,
-            temperature = 1.0f
-        ) { tokenId ->
-            generated.add(tokenId)
-        }
-
-        assertEquals(4, generated.size, "Should generate exactly 4 tokens")
-        for (tokenId in generated) {
-            assertTrue(tokenId in 0 until vocabSize, "Token $tokenId should be in [0, $vocabSize)")
-        }
-    }
-}