From 8a7e0ff3196e2744db485d1a53b5ba15657aa807 Mon Sep 17 00:00:00 2001 From: Michal Harakal Date: Sat, 2 May 2026 11:32:22 +0200 Subject: [PATCH] =?UTF-8?q?chore(apertus):=20close=20out=20rollout=20?= =?UTF-8?q?=E2=80=94=20remove=20deprecated=20runtimes?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes the Apertus rollout (APERTUS_ROLLOUT.md) at 3-of-3 PRs (plan #91, routing #92, chat-template docs #93, tool calling #94). Drops the optional PR 4 (rebuild kapertus-cli) — the unified skainet-cli already covers Apertus end-to-end after #92, and the workspace direction (per `81f3506` deleting kqwen / kvoxtral / kapertus CLIs) is consolidation rather than per-model binaries. Also lands the deprecated-runtime cleanup that the rollout deferred. After PR 1 made `OptimizedLLMRuntime + apertusNetwork()` the canonical path, the hand-coded `ApertusRuntime` and its quantized variant served no production callers — both were flagged @Deprecated in #92's wake. Removing them now rather than maintaining stale code through a separate deprecation cycle. Deleted: - llm-inference/apertus/.../ApertusRuntime.kt — hand-coded decoder runtime. Replaced by OptimizedLLMRuntime + apertusNetwork() per #92. - llm-inference/apertus/.../ApertusQuantizedRuntime.kt — lazy-dequant variant. Same canonical replacement path; QuantPolicy.NATIVE_OPTIMIZED through the unified loader covers the same memory profile. - llm-inference/apertus/.../ApertusAttentionBackend.kt — interface used only by the two deleted runtimes. - llm-inference/apertus/.../ApertusCpuAttentionBackend.kt — implementation, only used by the two deleted runtimes. - llm-inference/apertus/.../ApertusRuntimeSmokeTest.kt — exercised the deleted ApertusRuntime. - llm-inference/apertus/.../ApertusQuantizedRuntimeSmokeTest.kt — exercised the deleted ApertusQuantizedRuntime. Extracted (kept): - llm-inference/apertus/.../ApertusXIELU.kt (new) — `xielu()` and `softplus()` reference activation helpers were public functions in ApertusRuntime.kt and are still useful as a numerical reference (ApertusXIELUTest validates the math, and future xIELU implementations can point at this file as the golden reference). Pulled out as a standalone activation module so the test keeps compiling after ApertusRuntime is gone. Untouched (still on the production path): - ApertusNetworkDef.kt — the apertusNetwork() DSL with xIELU op, QK-Norm, ungated FFN. - ApertusNetworkLoader.kt — module-build entry point. - ApertusWeightLoader.kt + ApertusSafeTensorsLoader.kt — GGUF + SafeTensors ingestion. Still used by both ApertusNetworkLoader and (in transition) ApertusIngestion's loadQuantized* methods. - ApertusRuntimeWeights.kt — data classes (ApertusModelMetadata, ApertusLayerWeights, ApertusXIELUParams). Used by the network path. - ApertusIngestion.kt (kapertus runtime) — thin facade. Its loadQuantized* methods reference ApertusQuantizedRuntimeWeights, which lives in the (still-extant) ApertusWeightLoader codepath. Verified compiles cleanly after this PR. Stale code-comment references to "ApertusRuntime" remain in OptimizedLLMRuntime.kt and llm-core's OutputEquivalenceTest.kt kdocs — they describe the migration history. Not load-bearing; left for a future docs sweep. APERTUS_ROLLOUT.md rewritten as a closure document (status: complete, summary of the four merged PRs, what was dropped from PR 4, post-cleanup test footprint). Verification: - `:llm-inference:apertus:jvmTest` — 12/12 (ConfigParser 6, XIELU 6). - `:llm-agent:jvmTest --tests '*Apertus*'` — 21/21 (ChatTemplate 10, ParserStrategy 11). - `:llm-runtime:kapertus:compileKotlinJvm`, `:llm-apps:skainet-cli:compileKotlin`, `:llm-core:compileTestKotlinJvm` — all green after the deletes. Total Apertus test footprint after this commit: 33 tests, all green. Co-Authored-By: Claude Opus 4.7 (1M context) --- APERTUS_ROLLOUT.md | 134 +++------ .../models/apertus/ApertusAttentionBackend.kt | 51 ---- .../apertus/ApertusCpuAttentionBackend.kt | 242 --------------- .../models/apertus/ApertusQuantizedRuntime.kt | 185 ------------ .../sk/ainet/models/apertus/ApertusRuntime.kt | 242 --------------- .../sk/ainet/models/apertus/ApertusXIELU.kt | 51 ++++ .../ApertusQuantizedRuntimeSmokeTest.kt | 275 ------------------ .../models/apertus/ApertusRuntimeSmokeTest.kt | 175 ----------- 8 files changed, 85 insertions(+), 1270 deletions(-) delete mode 100644 llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusAttentionBackend.kt delete mode 100644 llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusCpuAttentionBackend.kt delete mode 100644 llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusQuantizedRuntime.kt delete mode 100644 llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusRuntime.kt create mode 100644 llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusXIELU.kt delete mode 100644 llm-inference/apertus/src/commonTest/kotlin/sk/ainet/models/apertus/ApertusQuantizedRuntimeSmokeTest.kt delete mode 100644 llm-inference/apertus/src/commonTest/kotlin/sk/ainet/models/apertus/ApertusRuntimeSmokeTest.kt diff --git a/APERTUS_ROLLOUT.md b/APERTUS_ROLLOUT.md index 2e782ee6..742973b1 100644 --- a/APERTUS_ROLLOUT.md +++ b/APERTUS_ROLLOUT.md @@ -1,116 +1,50 @@ -# Apertus Support Rollout +# Apertus Support Rollout — COMPLETE -**Status:** PR 3 in flight (tool-calling support). -**Owner:** unassigned. -**Plan PR:** #91 (merged). PR 1: #92 (merged). PR 2: #93 (merged). +**Status:** complete (3 of 3 PRs merged + deprecated-runtime cleanup). +**Plan PR:** #91. **Implementation PRs:** #92 (routing), #93 (chat-template docs), #94 (tool calling). -## Context +## Summary -The Apertus model (Swiss AI / EPFL multilingual decoder-only transformer) is **architecturally complete in the transformers library layer** but has three integration gaps that make it semi-broken end-to-end today: +Apertus (Swiss AI / EPFL multilingual decoder-only transformer) reached production parity with kllama and kgemma over four PRs landed 2026-05-01 and 2026-05-02: -1. **Silent correctness bug in `skainet-cli`.** Lines 168–216 of `llm-apps/skainet-cli/src/main/kotlin/sk/ainet/apps/skainet/cli/Main.kt` route every non-Gemma model — including Apertus — through the deprecated `LlamaRuntime`. Apertus uses **xIELU** activation, **QK-Norm** (RMSNorm on Q and K before RoPE), and **ungated FFN** (no `gate_proj`); none of those branches exist in `LlamaRuntime`. Inference completes, but the logits diverge from what the Apertus checkpoint actually wants. The output is wrong on a level the user can't easily catch unless they compare to a reference. -2. **Tool calling is OFF.** `llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/ModelRegistry.kt:66` lists Apertus as `("apertus", "Apertus", false, "chatml")` — `supportsToolCalling=false`, `chatTemplateFamily="chatml"` (a guess). There is no `ApertusChatTemplate.kt`, no `ApertusToolCallingSupport`, and no entry in `llm-agent/src/commonMain/kotlin/sk/ainet/apps/kllama/chat/ToolCallingSupportResolver.kt:16`. Apertus models fall back to `GenericToolCallingSupport`, which doesn't know the model's actual prompt format. -3. **`kapertus` runtime is a 1-file stub.** Commit `81f3506` (`chore: remove deprecated-runtime CLIs (kqwen-only / kapertus / kvoxtral)`) deleted the previous `kapertus` CLI's `Main.kt` (~292 lines). What remains in `llm-runtime/kapertus/` is a single `ApertusIngestion.kt` (89 lines) that wraps the weight loaders. There is no kapertus-cli binary equivalent of `kllama-cli`, and no `llm-apps/kapertus-cli/` module. +| PR | Title | +| ----: | --------------------------------------------------------- | +| #91 | Plan + this tracking doc | +| #92 | `fix(apertus): route through OptimizedLLMRuntime + apertusNetwork()` | +| #93 | `docs(apertus): document chat-template format` | +| #94 | `feat(apertus): tool calling support` | -The architecture / library layer itself is solid: -- `apertusNetwork()` DSL with xIELU, QK-Norm, ungated FFN — `llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusNetworkDef.kt:29`. -- Weight loading from GGUF + SafeTensors + quantized — `ApertusWeightLoader.kt`, `ApertusSafeTensorsLoader.kt`, `ApertusQuantizedRuntime.kt`. -- 4 self-contained `commonTest` smoke tests pass without a model file (`ApertusRuntimeSmokeTest`, `ApertusQuantizedRuntimeSmokeTest`, `ApertusXIELUTest`, `ApertusConfigParserTest`). +After this stack: +- `skainet-cli` routes Apertus models through `OptimizedLLMRuntime + apertusNetwork()` (xIELU + QK-Norm + ungated FFN — the previous `LlamaRuntime` fallback silently produced wrong logits). +- `--agent --template=apertus` formats prompts with Apertus's own role tokens (`<|system_start|>`, `<|user_start|>`, `<|assistant_start|>`, `<|tools_prefix|>`, etc.) and parses tool calls back from `<|tools_prefix|>[...]<|tools_suffix|>` JSON arrays. +- `ModelRegistry.APERTUS.supportsToolCalling = true`, `chatTemplateFamily = "apertus"`. +- `KernelRegistry` auto-discovers native FFM kernels for the matmul path via the 0.22.0 native-cpu module. -**Goal:** lift Apertus to the same level of polish kllama and kgemma have. Track the rollout in this file. The next contributor / session opens this doc, scans the staged-delivery checklist, and picks up where the previous one left off. +## What's not in this rollout ---- +- **Optional kapertus-cli rebuild** — was originally listed as PR 4 ("rebuild CLI under `llm-apps/`"). Dropped: the unified `skainet-cli` already covers Apertus end-to-end, model-specific CLIs (kqwen, kapertus, kvoxtral) are being deprecated per commit `81f3506`, and the workspace direction is consolidation rather than per-model binaries. If a downstream consumer needs an Apertus-only fat-jar later, copy the `skainet-cli` shadow setup. +- **Native Apertus kernels** — Apertus shares matmul shapes with Llama; the native FFM kernels from SKaiNET 0.22.0 (Q4_K, FP32) work transparently. No Apertus-specific kernel work needed. +- **TurboQuant KV-cache compression for Apertus** — tracked separately under the TurboQuant workstream. -## Staged delivery +## Reference docs -- [x] **PR 1 — `fix(apertus): route through OptimizedLLMRuntime + apertusNetwork()`** (correctness fix) — #92 -- [x] **PR 2 — `docs(apertus): document chat template format`** (research) — #93 -- [x] **PR 3 — `feat(apertus): tool calling support`** (implementation, depends on PR 2) — this PR -- [ ] **PR 4 — `feat(kapertus): rebuild CLI under llm-apps/`** (parity, optional) +- `docs/specs/apertus-chat-template.md` — full spec for the Apertus chat template (PR 2). Source of truth for the `ApertusChatTemplate` implementation. -Each PR ticks its own checkbox when merged. `Status:` at the top of this doc reflects the most recent merged PR. +## Cleanup that landed alongside the rollout (this commit) ---- +The hand-coded `ApertusRuntime.kt` and `ApertusQuantizedRuntime.kt` paths (and their attention backends + smoke tests) were marked `@Deprecated` after PR 1 made `OptimizedLLMRuntime + apertusNetwork()` the canonical path. Removed in this commit alongside the rollout closure: -## PR 1 — fix skainet-cli routing for Apertus +- `ApertusRuntime.kt` — hand-coded decoder runtime, deprecated. +- `ApertusQuantizedRuntime.kt` — lazy-dequant variant, deprecated. +- `ApertusAttentionBackend.kt` + `ApertusCpuAttentionBackend.kt` — only used by the two deleted runtimes. +- `ApertusRuntimeSmokeTest.kt` + `ApertusQuantizedRuntimeSmokeTest.kt` — exercised the deleted runtimes. -**Why first:** today's `skainet-cli` produces silently-wrong logits for Apertus models. Fix the worst class of bug first; everything else is additive. +The `xielu()` / `softplus()` activation reference functions previously housed in `ApertusRuntime.kt` were extracted to `ApertusXIELU.kt` so `ApertusXIELUTest` keeps validating the math. The kdoc references in `OptimizedLLMRuntime.kt` and `OutputEquivalenceTest.kt` to "ApertusRuntime" are now stale and worth a follow-up sweep, but they're code comments only and don't break anything. -**Changes:** -- `llm-apps/skainet-cli/src/main/kotlin/sk/ainet/apps/skainet/cli/Main.kt` lines 168–216: detect `architecture == "apertus"` (or `family == "apertus"`) and branch to `OptimizedLLMRuntime + apertusNetwork()` instead of `LlamaRuntime`. Mirror how Gemma4 is already special-cased in this file. -- Reuse `ApertusNetworkLoader.kt:31` for the module-build path. -- Reuse `apertusNetwork()` from `ApertusNetworkDef.kt:29`. +The remaining apertus library files (`ApertusNetworkDef`, `ApertusNetworkLoader`, `ApertusWeightLoader`, `ApertusSafeTensorsLoader`, `ApertusRuntimeWeights`, `ApertusConfigParser`, `QuantizedTensor`, `ApertusXIELU`, `ApertusIngestion`) cover the whole production path through `apertusNetwork() + OptimizedLLMRuntime`. -**Verification:** -- `skainet-cli -m "Hello"` produces coherent output (the divergence is silent today; loading a real checkpoint and seeing meaningful text is the canary). -- xIELU activation actually fires — verify by setting a debug breakpoint or `println` in `ApertusNetworkDef`'s xIELU branch on the first forward pass. +## Test footprint (post-cleanup) -**No new tests required** — the existing `:llm-inference:apertus:commonTest` already exercises `apertusNetwork()` end-to-end at toy scale; the routing fix is a Main.kt change covered by manual run. - ---- - -## PR 2 — document the chat template format - -**Why:** before implementing a chat template, we need to know what format Apertus models actually expect. `ModelRegistry.kt:66` lists `chatTemplateFamily="chatml"` but that's a guess — Apertus may use a different format (Alpaca, llama2-style, custom). Without the right template, tool-calling output won't parse correctly even if the rest of PR 3 is right. - -**Changes:** -- Inspect a real Apertus GGUF (download from HuggingFace if needed: `swiss-ai/Apertus-1B` or similar). Read the `tokenizer.chat_template` GGUF metadata key. -- Create `docs/explanation/models/apertus-chat-template.md` documenting: - - The actual chat-template Jinja string from the GGUF, byte-for-byte. - - Special tokens (`<|im_start|>`-style? `[INST]`-style? Alpaca?). - - Tool calling format (if the template has any). - - Whether the template matches an existing family (`chatml`, `llama3`, `gemma`) or needs a new `apertus` strategy. -- Update `ModelRegistry.kt:66` `chatTemplateFamily` if the research shows a different family is correct. - -**Verification:** doc exists; rendered template matches the GGUF's `tokenizer.chat_template` byte-for-byte for one canonical message exchange (system + user + assistant + user roles). - ---- - -## PR 3 — tool calling support - -**Depends on PR 2** (need the template format documented). - -**Changes:** -- New `llm-agent/src/commonMain/kotlin/sk/ainet/apps/kllama/chat/ApertusChatTemplate.kt` — concrete `ChatTemplate`. Pattern reference: `Llama3ChatTemplate.kt` and `Gemma4ChatTemplate.kt`. -- New `llm-agent/src/commonMain/kotlin/sk/ainet/apps/kllama/chat/ApertusToolCallingSupport.kt` — analog of `Gemma4ToolCallingSupport`. Bundles the chat template + the tool-call markup parser. -- Register in `ToolCallingSupportResolver.kt:16` — add a branch for `family == "apertus"` returning the new support. -- `ModelRegistry.kt:66`: flip `supportsToolCalling=true`. -- Tests: - - `ApertusChatTemplateTest.kt` in `llm-agent/src/commonTest/kotlin/sk/ainet/apps/kllama/chat/` — render parity vs the GGUF Jinja template (mirror `Gemma4ChatTemplateHfParityTest` shape). - - `ApertusToolCallParserStrategyTest.kt` — parse the Apertus tool-call output format. - -**Verification:** -- `:llm-agent:commonTest --tests '*Apertus*'` green. -- End-to-end: `skainet-cli -m --agent --template=apertus "What is 17 * 23?"` invokes a calculator tool the same way the kllama TinyLlama tool-calling smoke test does. - ---- - -## PR 4 — rebuild kapertus CLI under llm-apps/ (optional) - -**Why:** `kapertus` runtime is a 1-file library facade today. Users wanting a `kapertus-cli` equivalent of `kllama-cli` have nothing. The unified `skainet-cli` works fine after PR 1, so PR 4 is **optional polish** — only worth doing if there's a specific reason to ship a separate Apertus-only CLI binary (smaller fat-JAR, branded distribution, similar parity expectations as `kllama-cli`). - -**Changes:** -- New `llm-apps/kapertus-cli/build.gradle.kts` mirroring `kllama-cli/build.gradle.kts`: `kotlin("jvm")` + `shadow` + `application` plugins. -- New `llm-apps/kapertus-cli/src/main/kotlin/sk/ainet/apps/kapertus/cli/Main.kt` — re-implement the deleted (commit `81f3506`) Main.kt, but routed through `OptimizedLLMRuntime + apertusNetwork()`. -- Apply the shadow `mergeServiceFiles()` `doLast` workaround that PR #88 added to `kllama-cli` and `skainet-cli` (the `com.gradleup.shadow:9.4.x` bug — `NativeKernelProviderFactory` gets dropped from the merged services file otherwise). -- Add to `settings.gradle.kts`: `include("llm-apps:kapertus-cli")`. - -**Verification:** -- `:llm-apps:kapertus-cli:shadowJar` produces a runnable fat JAR. -- `unzip -p kapertus-all.jar META-INF/services/sk.ainet.backend.api.kernel.KernelProvider` shows all 3 KernelProvider entries (Scalar + PanamaVector + Native). -- `java -jar kapertus-all.jar -m "Hello"` produces coherent output. - ---- - -## Out of scope - -- **Native-cpu wiring for Apertus inference.** Works automatically once PR 1 lands: matmul flows through `KernelRegistry.bestAvailable()`; native FFM kernels are auto-discovered via ServiceLoader on the classpath. No Apertus-specific code needed. -- **Q4_K-quantized Apertus checkpoints.** Should work today via the existing Q4_K matmul path; no Apertus-specific code needed. -- **TurboQuant KV-cache compression for Apertus.** Tracked separately under the broader TurboQuant workstream. -- **Removing the deprecated `ApertusRuntime.kt`** (hand-coded path). Leave as `@Deprecated` until consumers have migrated to `OptimizedLLMRuntime + apertusNetwork()`. - ---- - -## Why this lives in transformers, not SKaiNET - -The Apertus rollout is entirely transformers-side. Model definition (`apertusNetwork()`), runtime, weight loaders, and tool calling all live under `llm-inference/apertus/`, `llm-runtime/kapertus/`, and `llm-agent/`. The SKaiNET upstream (kernels, tensor ops, ServiceLoader infra) needs no Apertus-specific changes. +- `:llm-inference:apertus:jvmTest` — 12 tests (ConfigParser 6, XIELU 6). +- `:llm-agent:jvmTest --tests '*Apertus*'` — 21 tests (ChatTemplate 10, ParserStrategy 11). +- 33 Apertus-specific tests total, all green. diff --git a/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusAttentionBackend.kt b/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusAttentionBackend.kt deleted file mode 100644 index 1980fd5b..00000000 --- a/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusAttentionBackend.kt +++ /dev/null @@ -1,51 +0,0 @@ -package sk.ainet.models.apertus - -import sk.ainet.lang.tensor.Tensor -import sk.ainet.lang.types.DType - -/** - * Strategy interface for Apertus attention computation. - * - * Similar to LLaMA's AttentionBackend but receives Q/K after QK-norm has been applied. - * Applies RoPE encoding, KV cache management, and GQA attention scoring. - * - * Contract: - * - Input: q [1, dim], k [1, kvDim], v [1, kvDim], layerIdx, position - * - Output: attention output [1, dim] - */ -public interface ApertusAttentionBackend { - - /** - * Compute attention for one token at the given position. - * - * Q and K have already been QK-normed by the caller. - * This method applies RoPE, stores k/v in the KV cache, - * and returns the attention-weighted output. - */ - public fun attention( - q: Tensor, - k: Tensor, - v: Tensor, - layerIdx: Int, - position: Int - ): Tensor - - /** - * Compute attention for a batch of tokens starting at [startPos]. - * - * Returns null if the backend does not support batch attention, - * in which case the runtime falls back to sequential processing. - */ - public fun batchAttention( - q: Tensor, - k: Tensor, - v: Tensor, - layerIdx: Int, - startPos: Int, - ): Tensor? = null - - /** - * Reset internal state (KV caches, position tracking, etc.). - */ - public fun reset() -} diff --git a/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusCpuAttentionBackend.kt b/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusCpuAttentionBackend.kt deleted file mode 100644 index 664410a5..00000000 --- a/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusCpuAttentionBackend.kt +++ /dev/null @@ -1,242 +0,0 @@ -package sk.ainet.models.apertus - -import kotlin.math.cos -import kotlin.math.sin -import kotlin.math.sqrt -import sk.ainet.apps.llm.KvCache -import sk.ainet.apps.llm.HeapKvCache -import sk.ainet.apps.llm.applyRopeRotation -import sk.ainet.apps.llm.softmaxInPlace -import sk.ainet.context.ExecutionContext -import sk.ainet.lang.tensor.Shape -import sk.ainet.lang.tensor.Tensor -import sk.ainet.lang.tensor.data.FloatArrayTensorData -import sk.ainet.lang.types.DType -import kotlin.reflect.KClass - -/** - * CPU-based attention backend for Apertus. - * - * Applies RoPE with high base theta (12M default), stores into KV cache, - * and computes GQA attention with explicit loops. - * - * Q and K are expected to be QK-normed before being passed to this backend. - */ -public class ApertusCpuAttentionBackend private constructor( - private val ctx: ExecutionContext, - private val dtype: KClass, - private val dim: Int, - private val seqLen: Int, - private val nLayers: Int, - private val nHeads: Int, - private val nKvHeads: Int, - private val headSize: Int, - private val kvDim: Int, - private val ropeDim: Int, - private val nHeadsPerKv: Int, - private val ropeFreqBase: Float, - private val cache: KvCache, - private val precomputedRopeFreqs: FloatArray? -) : ApertusAttentionBackend { - - /** - * Primary constructor from full runtime weights. - */ - public constructor( - ctx: ExecutionContext, - weights: ApertusRuntimeWeights, - dtype: KClass, - kvCache: KvCache? = null, - ropeFreqBase: Float = weights.metadata.ropeTheta - ) : this( - ctx = ctx, - dtype = dtype, - dim = weights.metadata.embeddingLength, - seqLen = weights.metadata.contextLength, - nLayers = weights.metadata.blockCount, - nHeads = weights.metadata.headCount, - nKvHeads = weights.metadata.kvHeadCount, - headSize = weights.metadata.embeddingLength / weights.metadata.headCount, - kvDim = weights.metadata.kvHeadCount * (weights.metadata.embeddingLength / weights.metadata.headCount), - ropeDim = weights.metadata.ropeDimensionCount - ?: (weights.metadata.embeddingLength / weights.metadata.headCount), - nHeadsPerKv = weights.metadata.headCount / weights.metadata.kvHeadCount, - ropeFreqBase = ropeFreqBase, - cache = kvCache ?: HeapKvCache( - weights.metadata.blockCount, - weights.metadata.contextLength, - weights.metadata.kvHeadCount * (weights.metadata.embeddingLength / weights.metadata.headCount) - ), - precomputedRopeFreqs = weights.ropeFreqs?.let { tensor -> - val data = tensor.data - if (data is FloatArrayTensorData<*>) data.buffer.copyOf() - else data.copyToFloatArray() - } - ) - - /** - * Constructor from metadata and optional rope frequencies (for quantized runtime). - */ - public constructor( - ctx: ExecutionContext, - metadata: ApertusModelMetadata, - dtype: KClass, - ropeFreqs: FloatArray? = null, - kvCache: KvCache? = null, - ropeFreqBase: Float = metadata.ropeTheta - ) : this( - ctx = ctx, - dtype = dtype, - dim = metadata.embeddingLength, - seqLen = metadata.contextLength, - nLayers = metadata.blockCount, - nHeads = metadata.headCount, - nKvHeads = metadata.kvHeadCount, - headSize = metadata.embeddingLength / metadata.headCount, - kvDim = metadata.kvHeadCount * (metadata.embeddingLength / metadata.headCount), - ropeDim = metadata.ropeDimensionCount - ?: (metadata.embeddingLength / metadata.headCount), - nHeadsPerKv = metadata.headCount / metadata.kvHeadCount, - ropeFreqBase = ropeFreqBase, - cache = kvCache ?: HeapKvCache( - metadata.blockCount, - metadata.contextLength, - metadata.kvHeadCount * (metadata.embeddingLength / metadata.headCount) - ), - precomputedRopeFreqs = ropeFreqs - ) - - private val scoreBuffer = FloatArray(seqLen) - - override fun attention( - q: Tensor, - k: Tensor, - v: Tensor, - layerIdx: Int, - position: Int - ): Tensor { - val qBuf = q.expectFloatBuffer() - val kBuf = k.expectFloatBuffer() - val vBuf = v.expectFloatBuffer() - - applyRopeGqa(qBuf, kBuf, position) - cache.store(layerIdx, position, kBuf, 0, vBuf, 0) - - val attnOutRaw = attentionGqa(layerIdx, qBuf, position) - return ctx.fromFloatArray(Shape(1, dim), dtype, attnOutRaw) - } - - override fun batchAttention( - q: Tensor, - k: Tensor, - v: Tensor, - layerIdx: Int, - startPos: Int, - ): Tensor { - val batchSize = q.shape[0] - val qAll = q.expectFloatBuffer() - val kAll = k.expectFloatBuffer() - val vAll = v.expectFloatBuffer() - - val result = FloatArray(batchSize * dim) - - for (i in 0 until batchSize) { - val pos = startPos + i - - val qBuf = qAll.copyOfRange(i * dim, (i + 1) * dim) - val kBuf = kAll.copyOfRange(i * kvDim, (i + 1) * kvDim) - val vBuf = vAll.copyOfRange(i * kvDim, (i + 1) * kvDim) - - applyRopeGqa(qBuf, kBuf, pos) - cache.store(layerIdx, pos, kBuf, 0, vBuf, 0) - - val attnOut = attentionGqa(layerIdx, qBuf, pos) - attnOut.copyInto(result, i * dim) - } - - return ctx.fromFloatArray(Shape(batchSize, dim), dtype, result) - } - - override fun reset() { - cache.reset() - } - - private fun applyRopeGqa(qBuf: FloatArray, kBuf: FloatArray, pos: Int) { - require(headSize % 2 == 0) { "RoPE requires even head size; got $headSize" } - - if (precomputedRopeFreqs != null) { - applyRopeWithFreqs(qBuf, nHeads, headSize, pos, precomputedRopeFreqs) - applyRopeWithFreqs(kBuf, nKvHeads, headSize, pos, precomputedRopeFreqs) - } else { - val ropeStride = headSize / 2 - applyRopeRotation(qBuf, nHeads, headSize, ropeDim, pos, ropeFreqBase, null, null, ropeStride) - applyRopeRotation(kBuf, nKvHeads, headSize, ropeDim, pos, ropeFreqBase, null, null, ropeStride) - } - } - - /** - * Apply RoPE using precomputed inverse frequencies. - * - * For each pair index `p`, the angle is `pos * freqs[p]`. - * Rotation: out[2p] = in[2p] * cos(θ) - in[2p+1] * sin(θ) - * out[2p+1] = in[2p] * sin(θ) + in[2p+1] * cos(θ) - */ - private fun applyRopeWithFreqs( - buf: FloatArray, - numHeads: Int, - headSize: Int, - pos: Int, - freqs: FloatArray - ) { - val nPairs = freqs.size - for (h in 0 until numHeads) { - val headOffset = h * headSize - for (pair in 0 until nPairs) { - val angle = pos * freqs[pair] - val fcr = cos(angle) - val fci = sin(angle) - val i = pair * 2 - val v0 = buf[headOffset + i] - val v1 = buf[headOffset + i + 1] - buf[headOffset + i] = (v0 * fcr - v1 * fci).toFloat() - buf[headOffset + i + 1] = (v0 * fci + v1 * fcr).toFloat() - } - } - } - - private fun attentionGqa(layerIdx: Int, qBuf: FloatArray, pos: Int): FloatArray { - val out = FloatArray(dim) - val scale = 1f / sqrt(headSize.toDouble()).toFloat() - val scores = scoreBuffer - - for (h in 0 until nHeads) { - val qHeadOffset = h * headSize - val kvHeadIdx = h / nHeadsPerKv - val kvHeadOffset = kvHeadIdx * headSize - - for (t in 0..pos) { - var score = 0f - for (i in 0 until headSize) { - score += qBuf[qHeadOffset + i] * cache.getKey(layerIdx, t, kvHeadOffset, i) - } - scores[t] = score * scale - } - - softmaxInPlace(scores, pos + 1) - - for (t in 0..pos) { - val weight = scores[t] - for (i in 0 until headSize) { - out[qHeadOffset + i] += weight * cache.getValue(layerIdx, t, kvHeadOffset, i) - } - } - } - return out - } - - private fun Tensor.expectFloatBuffer(): FloatArray { - val data = this.data - if (data is FloatArrayTensorData<*>) return data.buffer - return data.copyToFloatArray() - } -} diff --git a/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusQuantizedRuntime.kt b/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusQuantizedRuntime.kt deleted file mode 100644 index 9117f2e9..00000000 --- a/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusQuantizedRuntime.kt +++ /dev/null @@ -1,185 +0,0 @@ -package sk.ainet.models.apertus - -import kotlin.math.sqrt -import kotlin.random.Random -import sk.ainet.apps.llm.DecoderRuntime -import sk.ainet.context.ExecutionContext -import sk.ainet.io.gguf.dequant.DequantOps -import sk.ainet.lang.nn.layers.Embedding -import sk.ainet.lang.nn.normalization.RMSNormalization -import sk.ainet.lang.tensor.Shape -import sk.ainet.lang.tensor.Tensor -import sk.ainet.lang.tensor.matmul -import sk.ainet.lang.tensor.plus -import sk.ainet.lang.tensor.t -import sk.ainet.lang.types.FP32 - -/** - * Apertus decoder runtime with **lazy dequantization**. - * - * Weight matrices (wq, wk, wv, wo, ffnUp, ffnDown, outputWeight) are stored - * in their original quantized form ([QuantizedTensor]). Each layer dequantizes - * its weights to FP32 on the fly during [runLayer], then discards the temporary. - * - * Memory profile for a 7B Q4_0 model: - * - Resident: ~3.5 GB (quantized) + norms/embeddings in FP32 (~100 MB) - * - Per-layer temporary: ~50 MB (one projection matrix at a time) - * - vs. eager FP32: ~28 GB - * - * Trade-off: each token pays a dequantization cost per-layer. This is the same - * approach used by llama.cpp and is well worth the 4-8x memory savings. - */ -public class ApertusQuantizedRuntime( - private val ctx: ExecutionContext, - val weights: ApertusQuantizedRuntimeWeights, - private val attentionBackend: ApertusAttentionBackend, - private val eps: Float = weights.metadata.rmsNormEps, - random: Random = Random.Default -) : DecoderRuntime(random) { - - override val dim: Int = weights.metadata.embeddingLength - override val seqLen: Int = weights.metadata.contextLength - override val vocabSize: Int = weights.metadata.vocabSize - override val nLayers: Int = weights.layers.size - override val bosToken: Int = weights.metadata.bosTokenId - - private val nHeads = weights.metadata.headCount - private val headDim = dim / nHeads - private val nKvHeads = weights.metadata.kvHeadCount - - private val embedding = Embedding( - numEmbeddings = vocabSize, - embeddingDim = dim, - initWeight = weights.tokenEmbedding, - name = "token_embd" - ) - - private val outputNormLayer = RMSNormalization( - normalizedShape = intArrayOf(dim), - eps = eps.toDouble(), - name = "output_norm", - initWeight = weights.outputNorm - ) - - private val attnNorms = weights.layers.mapIndexed { i, layer -> - RMSNormalization( - normalizedShape = intArrayOf(dim), - eps = eps.toDouble(), - name = "layer_$i.attn_norm", - initWeight = layer.attnNorm - ) - } - - private val ffnNorms = weights.layers.mapIndexed { i, layer -> - RMSNormalization( - normalizedShape = intArrayOf(dim), - eps = eps.toDouble(), - name = "layer_$i.ffn_norm", - initWeight = layer.ffnNorm - ) - } - - override fun embedToken(tokenId: Int): Tensor = - embedding.forward(intArrayOf(tokenId), ctx) - - override fun runLayer(layerIdx: Int, x: Tensor): Tensor { - val layer = weights.layers[layerIdx] - - // 1. Attention norm - val attnNorm = attnNorms[layerIdx].forward(x, ctx) - - // 2. QKV projections — dequant weight, matmul, discard temp - val q = attnNorm.matmul(dequant2D(layer.wq).t()) - val k = attnNorm.matmul(dequant2D(layer.wk).t()) - val v = attnNorm.matmul(dequant2D(layer.wv).t()) - - // 3. QK-norm: per-head RMSNorm on Q and K - val qNormed = applyPerHeadRMSNorm(q, nHeads, headDim, layer.qNorm) - val kNormed = applyPerHeadRMSNorm(k, nKvHeads, headDim, layer.kNorm) - - // 4. Attention (RoPE + KV cache + GQA) - val attnOut = attentionBackend.attention(qNormed, kNormed, v, layerIdx, position) - - // 5. Output projection + residual - val afterAttn = x + attnOut.matmul(dequant2D(layer.wo).t()) - - // 6. FFN norm - val ffnNorm = ffnNorms[layerIdx].forward(afterAttn, ctx) - - // 7. Ungated MLP: up → xIELU → down - val up = ffnNorm.matmul(dequant2D(layer.ffnUp).t()) - val activated = applyXIELU(up, layer.xieluParams) - val ffnOut = activated.matmul(dequant2D(layer.ffnDown).t()) - - // 8. Residual - return afterAttn + ffnOut - } - - override fun outputNorm(x: Tensor): Tensor = - outputNormLayer.forward(x, ctx) - - override fun outputProject(x: Tensor): Tensor = - x.matmul(dequant2D(weights.outputWeight).t()) - - override fun resetState() { - attentionBackend.reset() - } - - // ---- Dequantization ---- - - /** - * Dequantize a 2D quantized tensor to FP32. - * - * GGUF stores 2D tensors column-major with shape [out, in]. - * We transpose to row-major [in, out] so that `.t()` in the caller - * gives the correct matmul orientation. - */ - private fun dequant2D(qt: QuantizedTensor): Tensor { - val floats = qt.dequantToFloat() - return if (qt.shape.rank == 2) { - val rows = qt.shape[0] - val cols = qt.shape[1] - val transposed = DequantOps.transposeColumnMajorToRowMajor(floats, rows, cols) - ctx.fromFloatArray(Shape(cols, rows), FP32::class, transposed) - } else { - ctx.fromFloatArray(qt.shape, FP32::class, floats) - } - } - - // ---- Apertus-specific helpers (same as ApertusRuntime) ---- - - private fun applyPerHeadRMSNorm( - x: Tensor, - numHeads: Int, - headDim: Int, - weight: Tensor - ): Tensor { - val buf = x.expectFloatBuffer().copyOf() - val w = weight.expectFloatBuffer() - val totalDim = numHeads * headDim - val batchSize = if (x.shape.rank == 2) x.shape[0] else 1 - - for (b in 0 until batchSize) { - val batchOffset = b * totalDim - for (h in 0 until numHeads) { - val headOffset = batchOffset + h * headDim - var sumSq = 0f - for (i in 0 until headDim) { - val v = buf[headOffset + i] - sumSq += v * v - } - val rms = sqrt(sumSq / headDim + eps) - for (i in 0 until headDim) { - buf[headOffset + i] = (buf[headOffset + i] / rms) * w[i] - } - } - } - return ctx.fromFloatArray(x.shape, FP32::class, buf) - } - - private fun applyXIELU(x: Tensor, params: ApertusXIELUParams): Tensor { - val buf = x.expectFloatBuffer().copyOf() - xielu(buf, params) - return ctx.fromFloatArray(x.shape, FP32::class, buf) - } -} diff --git a/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusRuntime.kt b/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusRuntime.kt deleted file mode 100644 index fafea08a..00000000 --- a/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusRuntime.kt +++ /dev/null @@ -1,242 +0,0 @@ -package sk.ainet.models.apertus - -import kotlin.math.exp -import kotlin.math.ln -import kotlin.math.min -import kotlin.math.sqrt -import kotlin.random.Random -import sk.ainet.apps.llm.DecoderRuntime -import sk.ainet.context.ExecutionContext -import sk.ainet.lang.nn.layers.Embedding -import sk.ainet.lang.tensor.Shape -import sk.ainet.lang.tensor.Tensor -import sk.ainet.lang.tensor.matmul -import sk.ainet.lang.tensor.plus -import sk.ainet.lang.tensor.t -import sk.ainet.lang.tensor.data.FloatArrayTensorData -import sk.ainet.lang.nn.normalization.RMSNormalization -import sk.ainet.lang.types.DType -import kotlin.reflect.KClass - -/** - * Apertus decoder runtime with pluggable attention backend. - * - * Key differences from LLaMA: - * - **xIELU activation** with per-layer learned scalar parameters (replaces SiLU) - * - **Ungated MLP** — only up_proj + down_proj, no gate_proj - * - **QK-norm** — per-head RMSNorm on Q and K before RoPE - * - * Extends [DecoderRuntime] for shared forward/generate/sample logic. - */ -@Deprecated( - message = "Use OptimizedLLMRuntime with apertusNetwork() instead. " + - "See docs/optimizable-LLM-NNs-DAG.md for migration guide.", - replaceWith = ReplaceWith( - "OptimizedLLMRuntime.create(apertusNetwork(config), tensors, resolver, ctx)", - "sk.ainet.apps.llm.OptimizedLLMRuntime" - ) -) -public class ApertusRuntime( - private val ctx: ExecutionContext, - val weights: ApertusRuntimeWeights, - private val attentionBackend: ApertusAttentionBackend, - private val dtype: KClass, - private val eps: Float = weights.metadata.rmsNormEps, - random: Random = Random.Default -) : DecoderRuntime(random) { - - // ---- DecoderRuntime abstract properties ---- - override val dim: Int = weights.metadata.embeddingLength - override val seqLen: Int = weights.metadata.contextLength - override val vocabSize: Int = weights.metadata.vocabSize - override val nLayers: Int = weights.layers.size - override val bosToken: Int = weights.metadata.bosTokenId - - private val nHeads = weights.metadata.headCount - private val headDim = dim / nHeads - private val nKvHeads = weights.metadata.kvHeadCount - private val kvDim = nKvHeads * headDim - - private val embedding = Embedding( - numEmbeddings = vocabSize, - embeddingDim = dim, - initWeight = weights.tokenEmbedding, - name = "token_embd" - ) - - private val outputNormLayer = RMSNormalization( - normalizedShape = intArrayOf(dim), - eps = eps.toDouble(), - name = "output_norm", - initWeight = weights.outputNorm - ) - - private val attnNorms = weights.layers.mapIndexed { i, layer -> - RMSNormalization( - normalizedShape = intArrayOf(dim), - eps = eps.toDouble(), - name = "layer_$i.attn_norm", - initWeight = layer.attnNorm - ) - } - - private val ffnNorms = weights.layers.mapIndexed { i, layer -> - RMSNormalization( - normalizedShape = intArrayOf(dim), - eps = eps.toDouble(), - name = "layer_$i.ffn_norm", - initWeight = layer.ffnNorm - ) - } - - private val outputWeightT: Tensor = weights.outputWeight.t() - - // ---- DecoderRuntime template methods ---- - - override fun embedToken(tokenId: Int): Tensor = - embedding.forward(intArrayOf(tokenId), ctx) - - override fun runLayer(layerIdx: Int, x: Tensor): Tensor { - val layer = weights.layers[layerIdx] - - // 1. Attention norm - val attnNorm = attnNorms[layerIdx].forward(x, ctx) - - // 2. QKV projections (transpose on the fly to avoid double-memory peak) - val q = attnNorm.matmul(layer.wq.t()) - val k = attnNorm.matmul(layer.wk.t()) - val v = attnNorm.matmul(layer.wv.t()) - - // 3. QK-norm: per-head RMSNorm on Q and K - val qNormed = applyPerHeadRMSNorm(q, nHeads, headDim, layer.qNorm) - val kNormed = applyPerHeadRMSNorm(k, nKvHeads, headDim, layer.kNorm) - - // 4. Attention (RoPE + KV cache + GQA) — backend receives QK-normed tensors - val attnOut = attentionBackend.attention(qNormed, kNormed, v, layerIdx, position) - - // 5. Output projection + residual - val afterAttn = x + attnOut.matmul(layer.wo.t()) - - // 6. FFN norm - val ffnNorm = ffnNorms[layerIdx].forward(afterAttn, ctx) - - // 7. Ungated MLP: up → xIELU → down - val up = ffnNorm.matmul(layer.ffnUp.t()) - val activated = applyXIELU(up, layer.xieluParams) - val ffnOut = activated.matmul(layer.ffnDown.t()) - - // 8. Residual - return afterAttn + ffnOut - } - - override fun outputNorm(x: Tensor): Tensor = - outputNormLayer.forward(x, ctx) - - override fun outputProject(x: Tensor): Tensor = - x.matmul(outputWeightT) - - override fun resetState() { - attentionBackend.reset() - } - - // ---- Apertus-specific helpers ---- - - /** - * Apply per-head RMSNorm to Q or K tensor. - * - * Input shape: [batch, nHeads * headDim] - * Weight shape: [nHeads * headDim] (per-head, so each head has `headDim` weight values) - */ - private fun applyPerHeadRMSNorm( - x: Tensor, - numHeads: Int, - headDim: Int, - weight: Tensor - ): Tensor { - val buf = x.expectFloatBuffer().copyOf() - val w = weight.expectFloatBuffer() - val totalDim = numHeads * headDim - - // Handle batched input - val batchSize = if (x.shape.rank == 2) x.shape[0] else 1 - - for (b in 0 until batchSize) { - val batchOffset = b * totalDim - for (h in 0 until numHeads) { - val headOffset = batchOffset + h * headDim - - // Compute RMS for this head - var sumSq = 0f - for (i in 0 until headDim) { - val v = buf[headOffset + i] - sumSq += v * v - } - val rms = sqrt(sumSq / headDim + eps) - - // Normalize and scale (weight is per-head, shared across all heads) - for (i in 0 until headDim) { - buf[headOffset + i] = (buf[headOffset + i] / rms) * w[i] - } - } - } - - val shape = x.shape - return ctx.fromFloatArray(shape, dtype, buf) - } - - /** - * Apply xIELU activation element-wise. - * - * xIELU formula: - * ``` - * alpha_p_eff = softplus(alpha_p) // ln(1 + exp(stored_alpha_p)) - * alpha_n_eff = beta + softplus(alpha_n) - * if x > 0: alpha_p_eff * x^2 + beta * x - * if x <= 0: (expm1(min(x, eps)) - x) * alpha_n_eff + beta * x - * ``` - */ - private fun applyXIELU(x: Tensor, params: ApertusXIELUParams): Tensor { - val buf = x.expectFloatBuffer().copyOf() - xielu(buf, params) - return ctx.fromFloatArray(x.shape, dtype, buf) - } - -} - -// ========== xIELU Implementation ========== - -/** - * In-place xIELU activation on a float buffer. - * - * Public for unit testing. - */ -public fun xielu(buf: FloatArray, params: ApertusXIELUParams) { - val alphaPEff = softplus(params.alphaP) - val alphaNEff = params.beta + softplus(params.alphaN) - - for (i in buf.indices) { - val x = buf[i] - buf[i] = if (x > 0f) { - alphaPEff * x * x + params.beta * x - } else { - val clamped = min(x, params.eps) - (expm1(clamped) - x) * alphaNEff + params.beta * x - } - } -} - -/** - * softplus(x) = ln(1 + exp(x)) - * - * Uses a numerically stable formulation: - * - For large x: softplus(x) ≈ x - * - For small x: use the exact formula - */ -public fun softplus(x: Float): Float { - return if (x > 20f) x else ln(1f + exp(x)) -} - -/** - * exp(x) - 1, avoiding catastrophic cancellation near zero. - */ -private fun expm1(x: Float): Float = kotlin.math.expm1(x.toDouble()).toFloat() diff --git a/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusXIELU.kt b/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusXIELU.kt new file mode 100644 index 00000000..5c9a7afc --- /dev/null +++ b/llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusXIELU.kt @@ -0,0 +1,51 @@ +package sk.ainet.models.apertus + +import kotlin.math.exp +import kotlin.math.ln +import kotlin.math.min + +/** + * xIELU activation reference implementation. + * + * Mutates [buf] in place applying: + * + * alpha_p_eff = softplus(alpha_p) + * alpha_n_eff = beta + softplus(alpha_n) + * + * if x > 0: alpha_p_eff * x*x + beta * x + * else: (expm1(min(x, eps)) - x) * alpha_n_eff + beta * x + * + * Apertus models carry per-layer scalar `alpha_p`, `alpha_n`, `beta`, `eps` + * weights (see [ApertusXIELUParams]). The production network uses an + * equivalent op emitted by `apertusNetwork()` through SKaiNET's compute + * graph; this Kotlin reference exists so callers (notably tests) can + * verify the math without spinning up a runtime, and so future xIELU + * implementations have a single golden reference to point at. + * + * Public for unit testing. + */ +public fun xielu(buf: FloatArray, params: ApertusXIELUParams) { + val alphaPEff = softplus(params.alphaP) + val alphaNEff = params.beta + softplus(params.alphaN) + + for (i in buf.indices) { + val x = buf[i] + buf[i] = if (x > 0f) { + alphaPEff * x * x + params.beta * x + } else { + val clamped = min(x, params.eps) + (expm1(clamped) - x) * alphaNEff + params.beta * x + } + } +} + +/** + * `softplus(x) = ln(1 + exp(x))`, with the standard large-x asymptotic + * shortcut to avoid `exp` overflow. + */ +public fun softplus(x: Float): Float { + return if (x > 20f) x else ln(1f + exp(x)) +} + +/** `exp(x) - 1` without catastrophic cancellation near zero. */ +private fun expm1(x: Float): Float = kotlin.math.expm1(x.toDouble()).toFloat() diff --git a/llm-inference/apertus/src/commonTest/kotlin/sk/ainet/models/apertus/ApertusQuantizedRuntimeSmokeTest.kt b/llm-inference/apertus/src/commonTest/kotlin/sk/ainet/models/apertus/ApertusQuantizedRuntimeSmokeTest.kt deleted file mode 100644 index e375641a..00000000 --- a/llm-inference/apertus/src/commonTest/kotlin/sk/ainet/models/apertus/ApertusQuantizedRuntimeSmokeTest.kt +++ /dev/null @@ -1,275 +0,0 @@ -package sk.ainet.models.apertus - -import sk.ainet.context.DefaultDataExecutionContext -import sk.ainet.io.gguf.GGMLQuantizationType -import sk.ainet.lang.tensor.Shape -import sk.ainet.lang.tensor.Tensor -import sk.ainet.lang.types.FP32 -import kotlin.test.Test -import kotlin.test.assertEquals -import kotlin.test.assertTrue - -/** - * Smoke test: builds a tiny Apertus model with quantized (simulated) weights - * and verifies the lazy-dequant runtime produces finite logits and generates tokens. - * - * Uses F32-typed QuantizedTensors stored in GGUF column-major format as an identity - * dequantization, so we can verify correctness against the eager FP32 runtime. - */ -class ApertusQuantizedRuntimeSmokeTest { - - private val dim = 8 - private val ffDim = 16 - private val vocabSize = 16 - private val nHeads = 2 - private val kvHeads = 2 - private val headDim = dim / nHeads - private val kvDim = kvHeads * headDim - - private val ctx = DefaultDataExecutionContext() - - private fun ones(shape: Shape): Tensor { - val values = FloatArray(shape.volume) { 0.01f } - return ctx.fromFloatArray(shape, FP32::class, values) - } - - private fun randnArray(shape: Shape, seed: Int): FloatArray { - val rng = kotlin.random.Random(seed) - return FloatArray(shape.volume) { (rng.nextFloat() - 0.5f) * 0.1f } - } - - private fun randn(shape: Shape, seed: Int = 42): Tensor = - ctx.fromFloatArray(shape, FP32::class, randnArray(shape, seed)) - - private fun floatArrayToBytes(data: FloatArray): ByteArray { - val bytes = ByteArray(data.size * 4) - for (i in data.indices) { - val bits = data[i].toRawBits() - bytes[i * 4] = (bits and 0xFF).toByte() - bytes[i * 4 + 1] = ((bits shr 8) and 0xFF).toByte() - bytes[i * 4 + 2] = ((bits shr 16) and 0xFF).toByte() - bytes[i * 4 + 3] = ((bits shr 24) and 0xFF).toByte() - } - return bytes - } - - /** - * Create a QuantizedTensor in GGUF column-major format from an eager row-major weight. - * - * For 2D tensors, GGUF stores shape [ne0, ne1] where ne0 = cols (fast dim), - * ne1 = rows, with data in column-major order. [dequant2D] will transpose this - * back to row-major [ne1, ne0] and then the runtime uses `.t()`. - * - * The eager runtime stores shape [rows, cols] in row-major and also uses `.t()`. - * Both paths produce the same effective matrix in the matmul. - * - * @param eagerShape The shape the eager runtime uses (row-major [rows, cols]) - * @param rowMajorData The eager row-major float data - */ - private fun asQuantizedGGUF(eagerShape: Shape, rowMajorData: FloatArray): QuantizedTensor { - if (eagerShape.rank == 2) { - val rows = eagerShape[0] - val cols = eagerShape[1] - // Convert row-major [rows, cols] to column-major [cols, rows] (GGUF layout) - val colMajor = FloatArray(rowMajorData.size) - for (r in 0 until rows) { - for (c in 0 until cols) { - colMajor[c * rows + r] = rowMajorData[r * cols + c] - } - } - return QuantizedTensor( - data = floatArrayToBytes(colMajor), - quantType = GGMLQuantizationType.F32, - shape = Shape(cols, rows), // GGUF shape: [ne0=cols, ne1=rows] - nElements = eagerShape.volume - ) - } - return QuantizedTensor( - data = floatArrayToBytes(rowMajorData), - quantType = GGMLQuantizationType.F32, - shape = eagerShape, - nElements = eagerShape.volume - ) - } - - /** Convenience: generate random data and create a GGUF-format QuantizedTensor. */ - private fun randnQuantized(eagerShape: Shape, seed: Int): QuantizedTensor = - asQuantizedGGUF(eagerShape, randnArray(eagerShape, seed)) - - private fun buildMetadata() = ApertusModelMetadata( - architecture = "apertus", - embeddingLength = dim, - contextLength = 32, - blockCount = 1, - headCount = nHeads, - kvHeadCount = kvHeads, - feedForwardLength = ffDim, - ropeDimensionCount = headDim, - vocabSize = vocabSize, - ropeTheta = 12000000f, - qkNorm = true, - hiddenAct = "xielu", - tiedEmbeddings = false - ) - - private val xieluParams = ApertusXIELUParams(-0.5f, -0.3f, 0.8f, -5.0f) - - @Test - fun quantizedForwardPassProducesFiniteLogits() { - val metadata = buildMetadata() - val layer = ApertusQuantizedLayerWeights( - attnNorm = ones(Shape(dim)), - qNorm = ones(Shape(headDim)), - kNorm = ones(Shape(headDim)), - ffnNorm = ones(Shape(dim)), - xieluParams = xieluParams, - wq = randnQuantized(Shape(dim, dim), seed = 1), - wk = randnQuantized(Shape(kvDim, dim), seed = 2), - wv = randnQuantized(Shape(kvDim, dim), seed = 3), - wo = randnQuantized(Shape(dim, dim), seed = 4), - ffnUp = randnQuantized(Shape(ffDim, dim), seed = 6), - ffnDown = randnQuantized(Shape(dim, ffDim), seed = 5) - ) - val weights = ApertusQuantizedRuntimeWeights( - metadata = metadata, - tokenEmbedding = randn(Shape(vocabSize, dim), seed = 10), - layers = listOf(layer), - outputNorm = ones(Shape(dim)), - outputWeight = randnQuantized(Shape(vocabSize, dim), seed = 11) - ) - val backend = ApertusCpuAttentionBackend( - ctx = ctx, metadata = metadata, dtype = FP32::class, ropeFreqBase = 12000000f - ) - val runtime = ApertusQuantizedRuntime( - ctx = ctx, weights = weights, attentionBackend = backend - ) - - val logits = runtime.forward(1) - assertEquals(2, logits.shape.rank, "logits should be 2D") - assertEquals(vocabSize, logits.shape[1], "logits dim should match vocab size") - val buf = logits.data.copyToFloatArray() - for (i in buf.indices) { - assertTrue(buf[i].isFinite(), "logit[$i] = ${buf[i]} is not finite") - } - } - - @Test - fun quantizedAndEagerProduceSameLogits() { - val metadata = buildMetadata() - val tokenEmb = randn(Shape(vocabSize, dim), seed = 10) - val outputNorm = ones(Shape(dim)) - - // Generate weight data arrays (same seeds for both runtimes) - val wqData = randnArray(Shape(dim, dim), seed = 1) - val wkData = randnArray(Shape(kvDim, dim), seed = 2) - val wvData = randnArray(Shape(kvDim, dim), seed = 3) - val woData = randnArray(Shape(dim, dim), seed = 4) - val ffnDownData = randnArray(Shape(dim, ffDim), seed = 5) - val ffnUpData = randnArray(Shape(ffDim, dim), seed = 6) - val outputWData = randnArray(Shape(vocabSize, dim), seed = 11) - - // --- Eager FP32 runtime --- - val eagerLayer = ApertusLayerWeights( - attnNorm = ones(Shape(dim)), - wq = ctx.fromFloatArray(Shape(dim, dim), FP32::class, wqData.copyOf()), - wk = ctx.fromFloatArray(Shape(kvDim, dim), FP32::class, wkData.copyOf()), - wv = ctx.fromFloatArray(Shape(kvDim, dim), FP32::class, wvData.copyOf()), - wo = ctx.fromFloatArray(Shape(dim, dim), FP32::class, woData.copyOf()), - qNorm = ones(Shape(headDim)), - kNorm = ones(Shape(headDim)), - ffnNorm = ones(Shape(dim)), - ffnDown = ctx.fromFloatArray(Shape(dim, ffDim), FP32::class, ffnDownData.copyOf()), - ffnUp = ctx.fromFloatArray(Shape(ffDim, dim), FP32::class, ffnUpData.copyOf()), - xieluParams = xieluParams - ) - val eagerWeights = ApertusRuntimeWeights( - metadata = metadata, tokenEmbedding = tokenEmb, - layers = listOf(eagerLayer), outputNorm = outputNorm, - outputWeight = ctx.fromFloatArray(Shape(vocabSize, dim), FP32::class, outputWData.copyOf()) - ) - val eagerBackend = ApertusCpuAttentionBackend( - ctx = ctx, weights = eagerWeights, dtype = FP32::class - ) - val eagerRuntime = ApertusRuntime( - ctx = ctx, weights = eagerWeights, - attentionBackend = eagerBackend, dtype = FP32::class - ) - - // --- Quantized runtime (F32 bytes = identity dequant, GGUF column-major layout) --- - val qLayer = ApertusQuantizedLayerWeights( - attnNorm = ones(Shape(dim)), - qNorm = ones(Shape(headDim)), - kNorm = ones(Shape(headDim)), - ffnNorm = ones(Shape(dim)), - xieluParams = xieluParams, - wq = asQuantizedGGUF(Shape(dim, dim), wqData.copyOf()), - wk = asQuantizedGGUF(Shape(kvDim, dim), wkData.copyOf()), - wv = asQuantizedGGUF(Shape(kvDim, dim), wvData.copyOf()), - wo = asQuantizedGGUF(Shape(dim, dim), woData.copyOf()), - ffnUp = asQuantizedGGUF(Shape(ffDim, dim), ffnUpData.copyOf()), - ffnDown = asQuantizedGGUF(Shape(dim, ffDim), ffnDownData.copyOf()) - ) - val qWeights = ApertusQuantizedRuntimeWeights( - metadata = metadata, tokenEmbedding = tokenEmb, - layers = listOf(qLayer), outputNorm = outputNorm, - outputWeight = asQuantizedGGUF(Shape(vocabSize, dim), outputWData.copyOf()) - ) - val qBackend = ApertusCpuAttentionBackend( - ctx = ctx, metadata = metadata, dtype = FP32::class - ) - val qRuntime = ApertusQuantizedRuntime( - ctx = ctx, weights = qWeights, attentionBackend = qBackend - ) - - // Compare logits - val eagerLogits = eagerRuntime.forward(1).data.copyToFloatArray() - val quantLogits = qRuntime.forward(1).data.copyToFloatArray() - - assertEquals(eagerLogits.size, quantLogits.size, "logit sizes should match") - for (i in eagerLogits.indices) { - val diff = kotlin.math.abs(eagerLogits[i] - quantLogits[i]) - assertTrue(diff < 1e-4f, "logit[$i] diff=$diff: eager=${eagerLogits[i]} vs quant=${quantLogits[i]}") - } - } - - @Test - fun quantizedGenerateProducesTokens() { - val metadata = buildMetadata() - val layer = ApertusQuantizedLayerWeights( - attnNorm = ones(Shape(dim)), - qNorm = ones(Shape(headDim)), - kNorm = ones(Shape(headDim)), - ffnNorm = ones(Shape(dim)), - xieluParams = xieluParams, - wq = randnQuantized(Shape(dim, dim), seed = 1), - wk = randnQuantized(Shape(kvDim, dim), seed = 2), - wv = randnQuantized(Shape(kvDim, dim), seed = 3), - wo = randnQuantized(Shape(dim, dim), seed = 4), - ffnUp = randnQuantized(Shape(ffDim, dim), seed = 6), - ffnDown = randnQuantized(Shape(dim, ffDim), seed = 5) - ) - val weights = ApertusQuantizedRuntimeWeights( - metadata = metadata, - tokenEmbedding = randn(Shape(vocabSize, dim), seed = 10), - layers = listOf(layer), - outputNorm = ones(Shape(dim)), - outputWeight = randnQuantized(Shape(vocabSize, dim), seed = 11) - ) - val backend = ApertusCpuAttentionBackend( - ctx = ctx, metadata = metadata, dtype = FP32::class - ) - val runtime = ApertusQuantizedRuntime( - ctx = ctx, weights = weights, attentionBackend = backend - ) - - val generated = mutableListOf() - runtime.generate( - prompt = intArrayOf(1, 5, 3), steps = 4, temperature = 1.0f - ) { generated.add(it) } - - assertEquals(4, generated.size, "Should generate exactly 4 tokens") - for (tokenId in generated) { - assertTrue(tokenId in 0 until vocabSize, "Token $tokenId should be in [0, $vocabSize)") - } - } -} diff --git a/llm-inference/apertus/src/commonTest/kotlin/sk/ainet/models/apertus/ApertusRuntimeSmokeTest.kt b/llm-inference/apertus/src/commonTest/kotlin/sk/ainet/models/apertus/ApertusRuntimeSmokeTest.kt deleted file mode 100644 index bf9381f0..00000000 --- a/llm-inference/apertus/src/commonTest/kotlin/sk/ainet/models/apertus/ApertusRuntimeSmokeTest.kt +++ /dev/null @@ -1,175 +0,0 @@ -package sk.ainet.models.apertus - -import sk.ainet.context.DefaultDataExecutionContext -import sk.ainet.lang.tensor.Shape -import sk.ainet.lang.tensor.Tensor -import sk.ainet.lang.types.FP32 -import kotlin.test.Test -import kotlin.test.assertEquals -import kotlin.test.assertTrue - -/** - * Smoke test: builds a tiny Apertus model (dim=8, 1 layer, vocab=16) - * with random weights and verifies that forward pass produces finite logits. - */ -class ApertusRuntimeSmokeTest { - - private val dim = 8 - private val ffDim = 16 - private val vocabSize = 16 - private val nHeads = 2 - private val kvHeads = 2 - private val headDim = dim / nHeads - private val kvDim = kvHeads * headDim - - private val ctx = DefaultDataExecutionContext() - - private fun ones(shape: Shape): Tensor { - val values = FloatArray(shape.volume) { 0.01f } - return ctx.fromFloatArray(shape, FP32::class, values) - } - - private fun randn(shape: Shape, seed: Int = 42): Tensor { - val rng = kotlin.random.Random(seed) - val values = FloatArray(shape.volume) { (rng.nextFloat() - 0.5f) * 0.1f } - return ctx.fromFloatArray(shape, FP32::class, values) - } - - @Test - fun forwardPassProducesFiniteLogits() { - val metadata = ApertusModelMetadata( - architecture = "apertus", - embeddingLength = dim, - contextLength = 32, - blockCount = 1, - headCount = nHeads, - kvHeadCount = kvHeads, - feedForwardLength = ffDim, - ropeDimensionCount = headDim, - vocabSize = vocabSize, - ropeTheta = 12000000f, - qkNorm = true, - hiddenAct = "xielu", - tiedEmbeddings = false - ) - - val layer = ApertusLayerWeights( - attnNorm = ones(Shape(dim)), - wq = randn(Shape(dim, dim), seed = 1), - wk = randn(Shape(kvDim, dim), seed = 2), - wv = randn(Shape(kvDim, dim), seed = 3), - wo = randn(Shape(dim, dim), seed = 4), - qNorm = ones(Shape(headDim)), - kNorm = ones(Shape(headDim)), - ffnNorm = ones(Shape(dim)), - ffnDown = randn(Shape(dim, ffDim), seed = 5), - ffnUp = randn(Shape(ffDim, dim), seed = 6), - xieluParams = ApertusXIELUParams( - alphaP = -0.5f, - alphaN = -0.3f, - beta = 0.8f, - eps = -5.0f - ) - ) - - val weights = ApertusRuntimeWeights( - metadata = metadata, - tokenEmbedding = randn(Shape(vocabSize, dim), seed = 10), - layers = listOf(layer), - outputNorm = ones(Shape(dim)), - outputWeight = randn(Shape(vocabSize, dim), seed = 11) - ) - - val backend = ApertusCpuAttentionBackend( - ctx = ctx, - weights = weights, - dtype = FP32::class, - ropeFreqBase = 12000000f - ) - - val runtime = ApertusRuntime( - ctx = ctx, - weights = weights, - attentionBackend = backend, - dtype = FP32::class - ) - - // Forward pass with token ID 1 (BOS) - val logits = runtime.forward(1) - - // Verify shape: should be [1, vocabSize] - assertEquals(2, logits.shape.rank, "logits should be 2D") - assertEquals(vocabSize, logits.shape[1], "logits dim should match vocab size") - - // Verify all values are finite - val buf = logits.data.copyToFloatArray() - for (i in buf.indices) { - assertTrue(buf[i].isFinite(), "logit[$i] = ${buf[i]} is not finite") - } - } - - @Test - fun generateProducesTokens() { - val metadata = ApertusModelMetadata( - architecture = "apertus", - embeddingLength = dim, - contextLength = 32, - blockCount = 1, - headCount = nHeads, - kvHeadCount = kvHeads, - feedForwardLength = ffDim, - ropeDimensionCount = headDim, - vocabSize = vocabSize, - ropeTheta = 12000000f - ) - - val layer = ApertusLayerWeights( - attnNorm = ones(Shape(dim)), - wq = randn(Shape(dim, dim), seed = 1), - wk = randn(Shape(kvDim, dim), seed = 2), - wv = randn(Shape(kvDim, dim), seed = 3), - wo = randn(Shape(dim, dim), seed = 4), - qNorm = ones(Shape(headDim)), - kNorm = ones(Shape(headDim)), - ffnNorm = ones(Shape(dim)), - ffnDown = randn(Shape(dim, ffDim), seed = 5), - ffnUp = randn(Shape(ffDim, dim), seed = 6), - xieluParams = ApertusXIELUParams(-0.5f, -0.3f, 0.8f, -5.0f) - ) - - val weights = ApertusRuntimeWeights( - metadata = metadata, - tokenEmbedding = randn(Shape(vocabSize, dim), seed = 10), - layers = listOf(layer), - outputNorm = ones(Shape(dim)), - outputWeight = randn(Shape(vocabSize, dim), seed = 11) - ) - - val backend = ApertusCpuAttentionBackend( - ctx = ctx, - weights = weights, - dtype = FP32::class - ) - - val runtime = ApertusRuntime( - ctx = ctx, - weights = weights, - attentionBackend = backend, - dtype = FP32::class - ) - - val generated = mutableListOf() - runtime.generate( - prompt = intArrayOf(1, 5, 3), - steps = 4, - temperature = 1.0f - ) { tokenId -> - generated.add(tokenId) - } - - assertEquals(4, generated.size, "Should generate exactly 4 tokens") - for (tokenId in generated) { - assertTrue(tokenId in 0 until vocabSize, "Token $tokenId should be in [0, $vocabSize)") - } - } -}