From 43e997623b9d4459e30d470c5205e50ddc980902 Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Thu, 4 Jun 2026 20:50:16 +0200
Subject: [PATCH 1/2] chore(release): prepare 0.27.0

Bump VERSION_NAME to 0.27.0, add the 0.27.0 CHANGELOG section
(StableHLO/HLO converter work: full gemma3 lowers to StableHLO and
compiles to vmfb), update install snippets and README "What's New" to
0.27.0, regenerate operator reference docs, and remove rfc.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md                                  |  19 ++
 README.md                                     |  13 +-
 .../modules/ROOT/pages/how-to/io-readers.adoc |   4 +-
 .../pages/how-to/java-model-training.adoc     |   2 +-
 .../reference/operators/generated/index.adoc  |   2 +-
 .../pages/reference/ops-status-matrix.adoc    |   2 +-
 .../pages/tutorials/java-getting-started.adoc |   4 +-
 gradle.properties                             |   2 +-
 rfc.md                                        | 228 ------------------
 9 files changed, 34 insertions(+), 242 deletions(-)
 delete mode 100644 rfc.md
diff --git a/CHANGELOG.md b/CHANGELOG.md
index ef7648af..f1310a26 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -2,6 +2,25 @@
 
 ## [Unreleased]
 
+## [0.27.0] - 2026-06-04
+
+### Added
+
+- **Full gemma3 network lowers to StableHLO with zero gaps.** A batch of new core converters closes every remaining op gap on the Kotlin DSL → MLIR StableHLO path, so a complete gemma3 graph traces and lowers end-to-end (verified by `GemmaTraceTest` over the composite build: 140 nodes → 255 lines, 0 unsupported, 0 arity errors):
+  - **`scaledDotProductAttention` converter** (`AttentionOperationsConverter`) — lowers the atomic SDPA op to the standard StableHLO subgraph: `scores = Q·Kᵀ` (`dot_general`, contract `head_dim`), `* scale` (arg or `1/sqrt(head_dim)`), numerically stable softmax over key length, `out = attn·V`. Batched `[..,S,D]` with all leading dims as batching dims. SDPA is a core `TensorOps` op, so its converter lives in core.
+  - **Causal mask for SDPA** — when the node's `causal` attr is set, emits an additive `-inf` mask before softmax (`iota`/`compare GE`/`select`) so each query attends only to keys at or before it. Validated EXACT against a NumPy causal reference and accepted by `iree-compile`.
+  - **Explicit SDPA mask operand** — `AttentionOperationsConverter` now consumes `operands[3]` (the additive mask), broadcasting it trailing-aligned to the scores shape before softmax. Fixes gemma sliding-window layers (`causal=false` + explicit causal+window mask) that previously exported unmasked and attended to future tokens. (PR #661)
+  - **`permute` + `narrow` converters** — `permute` registered as an arbitrary-axis alias routed to the existing transpose path; `narrow` slicing support.
+  - **Multi-output converter support + `split` converter** — per-`(nodeId, outputPort)` SSA naming in `ConversionContext`, operand resolution that walks incoming edges by `destinationInputIndex` and resolves each by the edge's `sourceOutputIndex`, and `split`/`chunk` lowering to N `stablehlo.slice` ops each registered on its own output port (lowers the RoPE `split` gap).
+- **Boxing-free `FloatArray` weight externalization for `.irpa` baking.** `finalize()` now stores resolved weights as the primitive `FloatArray` instead of `.toList()` (boxing a real LLM weight — e.g. a 262153×640 embedding → ~2.7 GB `List<Float>` — OOMed the trace). `ConstantOperationsConverter` externalizes `FloatArray` directly (`floatArrayToLittleEndianBytes` + `tryMaterializeExternalFloats`, inlining small/`InlineAlways` tensors via `asList()`), and `IrpaWriter` writes byte ranges in one shot. With this, the real Gemma-270M function bakes: 1 func arg (tokens) + 360 weights externalized to `util.global #flow.parameter.named`.
+- **DSL prescribes element dtype for placeholder weights.** The DSL can now specify the element dtype for placeholder weights during tracing.
+- **Numerical validation harness for SDPA lowering.** Dumps a small `scaledDotProductAttention` StableHLO graph; `iree-compile` + `iree-run-module` output matches a NumPy reference exactly to 5 decimals, confirming the attention converter is numerically correct, not just structurally valid.
+
+### Fixed
+
+- **IREE-valid StableHLO syntax — full gemma3 compiles to `vmfb`.** Aligned converter emission to what `iree-compile`'s StableHLO parser accepts (verified by compiling the full gemma3 graph end-to-end): `gather` uses the generic MLIR form, `slice`/`narrow`/`split` use the canonical bracket form via a shared `sliceLine()` helper, `concatenate` emits the full functional type, and batch matmul derives batch dims as `min(lhsRank,rhsRank)-2` (fixes 3D-activation @ 2D-weight Linear projections). Result: SKaiNET gemma3 DSL → StableHLO → `iree-compile` (llvm-cpu; +neon aarch64) → `vmfb` for both host x64 and aarch64 targets.
+- **`VoidTensorOps.gather` output shape for multi-dim indices.** The void/tracing gather collapsed the gathered axis to `indices.shape[0]`, so a `[vocab,emb]` table with `[batch,seq]` indices traced to `[batch,emb]` instead of `[batch,seq,emb]` (breaking the embedding's downstream reshape during weight-free tracing). Now replaces the axis with the full indices shape, matching `DefaultCpuOps.gather`, unblocking tracing of full transformer (gemma3) graphs.
+
 ## [0.26.0] - 2026-05-30
 
 ### Added
diff --git a/README.md b/README.md
index cfcd7dda..a53759fa 100644
--- a/README.md
+++ b/README.md
@@ -35,7 +35,7 @@ Add the core dependencies (Gradle Kotlin DSL):
 ```kotlin
 dependencies {
     // Recommended: import the umbrella BOM and drop versions on the engine modules.
-    implementation(platform("sk.ainet:skainet-bom:0.26.0"))
+    implementation(platform("sk.ainet:skainet-bom:0.27.0"))
 
     implementation("sk.ainet.core:skainet-lang-core")
     implementation("sk.ainet.core:skainet-backend-cpu")
@@ -193,15 +193,16 @@ deployment, the StableHLO path for native and edge targets.
 
 ---
 
-## What's New in 0.26.0
+## What's New in 0.27.0
 
-- **Q4_0 is now a first-class quantized format.** The older GGML 4-bit format joins Q8_0 / Q4_K across the full provider stack: a heap `Q4_0TensorData` any loader can produce, a `Q4_0MatmulKernel` SPI with scalar / Panama-Vector / native-FFM implementations auto-selected by `KernelRegistry`, and a `Q4_0Quantizer` to pack dense FP32 weights into canonical ggml Q4_0 without going through GGUF. (PRs #648–#651)
-- **`tanh` is now a first-class activation primitive.** Promoted from a `NotImplementedError` stub to a fully wired `@Diff @ActivationDsl` op — `TensorOps` interface, `Tensor.tanh()` extension, CPU backend, recording decorator, and autograd backward (`1 - output^2`) — so downstream consumers no longer re-derive the `2*sigmoid(2x)-1` polyfill. Pinned end-to-end by a micrograd tanh-MLP training test on the moons dataset. (Issue #630, PR #631)
-- **CPU tensor `convert` op.** Dtype conversion now has a real CPU backend implementation. (PR #636)
-- Plus test, build, and CI hygiene: portable KMP `@Ignore` for common tests, restored BatchNorm coverage, Gradle build-warning cleanup, and narrower feature-PR CI triggers. (PRs #633, #634, #638, #640, #645)
+- **A full gemma3 network now lowers to StableHLO and compiles to an IREE `vmfb`.** A batch of new core converters closes every remaining op gap on the Kotlin DSL → MLIR StableHLO path, so a complete gemma3 graph traces and lowers end-to-end with zero gaps (verified by `GemmaTraceTest`: 140 nodes → 255 lines, 0 unsupported), then compiles through `iree-compile` (llvm-cpu; +neon aarch64) to a `vmfb` for both host x64 and aarch64.
+- **`scaledDotProductAttention` converter.** Lowers the atomic SDPA op to the standard StableHLO subgraph (`Q·Kᵀ` → scale → stable softmax → `attn·V`), with causal-mask emission and explicit additive-mask operand support (fixes gemma sliding-window layers). Numerically validated EXACT against a NumPy reference via `iree-run-module`.
+- **`permute`, `narrow`, and multi-output `split` converters.** Per-`(nodeId, outputPort)` SSA naming and edge-accurate operand resolution let a consumer of a multi-output op (e.g. RoPE's `split`) get the right output port.
+- **Boxing-free `FloatArray` weight externalization for `.irpa` baking.** Resolved weights stay primitive `FloatArray` (no `List<Float>` boxing that OOMed multi-GB embeddings); the real Gemma-270M function bakes its 360 weights to `util.global #flow.parameter.named`.
 
 ### Recent releases
 
+- **0.26.0** — Q4_0 promoted to a first-class quantized format across the provider stack, `tanh` as a first-class activation primitive, and a CPU tensor `convert` op, plus test/build/CI hygiene. (PRs #648–#651, #631, #636)
 - **0.25.0** — BF16 and Q8_0 matmul kernels end-to-end across the provider stack, autograd completeness for `pow`/`log` and the conv/pool/upsample/split family, the hybrid adaptive dtype-constraint DSL, the `@DarcValidated` operator-doc flag, and the SentencePiece special-token splitter. (PRs #595, #605–#628)
 - **0.23.0** — Real-model GGUFs no longer OOM at network construction (lazy `TensorDataFactory.placeholder(...)`); Kotlin/Native can finally load GGUFs over 2 GiB via the new POSIX-`pread`-backed `PosixPreadRandomAccessSource`. (Issues #587, #589; PRs #588, #591)
 - **0.22.2** — `sk.ainet:skainet-bom` now resolves from Maven Central (earlier versions shipped at the wrong coordinates). (Issue #584)
diff --git a/docs/modules/ROOT/pages/how-to/io-readers.adoc b/docs/modules/ROOT/pages/how-to/io-readers.adoc
index 33a5eab8..c98cea37 100644
--- a/docs/modules/ROOT/pages/how-to/io-readers.adoc
+++ b/docs/modules/ROOT/pages/how-to/io-readers.adoc
@@ -20,7 +20,7 @@ Add the following dependencies to your `build.gradle.kts`:
 [source,kotlin]
 ----
 dependencies {
-    implementation(platform("sk.ainet:skainet-bom:0.26.0"))
+    implementation(platform("sk.ainet:skainet-bom:0.27.0"))
 
     implementation("sk.ainet.core:skainet-io-gguf")
     implementation("org.jetbrains.kotlinx:kotlinx-io-core:0.8.2")
@@ -32,7 +32,7 @@ dependencies {
 [source,kotlin]
 ----
 dependencies {
-    implementation(platform("sk.ainet:skainet-bom:0.26.0"))
+    implementation(platform("sk.ainet:skainet-bom:0.27.0"))
 
     implementation("sk.ainet.core:skainet-io-onnx")
     implementation("org.jetbrains.kotlinx:kotlinx-io-core:0.8.2")
diff --git a/docs/modules/ROOT/pages/how-to/java-model-training.adoc b/docs/modules/ROOT/pages/how-to/java-model-training.adoc
index 5fbde986..4a673f7e 100644
--- a/docs/modules/ROOT/pages/how-to/java-model-training.adoc
+++ b/docs/modules/ROOT/pages/how-to/java-model-training.adoc
@@ -23,7 +23,7 @@ This guide covers building neural networks, defining loss functions and optimize
         <dependency>
             <groupId>sk.ainet</groupId>
             <artifactId>skainet-bom</artifactId>
-            <version>0.26.0</version>
+            <version>0.27.0</version>
             <type>pom</type>
             <scope>import</scope>
         </dependency>
diff --git a/docs/modules/ROOT/pages/reference/operators/generated/index.adoc b/docs/modules/ROOT/pages/reference/operators/generated/index.adoc
index 0b065e09..111112b8 100644
--- a/docs/modules/ROOT/pages/reference/operators/generated/index.adoc
+++ b/docs/modules/ROOT/pages/reference/operators/generated/index.adoc
@@ -1,6 +1,6 @@
 = AI-NET Operators Reference
 
-Generated from version `0.26.0` on 2026-05-30
+Generated from version `0.27.0` on 2026-06-04
 
 == Operators by Modality
 
diff --git a/docs/modules/ROOT/pages/reference/ops-status-matrix.adoc b/docs/modules/ROOT/pages/reference/ops-status-matrix.adoc
index 945f3893..e12c5532 100644
--- a/docs/modules/ROOT/pages/reference/ops-status-matrix.adoc
+++ b/docs/modules/ROOT/pages/reference/ops-status-matrix.adoc
@@ -1,7 +1,7 @@
 = Operator Coverage Matrix
 :description: Cross-backend status for every operator function in SKaiNET.
 
-Generated from `operators.json` version `0.26.0` on 2026-05-30.
+Generated from `operators.json` version `0.27.0` on 2026-06-04.
 
 Rows are `Operator.function` pairs. The `Validated` column shows whether the function's documentation has been DARC-validated by a reviewer (see xref:contributing/darc-workflow.adoc[DARC workflow]). Remaining columns are backends that appear in any function's `statusByBackend` map — a missing entry means the backend makes no claim about the function (treat it as "unknown", not "not supported").
 
diff --git a/docs/modules/ROOT/pages/tutorials/java-getting-started.adoc b/docs/modules/ROOT/pages/tutorials/java-getting-started.adoc
index 5d6bdc6c..0b4139bb 100644
--- a/docs/modules/ROOT/pages/tutorials/java-getting-started.adoc
+++ b/docs/modules/ROOT/pages/tutorials/java-getting-started.adoc
@@ -46,7 +46,7 @@ The `skainet-bom` manages all SKaiNET module versions so you never have to keep
 ----
 <project>
     <properties>
-        <skainet.version>0.26.0</skainet.version>
+        <skainet.version>0.27.0</skainet.version>
     </properties>
 
     <dependencyManagement>
@@ -144,7 +144,7 @@ repositories {
 
 dependencies {
     // Import BOM for version alignment
-    implementation(platform("sk.ainet:skainet-bom:0.26.0"))
+    implementation(platform("sk.ainet:skainet-bom:0.27.0"))
 
     // Core tensor library
     implementation("sk.ainet:skainet-lang-core-jvm")
diff --git a/gradle.properties b/gradle.properties
index 42bd4bef..1a99b287 100644
--- a/gradle.properties
+++ b/gradle.properties
@@ -1,5 +1,5 @@
 GROUP=sk.ainet.core
-VERSION_NAME=0.26.0
+VERSION_NAME=0.27.0
 POM_DESCRIPTION=SKaiNET
 
 POM_URL=https://github.com/SKaiNET-developers/skainet/
diff --git a/rfc.md b/rfc.md
deleted file mode 100644
index aefc8aba..00000000
--- a/rfc.md
+++ /dev/null
@@ -1,228 +0,0 @@
-# The SKaiNET DType Model
-
-> **Status**: shipped in [#615](https://github.com/SKaiNET-developers/SKaiNET/issues/615) / [#616](https://github.com/SKaiNET-developers/SKaiNET/pull/616). This document was originally the **RFC** that proposed the hybrid adaptive DSL with optional dtype constraints; now that the design is implemented, the page explains *how the model works* and *what to use when*.
->
-> For the maintainer-facing reference (every concept mapped to its SKaiNET file path), see [`docs/modules/ROOT/pages/contributing/dtype-model.adoc`](docs/modules/ROOT/pages/contributing/dtype-model.adoc).
-
-## TL;DR
-
-SKaiNET is **architecture-first** by default — your DSL describes the model, dtype follows whatever the file actually stored. When some op or backend genuinely *requires* a specific dtype (NPU int8, a fused BF16 attention kernel, …), you attach a small `DTypePolicy` instead of rewriting the model. The loader or the constraint-resolution pass either satisfies the policy or **fails before forward execution** — never silently during it.
-
-Four moving parts:
-
-1. **`DTypePolicy`** — the four-arm sealed type (`Any` / `Require` / `Prefer` / `OneOf`) you attach to loaders, ops, or graph nodes.
-2. **Loaders** — `SafeTensorsParametersLoader.withPolicy(policy)` and `StreamingGgufParametersLoader.withPolicy(policy)` enforce the policy at load time.
-3. **`DTypeConstraintResolutionPass`** — runs inside the graph optimization pipeline before fusion; enforces per-node policies and produces a `ResolvedComputeGraph`.
-4. **`KernelStrictness`** + `KernelProvider.supports(...)` — runtime fail-fast for cases where graph-prep didn't run.
-
-## The four dtype concepts
-
-Every tensor in SKaiNET carries dtype information at four conceptual stages of its life. Each stage is implemented somewhere concrete.
-
-```mermaid
-flowchart LR
-    File[(Model file<br/>.gguf / .safetensors)]
-    File -->|"GGMLQuantizationType<br/>SafeTensors DataType"| Source["source dtype<br/>(what the file stores)"]
-    Source -->|"loader picks TensorData subtype<br/>per source dtype + policy"| Logical["logical dtype<br/>Tensor.dtype: KClass&lt;T&gt;<br/>(what the engine sees)"]
-    Logical -->|"DSL declares per-op constraint<br/>via dtypePolicy(...)"| Required["required dtype<br/>(what the op/backend needs)"]
-    Required -->|"constraint resolution +<br/>KernelRegistry.bestAvailable"| Lowered["lowered dtype<br/>(what the kernel actually gets)"]
-    Lowered --> Kernel[(SIMD kernel<br/>Panama / scalar / native)]
-```
-
-| Stage | Lives in | Notes |
-|---|---|---|
-| source dtype | `GGMLQuantizationType`, SafeTensors `DataType` | what's on disk (`F32`, `BF16`, `Q4_K`, `Q8_0`, …) |
-| logical dtype | `Tensor<T : DType, V>.dtype: KClass<T>` | explicit metadata, never inferred from packed-byte shape |
-| required dtype | `DTypePolicy.Require(dt)` etc. on DSL node `attributes["dtype_policy"]` | optional; absent = adaptive |
-| lowered dtype | whatever `KernelRegistry.bestAvailable()?.matmul*()` returns | post-resolution; matches a registered kernel |
-
-The whole point of the four-stage split is to keep the loader's job (what does the file say?) separate from the op's job (what dtype do I need?) separate from the runtime's job (what kernel do I actually have?). Each can change independently.
-
-## When to use which `DTypePolicy`
-
-Use this decision tree:
-
-```mermaid
-flowchart TD
-    Q["I'm declaring a tensor or op —<br/>what DTypePolicy do I attach?"]
-    Q --> Q1{"Does my code work<br/>with any dtype the file<br/>happens to provide?"}
-    Q1 -->|yes — this is the common case| Any["DTypePolicy.Any<br/><br/>(or omit entirely —<br/>Any is the default)"]
-    Q1 -->|no| Q2{"Is there exactly<br/>one acceptable dtype?"}
-    Q2 -->|yes| Q3{"Hard requirement<br/>or soft preference?"}
-    Q3 -->|hard| Require["DTypePolicy.Require(dt)<br/><br/>fail-fast at load/compile<br/>if dtype can't be made available"]
-    Q3 -->|soft| Prefer["DTypePolicy.Prefer(dt)<br/><br/>use dt if cheap,<br/>otherwise warn + fall through"]
-    Q2 -->|"no — small set"| OneOf["DTypePolicy.OneOf(set)<br/><br/>accept any dtype in the set;<br/>convert from outside if possible"]
-```
-
-Concrete examples:
-
-| Situation | Policy |
-|---|---|
-| "Load this GGUF however it ships." | `DTypePolicy.Any` — adaptive default; same model definition loads Q4_K, Q8_0, or FP16. |
-| "This SafeTensors file *must* keep BF16 native because my matmul kernel routes on it." | `DTypePolicy.Require(BF16)` |
-| "I'd prefer BF16 to avoid the 2× memory cost, but FP32 is fine if BF16 isn't available." | `DTypePolicy.Prefer(BF16)` |
-| "My attention kernel accepts either FP32 or BF16, nothing else." | `DTypePolicy.OneOf(setOf(FP32, BF16))` |
-| "NPU backend only runs int8; reject anything else at load." | `DTypePolicy.Require(Int8)` (fails fast today — no Int8 cast kernel ships in #615) |
-
-## Loader workflow: file → policy → tensor
-
-Both loaders (SafeTensors and GGUF) accept the same `DTypePolicy` shape. They validate it eagerly at construction time, then enforce it per-tensor as they iterate the file.
-
-```mermaid
-flowchart TD
-    Start([Open model file]) --> Build["SafeTensorsParametersLoader.withPolicy(policy)<br/>or<br/>StreamingGgufParametersLoader.withPolicy(policy)"]
-    Build --> Validate{Policy<br/>satisfiable<br/>by this loader?}
-    Validate -->|no — e.g. Require(FP16) on GGUF| FailEarly[/IllegalArgumentException<br/>before any tensor is read/]
-    Validate -->|yes| Iter[Iterate tensors]
-    Iter --> Source{Source dtype<br/>vs policy}
-    Source -->|"Any, or match"| Native["Native TensorData subtype<br/>Q4_KBlockTensorData /<br/>Q8_0BlockTensorData /<br/>Bf16DenseTensorData /<br/>FloatArrayTensorData"]
-    Source -->|"Require mismatch +<br/>no cast kernel"| FailLoad[/IllegalArgumentException<br/>fail at load/]
-    Source -->|"Prefer mismatch"| Soft[Warn + dequant to fallback]
-    Native --> Tensor([Tensor with explicit<br/>logical shape + dtype])
-    Soft --> Tensor
-```
-
-Key property: **logical shape is set from the file header, not from the packed-byte length**. A Q4_K tensor's `Q4_KBlockTensorData.shape` is its multi-dimensional logical shape; its `packedData: ByteArray` is the implementation detail. The graph sees the logical shape.
-
-## Graph workflow: DSL → policy → resolved graph → HLO
-
-Once a tensor is in the engine, the DSL lets you attach per-op or per-node policies. The constraint-resolution pass enforces them at graph-prep time, then the resolved graph flows into the HLO converter (and any future backend).
-
-```mermaid
-flowchart TD
-    DSL["dag {<br/>  val mm = op(<br/>    matmul,<br/>    inputs = listOf(x, w),<br/>    dtypePolicy = DTypePolicy.Require(BF16)<br/>  )<br/>}"]
-    DSL -->|"writes attributes['dtype_policy']"| Program[GraphProgram]
-    Program -->|"GraphProgramCompiler<br/>preserves attributes → metadata"| CG[ComputeGraph]
-    CG --> Pipeline[GraphOptimizationPipeline]
-    Pipeline -->|"first pass —<br/>before fusion"| Pass[DTypeConstraintResolutionPass]
-    Pass --> Visit{Node policy<br/>vs input dtype}
-    Visit -->|Any / match| Mark["mark metadata<br/>dtype_resolved = true"]
-    Visit -->|Require mismatch| Throw[/DtypeConstraintViolationException<br/>before forward execution/]
-    Visit -->|Prefer mismatch| Warn["diagnostic in<br/>GraphOptimizationResult"]
-    Mark --> Fusion["fusion passes see<br/>dtype-resolved nodes"]
-    Warn --> Fusion
-    Fusion --> Resolved[ResolvedComputeGraph wrapper]
-    Resolved -->|"validate() check —<br/>requireValid()"| HLO[toStableHlo<br/>byte-identical output<br/>to ComputeGraph overload]
-```
-
-The `dtype_resolved` marker is the proof that the pass ran. The `ResolvedComputeGraph` wrapper's `validate()` checks for it; the `toStableHlo(ResolvedComputeGraph)` overload calls `validate()` by default.
-
-## Runtime kernel dispatch + fail-fast
-
-Inside `ctx.ops.matmul(a, b)`, the runtime walks the registered providers by priority. If nothing matches and strict mode is on, you get a clean error instead of a silent scalar fallback.
-
-```mermaid
-flowchart LR
-    Call["ctx.ops.matmul(a, b)"] --> Ops["DefaultCpuOpsJvm.matmul<br/>(dtype dispatch)"]
-    Ops --> Q[chooseQuantizedMatmul]
-    Q -->|"recognized quantized<br/>data class match"| Hit1[Run quantized SPI kernel]
-    Q -->|no match| F32[chooseMatmul → fp32MatmulKernel]
-    F32 -->|"always non-null<br/>(falls back to scalar)"| Hit2[Run FP32 SPI kernel]
-    F32 -->|"impossible today<br/>(but tracked for future)"| Strict{strict mode?<br/>-Dskainet.strict.kernels=true}
-    Strict -->|on| Bang[/NoSuchKernelException/]
-    Strict -->|off — default| Silent["super.matmul<br/>(silent scalar fallback)"]
-```
-
-```mermaid
-flowchart TD
-    subgraph Reg["KernelRegistry (sorted by priority)"]
-        P100["NativeKernelProvider — priority 100<br/>(planned, native FFM)"]
-        P50["PanamaVectorKernelProvider — priority 50<br/>(JDK 21+ Vector API)"]
-        P0["ScalarKernelProvider — priority 0<br/>(always available)"]
-    end
-    Ask["For (matmul, [Float32, Q8_0]):<br/>walk providers, ask<br/>provider.matmulQ8_0() != null"]
-    Ask --> P100
-    P100 -->|"isAvailable() && matmulQ8_0() != null"| Win[picked]
-    P100 -->|null| P50
-    P50 -->|"matmulQ8_0() != null"| Win
-    P50 -->|null| P0
-    P0 -->|"null for Q8_0"| None["no kernel —<br/>fail-fast (strict) or<br/>silent fallback (default)"]
-```
-
-`KernelProvider.supports(opName, dtypeKeys)` is the introspection query the resolution pass uses to decide whether a `Require` constraint can be satisfied via an existing kernel.
-
-## End-to-end: putting it all together
-
-A worked example showing all four layers in one inference session:
-
-```kotlin
-import sk.ainet.context.DirectCpuExecutionContext
-import sk.ainet.io.RandomAccessSource
-import sk.ainet.io.safetensors.SafeTensorsParametersLoader
-import sk.ainet.lang.dag.dag
-import sk.ainet.lang.dag.op
-import sk.ainet.lang.tensor.ops.MatmulOperation
-import sk.ainet.lang.tensor.ops.TensorSpec
-import sk.ainet.lang.types.BF16
-import sk.ainet.lang.types.DTypePolicy
-import sk.ainet.lang.types.FP32
-
-// 1. LOAD with an explicit dtype policy
-val ctx = DirectCpuExecutionContext.create()
-val loader = SafeTensorsParametersLoader.withPolicy(
-    sourceProvider = { RandomAccessSource.open("model.safetensors") },
-    policy = DTypePolicy.Require(BF16),       // keep BF16 native, fail if file lacks it
-)
-loader.load(ctx, BF16::class) { name, tensor ->
-    // tensor.dtype == BF16::class
-    // tensor.data is Bf16DenseTensorData with explicit logical shape
-    registerWeight(name, tensor)
-}
-
-// 2. DECLARE the graph with a per-op policy
-val program = dag {
-    val input = input<FP32>("input", TensorSpec("input", listOf(1, 4096), "FP32"))
-    val weight = parameter<BF16, Float>("attn_proj") { shape(4096, 4096) { ones() } }
-    val projection = op(
-        operation = MatmulOperation<FP32, Float>(),
-        inputs = listOf(input, weight),
-        dtypePolicy = DTypePolicy.Require(BF16),   // attn projection must run BF16
-    )
-    output(projection.first())
-}
-
-// 3. COMPILE — constraint resolution runs before fusion
-val graph = GraphProgramCompiler().compile(program)        // ComputeGraph
-val resolved = GraphOptimizationPipeline.createDefault()
-    .optimize(graph)                                       // includes DTypeConstraintResolutionPass
-    .graph                                                 // throws DtypeConstraintViolationException if mismatch
-
-// 4. EXECUTE — runtime fail-fast as a backstop
-System.setProperty("skainet.strict.kernels", "true")       // optional: surface missing kernels
-val output = ctx.ops.matmul(inputTensor, weightTensor)     // dispatch via KernelRegistry
-```
-
-Each layer enforces the contract for the layer below:
-
-- The loader guarantees every produced tensor has the right *source*-loaded dtype.
-- The resolution pass guarantees every graph node has the right *required* dtype on its inputs (or fails).
-- The runtime dispatch guarantees the right *lowered* kernel runs (or fails if strict mode is on).
-
-## Where the implementation lives
-
-| Piece | Path |
-|---|---|
-| `DTypePolicy` sealed type | `skainet-lang/skainet-lang-core/src/commonMain/kotlin/sk/ainet/lang/types/DTypePolicy.kt` |
-| `SafeTensorsParametersLoader.withPolicy(...)` | `skainet-io/skainet-io-safetensors/src/commonMain/kotlin/sk/ainet/io/safetensors/SafeTensorsParametersLoader.kt` |
-| `StreamingGgufParametersLoader.withPolicy(...)` | `skainet-io/skainet-io-gguf/src/commonMain/kotlin/sk/ainet/io/gguf/StreamingGgufParametersLoader.kt` |
-| `dag { ... dtypePolicy(...) }` DSL extension | `skainet-lang/skainet-lang-dag/src/commonMain/kotlin/sk/ainet/lang/dag/DtypePolicyDsl.kt` |
-| `DTypeConstraintResolutionPass` | `skainet-compile/skainet-compile-opt/src/commonMain/kotlin/sk/ainet/compile/opt/passes/DTypeConstraintResolutionPass.kt` |
-| `ResolvedComputeGraph` | `skainet-compile/skainet-compile-dag/src/commonMain/kotlin/sk/ainet/lang/graph/ResolvedComputeGraph.kt` |
-| `toStableHlo(ResolvedComputeGraph)` overload | `skainet-compile/skainet-compile-hlo/src/commonMain/kotlin/sk/ainet/compile/hlo/dag2hlo.kt` |
-| `KernelProvider.supports(...)` capability query | `skainet-backends/skainet-backend-api/src/commonMain/kotlin/sk/ainet/backend/api/kernel/KernelProvider.kt` |
-| `KernelStrictness` system-property fail-fast | `skainet-backends/skainet-backend-api/src/jvmMain/kotlin/sk/ainet/backend/api/kernel/KernelStrictness.kt` |
-| Runtime check in `ctx.ops.matmul` | `skainet-backends/skainet-backend-cpu/src/jvmMain/kotlin/sk/ainet/exec/tensor/ops/DefaultCpuOpsJvm.kt` |
-
-## What's intentionally not here
-
-Three categories of work that the model is *shaped for* but doesn't ship today:
-
-- **Cast kernels** (Q4_K → Int8, FP32 → BF16, …). When a `Require` constraint needs a cast that isn't registered, the resolution pass fails fast — exactly what the RFC prescribed. Concrete casts are bound up with precision / lossy-conversion policy and live in their own track.
-- **Layout-aware capability queries** on `KernelProvider`. The `supports(opName, dtypeKeys)` API is dtype-aware only; future layout-aware variants are a follow-up.
-- **NPU backend and MLIR / native code lowering**. The compiled path terminates at StableHLO today.
-
-## Related
-
-- [`docs/.../contributing/dtype-model.adoc`](docs/modules/ROOT/pages/contributing/dtype-model.adoc) — maintainer-facing reference: every concept's file path, the loader audit tables, the anti-patterns the model prevents.
-- [`docs/.../contributing/benchmarks.adoc`](docs/modules/ROOT/pages/contributing/benchmarks.adoc) — engine benchmark program that exercises the kernel SPI the dispatch chain calls into.
-- [Issue #615](https://github.com/SKaiNET-developers/SKaiNET/issues/615) / [PR #616](https://github.com/SKaiNET-developers/SKaiNET/pull/616) — implementation history.

From 9de8634e534acfbef8c8f3d38e4b7ff26563f6f2 Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Thu, 4 Jun 2026 21:50:26 +0200
Subject: [PATCH 2/2] fix(hlo test): write SDPA MLIR dump to a portable temp
 path

SdpaNumericDumpTest defaulted its output to a hardcoded developer path
(/home/miso/projects/coral/build-mlir/sdpa.mlir), causing a
FileNotFoundException on any other machine. Default to the JVM temp dir
while still honoring the sdpaMlirOut system property override.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../kotlin/sk/ainet/compile/hlo/SdpaNumericDumpTest.kt       | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/skainet-compile/skainet-compile-hlo/src/jvmTest/kotlin/sk/ainet/compile/hlo/SdpaNumericDumpTest.kt b/skainet-compile/skainet-compile-hlo/src/jvmTest/kotlin/sk/ainet/compile/hlo/SdpaNumericDumpTest.kt
index 3aac9553..5c3d9b23 100644
--- a/skainet-compile/skainet-compile-hlo/src/jvmTest/kotlin/sk/ainet/compile/hlo/SdpaNumericDumpTest.kt
+++ b/skainet-compile/skainet-compile-hlo/src/jvmTest/kotlin/sk/ainet/compile/hlo/SdpaNumericDumpTest.kt
@@ -44,7 +44,10 @@ class SdpaNumericDumpTest {
         g.addEdge(GraphEdge("e2", v, sdpa, 0, 2, v.outputs[0]))
 
         val mlir = StableHloConverterFactory.createBasic().convert(g, "sdpa").content
-        val out = File(System.getProperty("sdpaMlirOut") ?: "/home/miso/projects/coral/build-mlir/sdpa.mlir")
+        val out = File(
+            System.getProperty("sdpaMlirOut")
+                ?: File(System.getProperty("java.io.tmpdir"), "skainet-mlir/sdpa.mlir").path,
+        )
         out.parentFile?.mkdirs()
         out.writeText(mlir)
         println("WROTE_SDPA ${out.absolutePath}")