Eager NATIVE_OPTIMIZED: keep Q8_0 matmul weights packed (pre-transpose marker) so gemma fits + runs fast on the SL2610

## Summary
The eager (`OptimizedLLMRuntime` DIRECT) Kotlin/Native path can't keep **Q8_0** matmul weights packed, because `linearProject` transposes the weight and **`ops.transpose` handles the K-series (Q4_K/Q5_K/Q6_K) but not Q8_0**. This forces FunctionGemma's tied **Q8_0 `token_embd`/`output`** to dequant to FP32 (2×~0.67 GB), which OOMs the 1.9 GB Astra Machina **SL2610** board. Result: the pure-SKaiNET eager LLM can't run on-device; only the IREE f16 vmfb path does.

This is the "pre-transposed marker" follow-up already noted in `llm-core/.../transformer/LinearProjection.kt` (Solution C / `ISSUE-skainet-8b-oom.md`).

## Repro
`GemmaQ5KPackedParityTest` (host, `-PincludeIntegration`) with `output`/lm_head packed as `Q8_0BlockTensorData` (drop it from the `isEmbed` FP32 branch in `GemmaPackedWeights.convertGemmaWeightsPacked` + add Q8_0 to `packGemmaKQuant`):

```
java.lang.ClassCastException: class java.lang.Byte cannot be cast to class java.lang.Float
  at sk.ainet.lang.tensor.data.DenseTensorDataFactory.init(DenseTensorDataFactory.kt:539)
  at sk.ainet.exec.tensor.ops.DefaultCpuOpsBase.transpose(DefaultCpuOps.kt:632)
  at sk.ainet.lang.nn.transformer.LinearProjectionKt.linearProject(LinearProjection.kt:34)
  at sk.ainet.lang.nn.transformer.MultiHeadAttention.attentionImpl(...)   # also Q8_0 attn weights
```

`linearProject = ops.matmul(input, ops.transpose(weight))` — transposing a Byte-backed packed tensor falls into the generic FP32 `DenseTensorDataFactory` path and crashes. K-series survive because they have a handled transpose; Q8_0 does not.

## Root cause
- `ops.transpose` (engine, `DefaultCpuOps`) has no path for `Q8_0TensorData` (Byte-backed) — and broadly, `linearProject` should not be transposing a packed weight at all when the packed layout is already `[out, in]` block-major (the kernel's expected order).
- `GemmaPackedWeights.convertGemmaWeightsPacked` therefore FP32s `token_embd`/`output`; with tying, that's two ~0.67 GB FP32 matrices in opposite orientations.

## Proposed fix (the actual Tier 2)
1. **Pre-transposed marker on packed weights**: mark a `NATIVE_OPTIMIZED`-packed weight as already `[out, in]` block-major; `linearProject` reads the marker and **skips `ops.transpose`**, dispatching straight through `DefaultCpuOps.chooseQuantizedMatmulHeap` (which already routes `Q8_0TensorData` → Q8_0 kernel). (Alternative/also: add Q8_0 support to `ops.transpose` — engine dependency.)
2. **Pack Q8_0 matmul weights** in `packGemmaKQuant` (generalize the relayout to 32-elem/34-byte blocks → `Q8_0BlockTensorData`); drop `output` from the `isEmbed` FP32 branch.
3. **Row-dequant the MAIN `token_embd` gather** so the embedding also stays Q8_0 (today only the per-layer PLE embed implements `RowDequantSource`; `ops.gather` on a packed tensor otherwise materializes full FP32). Then the tied embed/lm_head share one ~178 MB Q8_0 tensor.

Footprint ~1.34 GB → ~178 MB, and the lm_head runs on the NEON Q8_0 kernel.

## Target / proof it's right
The upstream Synaptics `sl2610-examples/Function_calling` runs the **identical** GGUF via llama.cpp at **~9.7 tok/s** on the 2-core Cortex-A55 — keeping the embedding Q8_0 packed (row-dequant gather + quantized lm_head). That's the goal for the eager path.

## Notes for whoever picks this up
- Validate on the **host** `GemmaQ5KPackedParityTest` first (compiles from source; prints flush; no 1.9 GB OOM) before any board cycle.
- Kotlin/Native caches klibs by **coordinate** — same-version `publishToMavenLocal` republishes link STALE klibs. Clear `~/.konan/*/klib/cache` + `~/.m2/.../<ver>` + `~/.gradle/caches/modules-2/.../<ver>` between republishes, or use a unique version. Verify with `grep -a <marker> <binary>.kexe` (ELF, uncompressed — valid; the `.klib` is a zip, grep is invalid).
- Cortex-A55: NEON+fp16+dotprod, **no i8mm** (don't add `+i8mm` → SIGILL).
- Board harness ready in `SKaiNET-embedded/sl2610-function-calling`: `voicecc gen-native` (eager `load(NATIVE_OPTIMIZED)` + greedy decode, parity vs llama-ref), linuxArm64 deps + `linkerOpts(native/prebuilt/libskainet_kernels.aarch64.a)` (board-built NEON archive). It builds/links/loads today; only OOMs at weight load on the unpatched converter.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eager NATIVE_OPTIMIZED: keep Q8_0 matmul weights packed (pre-transpose marker) so gemma fits + runs fast on the SL2610 #178

Summary

Repro

Root cause

Proposed fix (the actual Tier 2)

Target / proof it's right

Notes for whoever picks this up

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eager NATIVE_OPTIMIZED: keep Q8_0 matmul weights packed (pre-transpose marker) so gemma fits + runs fast on the SL2610 #178

Description

Summary

Repro

Root cause

Proposed fix (the actual Tier 2)

Target / proof it's right

Notes for whoever picks this up

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions