Summary
The eager (OptimizedLLMRuntime DIRECT) Kotlin/Native path can't keep Q8_0 matmul weights packed, because linearProject transposes the weight and ops.transpose handles the K-series (Q4_K/Q5_K/Q6_K) but not Q8_0. This forces FunctionGemma's tied Q8_0 token_embd/output to dequant to FP32 (2×~0.67 GB), which OOMs the 1.9 GB Astra Machina SL2610 board. Result: the pure-SKaiNET eager LLM can't run on-device; only the IREE f16 vmfb path does.
This is the "pre-transposed marker" follow-up already noted in llm-core/.../transformer/LinearProjection.kt (Solution C / ISSUE-skainet-8b-oom.md).
Repro
GemmaQ5KPackedParityTest (host, -PincludeIntegration) with output/lm_head packed as Q8_0BlockTensorData (drop it from the isEmbed FP32 branch in GemmaPackedWeights.convertGemmaWeightsPacked + add Q8_0 to packGemmaKQuant):
java.lang.ClassCastException: class java.lang.Byte cannot be cast to class java.lang.Float
at sk.ainet.lang.tensor.data.DenseTensorDataFactory.init(DenseTensorDataFactory.kt:539)
at sk.ainet.exec.tensor.ops.DefaultCpuOpsBase.transpose(DefaultCpuOps.kt:632)
at sk.ainet.lang.nn.transformer.LinearProjectionKt.linearProject(LinearProjection.kt:34)
at sk.ainet.lang.nn.transformer.MultiHeadAttention.attentionImpl(...) # also Q8_0 attn weights
linearProject = ops.matmul(input, ops.transpose(weight)) — transposing a Byte-backed packed tensor falls into the generic FP32 DenseTensorDataFactory path and crashes. K-series survive because they have a handled transpose; Q8_0 does not.
Root cause
ops.transpose (engine, DefaultCpuOps) has no path for Q8_0TensorData (Byte-backed) — and broadly, linearProject should not be transposing a packed weight at all when the packed layout is already [out, in] block-major (the kernel's expected order).
GemmaPackedWeights.convertGemmaWeightsPacked therefore FP32s token_embd/output; with tying, that's two ~0.67 GB FP32 matrices in opposite orientations.
Proposed fix (the actual Tier 2)
- Pre-transposed marker on packed weights: mark a
NATIVE_OPTIMIZED-packed weight as already [out, in] block-major; linearProject reads the marker and skips ops.transpose, dispatching straight through DefaultCpuOps.chooseQuantizedMatmulHeap (which already routes Q8_0TensorData → Q8_0 kernel). (Alternative/also: add Q8_0 support to ops.transpose — engine dependency.)
- Pack Q8_0 matmul weights in
packGemmaKQuant (generalize the relayout to 32-elem/34-byte blocks → Q8_0BlockTensorData); drop output from the isEmbed FP32 branch.
- Row-dequant the MAIN
token_embd gather so the embedding also stays Q8_0 (today only the per-layer PLE embed implements RowDequantSource; ops.gather on a packed tensor otherwise materializes full FP32). Then the tied embed/lm_head share one ~178 MB Q8_0 tensor.
Footprint ~1.34 GB → ~178 MB, and the lm_head runs on the NEON Q8_0 kernel.
Target / proof it's right
The upstream Synaptics sl2610-examples/Function_calling runs the identical GGUF via llama.cpp at ~9.7 tok/s on the 2-core Cortex-A55 — keeping the embedding Q8_0 packed (row-dequant gather + quantized lm_head). That's the goal for the eager path.
Notes for whoever picks this up
- Validate on the host
GemmaQ5KPackedParityTest first (compiles from source; prints flush; no 1.9 GB OOM) before any board cycle.
- Kotlin/Native caches klibs by coordinate — same-version
publishToMavenLocal republishes link STALE klibs. Clear ~/.konan/*/klib/cache + ~/.m2/.../<ver> + ~/.gradle/caches/modules-2/.../<ver> between republishes, or use a unique version. Verify with grep -a <marker> <binary>.kexe (ELF, uncompressed — valid; the .klib is a zip, grep is invalid).
- Cortex-A55: NEON+fp16+dotprod, no i8mm (don't add
+i8mm → SIGILL).
- Board harness ready in
SKaiNET-embedded/sl2610-function-calling: voicecc gen-native (eager load(NATIVE_OPTIMIZED) + greedy decode, parity vs llama-ref), linuxArm64 deps + linkerOpts(native/prebuilt/libskainet_kernels.aarch64.a) (board-built NEON archive). It builds/links/loads today; only OOMs at weight load on the unpatched converter.
🤖 Generated with Claude Code
Summary
The eager (
OptimizedLLMRuntimeDIRECT) Kotlin/Native path can't keep Q8_0 matmul weights packed, becauselinearProjecttransposes the weight andops.transposehandles the K-series (Q4_K/Q5_K/Q6_K) but not Q8_0. This forces FunctionGemma's tied Q8_0token_embd/outputto dequant to FP32 (2×~0.67 GB), which OOMs the 1.9 GB Astra Machina SL2610 board. Result: the pure-SKaiNET eager LLM can't run on-device; only the IREE f16 vmfb path does.This is the "pre-transposed marker" follow-up already noted in
llm-core/.../transformer/LinearProjection.kt(Solution C /ISSUE-skainet-8b-oom.md).Repro
GemmaQ5KPackedParityTest(host,-PincludeIntegration) withoutput/lm_head packed asQ8_0BlockTensorData(drop it from theisEmbedFP32 branch inGemmaPackedWeights.convertGemmaWeightsPacked+ add Q8_0 topackGemmaKQuant):linearProject = ops.matmul(input, ops.transpose(weight))— transposing a Byte-backed packed tensor falls into the generic FP32DenseTensorDataFactorypath and crashes. K-series survive because they have a handled transpose; Q8_0 does not.Root cause
ops.transpose(engine,DefaultCpuOps) has no path forQ8_0TensorData(Byte-backed) — and broadly,linearProjectshould not be transposing a packed weight at all when the packed layout is already[out, in]block-major (the kernel's expected order).GemmaPackedWeights.convertGemmaWeightsPackedtherefore FP32stoken_embd/output; with tying, that's two ~0.67 GB FP32 matrices in opposite orientations.Proposed fix (the actual Tier 2)
NATIVE_OPTIMIZED-packed weight as already[out, in]block-major;linearProjectreads the marker and skipsops.transpose, dispatching straight throughDefaultCpuOps.chooseQuantizedMatmulHeap(which already routesQ8_0TensorData→ Q8_0 kernel). (Alternative/also: add Q8_0 support toops.transpose— engine dependency.)packGemmaKQuant(generalize the relayout to 32-elem/34-byte blocks →Q8_0BlockTensorData); dropoutputfrom theisEmbedFP32 branch.token_embdgather so the embedding also stays Q8_0 (today only the per-layer PLE embed implementsRowDequantSource;ops.gatheron a packed tensor otherwise materializes full FP32). Then the tied embed/lm_head share one ~178 MB Q8_0 tensor.Footprint ~1.34 GB → ~178 MB, and the lm_head runs on the NEON Q8_0 kernel.
Target / proof it's right
The upstream Synaptics
sl2610-examples/Function_callingruns the identical GGUF via llama.cpp at ~9.7 tok/s on the 2-core Cortex-A55 — keeping the embedding Q8_0 packed (row-dequant gather + quantized lm_head). That's the goal for the eager path.Notes for whoever picks this up
GemmaQ5KPackedParityTestfirst (compiles from source; prints flush; no 1.9 GB OOM) before any board cycle.publishToMavenLocalrepublishes link STALE klibs. Clear~/.konan/*/klib/cache+~/.m2/.../<ver>+~/.gradle/caches/modules-2/.../<ver>between republishes, or use a unique version. Verify withgrep -a <marker> <binary>.kexe(ELF, uncompressed — valid; the.klibis a zip, grep is invalid).+i8mm→ SIGILL).SKaiNET-embedded/sl2610-function-calling:voicecc gen-native(eagerload(NATIVE_OPTIMIZED)+ greedy decode, parity vs llama-ref), linuxArm64 deps +linkerOpts(native/prebuilt/libskainet_kernels.aarch64.a)(board-built NEON archive). It builds/links/loads today; only OOMs at weight load on the unpatched converter.🤖 Generated with Claude Code