Skip to content

Eager NATIVE_OPTIMIZED: keep Q8_0 matmul weights packed (pre-transpose marker) so gemma fits + runs fast on the SL2610 #178

@michalharakal

Description

@michalharakal

Summary

The eager (OptimizedLLMRuntime DIRECT) Kotlin/Native path can't keep Q8_0 matmul weights packed, because linearProject transposes the weight and ops.transpose handles the K-series (Q4_K/Q5_K/Q6_K) but not Q8_0. This forces FunctionGemma's tied Q8_0 token_embd/output to dequant to FP32 (2×~0.67 GB), which OOMs the 1.9 GB Astra Machina SL2610 board. Result: the pure-SKaiNET eager LLM can't run on-device; only the IREE f16 vmfb path does.

This is the "pre-transposed marker" follow-up already noted in llm-core/.../transformer/LinearProjection.kt (Solution C / ISSUE-skainet-8b-oom.md).

Repro

GemmaQ5KPackedParityTest (host, -PincludeIntegration) with output/lm_head packed as Q8_0BlockTensorData (drop it from the isEmbed FP32 branch in GemmaPackedWeights.convertGemmaWeightsPacked + add Q8_0 to packGemmaKQuant):

java.lang.ClassCastException: class java.lang.Byte cannot be cast to class java.lang.Float
  at sk.ainet.lang.tensor.data.DenseTensorDataFactory.init(DenseTensorDataFactory.kt:539)
  at sk.ainet.exec.tensor.ops.DefaultCpuOpsBase.transpose(DefaultCpuOps.kt:632)
  at sk.ainet.lang.nn.transformer.LinearProjectionKt.linearProject(LinearProjection.kt:34)
  at sk.ainet.lang.nn.transformer.MultiHeadAttention.attentionImpl(...)   # also Q8_0 attn weights

linearProject = ops.matmul(input, ops.transpose(weight)) — transposing a Byte-backed packed tensor falls into the generic FP32 DenseTensorDataFactory path and crashes. K-series survive because they have a handled transpose; Q8_0 does not.

Root cause

  • ops.transpose (engine, DefaultCpuOps) has no path for Q8_0TensorData (Byte-backed) — and broadly, linearProject should not be transposing a packed weight at all when the packed layout is already [out, in] block-major (the kernel's expected order).
  • GemmaPackedWeights.convertGemmaWeightsPacked therefore FP32s token_embd/output; with tying, that's two ~0.67 GB FP32 matrices in opposite orientations.

Proposed fix (the actual Tier 2)

  1. Pre-transposed marker on packed weights: mark a NATIVE_OPTIMIZED-packed weight as already [out, in] block-major; linearProject reads the marker and skips ops.transpose, dispatching straight through DefaultCpuOps.chooseQuantizedMatmulHeap (which already routes Q8_0TensorData → Q8_0 kernel). (Alternative/also: add Q8_0 support to ops.transpose — engine dependency.)
  2. Pack Q8_0 matmul weights in packGemmaKQuant (generalize the relayout to 32-elem/34-byte blocks → Q8_0BlockTensorData); drop output from the isEmbed FP32 branch.
  3. Row-dequant the MAIN token_embd gather so the embedding also stays Q8_0 (today only the per-layer PLE embed implements RowDequantSource; ops.gather on a packed tensor otherwise materializes full FP32). Then the tied embed/lm_head share one ~178 MB Q8_0 tensor.

Footprint ~1.34 GB → ~178 MB, and the lm_head runs on the NEON Q8_0 kernel.

Target / proof it's right

The upstream Synaptics sl2610-examples/Function_calling runs the identical GGUF via llama.cpp at ~9.7 tok/s on the 2-core Cortex-A55 — keeping the embedding Q8_0 packed (row-dequant gather + quantized lm_head). That's the goal for the eager path.

Notes for whoever picks this up

  • Validate on the host GemmaQ5KPackedParityTest first (compiles from source; prints flush; no 1.9 GB OOM) before any board cycle.
  • Kotlin/Native caches klibs by coordinate — same-version publishToMavenLocal republishes link STALE klibs. Clear ~/.konan/*/klib/cache + ~/.m2/.../<ver> + ~/.gradle/caches/modules-2/.../<ver> between republishes, or use a unique version. Verify with grep -a <marker> <binary>.kexe (ELF, uncompressed — valid; the .klib is a zip, grep is invalid).
  • Cortex-A55: NEON+fp16+dotprod, no i8mm (don't add +i8mm → SIGILL).
  • Board harness ready in SKaiNET-embedded/sl2610-function-calling: voicecc gen-native (eager load(NATIVE_OPTIMIZED) + greedy decode, parity vs llama-ref), linuxArm64 deps + linkerOpts(native/prebuilt/libskainet_kernels.aarch64.a) (board-built NEON archive). It builds/links/loads today; only OOMs at weight load on the unpatched converter.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions