fix(gemma): keep tied Q8_0 lm_head packed in eager NATIVE_OPTIMIZED path (#178) by michalharakal · Pull Request #179 · SKaiNET-developers/SKaiNET-transformers

michalharakal · 2026-06-15T11:41:18Z

Closes part of #178 (the lm_head half).

Problem

FunctionGemma's token_embd is Q8_0 and tied (no separate output.weight), so convertGemmaWeightsPacked dequanted both token_embd and output to FP32 (2×~0.67 GB) — OOM on the 1.9 GB SL2610. output/lm_head is a real matmul weight, not an embedding.

Fix

packGemmaKQuant: add Q8_0 (32-elem/34 B blocks → Q8_0BlockTensorData); generalize the row-major→block-major relayout with a blockSize param.
convertGemmaWeightsPacked: drop OUTPUT_WEIGHT from the isEmbed FP32 branch so it packs like the other matmul weights and runs on the (NEON) Q8_0 kernel. token_embd stays FP32 (it's gathered) but is now wrapped no-copy via DenseFloatArrayTensorData instead of ctx.fromFloatArray (which allocates a second ~0.67 GB buffer).

Tied embed/lm_head footprint: ~1.34 GB → ~0.76 GB.

Depends on

SKaiNET#736 (fix/q8_0-lazy-transpose) — the engine ops.transpose Q8_0 case, so linearProject can transpose the packed weight without the Byte→Float ClassCastException. Merge/publish that first.

Verification

GemmaQ5KPackedParityTest via composite -PuseLocalSkainet=true (both repos from source): eager load(NATIVE_OPTIMIZED) decodes byte-identically to the FP32 baseline; lm_head packed Q8_0; no crash.

Remaining (#178, separate)

Row-dequant the main token_embd gather (today only the per-layer PLE embed implements RowDequantSource) to drop the last ~0.67 GB for full board fit.

🤖 Generated with Claude Code

…ath (#178) FunctionGemma's token_embd is Q8_0 and tied, so convertGemmaWeightsPacked was dequanting BOTH token_embd AND output to FP32 (2×~0.67 GB) — OOM on the 1.9 GB SL2610. `output`/lm_head is a real matmul weight, not an embedding: - packGemmaKQuant: add Q8_0 (32-elem/34B blocks → Q8_0BlockTensorData); generalize the row-major→block-major relayout with a blockSize param. - convertGemmaWeightsPacked: drop OUTPUT_WEIGHT from the isEmbed FP32 branch so it packs like the other matmul weights and runs on the (NEON) Q8_0 kernel. token_embd stays FP32 (it's gathered) but is now wrapped no-copy via DenseFloatArrayTensorData instead of ctx.fromFloatArray (which allocates a second ~0.67 GB buffer). Footprint for the tied embed/lm_head drops ~1.34 GB → ~0.67 GB (embed FP32) + ~0.09 GB (packed Q8_0 lm_head). Requires the engine Q8_0 case in ops.transpose (SKaiNET fix/q8_0-lazy-transpose) so linearProject can transpose the packed weight. Verified: GemmaQ5KPackedParityTest (composite -PuseLocalSkainet) — eager load(NATIVE_OPTIMIZED) decodes byte-identically to the FP32 baseline; lm_head packed as Q8_0. (token_embd row-dequant gather to drop the last ~0.67 GB is the remaining follow-up in #178.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e.properties; fix stale Q8_0 test Real fix for the Build & Test failure (was masked by, then surfaced after, the JS NPM config-time issue): 1. `gradle.properties`: set `android.dependencyResolutionAtConfigurationTime.disallow=false`. AGP's DependencyResolutionChecks fails the build when KGP's KotlinPackageJsonTask resolves the Kotlin/JS + Wasm `*NpmAggregated` configs at configuration time (we have JS npm deps: ktor-client-js, kotlinx-browser) — `assemble`/`allTests` threw `Configuration 'jsNpmAggregated' was resolved during configuration time` (gradle#31483), a false positive against KGP's known behaviour. AGP reads this option ONLY from the project gradle.properties — NOT from `-P` or the CI's ~/.gradle/gradle.properties (which is why the earlier attempts didn't take). Reverted those no-op attempts (build.yml/publish.yml `-P`, ci-gradle.properties). 2. `GemmaQuantLayoutTest`: `pack_non_kquant_returns_null` asserted Q8_0 packs to null, but #179 added Q8_0 packing — it now returns Q8_0BlockTensorData. Replace with `pack_q8_0_produces_block_tensor` + a true-null case (Q4_1). Verified locally: `clean assemble allTests --no-configuration-cache` is GREEN. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

michalharakal merged commit 689a283 into develop Jun 15, 2026
0 of 2 checks passed

michalharakal deleted the fix/gemma-board-embed-nocopy branch June 15, 2026 11:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gemma): keep tied Q8_0 lm_head packed in eager NATIVE_OPTIMIZED path (#178)#179

fix(gemma): keep tied Q8_0 lm_head packed in eager NATIVE_OPTIMIZED path (#178)#179
michalharakal merged 1 commit into
developfrom
fix/gemma-board-embed-nocopy

michalharakal commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented Jun 15, 2026

Problem

Fix

Depends on

Verification

Remaining (#178, separate)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant