fix(gemma): keep tied Q8_0 lm_head packed in eager NATIVE_OPTIMIZED path (#178)#179
Merged
Merged
Conversation
…ath (#178) FunctionGemma's token_embd is Q8_0 and tied, so convertGemmaWeightsPacked was dequanting BOTH token_embd AND output to FP32 (2×~0.67 GB) — OOM on the 1.9 GB SL2610. `output`/lm_head is a real matmul weight, not an embedding: - packGemmaKQuant: add Q8_0 (32-elem/34B blocks → Q8_0BlockTensorData); generalize the row-major→block-major relayout with a blockSize param. - convertGemmaWeightsPacked: drop OUTPUT_WEIGHT from the isEmbed FP32 branch so it packs like the other matmul weights and runs on the (NEON) Q8_0 kernel. token_embd stays FP32 (it's gathered) but is now wrapped no-copy via DenseFloatArrayTensorData instead of ctx.fromFloatArray (which allocates a second ~0.67 GB buffer). Footprint for the tied embed/lm_head drops ~1.34 GB → ~0.67 GB (embed FP32) + ~0.09 GB (packed Q8_0 lm_head). Requires the engine Q8_0 case in ops.transpose (SKaiNET fix/q8_0-lazy-transpose) so linearProject can transpose the packed weight. Verified: GemmaQ5KPackedParityTest (composite -PuseLocalSkainet) — eager load(NATIVE_OPTIMIZED) decodes byte-identically to the FP32 baseline; lm_head packed as Q8_0. (token_embd row-dequant gather to drop the last ~0.67 GB is the remaining follow-up in #178.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 15, 2026
Merged
michalharakal
added a commit
that referenced
this pull request
Jun 15, 2026
…e.properties; fix stale Q8_0 test Real fix for the Build & Test failure (was masked by, then surfaced after, the JS NPM config-time issue): 1. `gradle.properties`: set `android.dependencyResolutionAtConfigurationTime.disallow=false`. AGP's DependencyResolutionChecks fails the build when KGP's KotlinPackageJsonTask resolves the Kotlin/JS + Wasm `*NpmAggregated` configs at configuration time (we have JS npm deps: ktor-client-js, kotlinx-browser) — `assemble`/`allTests` threw `Configuration 'jsNpmAggregated' was resolved during configuration time` (gradle#31483), a false positive against KGP's known behaviour. AGP reads this option ONLY from the project gradle.properties — NOT from `-P` or the CI's ~/.gradle/gradle.properties (which is why the earlier attempts didn't take). Reverted those no-op attempts (build.yml/publish.yml `-P`, ci-gradle.properties). 2. `GemmaQuantLayoutTest`: `pack_non_kquant_returns_null` asserted Q8_0 packs to null, but #179 added Q8_0 packing — it now returns Q8_0BlockTensorData. Replace with `pack_q8_0_produces_block_tensor` + a true-null case (Q4_1). Verified locally: `clean assemble allTests --no-configuration-cache` is GREEN. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes part of #178 (the lm_head half).
Problem
FunctionGemma's
token_embdis Q8_0 and tied (no separateoutput.weight), soconvertGemmaWeightsPackeddequanted bothtoken_embdandoutputto FP32 (2×~0.67 GB) — OOM on the 1.9 GB SL2610.output/lm_head is a real matmul weight, not an embedding.Fix
packGemmaKQuant: add Q8_0 (32-elem/34 B blocks →Q8_0BlockTensorData); generalize the row-major→block-major relayout with ablockSizeparam.convertGemmaWeightsPacked: dropOUTPUT_WEIGHTfrom theisEmbedFP32 branch so it packs like the other matmul weights and runs on the (NEON) Q8_0 kernel.token_embdstays FP32 (it's gathered) but is now wrapped no-copy viaDenseFloatArrayTensorDatainstead ofctx.fromFloatArray(which allocates a second ~0.67 GB buffer).Tied embed/lm_head footprint: ~1.34 GB → ~0.76 GB.
Depends on
SKaiNET#736 (
fix/q8_0-lazy-transpose) — the engineops.transposeQ8_0 case, solinearProjectcan transpose the packed weight without theByte→FloatClassCastException. Merge/publish that first.Verification
GemmaQ5KPackedParityTestvia composite-PuseLocalSkainet=true(both repos from source): eagerload(NATIVE_OPTIMIZED)decodes byte-identically to the FP32 baseline; lm_head packed Q8_0; no crash.Remaining (#178, separate)
Row-dequant the main
token_embdgather (today only the per-layer PLE embed implementsRowDequantSource) to drop the last ~0.67 GB for full board fit.🤖 Generated with Claude Code