feat(gemma): optional maxInferenceLen on load() to cap KV cache on constrained devices (#178) by michalharakal · Pull Request #180 · SKaiNET-developers/SKaiNET-transformers

michalharakal · 2026-06-15T13:43:13Z

Follow-up to #179. Adds an optional maxInferenceLen to GemmaNetworkLoader.load() (threaded through applyWeightsToNetwork[NonReified] → gemmaNetwork).

Why

The eager network sizes its KV cache + RoPE tables for maxInferenceLen (default min(contextLength, 4096)). On the 1.9 GB SL2610, after #179 dropped the weight footprint to ~1.06 GB resident (packed Q8_0 lm_head), the first forward still allocates the ~0.4 GB KV cache for a 4096-token context and OOMs the board — even though a tool-call prompt is ~13 tokens.

Capping maxInferenceLen (e.g. 32) shrinks the KV cache ~100×, so the eager decode fits. Default null preserves existing behaviour.

On-board evidence

A composite build of #736+#737+#179 loaded FunctionGemma to a stable 1.06 GB on the SL2610 (vs the prior 1.5 GB OOM-at-load), confirming the packed-Q8_0-lm_head fix works on hardware; the remaining OOM was the uncapped KV cache this param addresses.

Part of #178.

🤖 Generated with Claude Code

) The eager network sizes its KV cache + RoPE tables for maxInferenceLen (= min(contextLength, 4096) by default). On the 1.9 GB SL2610 that ~0.4 GB KV cache (allocated at the first forward) OOMs the board even after the packed Q8_0 lm_head dropped the weight footprint to ~1.06 GB resident. Thread an optional `maxInferenceLen: Int? = null` through load() -> applyWeightsToNetwork -> applyWeightsToNetworkNonReified -> gemmaNetwork so a constrained-device consumer can cap the context (e.g. 32 for a short tool-call prompt), shrinking the KV cache ~100x. Default null preserves the existing min(contextLength, 4096) behaviour. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

michalharakal mentioned this pull request Jun 15, 2026

Eager NATIVE_OPTIMIZED: keep Q8_0 matmul weights packed (pre-transpose marker) so gemma fits + runs fast on the SL2610 #178

Open

michalharakal merged commit 19d62d4 into develop Jun 15, 2026
0 of 2 checks passed

michalharakal deleted the fix/gemma-board-embed-nocopy branch June 15, 2026 15:47

michalharakal mentioned this pull request Jun 15, 2026

chore(release): prepare SKaiNET-transformers 0.31.0 #181

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gemma): optional maxInferenceLen on load() to cap KV cache on constrained devices (#178)#180

feat(gemma): optional maxInferenceLen on load() to cap KV cache on constrained devices (#178)#180
michalharakal merged 1 commit into
developfrom
fix/gemma-board-embed-nocopy

michalharakal commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented Jun 15, 2026

Why

On-board evidence

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant