Skip to content

feat(gemma): optional maxInferenceLen on load() to cap KV cache on constrained devices (#178)#180

Merged
michalharakal merged 1 commit into
developfrom
fix/gemma-board-embed-nocopy
Jun 15, 2026
Merged

feat(gemma): optional maxInferenceLen on load() to cap KV cache on constrained devices (#178)#180
michalharakal merged 1 commit into
developfrom
fix/gemma-board-embed-nocopy

Conversation

@michalharakal

Copy link
Copy Markdown
Contributor

Follow-up to #179. Adds an optional maxInferenceLen to GemmaNetworkLoader.load() (threaded through applyWeightsToNetwork[NonReified]gemmaNetwork).

Why

The eager network sizes its KV cache + RoPE tables for maxInferenceLen (default min(contextLength, 4096)). On the 1.9 GB SL2610, after #179 dropped the weight footprint to ~1.06 GB resident (packed Q8_0 lm_head), the first forward still allocates the ~0.4 GB KV cache for a 4096-token context and OOMs the board — even though a tool-call prompt is ~13 tokens.

Capping maxInferenceLen (e.g. 32) shrinks the KV cache ~100×, so the eager decode fits. Default null preserves existing behaviour.

On-board evidence

A composite build of #736+#737+#179 loaded FunctionGemma to a stable 1.06 GB on the SL2610 (vs the prior 1.5 GB OOM-at-load), confirming the packed-Q8_0-lm_head fix works on hardware; the remaining OOM was the uncapped KV cache this param addresses.

Part of #178.

🤖 Generated with Claude Code

)

The eager network sizes its KV cache + RoPE tables for maxInferenceLen
(= min(contextLength, 4096) by default). On the 1.9 GB SL2610 that ~0.4 GB
KV cache (allocated at the first forward) OOMs the board even after the
packed Q8_0 lm_head dropped the weight footprint to ~1.06 GB resident.

Thread an optional `maxInferenceLen: Int? = null` through
load() -> applyWeightsToNetwork -> applyWeightsToNetworkNonReified ->
gemmaNetwork so a constrained-device consumer can cap the context (e.g. 32
for a short tool-call prompt), shrinking the KV cache ~100x. Default null
preserves the existing min(contextLength, 4096) behaviour.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit 19d62d4 into develop Jun 15, 2026
0 of 2 checks passed
@michalharakal michalharakal deleted the fix/gemma-board-embed-nocopy branch June 15, 2026 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant