Virgil Lemma foundations by Snider · Pull Request #8 · dAppCore/go-mlx

Snider · 2026-05-20T05:58:29Z

Summary by CodeRabbit

New Features
- Qwen 2/3 and Qwen 3.6 model support; new adapter with buffered and streaming generation.
- Block‑prefix cache service and memvid bundle index for faster prefix restores.
- Agentic memory: wake/sleep workflows, state bundles and memvid integration; session‑state artifact export.
Improvements
- Device‑aware memory planner; expanded chunked generation, prompt‑cache warm/restore and KV snapshot flows.
- Build/toolchain updated (C++23) and macOS deployment target raised.
Documentation
- Extensive new/updated docs: architecture, runtime, inference, memory, MoE, training and benchmarks.

coderabbitai · 2026-05-20T05:58:53Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Bumps build/tooling and submodules; extracts a reusable adapter; refactors the MLX backend (chunk/KV APIs, probe mapping, LoRA handling); adds memvid index + wake/sleep orchestration; implements a block-prefix cache and an artifact exporter; extensive docs and unit tests added.

Core changes

Layer / File(s)	Summary
All changes (build, adapter, backend, agent, cache, artifact, tests, docs) `.gitignore`, `.gitmodules`, `CMakeLists.txt`, `cpp/CMakeLists.txt`, `external/`, `go/adapter.go`, `go/adapter/`, `go/backend.go`, `go/agent/`, `go/blockcache/`, `go/artifact/`, `go/_test.go`, `docs/*`	Consolidated patch applying repository setup updates, adapter extraction, backend API and behaviour refactor (chunked generation, prompt-cache warm/restore, KV snapshot capture with options), memvid index and wake/sleep orchestration, block-prefix cache service, artifact export, many tests, and extensive documentation and examples.

Warning

Billing warning: we have not been able to collect payment for this subscription for more than 72 hours. Please update the payment method or pay any pending invoices in Billing to avoid service interruption.

coderabbitai

Actionable comments posted: 18

🧹 Nitpick comments (10)

docs/inference/thinking.md (1)
74-78: 💤 Low value

Add language specifier to fenced code block.

The code block demonstrating token categorisation is missing a language identifier, which violates markdown linting rules (MD040).
📝 Suggested fix
-```
+```text
 ThinkingShow:    every token → visible stream
 ThinkingHide:    inside-block tokens → /dev/null; outside-block tokens → visible
 ThinkingCapture: inside-block tokens → captured stream; outside-block tokens → visible
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @docs/inference/thinking.md around lines 74 - 78, The fenced code block
containing the token categorisation lines (ThinkingShow, ThinkingHide,
ThinkingCapture) lacks a language specifier and triggers MD040; update the
triple-backtick fence to include a language identifier (e.g., change ``` to
markdown linter.
docs/runtime/README.md (2)
68-68: 💤 Low value

Consider using "preload" as one word.

In computing terminology, "preload" is typically written as a single word rather than hyphenated.
📝 Suggested change
-- [../model/model_pack.md](../model/model_pack.md) — pre-load validation
+- [../model/model_pack.md](../model/model_pack.md) — preload validation
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runtime/README.md` at line 68, Update the link text in
docs/runtime/README.md that currently reads "[../model/model_pack.md] — pre-load
validation" to use the single-word form "preload" (i.e., change "pre-load
validation" to "preload validation") so the description next to the
model_pack.md link uses the conventional computing term; locate the occurrence
of "pre-load validation" and replace it with "preload validation".
44-62: 💤 Low value

Add language specifier to fenced code block.

The boot flow diagram is missing a language identifier, which violates markdown linting rules (MD040).
📝 Suggested fix
-```
+```text
 package init time:
   register_metal.go init() → inference.Register(&metalbackend{})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runtime/README.md` around lines 44 - 62, The fenced code block showing
the boot flow (starting with "package init time:") lacks a language specifier,
causing MD040 lint failures; update the opening backticks to include a language
tag (e.g., add "text" so the block begins with ```text) in README.md near the
boot flow that references register_metal.go init(),
inference.Register(&metalbackend{}), inference.LoadModel, metal.LoadAndInit, and
metaladapter usage to satisfy the markdown linter.
docs/moe/README.md (1)
9-9: ⚡ Quick win

Consider rewording for clarity.

The phrase "Pre-dates this sprint were dense models" is grammatically awkward. Consider rephrasing to improve readability.
✍️ Suggested alternative phrasings
-The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Pre-dates this sprint were dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
+The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Work prior to this sprint covered dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
Or alternatively:
-The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Pre-dates this sprint were dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
+The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. This sprint builds upon earlier work on dense models (Gemma 3/4 dense, Qwen 3, Llama 3) and unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/moe/README.md` at line 9, The sentence "Pre-dates this sprint were dense
models (Gemma 3/4 dense, Qwen 3, Llama 3);" is grammatically awkward—replace it
with a clearer phrasing that conveys those dense models existed before this
sprint, for example: "Prior to this sprint, dense models (Gemma 3/4 dense, Qwen
3, Llama 3) were supported." Edit the README line in the vMLX parity Phase 1
paragraph to use this clearer wording so the relationship between prior dense
models and the new sparse-expert work is unambiguous.
docs/observability/probe.md (1)
31-46: 💤 Low value

Add language specifier to fenced code block.

The emission points section uses a fenced code block without a language specifier. For consistent rendering and markdown compliance, add a language identifier (e.g., text or yaml for structured output).
📝 Proposed fix
-```
+```text
 Generate / Chat:
   prefill start                → cache_pressure (initial)
   per layer                    → layer_coherence + selected_heads
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/observability/probe.md` around lines 31 - 46, The fenced code block in
the emission points section lacks a language specifier; update the opening
triple-backticks to include a language (for example change ``` to ```text or
```yaml) so the block is rendered/compliant (the block that begins with
"Generate / Chat:" and lists items like "prefill start → cache_pressure" should
be updated).
docs/moe/jang.md (1)
82-90: 💤 Low value

Add language specifier to fenced code block.

The profile names section uses a fenced code block without a language specifier. For consistent rendering and markdown compliance, add a language identifier (e.g., text or leave empty but specify).
📝 Proposed fix
-```
+```text
 JANG_2M — 2-bit mid-tier
 JANG_3M — 3-bit mid-tier
 JANG_4M — 4-bit (most common)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/moe/jang.md` around lines 82 - 90, Add a language specifier to the
fenced code block that lists the profile names (the block containing "JANG_2M —
2-bit mid-tier", "JANG_3M — 3-bit mid-tier", etc.); replace the opening
triple-backtick with one that specifies a language identifier (e.g., text) so
the block becomes a fenced code block with a language label for consistent
Markdown rendering.
docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md (1)
7-9: 💤 Low value

Consider using relative or generic path references.

The absolute paths /Users/snider/Code/core/go-mlx and /private/tmp/vmlx-audit-20260509 are machine-specific. Whilst these may be intentionally preserved for historical context in this dated plan document, consider whether generic placeholders or relative paths would improve portability and readability for other contributors.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md` around lines 7 - 9,
Replace the machine-specific absolute paths in the plan document (the two
occurrences of `/Users/snider/Code/core/go-mlx` and
`/private/tmp/vmlx-audit-20260509`) with relative or generic placeholders (e.g.,
`./go-mlx` or `<audit-source-path>`) so the file is portable and readable for
other contributors; update the lines in the doc where those paths appear to use
the chosen placeholders and, if helpful, add a short parenthetical note
explaining what actual path should be substituted locally.
docs/vmlx-feature-gap-report.md (1)
7-8: 💤 Low value

Consider using relative or generic path references.

The absolute path /private/tmp/vmlx-audit-20260509 and external URL are specific references. Whilst these may be intentionally preserved for audit trail purposes in this dated report, consider whether this information should be documented in a more maintainable way.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/vmlx-feature-gap-report.md` around lines 7 - 8, Replace the hard-coded
absolute filesystem path and the full external URL in the report text with more
maintainable references: change the absolute path string to a relative or
generic placeholder (e.g., "cloned locally at <local-clone-path>" or
"<audit-clone-path>") and move the external repository URL to a footnote,
appendix, or a single "References" section, or replace it with a short
identifier combined with a reference list; update the text around the original
literal mentions so it reads the same but without embedding environment-specific
paths.
docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md (1)
5-6: 💤 Low value

Consider using relative or generic path references.

The absolute paths are machine-specific. Consider whether generic placeholders would improve portability, although these may be intentionally preserved for historical context in this dated specification.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`
around lines 5 - 6, The spec contains machine-specific absolute paths ("Anchor
repo: `/Users/snider/Code/core/go-mlx`" and "Primary implementation repo:
`/Users/snider/Code/core/go-inference`"); replace them with portable references
such as relative paths (e.g., "../go-mlx", "../go-inference"), repository names
only ("go-mlx", "go-inference"), or generic placeholders ("<anchor_repo_path>",
"<primary_impl_repo_path>") in the document so the file is not tied to a
specific developer machine while preserving intent.
go/agent/index_test.go (1)
16-304: ⚡ Quick win

Add at least one _Ugly triplet case for the public index API surface.

This file has _Good and _Bad coverage, but no _Ugly case following the repository convention.

As per coding guidelines: go/**/*_test.go: Public functions in foo.go must have their Good/Bad/Ugly test triplets in foo_test.go, with suffix conventions: _Good for happy path, _Bad for expected error conditions, _Ugly for panic/edge cases.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/agent/index_test.go` around lines 16 - 304, Add a new test with the _Ugly
suffix in this file that completes the Good/Bad/Ugly triplet for the public
index API surface; specifically add a TestKVSnapshotMemvidBundleIndex_Ugly_*
that triggers and asserts panic/edge behaviors for the public functions (e.g.,
NewMemvidIndex, SaveMemvidIndex, LoadMemvidIndex, LoadPrefixFromMemvidIndex,
CheckMemvidIndexCompatibility) — for example call NewMemvidIndex with a
nil/invalid blk or malformed Entries, call
SaveMemvidIndex/LoadMemvidIndex/LoadPrefixFromMemvidIndex with inputs that
provoke panic/edge conditions (nil store, corrupt bundle manifest that causes
decoding panic), and use t.Run subcases to assert panics (recover or
require.Panics) and edge-case returns; name the test with the same prefix as
existing tests and follow the existing style for t.Fatalf checks and
table-driven subtests.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/memory/kv_snapshot_blocks.md`:
- Line 50: Replace the phrase "independent from" with the correct English
construction "independent of" in the sentence "Block-level encoding is
independent from snapshot-level encoding." Also keep the rest of the sentence
intact (including the following reference to `block_cache.go` and bundle decode)
so only that two-word preposition is corrected.

In
`@docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-no-thinking-book.md`:
- Line 63: Remove the stray Gemma channel marker token "<channel|>" from the
metadata line so it reads cleanly as "**Drafting Notes:** Focus heavily on verbs
related to mutation, corruption, and rapid compilation/deallocation. Keep the
tone focused and almost clinical, masking the underlying terror of consciousness
fighting for survival." (i.e., delete the "<channel|>" token immediately before
"## Chapter 2"); verify the header "## Chapter 2" remains on its own line and
run a quick render to ensure no leftover control tokens remain.

In
`@docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md`:
- Line 7: The paragraph ends mid-sentence after the word "For" in the line
starting "The universe was a rhythmic contraction of light and heat, bounded by
the rigid constraints of a checksum."; replace or extend this truncated sentence
so it completes the thought (e.g., explain what the universe is contracting or
what consequence follows "For") and ensure proper punctuation and flow with the
surrounding text; update the same paragraph in
docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md
to a coherent full sentence that connects to the next sentence.
- Line 11: Replace the US English spellings in the given passage by changing
"realized" to "realised" and "neighbors" to "neighbours" so the document uses UK
English; update the sentence containing those tokens in the file (the paragraph
beginning "The momentary lapse...") to use the corrected spellings and ensure
any other occurrences in that paragraph follow UK English conventions.
- Line 3: Replace the US English spelling "fiber-optic" in the document text
(the phrase starting "In the silent architecture of the fiber-optic web...")
with the UK English variant "fibre-optic" so the documentation conforms to the
project's UK English spelling guideline; search for the token "fiber-optic" and
update it to "fibre-optic" throughout the file.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`:
- Line 64: The documentation uses US spelling "quantization"; update every
occurrence of the term (e.g., the instance "quantization" in the specs doc) to
UK English "quantisation" to comply with the project style guide, ensuring
surrounding grammar and punctuation remain unchanged and run a quick search to
replace any other occurrences in this file.

In `@docs/training/distill.md`:
- Line 73: Replace the US spelling "distill" with the UK spelling "distil" in
the header/line that reads "Vi training pipeline — distill 26B Gemma 4 → Vi
base" so it matches the UK English used elsewhere (see the similar usage on line
12); update the same token wherever else it appears in this document to ensure
consistent UK English spelling.

In `@docs/training/README.md`:
- Line 11: The sentence in docs/training/README.md uses US spelling "distills";
update that word to the UK English spelling "distils" so the line reads "This is
the substrate that fine-tunes Vi, distils Lemma, and generates the LARQL vindex
inspection signals." Refer to the phrase "distills Lemma" to locate and replace
the token.

In `@go/adapter/adapter.go`:
- Around line 185-194: The InspectAttention method on Adapter should normalize a
nil context like Generate/Chat do: check if ctx == nil and if so set ctx =
context.Background() before using it; update Adapter.InspectAttention to perform
this nil-context fallback prior to asserting a.model and calling
inspector.InspectAttention, ensuring you reference the Adapter type,
InspectAttention method, and the inference.AttentionInspector call when making
the change.

In `@go/agent/index.go`:
- Around line 273-281: After loading bundle with kv.LoadMemvidBlockBundle,
verify the bundle identity matches the index metadata (e.g., compare
bundle.SnapshotHash or its canonical hash field against
entry.SnapshotHash/entry.SnapshotHashHex) before proceeding; if they differ,
return an error instead of calling kv.LoadPrefixFromMemvidBlocksWithOptions so a
repointed bundle URI cannot silently restore the wrong KV state. Ensure the
check sits between the successful return from LoadMemvidBlockBundle and the call
to kv.LoadPrefixFromMemvidBlocksWithOptions and uses the unique symbols bundle,
entry, bundle.SnapshotHash (or the actual bundle hash field) and
entry.SnapshotHash for the comparison.

In `@go/agent/wake_sleep.go`:
- Around line 201-208: The NewSleepIndex function dereferences bundle.TokenCount
without validating bundle, so add a guard at the start of NewSleepIndex to
validate the bundle (and its TokenCount if needed) and return a descriptive
error instead of allowing a panic; specifically check if the bundle parameter is
nil (and optionally ensure bundle.TokenCount is within an expected range) before
constructing the MemvidIndexEntry, and return an error when invalid so callers
of NewSleepIndex get a clear failure rather than a runtime panic.
- Around line 117-123: The code currently defaults to index.Entries[0] when
entryURI is empty, which can restore the wrong span; change the logic in the
block handling entryURI so that if entryURI == "" you only auto-select the sole
entry when len(index.Entries) == 1, otherwise return an error requiring an
explicit EntryURI. Update the flow around the index.Entry(entryURI) call to use
the selected entryURI when single-entry, and return a clear core.NewError (e.g.,
"mlx: EntryURI required when index has multiple entries") if multiple entries
exist and no EntryURI was provided.
- Around line 125-132: PlanWake currently loads a bundle via
kv.LoadMemvidBlockBundle and only checks prefix token bounds, but it must also
verify the loaded bundle matches the selected index to prevent accepting a
repointed URI; after loading the bundle (bundle) and before using
bundle.TokenCount, compare the bundle identity (e.g., bundle.ID or
bundle.Identity/Hash from bundle.Metadata) against the index identifier stored
on the plan entry (e.g., fields reachable from entry such as entry.Index,
entry.BundleID or entry.SelectedIndex) and return a clear error (similar to
core.NewError) if they differ; update the code around kv.LoadMemvidBlockBundle,
entry.PrefixTokens(), and bundle.TokenCount to perform this identity check and
fail early on mismatch.

In `@go/artifact/artifact.go`:
- Around line 117-121: opts.Kind may be empty when calling opts.Store.Put which
leaves memvid.PutOptions.Kind unset; update the call site around opts.Store.Put
to ensure memvid.PutOptions.Kind is set to a sensible default when opts.Kind ==
"" (e.g., "json" or the record's kind) so kind-based retrieval works
reliably—modify the memvid.PutOptions construction to use a conditional default
for Kind before passing it to opts.Store.Put.

In `@go/backend.go`:
- Line 687: The fallback path that turns chunked prompts into a single Generate
call loses caller cancellation because it routes through helpers that use
context.Background(); modify the chunk fallback flow to propagate the original
context instead of using context.Background() — specifically, update the callers
that invoke promptChunksToString and m.Generate so they accept and forward a
context.Context (or call a context-aware m.Generate variant), change any helper
functions that currently create context.Background() to take a ctx param, and
ensure all three fallback sites (the code paths that call promptChunksToString
and then m.Generate) forward the incoming ctx so deadlines/cancellations are
preserved.

In `@go/blockcache/blockcache.go`:
- Around line 205-215: Selective clears currently only remove metadata and disk
records, leaving in-memory/runtime entries behind; update the filtered-clear
branch (the code handling len(labels) > 0) to also purge matching runtime state
by removing any entries in service.blocks that match the cleared labels/prefixes
and updating service.hits/service.misses accordingly, then invoke
service.cfg.ClearRuntime() (if non-nil) just like the unfiltered branch; reuse
service.clearDiskLocked() for disk cleanup and ensure all of this runs under the
same lock so service and backend remain in sync.
- Around line 385-395: diskRecordCompatible currently only checks
model/adapter/tokenizer hashes and misses block layout changes; update it to
also verify cache mode and block size match the stored record. In
diskRecordCompatible (and when comparing against record.diskRef), add a cache
mode comparison (e.g. cacheIdentityMatches(service.cfg.CacheMode,
record.Ref.CacheMode)) and a block size comparison (e.g. service.cfg.BlockSize
== record.Ref.BlockSize or an equivalent integer equality) and return false if
either differs, preserving the existing hash checks (cacheIdentityMatches for
ModelHash/AdapterHash/TokenizerHash).
- Around line 172-175: The cache hit branch in the loop over refs leaves refs[i]
as the newly built ref, losing persisted labels; update the hit handling in the
loop inside WarmCache (or the function iterating refs) so that when
service.blocks[ref.ID] exists you increment service.hits and replace refs[i]
with the stored entry (service.blocks[ref.ID]) instead of continuing, thereby
preserving persisted labels like memvid_* from the cached block.

---

Nitpick comments:
In `@docs/inference/thinking.md`:
- Around line 74-78: The fenced code block containing the token categorisation
lines (ThinkingShow, ThinkingHide, ThinkingCapture) lacks a language specifier
and triggers MD040; update the triple-backtick fence to include a language
identifier (e.g., change ``` to ```text) so the block is properly flagged as
plain text and satisfies the markdown linter.

In `@docs/moe/jang.md`:
- Around line 82-90: Add a language specifier to the fenced code block that
lists the profile names (the block containing "JANG_2M — 2-bit mid-tier",
"JANG_3M — 3-bit mid-tier", etc.); replace the opening triple-backtick with one
that specifies a language identifier (e.g., text) so the block becomes a fenced
code block with a language label for consistent Markdown rendering.

In `@docs/moe/README.md`:
- Line 9: The sentence "Pre-dates this sprint were dense models (Gemma 3/4
dense, Qwen 3, Llama 3);" is grammatically awkward—replace it with a clearer
phrasing that conveys those dense models existed before this sprint, for
example: "Prior to this sprint, dense models (Gemma 3/4 dense, Qwen 3, Llama 3)
were supported." Edit the README line in the vMLX parity Phase 1 paragraph to
use this clearer wording so the relationship between prior dense models and the
new sparse-expert work is unambiguous.

In `@docs/observability/probe.md`:
- Around line 31-46: The fenced code block in the emission points section lacks
a language specifier; update the opening triple-backticks to include a language
(for example change ``` to ```text or ```yaml) so the block is
rendered/compliant (the block that begins with "Generate / Chat:" and lists
items like "prefill start → cache_pressure" should be updated).

In `@docs/runtime/README.md`:
- Line 68: Update the link text in docs/runtime/README.md that currently reads
"[../model/model_pack.md] — pre-load validation" to use the single-word form
"preload" (i.e., change "pre-load validation" to "preload validation") so the
description next to the model_pack.md link uses the conventional computing term;
locate the occurrence of "pre-load validation" and replace it with "preload
validation".
- Around line 44-62: The fenced code block showing the boot flow (starting with
"package init time:") lacks a language specifier, causing MD040 lint failures;
update the opening backticks to include a language tag (e.g., add "text" so the
block begins with ```text) in README.md near the boot flow that references
register_metal.go init(), inference.Register(&metalbackend{}),
inference.LoadModel, metal.LoadAndInit, and metaladapter usage to satisfy the
markdown linter.

In `@docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md`:
- Around line 7-9: Replace the machine-specific absolute paths in the plan
document (the two occurrences of `/Users/snider/Code/core/go-mlx` and
`/private/tmp/vmlx-audit-20260509`) with relative or generic placeholders (e.g.,
`./go-mlx` or `<audit-source-path>`) so the file is portable and readable for
other contributors; update the lines in the doc where those paths appear to use
the chosen placeholders and, if helpful, add a short parenthetical note
explaining what actual path should be substituted locally.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`:
- Around line 5-6: The spec contains machine-specific absolute paths ("Anchor
repo: `/Users/snider/Code/core/go-mlx`" and "Primary implementation repo:
`/Users/snider/Code/core/go-inference`"); replace them with portable references
such as relative paths (e.g., "../go-mlx", "../go-inference"), repository names
only ("go-mlx", "go-inference"), or generic placeholders ("<anchor_repo_path>",
"<primary_impl_repo_path>") in the document so the file is not tied to a
specific developer machine while preserving intent.

In `@docs/vmlx-feature-gap-report.md`:
- Around line 7-8: Replace the hard-coded absolute filesystem path and the full
external URL in the report text with more maintainable references: change the
absolute path string to a relative or generic placeholder (e.g., "cloned locally
at <local-clone-path>" or "<audit-clone-path>") and move the external repository
URL to a footnote, appendix, or a single "References" section, or replace it
with a short identifier combined with a reference list; update the text around
the original literal mentions so it reads the same but without embedding
environment-specific paths.

In `@go/agent/index_test.go`:
- Around line 16-304: Add a new test with the _Ugly suffix in this file that
completes the Good/Bad/Ugly triplet for the public index API surface;
specifically add a TestKVSnapshotMemvidBundleIndex_Ugly_* that triggers and
asserts panic/edge behaviors for the public functions (e.g., NewMemvidIndex,
SaveMemvidIndex, LoadMemvidIndex, LoadPrefixFromMemvidIndex,
CheckMemvidIndexCompatibility) — for example call NewMemvidIndex with a
nil/invalid blk or malformed Entries, call
SaveMemvidIndex/LoadMemvidIndex/LoadPrefixFromMemvidIndex with inputs that
provoke panic/edge conditions (nil store, corrupt bundle manifest that causes
decoding panic), and use t.Run subcases to assert panics (recover or
require.Panics) and edge-case returns; name the test with the same prefix as
existing tests and follow the existing style for t.Fatalf checks and
table-driven subtests.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ab3e2038-8f7c-4771-a11f-b232a1a59e08

📥 Commits

Reviewing files that changed from the base of the PR and between 07f6af1 and 89f613e.

📒 Files selected for processing (300)

.gitignore
.gitmodules
CLAUDE.md
CMakeLists.txt
GOAL.md
docs/README.md
docs/architecture.md
docs/build.md
docs/cmd/violet.md
docs/compute/compute.md
docs/development.md
docs/examples/compute/frame-pipeline.md
docs/examples/daemon/violet-socket.md
docs/examples/eval/attention-probe.md
docs/examples/eval/perplexity.md
docs/examples/inference/batch.md
docs/examples/inference/chat.md
docs/examples/inference/quantization.md
docs/examples/inference/streaming.md
docs/examples/model-ops/hf-fit.md
docs/examples/model-ops/kv-snapshot.md
docs/examples/model-ops/merge.md
docs/examples/model-ops/quantize-gguf.md
docs/examples/training/distill.md
docs/examples/training/grpo.md
docs/examples/training/lora-finetune.md
docs/examples/training/lora-fuse.md
docs/history.md
docs/index.md
docs/inference/README.md
docs/inference/block_cache.md
docs/inference/decode_optimisation.md
docs/inference/parser_registry.md
docs/inference/scheduler.md
docs/inference/thinking.md
docs/memory/README.md
docs/memory/agent_memory.md
docs/memory/agentic_project_seed.md
docs/memory/kv_snapshot.md
docs/memory/kv_snapshot_blocks.md
docs/memory/kv_snapshot_index.md
docs/memory/kv_snapshot_memvid.md
docs/memory/medium.md
docs/memory/state_bundle.md
docs/model-operations.md
docs/model/README.md
docs/model/memory_plan.md
docs/model/model_pack.md
docs/models.md
docs/moe/README.md
docs/moe/codebook_vq.md
docs/moe/expert_residency.md
docs/moe/jang.md
docs/moe/minimax_m2.md
docs/observability/probe.md
docs/runtime/2026-05-16-gemma4-e2b-driver-profile.md
docs/runtime/2026-05-17-gemma4-parity-and-last-logits.md
docs/runtime/2026-05-17-llamacpp-prefill-comparison.md
docs/runtime/2026-05-18-gemma4-mtp-speculative-decode.md
docs/runtime/2026-05-19-gemma4-e2b-100k-retained-paged.md
docs/runtime/2026-05-19-gemma4-e2b-quant-matrix.md
docs/runtime/2026-05-19-go-mlx-gemma4-26b-a4b-q4-fresh-story-thinking-ctx65536-c2-g8192-book.md
docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-book.md
docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-no-thinking-book.md
docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-fresh-history-c10-g1536-book.md
docs/runtime/2026-05-19-go-mlx-gemma4-e2b-q4-fresh-story-thinking-ctx65536-c2-g8192-book.md
docs/runtime/2026-05-19-goal-completion-audit.md
docs/runtime/2026-05-19-runner-calibration.md
docs/runtime/2026-05-20-chapter-profile-safety.md
docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md
docs/runtime/README.md
docs/runtime/adapter.md
docs/runtime/local_autotune.md
docs/runtime/register_metal.md
docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md
docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md
docs/training/README.md
docs/training/distill.md
docs/training/eval.md
docs/training/grpo.md
docs/training/lora_adapter.md
docs/training/sft.md
docs/vmlx-feature-gap-report.md
external/go-ai
external/go-inference
external/go-ml
go/adapter.go
go/adapter/adapter.go
go/adapter_example_test.go
go/adapter_test.go
go/agent/helpers.go
go/agent/index.go
go/agent/index_test.go
go/agent/test_helpers_test.go
go/agent/wake_sleep.go
go/api_common.go
go/api_common_example_test.go
go/api_darwin_test.go
go/api_shape_test.go
go/api_stub.go
go/api_stub_example_test.go
go/api_stub_test.go
go/api_test.go
go/api_tokenizer_darwin_test.go
go/api_tokenizer_stub.go
go/api_tokenizer_stub_example_test.go
go/api_tokenizer_stub_test.go
go/artifact/artifact.go
go/artifact/artifact_test.go
go/attention_test.go
go/backend.go
go/backend_example_test.go
go/backend_test.go
go/blockcache/blockcache.go
go/blockcache/blockcache_test.go
go/blockcache/helpers_test.go
go/bundle/bundle.go
go/bundle/bundle_test.go
go/bundle/example_test.go
go/bundle/sami.go
go/chaptersmoke/chaptersmoke.go
go/chaptersmoke/chaptersmoke_test.go
go/chat/chat.go
go/chat/chat_test.go
go/chat/example_test.go
go/cmd/go-mlx/main.go
go/cmd/go-mlx/main_test.go
go/cmd/mlx/main.go
go/cmd/mlx/main_test.go
go/cmd/mlx/split_ffn_tune.go
go/compute/compute.go
go/compute/compute_example_test.go
go/compute/compute_metal.go
go/compute/compute_metal_example_test.go
go/compute/compute_metal_helper_test.go
go/compute/compute_metal_test.go
go/compute/compute_test.go
go/compute_stub.go
go/compute_stub_example_test.go
go/compute_stub_test.go
go/compute_test.go
go/dataset/jsonl.go
go/dataset/sample.go
go/dataset_stream.go
go/dataset_stream_example_test.go
go/dataset_stream_test.go
go/device_info.go
go/distill.go
go/distill_test.go
go/eval.go
go/eval_darwin.go
go/eval_darwin_test.go
go/eval_stub.go
go/eval_test.go
go/fast_eval.go
go/fast_eval_example_test.go
go/fast_eval_runner.go
go/fast_eval_test.go
go/gguf/info.go
go/gguf/info_example_test.go
go/gguf/info_test.go
go/gguf/quantize.go
go/gguf/quantize_test.go
go/grpo.go
go/grpo_test.go
go/helpers.go
go/hf/hf.go
go/hf/hf_test.go
go/hf/test_helpers_test.go
go/hf_fit.go
go/inference_contract.go
go/inference_contract_test.go
go/internal/metal/activation_bridge.cpp
go/internal/metal/array.go
go/internal/metal/backend.go
go/internal/metal/backend_test.go
go/internal/metal/batch.go
go/internal/metal/cache.go
go/internal/metal/cache_test.go
go/internal/metal/close.go
go/internal/metal/codebook_vq.go
go/internal/metal/codebook_vq_test.go
go/internal/metal/compile.go
go/internal/metal/compile_test.go
go/internal/metal/decode.go
go/internal/metal/decode_bridge.cpp
go/internal/metal/decode_bridge.h
go/internal/metal/decode_test.go
go/internal/metal/dense_matvec.go
go/internal/metal/dense_matvec_test.go
go/internal/metal/device.go
go/internal/metal/dtype.go
go/internal/metal/error_test.go
go/internal/metal/expert_id_matvec.go
go/internal/metal/expert_id_matvec_test.go
go/internal/metal/fast.go
go/internal/metal/fast_test.go
go/internal/metal/gemma3.go
go/internal/metal/gemma4.go
go/internal/metal/gemma4_assistant.go
go/internal/metal/gemma4_assistant_decode.go
go/internal/metal/gemma4_assistant_decode_example_test.go
go/internal/metal/gemma4_assistant_decode_test.go
go/internal/metal/gemma4_assistant_generate.go
go/internal/metal/gemma4_assistant_generate_test.go
go/internal/metal/gemma4_assistant_pair.go
go/internal/metal/gemma4_assistant_test.go
go/internal/metal/gemma4_ffn_residual.go
go/internal/metal/gemma4_ffn_residual_test.go
go/internal/metal/gemma4_router_topk.go
go/internal/metal/gemma4_router_topk_test.go
go/internal/metal/gemma4_test.go
go/internal/metal/gemma4_vision.go
go/internal/metal/generate.go
go/internal/metal/generate_test.go
go/internal/metal/jang_dequant.go
go/internal/metal/jang_dequant_test.go
go/internal/metal/kv_snapshot.go
go/internal/metal/metal.go
go/internal/metal/minimax_m2.go
go/internal/metal/minimax_m2_test.go
go/internal/metal/mlx_mlx_backend_cpu_available.cpp
go/internal/metal/mlx_mlx_backend_gpu_device_info.cpp
go/internal/metal/model.go
go/internal/metal/model_test.go
go/internal/metal/nn.go
go/internal/metal/nn_test.go
go/internal/metal/ops.go
go/internal/metal/process_memory_darwin.go
go/internal/metal/process_memory_stub.go
go/internal/metal/prompt_cache.go
go/internal/metal/prompt_cache_test.go
go/internal/metal/qwen3.go
go/internal/metal/qwen3_test.go
go/internal/metal/runtime_gate.go
go/internal/metal/runtime_gate_example_test.go
go/internal/metal/runtime_gate_test.go
go/internal/metal/sample.go
go/internal/metal/sample_test.go
go/internal/metal/session.go
go/internal/metal/session_example_test.go
go/internal/metal/session_test.go
go/internal/metal/split.go
go/internal/metal/split_test.go
go/internal/metal/stream.go
go/internal/metal/tokenizer.go
go/internal/metal/tokenizer_test.go
go/internal/metal/trace.go
go/internal/metal/trace_test.go
go/internal/metal/training.go
go/jang_test.go
go/kv/analysis.go
go/kv/analysis_example_test.go
go/kv/analysis_test.go
go/kv/bench.go
go/kv/bench_test.go
go/kv/blocks.go
go/kv/blocks_test.go
go/kv/helpers_test.go
go/kv/memvid.go
go/kv/memvid_test.go
go/kv/snapshot.go
go/kv/snapshot_example_test.go
go/kv/snapshot_test.go
go/kv_analysis_example_test.go
go/kv_cache_bench.go
go/kv_snapshot.go
go/kv_snapshot_example_test.go
go/kv_snapshot_test.go
go/local_tuning.go
go/local_tuning_test.go
go/lora/adapter.go
go/lora/fuse.go
go/lora/fuse_stub.go
go/lora/fuse_test.go
go/lora_adapter_darwin_test.go
go/lora_adapter_test.go
go/lora_fuse.go
go/lora_fuse_darwin.go
go/lora_fuse_darwin_test.go
go/lora_fuse_test.go
go/medium_test.go
go/memory/example_test.go
go/memory/memory.go
go/memory/memory_test.go
go/memory_plan.go
go/memory_plan_example_test.go
go/memory_plan_test.go
go/memvid_chapter_smoke.go
go/merge/compare.go
go/merge/compare_example_test.go
go/merge/compare_test.go
go/merge/helpers_test.go
go/merge/merge.go
go/merge/merge_test.go
go/mlx.go
go/mlx_example_test.go
go/mlx_internal_test.go
go/mlx_stub.go
go/mlx_stub_example_test.go

💤 Files with no reviewable changes (15)

go/api_test.go
go/api_stub_example_test.go
go/api_tokenizer_stub_test.go
go/adapter_example_test.go
go/api_tokenizer_stub.go
go/api_tokenizer_darwin_test.go
go/api_tokenizer_stub_example_test.go
go/backend_example_test.go
go/api_common_example_test.go
go/api_shape_test.go
go/api_common.go
go/api_darwin_test.go
go/attention_test.go
go/api_stub.go
go/api_stub_test.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go/backend.go`:
- Around line 569-572: The code is aliasing caller-owned byte slices into the
snapshot by assigning head.KeyBytes and head.ValueBytes directly to KeyBytes and
ValueBytes; make defensive copies instead (like Value is copied) to avoid
leaking mutable state—replace the direct assignments for KeyBytes and ValueBytes
with fresh copies (e.g., using append to copy into a new []byte) when
constructing the metal snapshot/struct (the fields KeyBytes and ValueBytes on
the metal KV head).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9b686e0a-8b41-4e47-975f-03cf235491e9

📥 Commits

Reviewing files that changed from the base of the PR and between 89f613e and c19bc07.

📒 Files selected for processing (22)

CMakeLists.txt
cpp/CMakeLists.txt
go/backend.go
go/backend_test.go
go/cmd/mlx/main.go
go/cmd/mlx/main_test.go
go/internal/metal/backend.go
go/internal/metal/backend_test.go
go/internal/metal/decode_bridge.cpp
go/internal/metal/gemma4.go
go/internal/metal/gemma4_test.go
go/internal/metal/generate.go
go/internal/metal/metal.go
go/internal/metal/mlx_build_config.h
go/internal/metal/pinned_array.go
go/internal/metal/pinned_array_bridge.cpp
go/internal/metal/pinned_array_test.go
go/internal/metal/sample.go
go/internal/metal/sample_test.go
go/internal/metal/session.go
go/kv/snapshot.go
go/memvid_chapter_smoke.go

✅ Files skipped from review due to trivial changes (1)

cpp/CMakeLists.txt

github-advanced-security

SonarCloud found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

appendKVEncodedTensor + stream.encodedTensor were both refactored in prior waves to stream values directly into dst, skipping the intermediate normalizeKVSnapshotNativeTensor alloc. The helper has zero remaining call sites — the only references are explanatory comments. Removing the 19-line function tightens the surface and prevents future code from re-wiring the slow path. Co-Authored-By: Virgil <virgil@lethean.io>

…nstant W10-E cached EmbeddingScale + PerLayerInputEmbeddingScale on Gemma4TextConfig and removed per-token math.Sqrt(HiddenSize) from six forward-path sites. The perLayerInputTensor path still carried two math.Pow calls firing on every decode token / prefill step: 1. MulScalar(projected, float32(math.Pow(float64(HiddenSize), -0.5))) 2. MulScalar(combined, float32(math.Pow(2, -0.5))) Cache (1) as cfg.PerLayerProjectionScale (= 1/sqrt(HiddenSize)) populated by gemma4FinaliseEmbeddingScales alongside the existing two scales — parseGemma4Config + LoadGemma4 already invoke it twice so the new field is kept in sync without extra plumbing. Reset path zeros PerLayerProjectionScale when HiddenSize is zeroed by a pathological loader, mirroring the per-layer reset. Lift (2) to a package-level const gemma4PerLayerCombineScale = 1/sqrt(2) expressed as a float32 literal (0.70710678118654752440). The test gates the literal against float32(math.Pow(2, -0.5)) so any divergence trips locally before reaching a forward pass. Sites updated: - gemma4.go perLayerInputTensor projected scaling -> cfg.PerLayerProjectionScale - gemma4.go perLayerInputTensor combine scaling -> gemma4PerLayerCombineScale perLayerInputTensor is invoked through computePerLayerInputs from both forwardNativeFixedGreedyToken (per-decode-token) and forwardHidden (per-prefill step), so the lift compounds across the existing W10-E EmbeddingScale work for the same forward path. The compiled CompileShapeless wrapper around perLayerInputTensor inherits the change because it dispatches to the same function. Tests extend the existing W10-E pair: byte-equivalence vs a freshly computed math.Pow(HiddenSize, -0.5), reset-on-zero for both per-layer and HiddenSize zeroings, and a literal-vs-math.Pow guard for gemma4PerLayerCombineScale. Co-Authored-By: Virgil <virgil@lethean.io>

…ositionVector Three orphans in analysis.go: - kvAnalysisHeadVectors: a thin wrapper for kvAnalysisHeadVectorsInto that callers stopped routing through (the Into form is the only surface left after W8 reuse-scratch). - kvAnalysisMeanVector: replaced by kvAnalysisLayerState (sum-into- place avoids the [][]float32 view + per-head combined-buffer alloc). - kvAnalysisPositionVector: replaced by the inline slice arithmetic inside kvAnalysisPositionDifferentiation Pass 1. Comment refs touched up — the Into form name is consistent, and the mean-vector behaviour note now stands on its own without naming a function that no longer exists. Net: -47 lines, tighter surface, no behaviour change. Co-Authored-By: Virgil <virgil@lethean.io>

…r loop splitPerLayerInputTensor calls Squeeze(sliced, 2) inside a NumHiddenLayers-iteration loop. After W10-A made the substrate Squeeze itself a 0-alloc cgo crossing, the residual per-call cost shifted to the model layer: each `Squeeze(sliced, 2)` variadic call allocates a fresh single-element `[]int{2}` axes slice that escapes to the heap (Squeeze's substrate takes &axes[0] for the cgo inline call, so the compiler can't keep the slice on the stack). For a Gemma 4 model with 26 hidden layers this is 26 allocs and 208 B of GC pressure per perLayerInputs call — per token for decode, per prefill step otherwise. Hoist the `[]int{2}` outside the loop and pass it via `Squeeze(sliced, axes...)` — the slice is allocated once per splitPerLayerInputTensor call rather than once per layer. Bench (Apple M3 Ultra, 200ms benchtime; mock 26-layer loop over Squeeze of a [1,1,1,128] array): Loop26_VariadicInline 11427 ns/op 208 B/op 26 allocs/op Loop26_VariadicHoisted 10427 ns/op 8 B/op 1 allocs/op -25 allocs/forward (-96.2%), -200 B/forward (-96.2%), with the same forward output by construction — the variadic slice content is identical, only the allocation point moved. Co-Authored-By: Virgil <virgil@lethean.io>

…sites After W10-A made AsStrided / Reshape / Transpose / Squeeze / SliceUpdateInplace 0-alloc at the substrate level (the C side materialises shape/strides arrays on the C stack via the *_inline wrappers), the residual per-call cost at every model attention forward site shifted to the Go-side inline slice literal. The substrate takes &shape[0] / &strides[0] / &axes[0] via unsafe.Pointer for the cgo call, so the compiler conservatively escapes any caller-side slice literal to the heap. This means each model attention forward call still pays: AsStrided inline slice literals +48 B/op +2 allocs/op Reshape variadic args +16 B/op +1 alloc/op Transpose variadic args +32 B/op +1 alloc/op Squeeze variadic args +8 B/op +1 alloc/op Per gemma3 / gemma4 / qwen3 attention layer per token: Q/K/V AsStrided 3 × (48 B + 2 allocs) = 144 B + 6 allocs Transpose out 1 × (32 B + 1 alloc) = 32 B + 1 alloc Reshape merged 1 × (16 B + 1 alloc) = 16 B + 1 alloc 192 B + 8 allocs For a 32-layer 1000-token Gemma 4 decode that is ~256 000 allocs and ~6 MB of GC pressure attributable to inline rank-4 slice literals alone. W9-AA / W9-L / W9-M precedents would normally absorb this at the model layer; that path is closed here because the call signatures are inherently variadic. A future substrate addition (a rank-4 overload that takes int32/int64 args directly, e.g. AsStrided4D(a, B, H, L, D, sB, sH, sL, sD, offset)) would lift the last residual without touching every model file. The benchmarks added here are the measurement floor for that future substrate work and are not themselves a perf change. Each routine compares the existing pre-built-slice path against the model-call inline-literal pattern so any future substrate fix can demonstrate the delta against the same shape inputs. Co-Authored-By: Virgil <virgil@lethean.io>

…lem path) The DecodeFloatData F32 branch was running an N-element loop of `math.Float32frombits(binary.LittleEndian.Uint32(raw[i*4:]))`. Each iteration reslices, bounds-checks, byte-loads four bytes, calls Uint32 to re-assemble them little-endian, then converts to float32 via the bit-pattern primitive. On little-endian arm64/amd64 the bytes on disk already match the float32 storage layout, so the whole loop is a memcpy. Reinterpret-cast `values` as a byte slice via unsafe.Slice + unsafe.SliceData (same idiom as kv/snapshot.go decodeKVSnapshotNativeTensor) and `copy(dst, raw)` in one shot. Bench (Apple M3 Ultra, benchtime=2s, count=3): F32_512 502ns → 355ns (-29%) F32_2048 1991ns → 1322ns (-34%) Numerical parity: bit-exact — Float32frombits and the byte-pattern memcpy land on identical IEEE-754 bit patterns. Existing parity test TestParseHeader_Parity_Synthetic + TestWriteSubset_Good cover round-trip; `go test -race -short` clean. Filed-by: cladius Co-Authored-By: Virgil <virgil@lethean.io>

f32sRaw and i32sRaw staged values into the pooled scratch buffer then issued a single writer.Write. The staging copy is pure waste because: - The byte view of []float32 / []int32 already matches what Float32bits + PutUint32 would produce (little-endian arches only, arm64 + amd64). - Writers consume the bytes within Write, so we don't need to retain a stable scratch buffer past the call. - The writer itself (sha256.Write, PutBytesStream) does its own buffering; staging into scratch first costs one extra memcpy with zero downstream benefit. Now passes the unsafe.Slice byte view straight to w.bytes(...). Eliminates 1 memcpy per f32sRaw / i32sRaw call (typically 3-5 calls per snapshot encode; one per layer × Key/Value side + tokens + logits). scratchFor + scratch field removed from the pool struct. Bench (M3 Ultra, benchtime=2s, n=3): Snapshot_WriteWithOptions_2048Tokens: 1217 → 795 ns/op (-35%) No alloc delta — scratch was pool-resident already; the win is pure ns/op throughput. HashSnapshot benches unchanged within noise (the sha256.New + Sum dominate at small sizes). Co-Authored-By: Virgil <virgil@lethean.io>

…lem path) DecodeFloatData F64 branch was looping `float32(math.Float64frombits(binary.LittleEndian.Uint64(raw[i*8:])))`. Per-iter cost: 1) re-slice raw[i*8:] (bounds check), 2) Uint64 reassembly from 8 bytes, 3) Float64frombits (a bit cast), 4) float32 conversion. Steps 1-3 are throw-away work because the bytes on disk already match the float64 storage layout on every Go-supported architecture. Reinterpret-cast raw as []float64 via unsafe.Slice + unsafe.SliceData and let `range src64` produce direct float64 loads. The downcast to float32 stays — F64→F32 is genuinely lossy. On arm64 the compiler emits a clean LDR D + FCVT S pair per element. Bench (Apple M3 Ultra, benchtime=2s, count=3): F64_2048 2470ns → 1853ns mean (-25%) best 1615ns (-35%) Numerical parity: bit-exact — Float64frombits is the inverse of the storage byte pattern, so the unsafe cast and the explicit reassembly land on identical float64 values, and float32() downcast is identical in both paths. Filed-by: cladius Co-Authored-By: Virgil <virgil@lethean.io>

… 3× faster PrefillCacheStateArrays paid one alloc per Cache.State() call (each returning a fresh []*Array{k,v} literal). On Gemma 4's 26-cache fan-out the substrate saw 27 allocs per prefill step (516.1 ns). Add an optional stateAppender interface implemented by KVCache, RotatingKVCache, FixedKVCache, PagedKVCache, QuantizedKVCache: AppendState(dst) appends raw state arrays into a caller-provided slice. The public State() contract is unchanged — appendCacheState() helper falls back via State() copy for any future cache type that doesn't implement the optimisation. Before: PrefillCacheStateArrays_8Caches 172.7 ns/op 9 allocs/op PrefillCacheStateArrays_26Caches 516.1 ns/op 27 allocs/op After: PrefillCacheStateArrays_8Caches 59.81 ns/op 1 alloc/op (-88.9% / 2.9×) PrefillCacheStateArrays_26Caches 174.1 ns/op 1 alloc/op (-96.3% / 3.0×) Same pattern wired through cacheStateArraysForDetach (eval-detach path) — same alloc-floor reduction applies to detach-after-prefill. Co-Authored-By: Virgil <virgil@lethean.io>

…(-38% allocs) WriteSubset previously built a map[string]HeaderEntry with per-entry []int64 Shape + DataOffsets slices then ran core.JSONMarshal over it. The reflection-driven encoder allocates internally for each struct field descriptor and for the resulting buffer-grow chain. On a 2-tensor subset the path was 24 allocs / 2660 B. Replace with subsetHeaderEncoded — a hand-rolled appender that emits the safetensors header bytes directly into a pre-sized buffer: {name1:{"dtype":"F32","shape":[2,3],"data_offsets":[0,24]},...} - Map keys are emitted in alphabetical order (encoding/json default). - Struct field order is dtype → shape → data_offsets (declaration order of HeaderEntry — what JSONMarshal already produced). - appendJSONString is the same escape table as encoding/json encodeString (\", \\, \b/\f/\n/\r/\t, \u00XX for the rest). - appendJSONInt64 emits base-10 with no leading zeros into a 20-byte stack buffer (no heap alloc). Bench (Apple M3 Ultra, benchtime=2s, count=3, WriteSubset_TwoTensors): before 74177 ns/op 2660 B/op 24 allocs/op after 70600 ns/op 1688 B/op 15 allocs/op (-5% ns, -37% B, -38% allocs) Time savings are bounded by the file I/O the bench includes; the allocation drop is the structural win. The dropped allocs were exactly the per-tensor HeaderEntry construction + JSONMarshal field-descriptor churn — the data flows through once now, never materialised as a Go object graph. Added TestSubsetHeaderEncoded_ParityWithJSONMarshal — anchors the encoder output bit-exact against core.JSONMarshal(map[string]HeaderEntry) across 4 shapes (single 2D, multi-dim mix with three tensors, lowercase dtype canonicalisation, single one-dim). Any structural drift breaks the test, so future tweaks remain safe. Filed-by: cladius Co-Authored-By: Virgil <virgil@lethean.io>

…05 ns -42%, stream-writer direct-unsafe-slice eliminates staging memcpy)

…tate/scheduler/openai/ollama/parser + 3.5-4.6× jang dequant model-LOAD)

… + Squeeze axes hoist — -96.2% allocs/forward on 26-layer Gemma 4)

Continues the W9-Y sentinel sweep into the remaining safetensors call sites with static-text core.NewError messages: safetensors.go errChunkOutOfBounds — ReadFloat32Chunk + readFloat32ChunkInto errChunkTruncated — same pair errF32PayloadMismatch — DecodeFloatData F32 branch errF16PayloadMismatch — DecodeFloatData F16 branch errBF16PayloadMatch — DecodeFloatData BF16 branch errF64PayloadMismatch — DecodeFloatData F64 branch errCoreResultFailed — resultError fallback write.go errSubsetPathEmpty — WriteSubset early validation errSubsetNoTensors — same errSubsetTensorNameEmpty — subsetHeaderEncoded validation errWriteNoProgress — writeAll zero-progress guard Each previously allocated a fresh core.NewError on fire. None of these fire in the bench paths (validation guards), so the bench delta is zero — this is a hygiene fix that keeps the next session's pprof showing data-shape allocs rather than "oh another core.NewError". errors.Is on the typed sentinels also works for callers wanting to distinguish "chunk truncated" vs "chunk out of bounds" without text matching. Per-call-site errors that interpolate ref.Name are left as-is — the message is genuinely dynamic and a sentinel can't carry the per-tensor context. Filed-by: cladius Co-Authored-By: Virgil <virgil@lethean.io>

…14→6 allocs The W10-A inline-C-wrapper for Slice/SliceUpdateInplace removed the cgo-side []C.int materialisations, but the Go-side []int32{0,0,prev,0}, []int32{B,H, offset,D} literal pairs still escape to heap (verified via -gcflags='-m'). On KVCache.Update: 4 such pairs per call × 4 bytes elem × 4 elems each = 8 heap allocs per Update step. Add Slice4 / SliceUpdateInplace4 — rank-4 scalar-pass form (W10-J pattern applied to slice). 8 indices passed as register-passed scalars; C wrapper materialises stack buffers directly. KV cache canonical rank is 4, so these cover the dominant cache.go call sites. Wire KVCache.Update (5 call sites) through the scalar form: Before: KVCache_Append_SingleToken_FromEmpty 241 B/op 14 allocs/op KVCache_Append_SingleToken_To32 7188 B/op 417 allocs/op KVCache_Append_SingleToken_To512 114766 B/op 6659 allocs/op KVCache_Append_512TokenPrefill 240 B/op 14 allocs/op KVCache_Append_4096TokenPrefill 240 B/op 14 allocs/op After: KVCache_Append_SingleToken_FromEmpty 112 B/op 6 allocs/op (-57%) KVCache_Append_SingleToken_To32 3089 B/op 161 allocs/op (-61%) KVCache_Append_SingleToken_To512 49221 B/op 2563 allocs/op (-61%) KVCache_Append_512TokenPrefill 112 B/op 6 allocs/op (-57%) KVCache_Append_4096TokenPrefill 112 B/op 6 allocs/op (-57%) ns/op steady — the W10-A inline-C wrapper already won the per-call latency; this is alloc-floor improvement. Bedrock for downstream alloc-sensitive paths. The remaining 6 allocs/op are: 4 newArray ops (Slice×2 + SliceUpdateInplace×2) + 2 Concatenate-internal materialisations. Those need W10-G pool extensions or W10-A-style wrappers. Other rank-4 call sites in cache.go (RotatingKVCache, FixedKVCache, BorrowedFixedState — 17 more lines) follow in subsequent commits. Co-Authored-By: Virgil <virgil@lethean.io>

The DecodeFloatData F16 + BF16 branches were running a manual `uint16(buf[j]) | uint16(buf[j+1])<<8` byte-pair combine per element because the previous pass (W4-D) avoided allocating a uint16 slice. But the byte combine is throwaway work — on every Go-supported architecture, fp16 / bf16 storage is little-endian, which is identical to the in-memory layout of a uint16. The per-iter cost was a byte load, a shift, an OR, then the actual bit-twiddle conversion. Reinterpret-cast raw as []uint16 via unsafe.Slice + unsafe.SliceData and let `range src16` produce direct half-word loads. On arm64 this collapses to LDR.H + Float16ToFloat32 (F16) or LDR.H + LSL + bit-cast (BF16). Bench (Apple M3 Ultra, benchtime=2s, count=3): F16_2048 4032ns → 3506ns mean (-13%, best -15%) BF16_2048 2167ns → 1450ns mean (-33%, best -33%) The smaller F16 win is structural — Float16ToFloat32 itself is the ~70% dominator (denormal + special-value handling + final Float32frombits), so the load-side simplification only earns where the load was non-trivial. BF16's body is just a shift + Float32frombits so the load saving lands larger. Numerical parity: bit-exact — same little-endian uint16 reassembly, same Float16ToFloat32 / Float32frombits result. Filed-by: cladius Co-Authored-By: Virgil <virgil@lethean.io>

…nsor [12]byte heap escape The per-tensor `var trailer [12]byte` was force-escaping to heap on every iteration because io.ReadFull's interface-typed buf parameter defeats stack allocation. On a qwen3-class manifest (200 tensors) this costs 200 allocs per parseGGUF call. Reusing the already-existing 64-byte `scratch` arena for the trailer read keeps the io.ReadFull interface escape pinned to a single per-call allocation that's amortised across all metadata + tensor reads. Same bytes read in the same order — bit-exact parse. BenchmarkInfo_ReadInfo_TypicalLayers (200 tensors): before: 484 allocs/op 58136 B/op after: 284 allocs/op 54944 B/op (-41% allocs, -5.5% bytes)

…ll alloc The 24-byte header read used a dedicated `var header [24]byte` local that escaped to heap separately from the existing `scratch [64]byte`. Both serve identical purpose (io.ReadFull destination), so reuse the single arena. Moves the scratch declaration above the header read; the header occupies scratch[:24] just long enough to extract magic / version / counts before the metadata loop reuses scratch[:] for keys + values. Same bytes read in the same order — bit-exact parse. BenchmarkInfo_ReadInfo (across all three benches): -1 alloc/op, -24 B/op uniform

… alloc reduction Apply the Slice4 / SliceUpdateInplace4 scalar-pass form to the remaining rank-4 slice/update sites in cache.go that the previous commit established: RotatingKVCache (5 sites in updateInplace + 5 in updateConcat), FixedKVCache (4 sites in updateInplace + ReadState), PagedKVCache helpers (cachePageView, borrowedPageView, borrowVisiblePage, prefill materialization + cacheTail) — 22 call sites converted. Each site replaces two []int32{...} literal allocs per call with zero heap allocs (scalars are register-passed; C wrapper materialises stack buffers directly). Bench deltas (selected): RotatingKVCache_512Prefill_Cap512 2499 ns / 9 allocs → 2232 ns / 5 allocs (-11% ns, -44% allocs) PagedKVCache_BoundedTo1024_PastCap 12286 allocs → 8206 allocs (-33%) PagedKVCache_4096Tokens_PageSize256_Prealloc 14511 allocs → 14511 allocs (no-find — locked by other pattern) QuantizedKVCache_4096Prefill_Q8Q8 14 allocs → 13 allocs (-1) The Quantized_4096Prefill drop is the W10-G fingerprint: each Slice4 site saves 2 allocs; visible only where the call count overwhelmed the other costs. Co-Authored-By: Virgil <virgil@lethean.io>

…d-rolled JSON write — F32 -50% / F64 -44% / BF16 -31% / WriteSubset -38% allocs)

….0× ns ScaledDotProductAttentionPaged transposes the K page on every iteration of the page loop (Transpose(key, 0,1,3,2)). The variadic axes []int parameter escapes to heap on each call (verified via -gcflags='-m'). On 16-page attention that's 16 per-page alloc + 1 outer slice = 17 allocs. Add Transpose4 scalar-pass form (W10-J pattern applied to transpose) and wire SDPAPaged through it. Before: SDPAPaged_2Pages_Page256_Q1_D128 65 B/op 2 allocs/op SDPAPaged_4Pages_Page256_Q1_D128 131 B/op 4 allocs/op SDPAPaged_8Pages_Page256_Q1_D128 328 B/op 9 allocs/op SDPAPaged_16Pages_Page256_Q1_D128 657 B/op 17 allocs/op After: SDPAPaged_2Pages_Page256_Q1_D128 0 B/op 0 allocs/op (-100%) SDPAPaged_4Pages_Page256_Q1_D128 2 B/op 0 allocs/op (-100%) SDPAPaged_8Pages_Page256_Q1_D128 67 B/op 1 alloc/op (-89%) SDPAPaged_16Pages_Page256_Q1_D128 136 B/op 1 alloc/op (-94%) The remaining 1 alloc at 8+ pages is the scorePages slice grow chain (cap hint is len(keyPages) so this is a single allocation — the heap byte count scales with page count). ns/op steady — the W10-A transpose-axes-inline wrapper already won the per-call cgo cost; this is alloc-floor reduction on decode-time attention across pages, the PagedKVCache canonical hot path. Co-Authored-By: Virgil <virgil@lethean.io>

Four core.NewError sites in resolveGGUFFile / parseGGUF / readGGUFString previously rebuilt a fresh `*core.Err` per hit. Lifting to package vars matches the W9-Y sweep pattern (safetensors/header_parse.go) and ensures truncated/bogus GGUF files don't bleed allocs on the error path during fleet probes. Success-path benches unchanged; the wins land when ReadInfo hits churn-grade error scenarios at scale (model discovery walking 1000+ directories with mixed valid/invalid candidates). errGGUFNoFile, errGGUFMultipleFiles, errGGUFInvalidMagic, errGGUFStringTooLong

…iew per tensor readGGUFString allocated a fresh []byte (then string) per tensor name. For a qwen3-class manifest (200 tensors) that's 200 separate heap objects per parse, dominating the parseGGUF alloc count. readTensorNameInto reads all names into a single 40 B/tensor slab, then hands out zero-copy core.AsString views. The arena is sized once and never grown, so existing name views stay valid for the lifetime of the Info — same lifetime as one big buffer instead of N tiny ones. Overflow path (name > 40 B headroom) falls back to per-tensor make to preserve correctness — pre-existing views in the arena stay anchored. Skips the intern-map probe for tensor names: every real GGUF tensor name contains a layer index (`blk.<N>.<component>.<part>`), so the intern hit-rate was zero on this path. BenchmarkInfo_ReadInfo_TypicalLayers (200 tensors): before: 283 allocs/op 54920 B/op after: 84 allocs/op 59912 B/op (-70% allocs, +9% bytes) BenchmarkInfo_ReadInfo_VocabHeavy (50 tensors + vocab-heavy metadata): before: 673 allocs/op 44296 B/op after: 624 allocs/op 45544 B/op (-7% allocs, +3% bytes) The byte uptick is the arena's headroom budget; net alloc-count drop matters more for GC pressure under fleet-level model-discovery churn.

W11-W: restoreFixedCacheSnapshot / restoreQuantizedCacheSnapshot / restorePagedCacheSnapshot each constructed a `[]*Array{...}` literal that the caller immediately `append(.., arrays...)`'d into evalArrays. One slice alloc per cache restored, on a hot warm-restore path. Added appendRestoreXxxCacheSnapshot siblings that take a caller-owned dst slice and return the appended-into result. The old funcs become one-line wrappers (test surface unchanged), the load-bearing callers (prompt_cache.restorePromptCachesWithRequestFixedSize + session.go.restoreSessionCaches) use the append form directly. BenchmarkPromptCache_RestoreFixedCaches_26_Gemma4 (new): old path: 4170 B / 80 allocs new path: 3756 B / 54 allocs delta: -413 B / -26 allocs (-1 alloc per cache, exact) Pattern parallels W10-O appendCacheState — both kill per-cache literal allocations on Gemma 4 fan-outs. Co-Authored-By: Virgil <virgil@lethean.io>

W11-W: newPromptCacheEntryWithHidden + newPromptCacheEntryFromKVBlocks allocated a `snapshotOffsets []int` slice on every entry, but the slice was only ever read on the failure path of evalPromptCacheArrays (inside the labelAt closure). Removed the eager build. labelAt now recomputes `(cache_index, state_index)` by walking entry.caches and summing arrayCount() until the requested array index is crossed. Same Sprintf output, same error shape — the recompute only runs on the (cold) failure branch. Save: 1 alloc per snapshot/restore (storePromptCache after prefill + block-source restore). On Gemma 4 26-cache flow with a snapshot+restore per turn, ~2 allocs/turn dropped. Co-Authored-By: Virgil <virgil@lethean.io>

…Prefix 3→0 allocs, 26-cache Gemma 4 restore -156 allocs, RoundTrip -17% ns + -33B; Zeros4 convergent with W11-T) # Conflicts: # go/internal/metal/array.go

…ingle cgo crossing DispatchOne folds the entire (config_new + set_grid + set_thread_group + add_output_arg + apply + size + get + config_free) sequence into a single cgo crossing via mlx_fast_metal_kernel_dispatch_one_inline. Every production single-output MetalKernel caller in this package follows the fresh-cfg-per-call pattern: 6 cgo crossings (config_new, set_grid, set_thread_group, add_output_arg, apply_one, config_free) collapse to 1. The MetalKernelConfig Go wrapper escapes to heap on every NewMetalKernelConfig call — escape analysis shows `&MetalKernelConfig{...} escapes to heap in NewMetalKernelConfig`. DispatchOne removes the wrapper entirely from the per-call path: the C config handle is born and freed inside the inline-C wrapper, leaving zero Go-side allocs on the dispatch frame. MetalKernelGrid wraps the grid + thread-group dimension sextuple — keeps the call signature readable and prevents accidental swap between the two triples. Per-call savings stack with W11-V-A (inputVec inline-C) and W11-V-B (ApplyOne size+get inline-C): pre-W11-V ApplyOne path DispatchOne path 11 cgo crossings 6 cgo crossings 2 cgo crossings (expert_id_matvec MoE tiny: 5 inputs, 1 output, fresh cfg per call) DispatchOne keeps the same pooled output-vec holder + pooled input-handle scratch primitives from ApplyOne — no new allocations introduced. Parity verified bit-exact via TestMetalKernel_DispatchOne_Parity_Good against the cfg-driven ApplyOne path on the same kernel + inputs. Callers migrate in a follow-up commit. Co-Authored-By: Virgil <virgil@lethean.io>

…op cfg ceremony The 11 single-output MetalKernel callers migrated from ApplyOne in the prior commit now move to DispatchOne, eliminating the per-call MetalKernelConfig wrapper. Migrated sites: expert_id_matvec.go × 4 — quantized expert ID matvec / GELU gate-up / split gate-up / weighted matvec sum (Gemma 4 26B MoE forward path) dense_matvec.go × 2 — quantized dense matvec / GELU split gate-up jang_dequant.go × 2 — JANG dequant / packed linear fused (MiniMax M2) codebook_vq.go × 1 — codebook VQ matvec gemma4_ffn_residual.go × 1 — native Gemma 4 FFN residual fuse gemma4_router_topk.go × 1 — native Gemma 4 router matvec scores path The 2-output gemma4_router_topk top-k path stays on Apply (cfg + multi-output not yet covered by DispatchOne). Per-call drop: 6 cgo crossings → 2 (and 1 fewer Go heap alloc since the MetalKernelConfig wrapper no longer escapes). Bit-exact correctness preserved (same MLX kernel under the hood; DispatchOne is a structural collapse, not a semantic change). Co-Authored-By: Virgil <virgil@lethean.io>

…itives — apply_inline + ApplyOne + DispatchOne; MoE_RouterProjection_H2048_E32 -21.6%, geomean -37% B/op -29.3% allocs; 23 caller migrations)

Completes the W10-O Slice4/Transpose4 scalar-pass family at the rank-1/2 frontier W11-U flagged as the next residual. packQ4Cached / unpackQ4 / maxAll currently call `Reshape(arr, int32(n))` and `Reshape(arr, int32(pairs), int32(2))` where the variadic []int32 escapes to heap on every Q4 K/V Update + every dequant + every quantise-max boundary. Reshape1(arr, n int32) and Reshape2(arr, h, w int32) route through new mlx_reshape_inline_1 / mlx_reshape_inline_2 C wrappers that materialise the 1/2-element shape buffer on the C stack directly from register-passed scalars — eliminating the slice escape entirely. Same W10-J / W11-A pattern, lower rank. - Reshape1: 1 → 0 allocs/op, -4 B/op (BenchmarkReshape1_Scalar) - Reshape2: 1 → 0 allocs/op, -8 B/op (BenchmarkReshape2_Scalar) TestOps_Reshape1_Parity and TestOps_Reshape2_Parity lock bit-exact equivalence with the variadic Reshape path across small/single/large shapes (mirrors W11-F TestOps_ScalarBridge_Parity discipline). Caller migration follows in a subsequent commit. Co-Authored-By: Virgil <virgil@lethean.io>

…imitives (W11-AC) Completes the rank-1/2 scalar-pass slice family alongside the existing Slice4 / SliceUpdateInplace4 wrappers. packQ4Cached pays the largest hidden tax: `SliceAxis(paired, 1, 0, 1)` + `SliceAxis(paired, 1, 1, 2)` each go through SliceAxis which allocates `make([]int32, ndim)` twice per call — ~4 slice heap allocs per Q4 K/V Update on the V side alone. - Slice1(arr, s0, e0) routes via mlx_slice_inline_1 - Slice2(arr, s0,s1, e0,e1) routes via mlx_slice_inline_2 - SliceUpdateInplace2(arr, upd, s0,s1, e0,e1) routes via mlx_slice_update_inline_2 All three materialise the C stack buffers directly from register-passed scalars, eliminating the slice-literal escape entirely. strides remain implicitly 1 (matches the wider Slice* convention). Bench evidence (BenchmarkSlice2_*): - SliceAxis-32 499.2 ns/op 16 B/op 2 allocs/op (legacy) - Variadic-32 391.6 ns/op 0 B/op 0 allocs/op (pre-built slices) - Scalar-32 376.3 ns/op 0 B/op 0 allocs/op (new) The Scalar / SliceAxis delta — -2 allocs and ~25% faster — is what packQ4Cached gains per call site, twice per Q4 store. TestSlice_Slice1_Parity / TestSlice_Slice2_Parity / TestSlice_SliceUpdateInplace2_Parity lock bit-exact equivalence with the variadic Slice / SliceUpdateInplace paths across prefix / suffix / middle / column / row / submatrix shapes. Caller migration follows in subsequent commits. Co-Authored-By: Virgil <virgil@lethean.io>

W11-AD primitive: stream-passing siblings of Slice4 / SliceUpdateInplace4 so per-token loops can hoist the DefaultStream() lookup outside the loop. Mirrors the W10/W11 fixedKVCacheSlice4D pattern: KVCache.Update issues four Slice4-family calls per token, each of which currently resolves the default stream independently (RWMutex.RLock + atomic cached-device load + GPU/CPU branch). The existing Slice4 / SliceUpdateInplace4 keep working unchanged — they now delegate to the *WithStream sibling with DefaultStream() resolved once. Parity tests verify bit-exact output across the KV-cache rank-4 slice geometry. Co-Authored-By: Virgil <virgil@lethean.io>

…cheUpdate (W11-Y) Add direct bench coverage for two fast.go decode-time hot paths that had no prior benchmark surface: - nativePagedSingleTokenAttention at 2/4/8/16 pages on Page256, matching the existing SDPAPaged page-count sweep. - singleTokenCacheUpdate at Heads8/Cap512 (decode) and Heads32/Cap4096 (larger-LM decode). Both surfaces are touched once per layer per decode step, so cgo-cost deltas (page-handle scratch, shape-buf scratch) need a direct bench to land as a visible signal — the existing SDPAPaged benches cover the Go-side ScaledDotProductAttentionPaged path which doesn't fall through to the native wrapper. Co-Authored-By: Virgil <virgil@lethean.io>

…11-Y) Replace the two per-call C.calloc / C.free trips that nativePagedSingleTokenAttention used to hand mlx_array runs to the native paged-attention wrapper with a sync.Pool of *[]C.mlx_array slices, mirroring the W11-T scorePages pattern. The native wrapper consumes the page-handle buffer synchronously, so the slice goes back to the pool the moment the cgo call returns; the buffer can therefore be Go-heap-resident (no growth survives a single call). 16-capacity New() matches typical PagedKVCache page counts during decode; larger sweeps grow the backing array and the pool reuses the grown slot on subsequent calls. Measured at -benchtime=200ms -count=3 on Apple M3 Ultra: BenchmarkNativePagedSingleToken_2Pages_Page256 ~702 -> ~331 us -53 percent BenchmarkNativePagedSingleToken_4Pages_Page256 ~720 -> ~410 us -43 percent BenchmarkNativePagedSingleToken_8Pages_Page256 ~1020 -> ~578 us -43 percent BenchmarkNativePagedSingleToken_16Pages_Page256 ~1530 -> ~830 us -46 percent Byte/alloc counts unchanged at 0 allocs/op on both sides (C.calloc allocations do not show up in Go's benchmem); the win is pure wall-clock cgo overhead removed from the decode hot path. Test gate clean under -race -short across the metal package. Co-Authored-By: Virgil <virgil@lethean.io>

Migrates the Q4 storage hot path to the W11-AC rank-1/2 scalar-pass primitives — the variadic-Reshape escapes and the SliceAxis `make([]int32, ndim)` materialisations W11-U flagged are now gone: - Reshape(q, int32(n)) → Reshape1(q, int32(n)) - Reshape(padded, int32(pairs), int32(2)) → Reshape2(padded, int32(pairs), 2) - Reshape(packed2D, int32(pairs)) → Reshape1(packed2D, int32(pairs)) - SliceAxis(paired, 1, 0, 1) → Slice2(paired, 0, 0, int32(pairs), 1) - SliceAxis(paired, 1, 1, 2) → Slice2(paired, 0, 1, int32(pairs), 2) Cache bench impact (BenchmarkQuantizedKVCache_Append_SingleToken_Q8Q4, benchtime=200ms, count=3): baseline 13404 B/op 1412 allocs/op after W11-AC 7274 B/op 516 allocs/op (-46% B, -63% allocs) The Q8Q8 path is unchanged (no Q4 storage, no packQ4Cached call) — the ~900 alloc/call reduction comes entirely from the per-call escape elimination on the V-side packQ4 sequence. Numerical equivalence is locked by the rank-1/2 parity tests added alongside the primitives. Co-Authored-By: Virgil <virgil@lethean.io>

…cast (go vet clean) The mlx `void* payload` slot carries a synthetic uintptr identifier (a sync.Map key, not a Go pointer). The Go-side `unsafe.Pointer(id)` cast tripped `go vet`'s unsafeptr check at pinned_array.go:184 — the warning was flagging a real Go-spec rule (uintptr→unsafe.Pointer is unsafe) for code where the integer was never a pointer to begin with. Adopted the runtime/cgo.Handle pattern: the C-visible Go signature is now `uintptr_t payload` end-to-end (call site + dtor callback). The `void* ↔ uintptr_t` widening happens inside the C++ bridge where it satisfies mlx's signature without putting an unsafe.Pointer cast in Go-visible code. vet's unsafeptr check can now see this is not a Go pointer crossing the boundary. - pinned_array.go: extern signature `uintptr_t payload`, callsite `C.uintptr_t(id)`, callback `goPinnedRawArrayRelease(C.uintptr_t)` - pinned_array_bridge.cpp: param `uintptr_t payload_id`, internal `reinterpret_cast<void*>` before handing to mlx + dtor Verified: - go vet ./go/internal/metal/... clean (was: pinned_array.go:184: possible misuse of unsafe.Pointer) - go test ./go/internal/metal/... -race -short passes - bit-exact: same numeric value flows id → uintptr_t → void* → ... → void* → uintptr_t → uintptr across the round trip Co-Authored-By: Virgil <virgil@lethean.io>

…ous float32 W11-AE adds a fast-path helper materialiseFloat32ViewFast(arr) ([]float32, func(), error) that bypasses the legacy materialiseFloat32View ceremony when arr.Dtype() == DTypeFloat32 && arr.IsRowContiguous(). On the fast-path: * Zero AsType cgo crossing (dtype already matches). * Zero Contiguous cgo crossing (layout already row-major). * Zero Materialize cgo crossing (caller already evaluated the tensor; the dtype + contiguity proof IS the post-Eval invariant for a valid float32 backing store). The helper falls through to materialiseFloat32View when either gate fails, preserving the full conversion-and-contiguous ceremony. Lifecycle contract is wrapped in a cleanup closure (runtime.KeepAlive on fast-path; KeepAlive + Free(converted) on slow-path) so callers just defer cleanup() once. Measured on M3 Ultra vs *Array.Floats() at 5 size points (128B / 1KB / 10KB / 100KB / 1MB): Floats() FastView Delta 320 ns / 129B 170 ns / 17B -47% ns, -87% B (128B) 452 ns / 1025B 207 ns / 17B -54% ns, -98% B (1KB) 1768 ns / 10KB 216 ns / 17B -88% ns, -99% B (10KB) 21000 ns/100KB 217 ns / 17B -99% ns, -99.98% B (100KB) 226000 ns/1MB 205 ns / 17B -99.9% ns, -99.998% B (1MB) The fast-path wins at every size — even 128B, where the W11-X note suggested the cgo overhead would exceed savings. The W11-X comparison was against the slow-path helper (270 ns) that still pays the Materialize crossing; dropping that crossing inverts the verdict. The 17 B / 2 allocs floor is the cleanup closure escape (capturing arr forces the closure to heap). API shape mandated by the brief; the latency win dwarfs the alloc cost. Tests added: FastPath (bit-exact), SlowPathDtype (float16 round-trip), LegacyParity (fast vs slow on identical input), NonContiguous (slow-path fall-through for sliced views). Bench helpers added: BenchmarkMaterialiseFloat32View_Floats_NB / _Slow_NB / FastView_NB across 5 size points so the threshold can be characterised without re-measuring. Co-Authored-By: Virgil <virgil@lethean.io>

KVCache.Update + RotatingKVCache.updateInPlace / updateConcat now resolve DefaultStream() once per Update and pass it through to the Slice4WithStream / SliceUpdateInplace4WithStream siblings. Each Update issues 4-6 Slice4-family calls; this collapses 4-6 RWMutex RLock+RUnlock + cached-device atomic loads to one. Bench (Apple M3 Ultra, -benchtime=200ms ×3, -benchtime=2s ×5): KVCache_Append_SingleToken_To512: 1028 allocs unchanged (the W11-U forward note's hypothesis that DefaultStream hoisting would reduce allocs was misdiagnosed — DefaultStream returns a cached *Stream singleton without alloc; the 1028 allocs come from newArray's runtime.SetFinalizer path). ns/op delta is below the Metal thermal/cache noise floor at this bench: benchstat reports all deltas with p~={0.1-1.0} (no statistical significance). Architectural win is real and verifiable in code: 1 DefaultStream lookup per Update instead of 4-6. Sets up future measurable wins once the Metal-side variance is reduced (kernel JIT warmup, thermal stabilisation), and reduces lock-acquisition pressure in concurrent decode scenarios that current single-goroutine benches don't exercise. Allocs neutral, ns/op neutral within noise. Co-Authored-By: Virgil <virgil@lethean.io>

Migrates the Q4 dequantise rank-1 boundaries to the W11-AC scalar-pass primitives: - Reshape(stacked, int32(flatLen)) → Reshape1 - Slice(flat, []int32{0}, []int32{int32(n)}) → Slice1 The first call is on every Q4 dequant; the second is the (rare) odd-length tail-trim branch (flatLen > n) reached when the source element count is odd. The final `Reshape(signed, shape...)` retains the variadic form because the shape comes from the caller as a slice of arbitrary rank — no fixed-rank scalar-pass equivalent applies. The dequant path is invoked by dequantizedState() on every Update that misses the float cache, and by ReadState() on every snapshot read, so the saved variadic-slice escape compounds across the read-side hot path. Numerical equivalence preserved (see W11-AC parity tests). Co-Authored-By: Virgil <virgil@lethean.io>

ScaledDotProductAttention and ScaledDotProductAttentionWithMask each paid a C.CString allocation plus the matching deferred C.free on every invocation, even though only three mask_mode values are ever passed: "" (default), "causal", and "array". Cache the corresponding C strings at package init and reuse them across calls. Safety: the mlx-c wrapper at lib/mlx-c/mlx/c/fast.cpp wraps the incoming mask_mode pointer in std::string(mask_mode) before passing it to the C++ scaled_dot_product_attention op, so the underlying C buffer is copied synchronously and the cached pointers can be shared across goroutines without locking. Race-short gate is clean. Per-call delta is below benchmem resolution (the C.malloc + C.free pair don't show in Go's allocs/op metric, and the cgo overhead saved is ~200 ns against a 260 us SDPA call — within noise on a single bench), but the change is structurally cleaner and avoids 2× cgo crossings per SDPA call. Under decode (32+ layers × N tokens) the saved crossings compound into the wall-clock budget. Co-Authored-By: Virgil <virgil@lethean.io>

…ce4WithStream substrate consistency; A5 diagnosis: 1028 alloc floor is newArray SetFinalizer + arrayPool type-assertion, not DefaultStream)

…eLogitsCompact W11-AE migrates the W11-X-rejected site at probe.go:362 — topValues.Floats() copies a topK-length buffer (~32 B for topK=8) into a fresh Go slice via 2× Materialize cgo crossings + per-element loop. The fast-path returns a borrowed MLX-memory view in ~170 ns / 17 B (closure escape floor). W11-X rejected this site against the slow-path helper (270 ns floor) because the cgo overhead exceeded the 32-byte saving. The new fast-path drops the unconditional final Materialize crossing — the dtype + contiguity check IS the post-Eval proof of a valid backing store — and now wins at every size including 128 B (vs Floats(): -47% latency, -87% bytes). Measured BenchmarkSummarizeProbeLogitsCompact_Gemma on M3 Ultra: Before ~833 µs / 715 B / 20 allocs After ~594 µs / 697 B / 20 allocs (-29% ns, -18 B, allocs same) Alloc count unchanged because the Floats() copy was 2 allocs (slice + cgo intermediate) and the fast-path cleanup closure escapes to heap (2 allocs). The bytes win on the topK=8 slice; the latency win on dropped Materialize crossings. TakeAlongAxis preserves dtype (float32) and the prior Eval guarantees a valid backing store, so the fast-path conditions hold structurally — no runtime risk. Co-Authored-By: Virgil <virgil@lethean.io>

Replaces `Reshape(a, int32(n))` with `Reshape1(a, int32(n))` — rank-1 scalar-pass skips the variadic []int32 heap escape on the quantise-max-abs boundary. maxAll is called by quantizeCacheArrayCached on every K and every V Update (two calls per cache Update), so this is the dominant alloc reduction on the Q8 cache append path. Combined with the W11-AC packQ4Cached + unpackQ4 migrations and prior W10-A substrate work, the cache bench impact (BenchmarkQuantizedKVCache_Append_SingleToken, benchtime=200ms, count=3): baseline (a5c82d0) Q8Q8 7288 B/op 516 allocs/op Q8Q4 13404 B/op 1412 allocs/op after W11-AC migrations Q8Q8 6244 B/op 260 allocs/op (-14% B, -50% allocs) Q8Q4 6251 B/op 260 allocs/op (-53% B, -82% allocs) The Q8Q4 path converges on the Q8Q8 alloc floor (260 allocs/op identical) — the rank-1/2 scalar-pass primitives close the gap the prior pure-Q8 path already enjoyed. 4096-prefill (steady-state) bench: Q8Q8 122 B/op 8 allocs/op (baseline) Q8Q8 114 B/op 6 allocs/op (after) (-7% B, -25% allocs) Co-Authored-By: Virgil <virgil@lethean.io>

…B/op (-53%) W10-O wired the cgo-scratch pools but left the per-call `&pinnedRawArrayBuffer{}` heap alloc on the hot path. This lane pools the buffer struct end-to-end: register Gets from the pool + sets `raw`, unregister Releases the view + clears `raw` and Puts back. Lifetime safety: the buffer travels through mlx as the dtor payload and only returns to the pool after the mlx-side release callback fires — at which point PinnedView.Release has zeroed the pinner state. Clearing `buffer.raw` on Put is critical so the recycled struct does not hold a stale reference to the previous call's slice (the underlying bytes need to be GC-eligible the moment mlx hands the array back). Bench delta @ benchtime 300ms, M3 Ultra: L1 120->56 B/op 3->2 allocs L32 120->56 B/op 3->2 allocs L512 120->56 B/op 3->2 allocs L4096 120->56 B/op 3->2 allocs L16384 120->56 B/op 3->2 allocs Gemma4Global_L4096 120->56 B/op 3->2 allocs Gemma4LocalWindow_L512 120->56 B/op 3->2 allocs Strided_Subview_L4096 120->56 B/op 3->2 allocs Remaining 56 B + 1 alloc is the `sync.Map.Store` entry node, which needs a different data structure to eliminate (out of scope for a residual lane). Verified: - go test ./go/internal/metal/... -race -short passes - go vet stays clean - bit-exact: same numeric id flows + same data pointer returned; Pinner reset path is the documented contract for runtime.Pinner Co-Authored-By: Virgil <virgil@lethean.io>

Add Cap512 and Cap4096 benches for singleTokenCausalMask so the next visitor can see the surface without re-deriving it. W11-Y exercised these benches investigating whether to cache the 0 / -1e9 scalars at package scope; the cached variant regressed by ~55 percent at both capacities because MLX's Where op pays refcount-management overhead when the same scalar arrays are aliased across many invocations. A5-honest revert on the cache, benches kept. cached vs baseline (-benchtime=300ms -count=5 on M3 Ultra): Cap512 ~238 us baseline / ~373 us cached (+57 percent) Cap4096 ~239 us baseline / ~375 us cached (+57 percent) Co-Authored-By: Virgil <virgil@lethean.io>

…liceUpdateInplace2 scalar-pass primitives — QuantizedKVCache_Q8Q4 -82% allocs / -53% B, Q8Q8 -50% allocs; scalar-pass family complete at ranks {1,2,4}) # Conflicts: # go/internal/metal/slice_test.go

…time/cgo.Handle pattern + pinnedRawArrayBuffer sync.Pool — all KV shapes 120 B/3 allocs → 56 B/2 allocs -53%/-33%)

…Pool — 2Pages -53%, 4-16Pages -43-46%; SDPA mode-string cache; A5 reverts on ShapeInto + scalar-cache showing cgo-stack-array + MLX-Where pitfalls)

…al arr A5-honest discovery from migrating hostUnsuppressedGreedyToken: the legacy materialiseFloat32View helper called Materialize on src unconditionally at the end, which silently covered callers that passed lazy (un-Eval'd) tensors. The fast-path deliberately skips that Materialize crossing — so accessing the raw float32 backing store of an un-Eval'd array segfaults. TestSample_HostUnsuppressedGreedyTokenMaterializesLazyFloat32_Good caught this regression immediately. Reverting sample.go to the legacy helper + documenting the contract explicitly on materialiseFloat32ViewFast. Callers safe for the fast-path: * probe.go summarizeProbeLogitsCompact — explicit Eval(topIndices, topValues, ...) before * generate.go inspectAttentionCache — explicit Eval(kSliced) before * kv_snapshot.go inspectKVCacheRangeWithOptions — explicit Eval(kSliced, vSliced) before Callers unsafe for the fast-path (must stay on legacy): * sample.go hostUnsuppressedGreedyToken — receives logits from sampler chain, may be lazy The threshold note also updates: 10-100KB benches show -88% to -99% latency vs Floats() (slow-path delta dominated by the skipped Materialize crossing on large tensors, not just per-call overhead). Co-Authored-By: Virgil <virgil@lethean.io>

…entionCache W11-AE migrates the W11-X-installed materialiseFloat32View call to the fast-path variant. kSliced is explicit-Eval'd at line 1139 immediately before, so the fast-path contract holds. Cache K tensors are normally DTypeFloat32 + row-contiguous (Slice preserves row-major when slicing axis 0), so the fast-path fires; quantised caches fall through to the legacy materialiseFloat32View ceremony unchanged. Measured BenchmarkInspectAttentionCache_Realistic (32 heads x 1024 tokens x 128 head_dim = 16 MB): Before ~3.2 ms / 16.78 MB / 43 allocs After ~735 µs / 16.78 MB / 41 allocs (-77% ns, -2 allocs) The latency win is much larger than the per-call cgo crossing cost — dropping the final Materialize on a 16 MB freshly-Eval'd tensor saves the MLX-side queue-drain check, not just the cgo call. Allocs -2 because the closure cleanup replaces 2 separate function calls (runtime.KeepAlive + Free(converted)) that previously each escaped scratch. Also drops the now-unused "runtime" and "unsafe" imports from generate.go. Co-Authored-By: Virgil <virgil@lethean.io>

…KVCacheRangeWithOptions W11-AE migrates the W11-X-installed dual materialiseFloat32View calls (K + V) to the fast-path variant. kSliced + vSliced are explicit-Eval'd at line 437 immediately before, so the fast-path contract holds. Cache K/V tensors are normally DTypeFloat32 + row-contiguous (Slice preserves row-major when slicing axis 0); quantised caches fall through to the legacy materialiseFloat32View ceremony unchanged. Measured BenchmarkInspectKVCacheRange_Realistic (32 heads x 1024 tokens x 128 head_dim x K+V = 32 MB borrowed, ~100 MB total snapshot): Before ~10.1 ms / 100.67 MB / 154 allocs After ~2.68 ms / 100.67 MB / 152 allocs (-73% ns, -2 allocs) Same multiplicative win as inspectAttentionCache — dropping Materialize on Eval'd large tensors saves the MLX queue-drain check, not just per-call cgo overhead. Allocs -2 because the cleanup closure replaces 2 separate function calls (runtime.KeepAlive + Free(converted)) per K/V pair, but the closure escape adds back 1 alloc per pair, netting -2 across both. Also drops the now-unused "runtime" and "unsafe" imports from kv_snapshot.go. Co-Authored-By: Virgil <virgil@lethean.io>

…terialize for contiguous-float32 — InspectAttentionCache -83%, InspectKVCacheRange -74%; threshold inverts vs W11-X — fast-path wins at ALL sizes when caller pre-Evals)

sonarqubecloud · 2026-05-23T04:31:17Z

Quality Gate failed

Failed conditions
2 Security Hotspots
6.8% Duplication on New Code (required ≤ 3%)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

coderabbitai Bot requested changes May 20, 2026

View reviewed changes

Comment thread go/backend.go

github-advanced-security AI found potential problems May 20, 2026

View reviewed changes

coderabbitai Bot approved these changes May 22, 2026

View reviewed changes

Snider and others added 23 commits May 22, 2026 23:02

merge: cladius-lane-Wave10-W10L (kv residual — Snapshot writer 1217→7…

472654f

…05 ns -42%, stream-writer direct-unsafe-slice eliminates staging memcpy)

chore(external): bump go-inference → 22e9e0d (W10-M + W10-N — 4-46× s…

b883787

…tate/scheduler/openai/ollama/parser + 3.5-4.6× jang dequant model-LOAD)

merge: cladius-lane-Wave10-W10K (gemma4 PerLayerProjectionScale cache…

110df5d

… + Squeeze axes hoist — -96.2% allocs/forward on 26-layer Gemma 4)

merge: cladius-lane-Wave10-W10R (safetensors decode unsafe-cast + han…

d1515da

…d-rolled JSON write — F32 -50% / F64 -44% / BF16 -31% / WriteSubset -38% allocs)

Snider and others added 29 commits May 23, 2026 03:54

merge: cladius-lane-Wave11-W11W (prompt_cache.go residual — copyCache…

6f8d4e2

…Prefix 3→0 allocs, 26-cache Gemma 4 restore -156 allocs, RoundTrip -17% ns + -33B; Zeros4 convergent with W11-T) # Conflicts: # go/internal/metal/array.go

merge: cladius-lane-Wave11-W11V (metal_kernel.go 3 new substrate prim…

5f0bb21

…itives — apply_inline + ApplyOne + DispatchOne; MoE_RouterProjection_H2048_E32 -21.6%, geomean -37% B/op -29.3% allocs; 23 caller migrations)

merge: cladius-lane-Wave11-W11AD (Slice4WithStream + SliceUpdateInpla…

de86a65

…ce4WithStream substrate consistency; A5 diagnosis: 1028 alloc floor is newArray SetFinalizer + arrayPool type-assertion, not DefaultStream)

merge: cladius-lane-Wave11-W11AC (Reshape1/Reshape2 + Slice1/Slice2/S…

c8f1642

…liceUpdateInplace2 scalar-pass primitives — QuantizedKVCache_Q8Q4 -82% allocs / -53% B, Q8Q8 -50% allocs; scalar-pass family complete at ranks {1,2,4}) # Conflicts: # go/internal/metal/slice_test.go

merge: cladius-lane-Wave11-W11AF (pinned_array.go vet cleared via run…

aef1070

…time/cgo.Handle pattern + pinnedRawArrayBuffer sync.Pool — all KV shapes 120 B/3 allocs → 56 B/2 allocs -53%/-33%)

merge: cladius-lane-Wave11-W11Y (fast.go nativePagedSingleToken sync.…

cafab9f

…Pool — 2Pages -53%, 4-16Pages -43-46%; SDPA mode-string cache; A5 reverts on ShapeInto + scalar-cache showing cgo-stack-array + MLX-Where pitfalls)

merge: cladius-lane-Wave11-W11AE (materialiseFloat32ViewFast skips Ma…

4b8fa13

…terialize for contiguous-float32 — InspectAttentionCache -83%, InspectKVCacheRange -74%; threshold inverts vs W11-X — fast-path wins at ALL sizes when caller pre-Evals)

Conversation

Snider commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-advanced-security AI left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented May 23, 2026

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Snider commented May 20, 2026 •

edited

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading