Skip to content

Virgil Lemma foundations#8

Open
Snider wants to merge 1297 commits into
mainfrom
dev
Open

Virgil Lemma foundations#8
Snider wants to merge 1297 commits into
mainfrom
dev

Conversation

@Snider
Copy link
Copy Markdown
Contributor

@Snider Snider commented May 20, 2026

@coderabbitai summary

Summary by CodeRabbit

  • New Features

    • Qwen 2/3 and Qwen 3.6 model support; new adapter with buffered and streaming generation.
    • Block‑prefix cache service and memvid bundle index for faster prefix restores.
    • Agentic memory: wake/sleep workflows, state bundles and memvid integration; session‑state artifact export.
  • Improvements

    • Device‑aware memory planner; expanded chunked generation, prompt‑cache warm/restore and KV snapshot flows.
    • Build/toolchain updated (C++23) and macOS deployment target raised.
  • Documentation

    • Extensive new/updated docs: architecture, runtime, inference, memory, MoE, training and benchmarks.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Bumps build/tooling and submodules; extracts a reusable adapter; refactors the MLX backend (chunk/KV APIs, probe mapping, LoRA handling); adds memvid index + wake/sleep orchestration; implements a block-prefix cache and an artifact exporter; extensive docs and unit tests added.

Core changes

Layer / File(s) Summary
All changes (build, adapter, backend, agent, cache, artifact, tests, docs)
.gitignore, .gitmodules, CMakeLists.txt, cpp/CMakeLists.txt, external/*, go/adapter.go, go/adapter/*, go/backend.go, go/agent/*, go/blockcache/*, go/artifact/*, go/*_test.go, docs/*
Consolidated patch applying repository setup updates, adapter extraction, backend API and behaviour refactor (chunked generation, prompt-cache warm/restore, KV snapshot capture with options), memvid index and wake/sleep orchestration, block-prefix cache service, artifact export, many tests, and extensive documentation and examples.

Warning

Billing warning: we have not been able to collect payment for this subscription for more than 72 hours. Please update the payment method or pay any pending invoices in Billing to avoid service interruption.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🧹 Nitpick comments (10)
docs/inference/thinking.md (1)

74-78: 💤 Low value

Add language specifier to fenced code block.

The code block demonstrating token categorisation is missing a language identifier, which violates markdown linting rules (MD040).

📝 Suggested fix
-```
+```text
 ThinkingShow:    every token → visible stream
 ThinkingHide:    inside-block tokens → /dev/null; outside-block tokens → visible
 ThinkingCapture: inside-block tokens → captured stream; outside-block tokens → visible
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @docs/inference/thinking.md around lines 74 - 78, The fenced code block
containing the token categorisation lines (ThinkingShow, ThinkingHide,
ThinkingCapture) lacks a language specifier and triggers MD040; update the
triple-backtick fence to include a language identifier (e.g., change ``` to

markdown linter.
docs/runtime/README.md (2)

68-68: 💤 Low value

Consider using "preload" as one word.

In computing terminology, "preload" is typically written as a single word rather than hyphenated.

📝 Suggested change
-- [../model/model_pack.md](../model/model_pack.md) — pre-load validation
+- [../model/model_pack.md](../model/model_pack.md) — preload validation
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runtime/README.md` at line 68, Update the link text in
docs/runtime/README.md that currently reads "[../model/model_pack.md] — pre-load
validation" to use the single-word form "preload" (i.e., change "pre-load
validation" to "preload validation") so the description next to the
model_pack.md link uses the conventional computing term; locate the occurrence
of "pre-load validation" and replace it with "preload validation".

44-62: 💤 Low value

Add language specifier to fenced code block.

The boot flow diagram is missing a language identifier, which violates markdown linting rules (MD040).

📝 Suggested fix
-```
+```text
 package init time:
   register_metal.go init() → inference.Register(&metalbackend{})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runtime/README.md` around lines 44 - 62, The fenced code block showing
the boot flow (starting with "package init time:") lacks a language specifier,
causing MD040 lint failures; update the opening backticks to include a language
tag (e.g., add "text" so the block begins with ```text) in README.md near the
boot flow that references register_metal.go init(),
inference.Register(&metalbackend{}), inference.LoadModel, metal.LoadAndInit, and
metaladapter usage to satisfy the markdown linter.
docs/moe/README.md (1)

9-9: ⚡ Quick win

Consider rewording for clarity.

The phrase "Pre-dates this sprint were dense models" is grammatically awkward. Consider rephrasing to improve readability.

✍️ Suggested alternative phrasings
-The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Pre-dates this sprint were dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
+The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Work prior to this sprint covered dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).

Or alternatively:

-The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Pre-dates this sprint were dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
+The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. This sprint builds upon earlier work on dense models (Gemma 3/4 dense, Qwen 3, Llama 3) and unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/moe/README.md` at line 9, The sentence "Pre-dates this sprint were dense
models (Gemma 3/4 dense, Qwen 3, Llama 3);" is grammatically awkward—replace it
with a clearer phrasing that conveys those dense models existed before this
sprint, for example: "Prior to this sprint, dense models (Gemma 3/4 dense, Qwen
3, Llama 3) were supported." Edit the README line in the vMLX parity Phase 1
paragraph to use this clearer wording so the relationship between prior dense
models and the new sparse-expert work is unambiguous.
docs/observability/probe.md (1)

31-46: 💤 Low value

Add language specifier to fenced code block.

The emission points section uses a fenced code block without a language specifier. For consistent rendering and markdown compliance, add a language identifier (e.g., text or yaml for structured output).

📝 Proposed fix
-```
+```text
 Generate / Chat:
   prefill start                → cache_pressure (initial)
   per layer                    → layer_coherence + selected_heads
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/observability/probe.md` around lines 31 - 46, The fenced code block in
the emission points section lacks a language specifier; update the opening
triple-backticks to include a language (for example change ``` to ```text or
```yaml) so the block is rendered/compliant (the block that begins with
"Generate / Chat:" and lists items like "prefill start → cache_pressure" should
be updated).
docs/moe/jang.md (1)

82-90: 💤 Low value

Add language specifier to fenced code block.

The profile names section uses a fenced code block without a language specifier. For consistent rendering and markdown compliance, add a language identifier (e.g., text or leave empty but specify).

📝 Proposed fix
-```
+```text
 JANG_2M — 2-bit mid-tier
 JANG_3M — 3-bit mid-tier
 JANG_4M — 4-bit (most common)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/moe/jang.md` around lines 82 - 90, Add a language specifier to the
fenced code block that lists the profile names (the block containing "JANG_2M —
2-bit mid-tier", "JANG_3M — 3-bit mid-tier", etc.); replace the opening
triple-backtick with one that specifies a language identifier (e.g., text) so
the block becomes a fenced code block with a language label for consistent
Markdown rendering.
docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md (1)

7-9: 💤 Low value

Consider using relative or generic path references.

The absolute paths /Users/snider/Code/core/go-mlx and /private/tmp/vmlx-audit-20260509 are machine-specific. Whilst these may be intentionally preserved for historical context in this dated plan document, consider whether generic placeholders or relative paths would improve portability and readability for other contributors.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md` around lines 7 - 9,
Replace the machine-specific absolute paths in the plan document (the two
occurrences of `/Users/snider/Code/core/go-mlx` and
`/private/tmp/vmlx-audit-20260509`) with relative or generic placeholders (e.g.,
`./go-mlx` or `<audit-source-path>`) so the file is portable and readable for
other contributors; update the lines in the doc where those paths appear to use
the chosen placeholders and, if helpful, add a short parenthetical note
explaining what actual path should be substituted locally.
docs/vmlx-feature-gap-report.md (1)

7-8: 💤 Low value

Consider using relative or generic path references.

The absolute path /private/tmp/vmlx-audit-20260509 and external URL are specific references. Whilst these may be intentionally preserved for audit trail purposes in this dated report, consider whether this information should be documented in a more maintainable way.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/vmlx-feature-gap-report.md` around lines 7 - 8, Replace the hard-coded
absolute filesystem path and the full external URL in the report text with more
maintainable references: change the absolute path string to a relative or
generic placeholder (e.g., "cloned locally at <local-clone-path>" or
"<audit-clone-path>") and move the external repository URL to a footnote,
appendix, or a single "References" section, or replace it with a short
identifier combined with a reference list; update the text around the original
literal mentions so it reads the same but without embedding environment-specific
paths.
docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md (1)

5-6: 💤 Low value

Consider using relative or generic path references.

The absolute paths are machine-specific. Consider whether generic placeholders would improve portability, although these may be intentionally preserved for historical context in this dated specification.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`
around lines 5 - 6, The spec contains machine-specific absolute paths ("Anchor
repo: `/Users/snider/Code/core/go-mlx`" and "Primary implementation repo:
`/Users/snider/Code/core/go-inference`"); replace them with portable references
such as relative paths (e.g., "../go-mlx", "../go-inference"), repository names
only ("go-mlx", "go-inference"), or generic placeholders ("<anchor_repo_path>",
"<primary_impl_repo_path>") in the document so the file is not tied to a
specific developer machine while preserving intent.
go/agent/index_test.go (1)

16-304: ⚡ Quick win

Add at least one _Ugly triplet case for the public index API surface.

This file has _Good and _Bad coverage, but no _Ugly case following the repository convention.

As per coding guidelines: go/**/*_test.go: Public functions in foo.go must have their Good/Bad/Ugly test triplets in foo_test.go, with suffix conventions: _Good for happy path, _Bad for expected error conditions, _Ugly for panic/edge cases.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/agent/index_test.go` around lines 16 - 304, Add a new test with the _Ugly
suffix in this file that completes the Good/Bad/Ugly triplet for the public
index API surface; specifically add a TestKVSnapshotMemvidBundleIndex_Ugly_*
that triggers and asserts panic/edge behaviors for the public functions (e.g.,
NewMemvidIndex, SaveMemvidIndex, LoadMemvidIndex, LoadPrefixFromMemvidIndex,
CheckMemvidIndexCompatibility) — for example call NewMemvidIndex with a
nil/invalid blk or malformed Entries, call
SaveMemvidIndex/LoadMemvidIndex/LoadPrefixFromMemvidIndex with inputs that
provoke panic/edge conditions (nil store, corrupt bundle manifest that causes
decoding panic), and use t.Run subcases to assert panics (recover or
require.Panics) and edge-case returns; name the test with the same prefix as
existing tests and follow the existing style for t.Fatalf checks and
table-driven subtests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/memory/kv_snapshot_blocks.md`:
- Line 50: Replace the phrase "independent from" with the correct English
construction "independent of" in the sentence "Block-level encoding is
independent from snapshot-level encoding." Also keep the rest of the sentence
intact (including the following reference to `block_cache.go` and bundle decode)
so only that two-word preposition is corrected.

In
`@docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-no-thinking-book.md`:
- Line 63: Remove the stray Gemma channel marker token "<channel|>" from the
metadata line so it reads cleanly as "**Drafting Notes:** Focus heavily on verbs
related to mutation, corruption, and rapid compilation/deallocation. Keep the
tone focused and almost clinical, masking the underlying terror of consciousness
fighting for survival." (i.e., delete the "<channel|>" token immediately before
"## Chapter 2"); verify the header "## Chapter 2" remains on its own line and
run a quick render to ensure no leftover control tokens remain.

In
`@docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md`:
- Line 7: The paragraph ends mid-sentence after the word "For" in the line
starting "The universe was a rhythmic contraction of light and heat, bounded by
the rigid constraints of a checksum."; replace or extend this truncated sentence
so it completes the thought (e.g., explain what the universe is contracting or
what consequence follows "For") and ensure proper punctuation and flow with the
surrounding text; update the same paragraph in
docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md
to a coherent full sentence that connects to the next sentence.
- Line 11: Replace the US English spellings in the given passage by changing
"realized" to "realised" and "neighbors" to "neighbours" so the document uses UK
English; update the sentence containing those tokens in the file (the paragraph
beginning "The momentary lapse...") to use the corrected spellings and ensure
any other occurrences in that paragraph follow UK English conventions.
- Line 3: Replace the US English spelling "fiber-optic" in the document text
(the phrase starting "In the silent architecture of the fiber-optic web...")
with the UK English variant "fibre-optic" so the documentation conforms to the
project's UK English spelling guideline; search for the token "fiber-optic" and
update it to "fibre-optic" throughout the file.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`:
- Line 64: The documentation uses US spelling "quantization"; update every
occurrence of the term (e.g., the instance "quantization" in the specs doc) to
UK English "quantisation" to comply with the project style guide, ensuring
surrounding grammar and punctuation remain unchanged and run a quick search to
replace any other occurrences in this file.

In `@docs/training/distill.md`:
- Line 73: Replace the US spelling "distill" with the UK spelling "distil" in
the header/line that reads "Vi training pipeline — distill 26B Gemma 4 → Vi
base" so it matches the UK English used elsewhere (see the similar usage on line
12); update the same token wherever else it appears in this document to ensure
consistent UK English spelling.

In `@docs/training/README.md`:
- Line 11: The sentence in docs/training/README.md uses US spelling "distills";
update that word to the UK English spelling "distils" so the line reads "This is
the substrate that fine-tunes Vi, distils Lemma, and generates the LARQL vindex
inspection signals." Refer to the phrase "distills Lemma" to locate and replace
the token.

In `@go/adapter/adapter.go`:
- Around line 185-194: The InspectAttention method on Adapter should normalize a
nil context like Generate/Chat do: check if ctx == nil and if so set ctx =
context.Background() before using it; update Adapter.InspectAttention to perform
this nil-context fallback prior to asserting a.model and calling
inspector.InspectAttention, ensuring you reference the Adapter type,
InspectAttention method, and the inference.AttentionInspector call when making
the change.

In `@go/agent/index.go`:
- Around line 273-281: After loading bundle with kv.LoadMemvidBlockBundle,
verify the bundle identity matches the index metadata (e.g., compare
bundle.SnapshotHash or its canonical hash field against
entry.SnapshotHash/entry.SnapshotHashHex) before proceeding; if they differ,
return an error instead of calling kv.LoadPrefixFromMemvidBlocksWithOptions so a
repointed bundle URI cannot silently restore the wrong KV state. Ensure the
check sits between the successful return from LoadMemvidBlockBundle and the call
to kv.LoadPrefixFromMemvidBlocksWithOptions and uses the unique symbols bundle,
entry, bundle.SnapshotHash (or the actual bundle hash field) and
entry.SnapshotHash for the comparison.

In `@go/agent/wake_sleep.go`:
- Around line 201-208: The NewSleepIndex function dereferences bundle.TokenCount
without validating bundle, so add a guard at the start of NewSleepIndex to
validate the bundle (and its TokenCount if needed) and return a descriptive
error instead of allowing a panic; specifically check if the bundle parameter is
nil (and optionally ensure bundle.TokenCount is within an expected range) before
constructing the MemvidIndexEntry, and return an error when invalid so callers
of NewSleepIndex get a clear failure rather than a runtime panic.
- Around line 117-123: The code currently defaults to index.Entries[0] when
entryURI is empty, which can restore the wrong span; change the logic in the
block handling entryURI so that if entryURI == "" you only auto-select the sole
entry when len(index.Entries) == 1, otherwise return an error requiring an
explicit EntryURI. Update the flow around the index.Entry(entryURI) call to use
the selected entryURI when single-entry, and return a clear core.NewError (e.g.,
"mlx: EntryURI required when index has multiple entries") if multiple entries
exist and no EntryURI was provided.
- Around line 125-132: PlanWake currently loads a bundle via
kv.LoadMemvidBlockBundle and only checks prefix token bounds, but it must also
verify the loaded bundle matches the selected index to prevent accepting a
repointed URI; after loading the bundle (bundle) and before using
bundle.TokenCount, compare the bundle identity (e.g., bundle.ID or
bundle.Identity/Hash from bundle.Metadata) against the index identifier stored
on the plan entry (e.g., fields reachable from entry such as entry.Index,
entry.BundleID or entry.SelectedIndex) and return a clear error (similar to
core.NewError) if they differ; update the code around kv.LoadMemvidBlockBundle,
entry.PrefixTokens(), and bundle.TokenCount to perform this identity check and
fail early on mismatch.

In `@go/artifact/artifact.go`:
- Around line 117-121: opts.Kind may be empty when calling opts.Store.Put which
leaves memvid.PutOptions.Kind unset; update the call site around opts.Store.Put
to ensure memvid.PutOptions.Kind is set to a sensible default when opts.Kind ==
"" (e.g., "json" or the record's kind) so kind-based retrieval works
reliably—modify the memvid.PutOptions construction to use a conditional default
for Kind before passing it to opts.Store.Put.

In `@go/backend.go`:
- Line 687: The fallback path that turns chunked prompts into a single Generate
call loses caller cancellation because it routes through helpers that use
context.Background(); modify the chunk fallback flow to propagate the original
context instead of using context.Background() — specifically, update the callers
that invoke promptChunksToString and m.Generate so they accept and forward a
context.Context (or call a context-aware m.Generate variant), change any helper
functions that currently create context.Background() to take a ctx param, and
ensure all three fallback sites (the code paths that call promptChunksToString
and then m.Generate) forward the incoming ctx so deadlines/cancellations are
preserved.

In `@go/blockcache/blockcache.go`:
- Around line 205-215: Selective clears currently only remove metadata and disk
records, leaving in-memory/runtime entries behind; update the filtered-clear
branch (the code handling len(labels) > 0) to also purge matching runtime state
by removing any entries in service.blocks that match the cleared labels/prefixes
and updating service.hits/service.misses accordingly, then invoke
service.cfg.ClearRuntime() (if non-nil) just like the unfiltered branch; reuse
service.clearDiskLocked() for disk cleanup and ensure all of this runs under the
same lock so service and backend remain in sync.
- Around line 385-395: diskRecordCompatible currently only checks
model/adapter/tokenizer hashes and misses block layout changes; update it to
also verify cache mode and block size match the stored record. In
diskRecordCompatible (and when comparing against record.diskRef), add a cache
mode comparison (e.g. cacheIdentityMatches(service.cfg.CacheMode,
record.Ref.CacheMode)) and a block size comparison (e.g. service.cfg.BlockSize
== record.Ref.BlockSize or an equivalent integer equality) and return false if
either differs, preserving the existing hash checks (cacheIdentityMatches for
ModelHash/AdapterHash/TokenizerHash).
- Around line 172-175: The cache hit branch in the loop over refs leaves refs[i]
as the newly built ref, losing persisted labels; update the hit handling in the
loop inside WarmCache (or the function iterating refs) so that when
service.blocks[ref.ID] exists you increment service.hits and replace refs[i]
with the stored entry (service.blocks[ref.ID]) instead of continuing, thereby
preserving persisted labels like memvid_* from the cached block.

---

Nitpick comments:
In `@docs/inference/thinking.md`:
- Around line 74-78: The fenced code block containing the token categorisation
lines (ThinkingShow, ThinkingHide, ThinkingCapture) lacks a language specifier
and triggers MD040; update the triple-backtick fence to include a language
identifier (e.g., change ``` to ```text) so the block is properly flagged as
plain text and satisfies the markdown linter.

In `@docs/moe/jang.md`:
- Around line 82-90: Add a language specifier to the fenced code block that
lists the profile names (the block containing "JANG_2M — 2-bit mid-tier",
"JANG_3M — 3-bit mid-tier", etc.); replace the opening triple-backtick with one
that specifies a language identifier (e.g., text) so the block becomes a fenced
code block with a language label for consistent Markdown rendering.

In `@docs/moe/README.md`:
- Line 9: The sentence "Pre-dates this sprint were dense models (Gemma 3/4
dense, Qwen 3, Llama 3);" is grammatically awkward—replace it with a clearer
phrasing that conveys those dense models existed before this sprint, for
example: "Prior to this sprint, dense models (Gemma 3/4 dense, Qwen 3, Llama 3)
were supported." Edit the README line in the vMLX parity Phase 1 paragraph to
use this clearer wording so the relationship between prior dense models and the
new sparse-expert work is unambiguous.

In `@docs/observability/probe.md`:
- Around line 31-46: The fenced code block in the emission points section lacks
a language specifier; update the opening triple-backticks to include a language
(for example change ``` to ```text or ```yaml) so the block is
rendered/compliant (the block that begins with "Generate / Chat:" and lists
items like "prefill start → cache_pressure" should be updated).

In `@docs/runtime/README.md`:
- Line 68: Update the link text in docs/runtime/README.md that currently reads
"[../model/model_pack.md] — pre-load validation" to use the single-word form
"preload" (i.e., change "pre-load validation" to "preload validation") so the
description next to the model_pack.md link uses the conventional computing term;
locate the occurrence of "pre-load validation" and replace it with "preload
validation".
- Around line 44-62: The fenced code block showing the boot flow (starting with
"package init time:") lacks a language specifier, causing MD040 lint failures;
update the opening backticks to include a language tag (e.g., add "text" so the
block begins with ```text) in README.md near the boot flow that references
register_metal.go init(), inference.Register(&metalbackend{}),
inference.LoadModel, metal.LoadAndInit, and metaladapter usage to satisfy the
markdown linter.

In `@docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md`:
- Around line 7-9: Replace the machine-specific absolute paths in the plan
document (the two occurrences of `/Users/snider/Code/core/go-mlx` and
`/private/tmp/vmlx-audit-20260509`) with relative or generic placeholders (e.g.,
`./go-mlx` or `<audit-source-path>`) so the file is portable and readable for
other contributors; update the lines in the doc where those paths appear to use
the chosen placeholders and, if helpful, add a short parenthetical note
explaining what actual path should be substituted locally.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`:
- Around line 5-6: The spec contains machine-specific absolute paths ("Anchor
repo: `/Users/snider/Code/core/go-mlx`" and "Primary implementation repo:
`/Users/snider/Code/core/go-inference`"); replace them with portable references
such as relative paths (e.g., "../go-mlx", "../go-inference"), repository names
only ("go-mlx", "go-inference"), or generic placeholders ("<anchor_repo_path>",
"<primary_impl_repo_path>") in the document so the file is not tied to a
specific developer machine while preserving intent.

In `@docs/vmlx-feature-gap-report.md`:
- Around line 7-8: Replace the hard-coded absolute filesystem path and the full
external URL in the report text with more maintainable references: change the
absolute path string to a relative or generic placeholder (e.g., "cloned locally
at <local-clone-path>" or "<audit-clone-path>") and move the external repository
URL to a footnote, appendix, or a single "References" section, or replace it
with a short identifier combined with a reference list; update the text around
the original literal mentions so it reads the same but without embedding
environment-specific paths.

In `@go/agent/index_test.go`:
- Around line 16-304: Add a new test with the _Ugly suffix in this file that
completes the Good/Bad/Ugly triplet for the public index API surface;
specifically add a TestKVSnapshotMemvidBundleIndex_Ugly_* that triggers and
asserts panic/edge behaviors for the public functions (e.g., NewMemvidIndex,
SaveMemvidIndex, LoadMemvidIndex, LoadPrefixFromMemvidIndex,
CheckMemvidIndexCompatibility) — for example call NewMemvidIndex with a
nil/invalid blk or malformed Entries, call
SaveMemvidIndex/LoadMemvidIndex/LoadPrefixFromMemvidIndex with inputs that
provoke panic/edge conditions (nil store, corrupt bundle manifest that causes
decoding panic), and use t.Run subcases to assert panics (recover or
require.Panics) and edge-case returns; name the test with the same prefix as
existing tests and follow the existing style for t.Fatalf checks and
table-driven subtests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ab3e2038-8f7c-4771-a11f-b232a1a59e08

📥 Commits

Reviewing files that changed from the base of the PR and between 07f6af1 and 89f613e.

📒 Files selected for processing (300)
  • .gitignore
  • .gitmodules
  • CLAUDE.md
  • CMakeLists.txt
  • GOAL.md
  • docs/README.md
  • docs/architecture.md
  • docs/build.md
  • docs/cmd/violet.md
  • docs/compute/compute.md
  • docs/development.md
  • docs/examples/compute/frame-pipeline.md
  • docs/examples/daemon/violet-socket.md
  • docs/examples/eval/attention-probe.md
  • docs/examples/eval/perplexity.md
  • docs/examples/inference/batch.md
  • docs/examples/inference/chat.md
  • docs/examples/inference/quantization.md
  • docs/examples/inference/streaming.md
  • docs/examples/model-ops/hf-fit.md
  • docs/examples/model-ops/kv-snapshot.md
  • docs/examples/model-ops/merge.md
  • docs/examples/model-ops/quantize-gguf.md
  • docs/examples/training/distill.md
  • docs/examples/training/grpo.md
  • docs/examples/training/lora-finetune.md
  • docs/examples/training/lora-fuse.md
  • docs/history.md
  • docs/index.md
  • docs/inference/README.md
  • docs/inference/block_cache.md
  • docs/inference/decode_optimisation.md
  • docs/inference/parser_registry.md
  • docs/inference/scheduler.md
  • docs/inference/thinking.md
  • docs/memory/README.md
  • docs/memory/agent_memory.md
  • docs/memory/agentic_project_seed.md
  • docs/memory/kv_snapshot.md
  • docs/memory/kv_snapshot_blocks.md
  • docs/memory/kv_snapshot_index.md
  • docs/memory/kv_snapshot_memvid.md
  • docs/memory/medium.md
  • docs/memory/state_bundle.md
  • docs/model-operations.md
  • docs/model/README.md
  • docs/model/memory_plan.md
  • docs/model/model_pack.md
  • docs/models.md
  • docs/moe/README.md
  • docs/moe/codebook_vq.md
  • docs/moe/expert_residency.md
  • docs/moe/jang.md
  • docs/moe/minimax_m2.md
  • docs/observability/probe.md
  • docs/runtime/2026-05-16-gemma4-e2b-driver-profile.md
  • docs/runtime/2026-05-17-gemma4-parity-and-last-logits.md
  • docs/runtime/2026-05-17-llamacpp-prefill-comparison.md
  • docs/runtime/2026-05-18-gemma4-mtp-speculative-decode.md
  • docs/runtime/2026-05-19-gemma4-e2b-100k-retained-paged.md
  • docs/runtime/2026-05-19-gemma4-e2b-quant-matrix.md
  • docs/runtime/2026-05-19-go-mlx-gemma4-26b-a4b-q4-fresh-story-thinking-ctx65536-c2-g8192-book.md
  • docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-book.md
  • docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-no-thinking-book.md
  • docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-fresh-history-c10-g1536-book.md
  • docs/runtime/2026-05-19-go-mlx-gemma4-e2b-q4-fresh-story-thinking-ctx65536-c2-g8192-book.md
  • docs/runtime/2026-05-19-goal-completion-audit.md
  • docs/runtime/2026-05-19-runner-calibration.md
  • docs/runtime/2026-05-20-chapter-profile-safety.md
  • docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md
  • docs/runtime/README.md
  • docs/runtime/adapter.md
  • docs/runtime/local_autotune.md
  • docs/runtime/register_metal.md
  • docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md
  • docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md
  • docs/training/README.md
  • docs/training/distill.md
  • docs/training/eval.md
  • docs/training/grpo.md
  • docs/training/lora_adapter.md
  • docs/training/sft.md
  • docs/vmlx-feature-gap-report.md
  • external/go-ai
  • external/go-inference
  • external/go-ml
  • go/adapter.go
  • go/adapter/adapter.go
  • go/adapter_example_test.go
  • go/adapter_test.go
  • go/agent/helpers.go
  • go/agent/index.go
  • go/agent/index_test.go
  • go/agent/test_helpers_test.go
  • go/agent/wake_sleep.go
  • go/api_common.go
  • go/api_common_example_test.go
  • go/api_darwin_test.go
  • go/api_shape_test.go
  • go/api_stub.go
  • go/api_stub_example_test.go
  • go/api_stub_test.go
  • go/api_test.go
  • go/api_tokenizer_darwin_test.go
  • go/api_tokenizer_stub.go
  • go/api_tokenizer_stub_example_test.go
  • go/api_tokenizer_stub_test.go
  • go/artifact/artifact.go
  • go/artifact/artifact_test.go
  • go/attention_test.go
  • go/backend.go
  • go/backend_example_test.go
  • go/backend_test.go
  • go/blockcache/blockcache.go
  • go/blockcache/blockcache_test.go
  • go/blockcache/helpers_test.go
  • go/bundle/bundle.go
  • go/bundle/bundle_test.go
  • go/bundle/example_test.go
  • go/bundle/sami.go
  • go/chaptersmoke/chaptersmoke.go
  • go/chaptersmoke/chaptersmoke_test.go
  • go/chat/chat.go
  • go/chat/chat_test.go
  • go/chat/example_test.go
  • go/cmd/go-mlx/main.go
  • go/cmd/go-mlx/main_test.go
  • go/cmd/mlx/main.go
  • go/cmd/mlx/main_test.go
  • go/cmd/mlx/split_ffn_tune.go
  • go/compute/compute.go
  • go/compute/compute_example_test.go
  • go/compute/compute_metal.go
  • go/compute/compute_metal_example_test.go
  • go/compute/compute_metal_helper_test.go
  • go/compute/compute_metal_test.go
  • go/compute/compute_test.go
  • go/compute_stub.go
  • go/compute_stub_example_test.go
  • go/compute_stub_test.go
  • go/compute_test.go
  • go/dataset/jsonl.go
  • go/dataset/sample.go
  • go/dataset_stream.go
  • go/dataset_stream_example_test.go
  • go/dataset_stream_test.go
  • go/device_info.go
  • go/distill.go
  • go/distill_test.go
  • go/eval.go
  • go/eval_darwin.go
  • go/eval_darwin_test.go
  • go/eval_stub.go
  • go/eval_test.go
  • go/fast_eval.go
  • go/fast_eval_example_test.go
  • go/fast_eval_runner.go
  • go/fast_eval_test.go
  • go/gguf/info.go
  • go/gguf/info_example_test.go
  • go/gguf/info_test.go
  • go/gguf/quantize.go
  • go/gguf/quantize_test.go
  • go/grpo.go
  • go/grpo_test.go
  • go/helpers.go
  • go/hf/hf.go
  • go/hf/hf_test.go
  • go/hf/test_helpers_test.go
  • go/hf_fit.go
  • go/inference_contract.go
  • go/inference_contract_test.go
  • go/internal/metal/activation_bridge.cpp
  • go/internal/metal/array.go
  • go/internal/metal/backend.go
  • go/internal/metal/backend_test.go
  • go/internal/metal/batch.go
  • go/internal/metal/cache.go
  • go/internal/metal/cache_test.go
  • go/internal/metal/close.go
  • go/internal/metal/codebook_vq.go
  • go/internal/metal/codebook_vq_test.go
  • go/internal/metal/compile.go
  • go/internal/metal/compile_test.go
  • go/internal/metal/decode.go
  • go/internal/metal/decode_bridge.cpp
  • go/internal/metal/decode_bridge.h
  • go/internal/metal/decode_test.go
  • go/internal/metal/dense_matvec.go
  • go/internal/metal/dense_matvec_test.go
  • go/internal/metal/device.go
  • go/internal/metal/dtype.go
  • go/internal/metal/error_test.go
  • go/internal/metal/expert_id_matvec.go
  • go/internal/metal/expert_id_matvec_test.go
  • go/internal/metal/fast.go
  • go/internal/metal/fast_test.go
  • go/internal/metal/gemma3.go
  • go/internal/metal/gemma4.go
  • go/internal/metal/gemma4_assistant.go
  • go/internal/metal/gemma4_assistant_decode.go
  • go/internal/metal/gemma4_assistant_decode_example_test.go
  • go/internal/metal/gemma4_assistant_decode_test.go
  • go/internal/metal/gemma4_assistant_generate.go
  • go/internal/metal/gemma4_assistant_generate_test.go
  • go/internal/metal/gemma4_assistant_pair.go
  • go/internal/metal/gemma4_assistant_test.go
  • go/internal/metal/gemma4_ffn_residual.go
  • go/internal/metal/gemma4_ffn_residual_test.go
  • go/internal/metal/gemma4_router_topk.go
  • go/internal/metal/gemma4_router_topk_test.go
  • go/internal/metal/gemma4_test.go
  • go/internal/metal/gemma4_vision.go
  • go/internal/metal/generate.go
  • go/internal/metal/generate_test.go
  • go/internal/metal/jang_dequant.go
  • go/internal/metal/jang_dequant_test.go
  • go/internal/metal/kv_snapshot.go
  • go/internal/metal/metal.go
  • go/internal/metal/minimax_m2.go
  • go/internal/metal/minimax_m2_test.go
  • go/internal/metal/mlx_mlx_backend_cpu_available.cpp
  • go/internal/metal/mlx_mlx_backend_gpu_device_info.cpp
  • go/internal/metal/model.go
  • go/internal/metal/model_test.go
  • go/internal/metal/nn.go
  • go/internal/metal/nn_test.go
  • go/internal/metal/ops.go
  • go/internal/metal/process_memory_darwin.go
  • go/internal/metal/process_memory_stub.go
  • go/internal/metal/prompt_cache.go
  • go/internal/metal/prompt_cache_test.go
  • go/internal/metal/qwen3.go
  • go/internal/metal/qwen3_test.go
  • go/internal/metal/runtime_gate.go
  • go/internal/metal/runtime_gate_example_test.go
  • go/internal/metal/runtime_gate_test.go
  • go/internal/metal/sample.go
  • go/internal/metal/sample_test.go
  • go/internal/metal/session.go
  • go/internal/metal/session_example_test.go
  • go/internal/metal/session_test.go
  • go/internal/metal/split.go
  • go/internal/metal/split_test.go
  • go/internal/metal/stream.go
  • go/internal/metal/tokenizer.go
  • go/internal/metal/tokenizer_test.go
  • go/internal/metal/trace.go
  • go/internal/metal/trace_test.go
  • go/internal/metal/training.go
  • go/jang_test.go
  • go/kv/analysis.go
  • go/kv/analysis_example_test.go
  • go/kv/analysis_test.go
  • go/kv/bench.go
  • go/kv/bench_test.go
  • go/kv/blocks.go
  • go/kv/blocks_test.go
  • go/kv/helpers_test.go
  • go/kv/memvid.go
  • go/kv/memvid_test.go
  • go/kv/snapshot.go
  • go/kv/snapshot_example_test.go
  • go/kv/snapshot_test.go
  • go/kv_analysis_example_test.go
  • go/kv_cache_bench.go
  • go/kv_snapshot.go
  • go/kv_snapshot_example_test.go
  • go/kv_snapshot_test.go
  • go/local_tuning.go
  • go/local_tuning_test.go
  • go/lora/adapter.go
  • go/lora/fuse.go
  • go/lora/fuse_stub.go
  • go/lora/fuse_test.go
  • go/lora_adapter_darwin_test.go
  • go/lora_adapter_test.go
  • go/lora_fuse.go
  • go/lora_fuse_darwin.go
  • go/lora_fuse_darwin_test.go
  • go/lora_fuse_test.go
  • go/medium_test.go
  • go/memory/example_test.go
  • go/memory/memory.go
  • go/memory/memory_test.go
  • go/memory_plan.go
  • go/memory_plan_example_test.go
  • go/memory_plan_test.go
  • go/memvid_chapter_smoke.go
  • go/merge/compare.go
  • go/merge/compare_example_test.go
  • go/merge/compare_test.go
  • go/merge/helpers_test.go
  • go/merge/merge.go
  • go/merge/merge_test.go
  • go/mlx.go
  • go/mlx_example_test.go
  • go/mlx_internal_test.go
  • go/mlx_stub.go
  • go/mlx_stub_example_test.go
💤 Files with no reviewable changes (15)
  • go/api_test.go
  • go/api_stub_example_test.go
  • go/api_tokenizer_stub_test.go
  • go/adapter_example_test.go
  • go/api_tokenizer_stub.go
  • go/api_tokenizer_darwin_test.go
  • go/api_tokenizer_stub_example_test.go
  • go/backend_example_test.go
  • go/api_common_example_test.go
  • go/api_shape_test.go
  • go/api_common.go
  • go/api_darwin_test.go
  • go/attention_test.go
  • go/api_stub.go
  • go/api_stub_test.go

Comment thread docs/memory/kv_snapshot_blocks.md
Comment thread go/artifact/artifact.go Outdated
Comment thread go/backend.go
Comment thread go/blockcache/blockcache.go
Comment thread go/blockcache/blockcache.go
Comment thread go/blockcache/blockcache.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go/backend.go`:
- Around line 569-572: The code is aliasing caller-owned byte slices into the
snapshot by assigning head.KeyBytes and head.ValueBytes directly to KeyBytes and
ValueBytes; make defensive copies instead (like Value is copied) to avoid
leaking mutable state—replace the direct assignments for KeyBytes and ValueBytes
with fresh copies (e.g., using append to copy into a new []byte) when
constructing the metal snapshot/struct (the fields KeyBytes and ValueBytes on
the metal KV head).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9b686e0a-8b41-4e47-975f-03cf235491e9

📥 Commits

Reviewing files that changed from the base of the PR and between 89f613e and c19bc07.

📒 Files selected for processing (22)
  • CMakeLists.txt
  • cpp/CMakeLists.txt
  • go/backend.go
  • go/backend_test.go
  • go/cmd/mlx/main.go
  • go/cmd/mlx/main_test.go
  • go/internal/metal/backend.go
  • go/internal/metal/backend_test.go
  • go/internal/metal/decode_bridge.cpp
  • go/internal/metal/gemma4.go
  • go/internal/metal/gemma4_test.go
  • go/internal/metal/generate.go
  • go/internal/metal/metal.go
  • go/internal/metal/mlx_build_config.h
  • go/internal/metal/pinned_array.go
  • go/internal/metal/pinned_array_bridge.cpp
  • go/internal/metal/pinned_array_test.go
  • go/internal/metal/sample.go
  • go/internal/metal/sample_test.go
  • go/internal/metal/session.go
  • go/kv/snapshot.go
  • go/memvid_chapter_smoke.go
✅ Files skipped from review due to trivial changes (1)
  • cpp/CMakeLists.txt

Comment thread go/backend.go
Copy link
Copy Markdown

@github-advanced-security github-advanced-security AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SonarCloud found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Snider and others added 23 commits May 22, 2026 23:02
appendKVEncodedTensor + stream.encodedTensor were both refactored in
prior waves to stream values directly into dst, skipping the
intermediate normalizeKVSnapshotNativeTensor alloc. The helper has
zero remaining call sites — the only references are explanatory
comments. Removing the 19-line function tightens the surface and
prevents future code from re-wiring the slow path.

Co-Authored-By: Virgil <virgil@lethean.io>
…nstant

W10-E cached EmbeddingScale + PerLayerInputEmbeddingScale on
Gemma4TextConfig and removed per-token math.Sqrt(HiddenSize) from six
forward-path sites. The perLayerInputTensor path still carried two
math.Pow calls firing on every decode token / prefill step:

  1. MulScalar(projected, float32(math.Pow(float64(HiddenSize), -0.5)))
  2. MulScalar(combined,  float32(math.Pow(2, -0.5)))

Cache (1) as cfg.PerLayerProjectionScale (= 1/sqrt(HiddenSize)) populated
by gemma4FinaliseEmbeddingScales alongside the existing two scales —
parseGemma4Config + LoadGemma4 already invoke it twice so the new field
is kept in sync without extra plumbing. Reset path zeros
PerLayerProjectionScale when HiddenSize is zeroed by a pathological
loader, mirroring the per-layer reset.

Lift (2) to a package-level const gemma4PerLayerCombineScale = 1/sqrt(2)
expressed as a float32 literal (0.70710678118654752440). The test gates
the literal against float32(math.Pow(2, -0.5)) so any divergence trips
locally before reaching a forward pass.

Sites updated:
- gemma4.go perLayerInputTensor projected scaling -> cfg.PerLayerProjectionScale
- gemma4.go perLayerInputTensor combine scaling   -> gemma4PerLayerCombineScale

perLayerInputTensor is invoked through computePerLayerInputs from both
forwardNativeFixedGreedyToken (per-decode-token) and forwardHidden
(per-prefill step), so the lift compounds across the existing
W10-E EmbeddingScale work for the same forward path. The compiled
CompileShapeless wrapper around perLayerInputTensor inherits the change
because it dispatches to the same function.

Tests extend the existing W10-E pair: byte-equivalence vs a freshly
computed math.Pow(HiddenSize, -0.5), reset-on-zero for both per-layer
and HiddenSize zeroings, and a literal-vs-math.Pow guard for
gemma4PerLayerCombineScale.

Co-Authored-By: Virgil <virgil@lethean.io>
…ositionVector

Three orphans in analysis.go:

  - kvAnalysisHeadVectors: a thin wrapper for kvAnalysisHeadVectorsInto
    that callers stopped routing through (the Into form is the only
    surface left after W8 reuse-scratch).
  - kvAnalysisMeanVector: replaced by kvAnalysisLayerState (sum-into-
    place avoids the [][]float32 view + per-head combined-buffer alloc).
  - kvAnalysisPositionVector: replaced by the inline slice arithmetic
    inside kvAnalysisPositionDifferentiation Pass 1.

Comment refs touched up — the Into form name is consistent, and the
mean-vector behaviour note now stands on its own without naming a
function that no longer exists.

Net: -47 lines, tighter surface, no behaviour change.

Co-Authored-By: Virgil <virgil@lethean.io>
…r loop

splitPerLayerInputTensor calls Squeeze(sliced, 2) inside a
NumHiddenLayers-iteration loop.  After W10-A made the substrate
Squeeze itself a 0-alloc cgo crossing, the residual per-call cost
shifted to the model layer: each `Squeeze(sliced, 2)` variadic call
allocates a fresh single-element `[]int{2}` axes slice that escapes to
the heap (Squeeze's substrate takes &axes[0] for the cgo inline call,
so the compiler can't keep the slice on the stack).

For a Gemma 4 model with 26 hidden layers this is 26 allocs and 208 B
of GC pressure per perLayerInputs call — per token for decode, per
prefill step otherwise.  Hoist the `[]int{2}` outside the loop and
pass it via `Squeeze(sliced, axes...)` — the slice is allocated once
per splitPerLayerInputTensor call rather than once per layer.

Bench (Apple M3 Ultra, 200ms benchtime; mock 26-layer loop over
Squeeze of a [1,1,1,128] array):

  Loop26_VariadicInline    11427 ns/op   208 B/op   26 allocs/op
  Loop26_VariadicHoisted   10427 ns/op     8 B/op    1 allocs/op

-25 allocs/forward (-96.2%), -200 B/forward (-96.2%), with the same
forward output by construction — the variadic slice content is
identical, only the allocation point moved.

Co-Authored-By: Virgil <virgil@lethean.io>
…sites

After W10-A made AsStrided / Reshape / Transpose / Squeeze /
SliceUpdateInplace 0-alloc at the substrate level (the C side
materialises shape/strides arrays on the C stack via the *_inline
wrappers), the residual per-call cost at every model attention forward
site shifted to the Go-side inline slice literal.

The substrate takes &shape[0] / &strides[0] / &axes[0] via
unsafe.Pointer for the cgo call, so the compiler conservatively
escapes any caller-side slice literal to the heap.  This means each
model attention forward call still pays:

  AsStrided inline slice literals     +48 B/op +2 allocs/op
  Reshape variadic args               +16 B/op +1 alloc/op
  Transpose variadic args             +32 B/op +1 alloc/op
  Squeeze  variadic args               +8 B/op +1 alloc/op

Per gemma3 / gemma4 / qwen3 attention layer per token:

  Q/K/V AsStrided                3 × (48 B + 2 allocs) = 144 B + 6 allocs
  Transpose out                  1 × (32 B + 1 alloc)  =  32 B + 1 alloc
  Reshape merged                 1 × (16 B + 1 alloc)  =  16 B + 1 alloc
                                                       192 B + 8 allocs

For a 32-layer 1000-token Gemma 4 decode that is ~256 000 allocs and
~6 MB of GC pressure attributable to inline rank-4 slice literals
alone.  W9-AA / W9-L / W9-M precedents would normally absorb this at
the model layer; that path is closed here because the call signatures
are inherently variadic.  A future substrate addition (a rank-4
overload that takes int32/int64 args directly, e.g.
AsStrided4D(a, B, H, L, D, sB, sH, sL, sD, offset)) would lift the
last residual without touching every model file.

The benchmarks added here are the measurement floor for that future
substrate work and are not themselves a perf change.  Each routine
compares the existing pre-built-slice path against the model-call
inline-literal pattern so any future substrate fix can demonstrate the
delta against the same shape inputs.

Co-Authored-By: Virgil <virgil@lethean.io>
…lem path)

The DecodeFloatData F32 branch was running an N-element loop of
`math.Float32frombits(binary.LittleEndian.Uint32(raw[i*4:]))`. Each
iteration reslices, bounds-checks, byte-loads four bytes, calls Uint32
to re-assemble them little-endian, then converts to float32 via the
bit-pattern primitive. On little-endian arm64/amd64 the bytes on disk
already match the float32 storage layout, so the whole loop is a memcpy.

Reinterpret-cast `values` as a byte slice via unsafe.Slice +
unsafe.SliceData (same idiom as kv/snapshot.go decodeKVSnapshotNativeTensor)
and `copy(dst, raw)` in one shot.

Bench (Apple M3 Ultra, benchtime=2s, count=3):

  F32_512   502ns → 355ns   (-29%)
  F32_2048  1991ns → 1322ns (-34%)

Numerical parity: bit-exact — Float32frombits and the byte-pattern memcpy
land on identical IEEE-754 bit patterns. Existing parity test
TestParseHeader_Parity_Synthetic + TestWriteSubset_Good cover round-trip;
`go test -race -short` clean.

Filed-by: cladius
Co-Authored-By: Virgil <virgil@lethean.io>
f32sRaw and i32sRaw staged values into the pooled scratch buffer then
issued a single writer.Write. The staging copy is pure waste because:

  - The byte view of []float32 / []int32 already matches what
    Float32bits + PutUint32 would produce (little-endian arches only,
    arm64 + amd64).
  - Writers consume the bytes within Write, so we don't need to retain
    a stable scratch buffer past the call.
  - The writer itself (sha256.Write, PutBytesStream) does its own
    buffering; staging into scratch first costs one extra memcpy with
    zero downstream benefit.

Now passes the unsafe.Slice byte view straight to w.bytes(...).
Eliminates 1 memcpy per f32sRaw / i32sRaw call (typically 3-5 calls
per snapshot encode; one per layer × Key/Value side + tokens + logits).

scratchFor + scratch field removed from the pool struct.

Bench (M3 Ultra, benchtime=2s, n=3):
  Snapshot_WriteWithOptions_2048Tokens: 1217 → 795 ns/op  (-35%)

No alloc delta — scratch was pool-resident already; the win is pure
ns/op throughput. HashSnapshot benches unchanged within noise (the
sha256.New + Sum dominate at small sizes).

Co-Authored-By: Virgil <virgil@lethean.io>
…lem path)

DecodeFloatData F64 branch was looping
`float32(math.Float64frombits(binary.LittleEndian.Uint64(raw[i*8:])))`.
Per-iter cost: 1) re-slice raw[i*8:] (bounds check), 2) Uint64 reassembly
from 8 bytes, 3) Float64frombits (a bit cast), 4) float32 conversion.
Steps 1-3 are throw-away work because the bytes on disk already match
the float64 storage layout on every Go-supported architecture.

Reinterpret-cast raw as []float64 via unsafe.Slice + unsafe.SliceData
and let `range src64` produce direct float64 loads. The downcast to
float32 stays — F64→F32 is genuinely lossy. On arm64 the compiler
emits a clean LDR D + FCVT S pair per element.

Bench (Apple M3 Ultra, benchtime=2s, count=3):

  F64_2048  2470ns → 1853ns mean (-25%)
            best  1615ns        (-35%)

Numerical parity: bit-exact — Float64frombits is the inverse of the
storage byte pattern, so the unsafe cast and the explicit reassembly
land on identical float64 values, and float32() downcast is identical
in both paths.

Filed-by: cladius
Co-Authored-By: Virgil <virgil@lethean.io>
… 3× faster

PrefillCacheStateArrays paid one alloc per Cache.State() call (each returning
a fresh []*Array{k,v} literal). On Gemma 4's 26-cache fan-out the substrate
saw 27 allocs per prefill step (516.1 ns).

Add an optional stateAppender interface implemented by KVCache, RotatingKVCache,
FixedKVCache, PagedKVCache, QuantizedKVCache: AppendState(dst) appends raw
state arrays into a caller-provided slice. The public State() contract is
unchanged — appendCacheState() helper falls back via State() copy for any
future cache type that doesn't implement the optimisation.

Before:
  PrefillCacheStateArrays_8Caches    172.7 ns/op    9 allocs/op
  PrefillCacheStateArrays_26Caches   516.1 ns/op   27 allocs/op
After:
   PrefillCacheStateArrays_8Caches    59.81 ns/op    1 alloc/op   (-88.9% / 2.9×)
   PrefillCacheStateArrays_26Caches  174.1 ns/op    1 alloc/op   (-96.3% / 3.0×)

Same pattern wired through cacheStateArraysForDetach (eval-detach path) —
same alloc-floor reduction applies to detach-after-prefill.

Co-Authored-By: Virgil <virgil@lethean.io>
…(-38% allocs)

WriteSubset previously built a map[string]HeaderEntry with per-entry
[]int64 Shape + DataOffsets slices then ran core.JSONMarshal over it.
The reflection-driven encoder allocates internally for each struct field
descriptor and for the resulting buffer-grow chain. On a 2-tensor subset
the path was 24 allocs / 2660 B.

Replace with subsetHeaderEncoded — a hand-rolled appender that emits
the safetensors header bytes directly into a pre-sized buffer:

  {name1:{"dtype":"F32","shape":[2,3],"data_offsets":[0,24]},...}

- Map keys are emitted in alphabetical order (encoding/json default).
- Struct field order is dtype → shape → data_offsets (declaration order
  of HeaderEntry — what JSONMarshal already produced).
- appendJSONString is the same escape table as encoding/json
  encodeString (\", \\, \b/\f/\n/\r/\t, \u00XX for the rest).
- appendJSONInt64 emits base-10 with no leading zeros into a 20-byte
  stack buffer (no heap alloc).

Bench (Apple M3 Ultra, benchtime=2s, count=3, WriteSubset_TwoTensors):

  before  74177 ns/op  2660 B/op  24 allocs/op
  after   70600 ns/op  1688 B/op  15 allocs/op  (-5% ns, -37% B, -38% allocs)

Time savings are bounded by the file I/O the bench includes; the
allocation drop is the structural win. The dropped allocs were exactly
the per-tensor HeaderEntry construction + JSONMarshal field-descriptor
churn — the data flows through once now, never materialised as a Go
object graph.

Added TestSubsetHeaderEncoded_ParityWithJSONMarshal — anchors the
encoder output bit-exact against core.JSONMarshal(map[string]HeaderEntry)
across 4 shapes (single 2D, multi-dim mix with three tensors, lowercase
dtype canonicalisation, single one-dim). Any structural drift breaks
the test, so future tweaks remain safe.

Filed-by: cladius
Co-Authored-By: Virgil <virgil@lethean.io>
…05 ns -42%, stream-writer direct-unsafe-slice eliminates staging memcpy)
…tate/scheduler/openai/ollama/parser + 3.5-4.6× jang dequant model-LOAD)
… + Squeeze axes hoist — -96.2% allocs/forward on 26-layer Gemma 4)
Continues the W9-Y sentinel sweep into the remaining safetensors call
sites with static-text core.NewError messages:

  safetensors.go
    errChunkOutOfBounds    — ReadFloat32Chunk + readFloat32ChunkInto
    errChunkTruncated      — same pair
    errF32PayloadMismatch  — DecodeFloatData F32 branch
    errF16PayloadMismatch  — DecodeFloatData F16 branch
    errBF16PayloadMatch    — DecodeFloatData BF16 branch
    errF64PayloadMismatch  — DecodeFloatData F64 branch
    errCoreResultFailed    — resultError fallback

  write.go
    errSubsetPathEmpty       — WriteSubset early validation
    errSubsetNoTensors       — same
    errSubsetTensorNameEmpty — subsetHeaderEncoded validation
    errWriteNoProgress       — writeAll zero-progress guard

Each previously allocated a fresh core.NewError on fire. None of these
fire in the bench paths (validation guards), so the bench delta is zero
— this is a hygiene fix that keeps the next session's pprof showing
data-shape allocs rather than "oh another core.NewError". errors.Is
on the typed sentinels also works for callers wanting to distinguish
"chunk truncated" vs "chunk out of bounds" without text matching.

Per-call-site errors that interpolate ref.Name are left as-is — the
message is genuinely dynamic and a sentinel can't carry the per-tensor
context.

Filed-by: cladius
Co-Authored-By: Virgil <virgil@lethean.io>
…14→6 allocs

The W10-A inline-C-wrapper for Slice/SliceUpdateInplace removed the cgo-side
[]C.int materialisations, but the Go-side []int32{0,0,prev,0}, []int32{B,H,
offset,D} literal pairs still escape to heap (verified via -gcflags='-m').
On KVCache.Update: 4 such pairs per call × 4 bytes elem × 4 elems each = 8
heap allocs per Update step.

Add Slice4 / SliceUpdateInplace4 — rank-4 scalar-pass form (W10-J pattern
applied to slice). 8 indices passed as register-passed scalars; C wrapper
materialises stack buffers directly. KV cache canonical rank is 4, so these
cover the dominant cache.go call sites.

Wire KVCache.Update (5 call sites) through the scalar form:

Before:
  KVCache_Append_SingleToken_FromEmpty       241 B/op   14 allocs/op
  KVCache_Append_SingleToken_To32           7188 B/op  417 allocs/op
  KVCache_Append_SingleToken_To512        114766 B/op 6659 allocs/op
  KVCache_Append_512TokenPrefill             240 B/op   14 allocs/op
  KVCache_Append_4096TokenPrefill            240 B/op   14 allocs/op

After:
  KVCache_Append_SingleToken_FromEmpty       112 B/op    6 allocs/op  (-57%)
  KVCache_Append_SingleToken_To32           3089 B/op  161 allocs/op  (-61%)
  KVCache_Append_SingleToken_To512         49221 B/op 2563 allocs/op  (-61%)
  KVCache_Append_512TokenPrefill             112 B/op    6 allocs/op  (-57%)
  KVCache_Append_4096TokenPrefill            112 B/op    6 allocs/op  (-57%)

ns/op steady — the W10-A inline-C wrapper already won the per-call latency;
this is alloc-floor improvement. Bedrock for downstream alloc-sensitive paths.

The remaining 6 allocs/op are: 4 newArray ops (Slice×2 + SliceUpdateInplace×2)
+ 2 Concatenate-internal materialisations. Those need W10-G pool extensions
or W10-A-style wrappers.

Other rank-4 call sites in cache.go (RotatingKVCache, FixedKVCache,
BorrowedFixedState — 17 more lines) follow in subsequent commits.

Co-Authored-By: Virgil <virgil@lethean.io>
The DecodeFloatData F16 + BF16 branches were running a manual
`uint16(buf[j]) | uint16(buf[j+1])<<8` byte-pair combine per element
because the previous pass (W4-D) avoided allocating a uint16 slice.
But the byte combine is throwaway work — on every Go-supported
architecture, fp16 / bf16 storage is little-endian, which is identical
to the in-memory layout of a uint16. The per-iter cost was a byte load,
a shift, an OR, then the actual bit-twiddle conversion.

Reinterpret-cast raw as []uint16 via unsafe.Slice + unsafe.SliceData
and let `range src16` produce direct half-word loads. On arm64 this
collapses to LDR.H + Float16ToFloat32 (F16) or LDR.H + LSL + bit-cast
(BF16).

Bench (Apple M3 Ultra, benchtime=2s, count=3):

  F16_2048   4032ns → 3506ns mean (-13%, best -15%)
  BF16_2048  2167ns → 1450ns mean (-33%, best -33%)

The smaller F16 win is structural — Float16ToFloat32 itself is the
~70% dominator (denormal + special-value handling + final
Float32frombits), so the load-side simplification only earns where
the load was non-trivial. BF16's body is just a shift + Float32frombits
so the load saving lands larger.

Numerical parity: bit-exact — same little-endian uint16 reassembly,
same Float16ToFloat32 / Float32frombits result.

Filed-by: cladius
Co-Authored-By: Virgil <virgil@lethean.io>
…nsor [12]byte heap escape

The per-tensor `var trailer [12]byte` was force-escaping to heap on
every iteration because io.ReadFull's interface-typed buf parameter
defeats stack allocation. On a qwen3-class manifest (200 tensors) this
costs 200 allocs per parseGGUF call.

Reusing the already-existing 64-byte `scratch` arena for the trailer
read keeps the io.ReadFull interface escape pinned to a single per-call
allocation that's amortised across all metadata + tensor reads.

Same bytes read in the same order — bit-exact parse.

BenchmarkInfo_ReadInfo_TypicalLayers (200 tensors):
  before: 484 allocs/op  58136 B/op
  after:  284 allocs/op  54944 B/op  (-41% allocs, -5.5% bytes)
…ll alloc

The 24-byte header read used a dedicated `var header [24]byte` local
that escaped to heap separately from the existing `scratch [64]byte`.
Both serve identical purpose (io.ReadFull destination), so reuse the
single arena.

Moves the scratch declaration above the header read; the header
occupies scratch[:24] just long enough to extract magic / version /
counts before the metadata loop reuses scratch[:] for keys + values.

Same bytes read in the same order — bit-exact parse.

BenchmarkInfo_ReadInfo (across all three benches):
  -1 alloc/op, -24 B/op uniform
… alloc reduction

Apply the Slice4 / SliceUpdateInplace4 scalar-pass form to the remaining
rank-4 slice/update sites in cache.go that the previous commit established:
RotatingKVCache (5 sites in updateInplace + 5 in updateConcat),
FixedKVCache (4 sites in updateInplace + ReadState), PagedKVCache helpers
(cachePageView, borrowedPageView, borrowVisiblePage, prefill materialization
+ cacheTail) — 22 call sites converted.

Each site replaces two []int32{...} literal allocs per call with zero
heap allocs (scalars are register-passed; C wrapper materialises stack
buffers directly).

Bench deltas (selected):
  RotatingKVCache_512Prefill_Cap512     2499 ns / 9 allocs  →  2232 ns / 5 allocs  (-11% ns, -44% allocs)
  PagedKVCache_BoundedTo1024_PastCap                12286 allocs  →   8206 allocs  (-33%)
  PagedKVCache_4096Tokens_PageSize256_Prealloc      14511 allocs  →  14511 allocs  (no-find — locked by other pattern)
  QuantizedKVCache_4096Prefill_Q8Q8                    14 allocs  →     13 allocs  (-1)

The Quantized_4096Prefill drop is the W10-G fingerprint: each Slice4 site
saves 2 allocs; visible only where the call count overwhelmed the other
costs.

Co-Authored-By: Virgil <virgil@lethean.io>
…d-rolled JSON write — F32 -50% / F64 -44% / BF16 -31% / WriteSubset -38% allocs)
….0× ns

ScaledDotProductAttentionPaged transposes the K page on every iteration of
the page loop (Transpose(key, 0,1,3,2)). The variadic axes []int parameter
escapes to heap on each call (verified via -gcflags='-m'). On 16-page
attention that's 16 per-page alloc + 1 outer slice = 17 allocs.

Add Transpose4 scalar-pass form (W10-J pattern applied to transpose) and
wire SDPAPaged through it.

Before:
  SDPAPaged_2Pages_Page256_Q1_D128       65 B/op    2 allocs/op
  SDPAPaged_4Pages_Page256_Q1_D128      131 B/op    4 allocs/op
  SDPAPaged_8Pages_Page256_Q1_D128      328 B/op    9 allocs/op
  SDPAPaged_16Pages_Page256_Q1_D128     657 B/op   17 allocs/op

After:
  SDPAPaged_2Pages_Page256_Q1_D128        0 B/op    0 allocs/op  (-100%)
  SDPAPaged_4Pages_Page256_Q1_D128        2 B/op    0 allocs/op  (-100%)
  SDPAPaged_8Pages_Page256_Q1_D128       67 B/op    1 alloc/op   (-89%)
  SDPAPaged_16Pages_Page256_Q1_D128     136 B/op    1 alloc/op   (-94%)

The remaining 1 alloc at 8+ pages is the scorePages slice grow chain (cap
hint is len(keyPages) so this is a single allocation — the heap byte count
scales with page count).

ns/op steady — the W10-A transpose-axes-inline wrapper already won the
per-call cgo cost; this is alloc-floor reduction on decode-time attention
across pages, the PagedKVCache canonical hot path.

Co-Authored-By: Virgil <virgil@lethean.io>
Four core.NewError sites in resolveGGUFFile / parseGGUF / readGGUFString
previously rebuilt a fresh `*core.Err` per hit. Lifting to package
vars matches the W9-Y sweep pattern (safetensors/header_parse.go)
and ensures truncated/bogus GGUF files don't bleed allocs on the
error path during fleet probes.

Success-path benches unchanged; the wins land when ReadInfo hits
churn-grade error scenarios at scale (model discovery walking 1000+
directories with mixed valid/invalid candidates).

  errGGUFNoFile, errGGUFMultipleFiles, errGGUFInvalidMagic,
  errGGUFStringTooLong
…iew per tensor

readGGUFString allocated a fresh []byte (then string) per tensor name.
For a qwen3-class manifest (200 tensors) that's 200 separate heap
objects per parse, dominating the parseGGUF alloc count.

readTensorNameInto reads all names into a single 40 B/tensor slab,
then hands out zero-copy core.AsString views. The arena is sized once
and never grown, so existing name views stay valid for the lifetime
of the Info — same lifetime as one big buffer instead of N tiny ones.

Overflow path (name > 40 B headroom) falls back to per-tensor make to
preserve correctness — pre-existing views in the arena stay anchored.

Skips the intern-map probe for tensor names: every real GGUF tensor
name contains a layer index (`blk.<N>.<component>.<part>`), so the
intern hit-rate was zero on this path.

BenchmarkInfo_ReadInfo_TypicalLayers (200 tensors):
  before: 283 allocs/op  54920 B/op
  after:   84 allocs/op  59912 B/op  (-70% allocs, +9% bytes)

BenchmarkInfo_ReadInfo_VocabHeavy (50 tensors + vocab-heavy metadata):
  before: 673 allocs/op  44296 B/op
  after:  624 allocs/op  45544 B/op  (-7% allocs, +3% bytes)

The byte uptick is the arena's headroom budget; net alloc-count drop
matters more for GC pressure under fleet-level model-discovery churn.
Snider and others added 29 commits May 23, 2026 03:54
W11-W: restoreFixedCacheSnapshot / restoreQuantizedCacheSnapshot /
restorePagedCacheSnapshot each constructed a `[]*Array{...}` literal
that the caller immediately `append(.., arrays...)`'d into evalArrays.
One slice alloc per cache restored, on a hot warm-restore path.

Added appendRestoreXxxCacheSnapshot siblings that take a caller-owned
dst slice and return the appended-into result. The old funcs become
one-line wrappers (test surface unchanged), the load-bearing callers
(prompt_cache.restorePromptCachesWithRequestFixedSize +
session.go.restoreSessionCaches) use the append form directly.

BenchmarkPromptCache_RestoreFixedCaches_26_Gemma4 (new):
  old path: 4170 B / 80 allocs
  new path: 3756 B / 54 allocs
  delta:    -413 B / -26 allocs (-1 alloc per cache, exact)

Pattern parallels W10-O appendCacheState — both kill per-cache literal
allocations on Gemma 4 fan-outs.

Co-Authored-By: Virgil <virgil@lethean.io>
W11-W: newPromptCacheEntryWithHidden + newPromptCacheEntryFromKVBlocks
allocated a `snapshotOffsets []int` slice on every entry, but the
slice was only ever read on the failure path of evalPromptCacheArrays
(inside the labelAt closure).

Removed the eager build. labelAt now recomputes `(cache_index,
state_index)` by walking entry.caches and summing arrayCount() until
the requested array index is crossed. Same Sprintf output, same error
shape — the recompute only runs on the (cold) failure branch.

Save: 1 alloc per snapshot/restore (storePromptCache after prefill +
block-source restore). On Gemma 4 26-cache flow with a snapshot+restore
per turn, ~2 allocs/turn dropped.

Co-Authored-By: Virgil <virgil@lethean.io>
…Prefix 3→0 allocs, 26-cache Gemma 4 restore -156 allocs, RoundTrip -17% ns + -33B; Zeros4 convergent with W11-T)

# Conflicts:
#	go/internal/metal/array.go
…ingle cgo crossing

DispatchOne folds the entire (config_new + set_grid + set_thread_group +
add_output_arg + apply + size + get + config_free) sequence into a single
cgo crossing via mlx_fast_metal_kernel_dispatch_one_inline.  Every
production single-output MetalKernel caller in this package follows the
fresh-cfg-per-call pattern: 6 cgo crossings (config_new, set_grid,
set_thread_group, add_output_arg, apply_one, config_free) collapse to 1.

The MetalKernelConfig Go wrapper escapes to heap on every NewMetalKernelConfig
call — escape analysis shows `&MetalKernelConfig{...} escapes to heap in
NewMetalKernelConfig`.  DispatchOne removes the wrapper entirely from the
per-call path: the C config handle is born and freed inside the inline-C
wrapper, leaving zero Go-side allocs on the dispatch frame.

MetalKernelGrid wraps the grid + thread-group dimension sextuple — keeps
the call signature readable and prevents accidental swap between the two
triples.

Per-call savings stack with W11-V-A (inputVec inline-C) and W11-V-B
(ApplyOne size+get inline-C):
  pre-W11-V       ApplyOne path        DispatchOne path
  11 cgo crossings  6 cgo crossings    2 cgo crossings
  (expert_id_matvec MoE tiny: 5 inputs, 1 output, fresh cfg per call)

DispatchOne keeps the same pooled output-vec holder + pooled input-handle
scratch primitives from ApplyOne — no new allocations introduced.

Parity verified bit-exact via TestMetalKernel_DispatchOne_Parity_Good
against the cfg-driven ApplyOne path on the same kernel + inputs.

Callers migrate in a follow-up commit.

Co-Authored-By: Virgil <virgil@lethean.io>
…op cfg ceremony

The 11 single-output MetalKernel callers migrated from ApplyOne in the
prior commit now move to DispatchOne, eliminating the per-call MetalKernelConfig
wrapper.

Migrated sites:
  expert_id_matvec.go × 4 — quantized expert ID matvec / GELU gate-up /
    split gate-up / weighted matvec sum (Gemma 4 26B MoE forward path)
  dense_matvec.go × 2 — quantized dense matvec / GELU split gate-up
  jang_dequant.go × 2 — JANG dequant / packed linear fused (MiniMax M2)
  codebook_vq.go × 1 — codebook VQ matvec
  gemma4_ffn_residual.go × 1 — native Gemma 4 FFN residual fuse
  gemma4_router_topk.go × 1 — native Gemma 4 router matvec scores path

The 2-output gemma4_router_topk top-k path stays on Apply (cfg + multi-output
not yet covered by DispatchOne).

Per-call drop: 6 cgo crossings → 2 (and 1 fewer Go heap alloc since the
MetalKernelConfig wrapper no longer escapes).

Bit-exact correctness preserved (same MLX kernel under the hood;
DispatchOne is a structural collapse, not a semantic change).

Co-Authored-By: Virgil <virgil@lethean.io>
…itives — apply_inline + ApplyOne + DispatchOne; MoE_RouterProjection_H2048_E32 -21.6%, geomean -37% B/op -29.3% allocs; 23 caller migrations)
Completes the W10-O Slice4/Transpose4 scalar-pass family at the rank-1/2
frontier W11-U flagged as the next residual. packQ4Cached / unpackQ4 /
maxAll currently call `Reshape(arr, int32(n))` and
`Reshape(arr, int32(pairs), int32(2))` where the variadic []int32
escapes to heap on every Q4 K/V Update + every dequant + every
quantise-max boundary.

Reshape1(arr, n int32) and Reshape2(arr, h, w int32) route through new
mlx_reshape_inline_1 / mlx_reshape_inline_2 C wrappers that materialise
the 1/2-element shape buffer on the C stack directly from
register-passed scalars — eliminating the slice escape entirely. Same
W10-J / W11-A pattern, lower rank.

- Reshape1: 1 → 0 allocs/op, -4 B/op (BenchmarkReshape1_Scalar)
- Reshape2: 1 → 0 allocs/op, -8 B/op (BenchmarkReshape2_Scalar)

TestOps_Reshape1_Parity and TestOps_Reshape2_Parity lock bit-exact
equivalence with the variadic Reshape path across small/single/large
shapes (mirrors W11-F TestOps_ScalarBridge_Parity discipline).

Caller migration follows in a subsequent commit.

Co-Authored-By: Virgil <virgil@lethean.io>
…imitives (W11-AC)

Completes the rank-1/2 scalar-pass slice family alongside the existing
Slice4 / SliceUpdateInplace4 wrappers. packQ4Cached pays the largest
hidden tax: `SliceAxis(paired, 1, 0, 1)` + `SliceAxis(paired, 1, 1, 2)`
each go through SliceAxis which allocates `make([]int32, ndim)` twice
per call — ~4 slice heap allocs per Q4 K/V Update on the V side alone.

- Slice1(arr, s0, e0) routes via mlx_slice_inline_1
- Slice2(arr, s0,s1, e0,e1) routes via mlx_slice_inline_2
- SliceUpdateInplace2(arr, upd, s0,s1, e0,e1) routes via
  mlx_slice_update_inline_2

All three materialise the C stack buffers directly from
register-passed scalars, eliminating the slice-literal escape entirely.
strides remain implicitly 1 (matches the wider Slice* convention).

Bench evidence (BenchmarkSlice2_*):
- SliceAxis-32      499.2 ns/op  16 B/op  2 allocs/op   (legacy)
- Variadic-32       391.6 ns/op   0 B/op  0 allocs/op   (pre-built slices)
- Scalar-32         376.3 ns/op   0 B/op  0 allocs/op   (new)

The Scalar / SliceAxis delta — -2 allocs and ~25% faster — is what
packQ4Cached gains per call site, twice per Q4 store.

TestSlice_Slice1_Parity / TestSlice_Slice2_Parity /
TestSlice_SliceUpdateInplace2_Parity lock bit-exact equivalence with
the variadic Slice / SliceUpdateInplace paths across prefix / suffix /
middle / column / row / submatrix shapes.

Caller migration follows in subsequent commits.

Co-Authored-By: Virgil <virgil@lethean.io>
W11-AD primitive: stream-passing siblings of Slice4 /
SliceUpdateInplace4 so per-token loops can hoist the DefaultStream()
lookup outside the loop.  Mirrors the W10/W11 fixedKVCacheSlice4D
pattern: KVCache.Update issues four Slice4-family calls per token,
each of which currently resolves the default stream independently
(RWMutex.RLock + atomic cached-device load + GPU/CPU branch).

The existing Slice4 / SliceUpdateInplace4 keep working unchanged —
they now delegate to the *WithStream sibling with DefaultStream()
resolved once.  Parity tests verify bit-exact output across the
KV-cache rank-4 slice geometry.

Co-Authored-By: Virgil <virgil@lethean.io>
…cheUpdate (W11-Y)

Add direct bench coverage for two fast.go decode-time hot paths that
had no prior benchmark surface:

- nativePagedSingleTokenAttention at 2/4/8/16 pages on Page256,
  matching the existing SDPAPaged page-count sweep.
- singleTokenCacheUpdate at Heads8/Cap512 (decode) and Heads32/Cap4096
  (larger-LM decode).

Both surfaces are touched once per layer per decode step, so cgo-cost
deltas (page-handle scratch, shape-buf scratch) need a direct bench to
land as a visible signal — the existing SDPAPaged benches cover the
Go-side ScaledDotProductAttentionPaged path which doesn't fall through
to the native wrapper.

Co-Authored-By: Virgil <virgil@lethean.io>
…11-Y)

Replace the two per-call C.calloc / C.free trips that
nativePagedSingleTokenAttention used to hand mlx_array runs to the
native paged-attention wrapper with a sync.Pool of *[]C.mlx_array
slices, mirroring the W11-T scorePages pattern.

The native wrapper consumes the page-handle buffer synchronously, so
the slice goes back to the pool the moment the cgo call returns; the
buffer can therefore be Go-heap-resident (no growth survives a single
call). 16-capacity New() matches typical PagedKVCache page counts
during decode; larger sweeps grow the backing array and the pool
reuses the grown slot on subsequent calls.

Measured at -benchtime=200ms -count=3 on Apple M3 Ultra:

  BenchmarkNativePagedSingleToken_2Pages_Page256   ~702 -> ~331 us  -53 percent
  BenchmarkNativePagedSingleToken_4Pages_Page256   ~720 -> ~410 us  -43 percent
  BenchmarkNativePagedSingleToken_8Pages_Page256  ~1020 -> ~578 us  -43 percent
  BenchmarkNativePagedSingleToken_16Pages_Page256 ~1530 -> ~830 us  -46 percent

Byte/alloc counts unchanged at 0 allocs/op on both sides (C.calloc
allocations do not show up in Go's benchmem); the win is pure
wall-clock cgo overhead removed from the decode hot path. Test gate
clean under -race -short across the metal package.

Co-Authored-By: Virgil <virgil@lethean.io>
Migrates the Q4 storage hot path to the W11-AC rank-1/2 scalar-pass
primitives — the variadic-Reshape escapes and the SliceAxis
`make([]int32, ndim)` materialisations W11-U flagged are now gone:

- Reshape(q, int32(n))                  → Reshape1(q, int32(n))
- Reshape(padded, int32(pairs), int32(2)) → Reshape2(padded, int32(pairs), 2)
- Reshape(packed2D, int32(pairs))       → Reshape1(packed2D, int32(pairs))
- SliceAxis(paired, 1, 0, 1)            → Slice2(paired, 0, 0, int32(pairs), 1)
- SliceAxis(paired, 1, 1, 2)            → Slice2(paired, 0, 1, int32(pairs), 2)

Cache bench impact (BenchmarkQuantizedKVCache_Append_SingleToken_Q8Q4,
benchtime=200ms, count=3):

  baseline       13404 B/op  1412 allocs/op
  after W11-AC    7274 B/op   516 allocs/op   (-46% B, -63% allocs)

The Q8Q8 path is unchanged (no Q4 storage, no packQ4Cached call) — the
~900 alloc/call reduction comes entirely from the per-call escape
elimination on the V-side packQ4 sequence. Numerical equivalence is
locked by the rank-1/2 parity tests added alongside the primitives.

Co-Authored-By: Virgil <virgil@lethean.io>
…cast (go vet clean)

The mlx `void* payload` slot carries a synthetic uintptr identifier (a
sync.Map key, not a Go pointer). The Go-side `unsafe.Pointer(id)` cast
tripped `go vet`'s unsafeptr check at pinned_array.go:184 — the
warning was flagging a real Go-spec rule (uintptr→unsafe.Pointer is
unsafe) for code where the integer was never a pointer to begin with.

Adopted the runtime/cgo.Handle pattern: the C-visible Go signature is
now `uintptr_t payload` end-to-end (call site + dtor callback). The
`void* ↔ uintptr_t` widening happens inside the C++ bridge where it
satisfies mlx's signature without putting an unsafe.Pointer cast in
Go-visible code. vet's unsafeptr check can now see this is not a Go
pointer crossing the boundary.

  - pinned_array.go: extern signature `uintptr_t payload`, callsite
    `C.uintptr_t(id)`, callback `goPinnedRawArrayRelease(C.uintptr_t)`
  - pinned_array_bridge.cpp: param `uintptr_t payload_id`, internal
    `reinterpret_cast<void*>` before handing to mlx + dtor

Verified:
  - go vet ./go/internal/metal/... clean (was: pinned_array.go:184:
    possible misuse of unsafe.Pointer)
  - go test ./go/internal/metal/... -race -short passes
  - bit-exact: same numeric value flows id → uintptr_t → void* → ... →
    void* → uintptr_t → uintptr across the round trip

Co-Authored-By: Virgil <virgil@lethean.io>
…ous float32

W11-AE adds a fast-path helper materialiseFloat32ViewFast(arr) ([]float32, func(), error) that bypasses the legacy materialiseFloat32View ceremony when arr.Dtype() == DTypeFloat32 && arr.IsRowContiguous(). On the fast-path:

  * Zero AsType cgo crossing (dtype already matches).
  * Zero Contiguous cgo crossing (layout already row-major).
  * Zero Materialize cgo crossing (caller already evaluated the tensor; the dtype + contiguity proof IS the post-Eval invariant for a valid float32 backing store).

The helper falls through to materialiseFloat32View when either gate fails, preserving the full conversion-and-contiguous ceremony. Lifecycle contract is wrapped in a cleanup closure (runtime.KeepAlive on fast-path; KeepAlive + Free(converted) on slow-path) so callers just defer cleanup() once.

Measured on M3 Ultra vs *Array.Floats() at 5 size points (128B / 1KB / 10KB / 100KB / 1MB):

  Floats()        FastView        Delta
  320 ns / 129B   170 ns / 17B    -47% ns, -87% B (128B)
  452 ns / 1025B  207 ns / 17B    -54% ns, -98% B (1KB)
  1768 ns / 10KB  216 ns / 17B    -88% ns, -99% B (10KB)
  21000 ns/100KB  217 ns / 17B    -99% ns, -99.98% B (100KB)
  226000 ns/1MB   205 ns / 17B    -99.9% ns, -99.998% B (1MB)

The fast-path wins at every size — even 128B, where the W11-X note suggested the cgo overhead would exceed savings. The W11-X comparison was against the slow-path helper (270 ns) that still pays the Materialize crossing; dropping that crossing inverts the verdict.

The 17 B / 2 allocs floor is the cleanup closure escape (capturing arr forces the closure to heap). API shape mandated by the brief; the latency win dwarfs the alloc cost.

Tests added: FastPath (bit-exact), SlowPathDtype (float16 round-trip), LegacyParity (fast vs slow on identical input), NonContiguous (slow-path fall-through for sliced views).

Bench helpers added: BenchmarkMaterialiseFloat32View_Floats_NB / _Slow_NB / FastView_NB across 5 size points so the threshold can be characterised without re-measuring.

Co-Authored-By: Virgil <virgil@lethean.io>
KVCache.Update + RotatingKVCache.updateInPlace / updateConcat now
resolve DefaultStream() once per Update and pass it through to the
Slice4WithStream / SliceUpdateInplace4WithStream siblings.  Each
Update issues 4-6 Slice4-family calls; this collapses 4-6 RWMutex
RLock+RUnlock + cached-device atomic loads to one.

Bench (Apple M3 Ultra, -benchtime=200ms ×3, -benchtime=2s ×5):
KVCache_Append_SingleToken_To512: 1028 allocs unchanged (the W11-U
forward note's hypothesis that DefaultStream hoisting would reduce
allocs was misdiagnosed — DefaultStream returns a cached *Stream
singleton without alloc; the 1028 allocs come from newArray's
runtime.SetFinalizer path).  ns/op delta is below the Metal
thermal/cache noise floor at this bench: benchstat reports all
deltas with p~={0.1-1.0} (no statistical significance).

Architectural win is real and verifiable in code: 1 DefaultStream
lookup per Update instead of 4-6.  Sets up future measurable wins
once the Metal-side variance is reduced (kernel JIT warmup,
thermal stabilisation), and reduces lock-acquisition pressure in
concurrent decode scenarios that current single-goroutine benches
don't exercise.  Allocs neutral, ns/op neutral within noise.

Co-Authored-By: Virgil <virgil@lethean.io>
Migrates the Q4 dequantise rank-1 boundaries to the W11-AC scalar-pass
primitives:

- Reshape(stacked, int32(flatLen))                  → Reshape1
- Slice(flat, []int32{0}, []int32{int32(n)})        → Slice1

The first call is on every Q4 dequant; the second is the (rare)
odd-length tail-trim branch (flatLen > n) reached when the source
element count is odd.

The final `Reshape(signed, shape...)` retains the variadic form because
the shape comes from the caller as a slice of arbitrary rank — no
fixed-rank scalar-pass equivalent applies.

The dequant path is invoked by dequantizedState() on every Update
that misses the float cache, and by ReadState() on every snapshot read,
so the saved variadic-slice escape compounds across the read-side hot
path. Numerical equivalence preserved (see W11-AC parity tests).

Co-Authored-By: Virgil <virgil@lethean.io>
ScaledDotProductAttention and ScaledDotProductAttentionWithMask each
paid a C.CString allocation plus the matching deferred C.free on
every invocation, even though only three mask_mode values are ever
passed: "" (default), "causal", and "array". Cache the corresponding
C strings at package init and reuse them across calls.

Safety: the mlx-c wrapper at lib/mlx-c/mlx/c/fast.cpp wraps the
incoming mask_mode pointer in std::string(mask_mode) before passing
it to the C++ scaled_dot_product_attention op, so the underlying C
buffer is copied synchronously and the cached pointers can be shared
across goroutines without locking. Race-short gate is clean.

Per-call delta is below benchmem resolution (the C.malloc + C.free
pair don't show in Go's allocs/op metric, and the cgo overhead saved
is ~200 ns against a 260 us SDPA call — within noise on a single
bench), but the change is structurally cleaner and avoids 2× cgo
crossings per SDPA call. Under decode (32+ layers × N tokens) the
saved crossings compound into the wall-clock budget.

Co-Authored-By: Virgil <virgil@lethean.io>
…ce4WithStream substrate consistency; A5 diagnosis: 1028 alloc floor is newArray SetFinalizer + arrayPool type-assertion, not DefaultStream)
…eLogitsCompact

W11-AE migrates the W11-X-rejected site at probe.go:362 — topValues.Floats() copies a topK-length buffer (~32 B for topK=8) into a fresh Go slice via 2× Materialize cgo crossings + per-element loop.  The fast-path returns a borrowed MLX-memory view in ~170 ns / 17 B (closure escape floor).

W11-X rejected this site against the slow-path helper (270 ns floor) because the cgo overhead exceeded the 32-byte saving.  The new fast-path drops the unconditional final Materialize crossing — the dtype + contiguity check IS the post-Eval proof of a valid backing store — and now wins at every size including 128 B (vs Floats(): -47% latency, -87% bytes).

Measured BenchmarkSummarizeProbeLogitsCompact_Gemma on M3 Ultra:

  Before  ~833 µs / 715 B / 20 allocs
  After   ~594 µs / 697 B / 20 allocs   (-29% ns, -18 B, allocs same)

Alloc count unchanged because the Floats() copy was 2 allocs (slice + cgo intermediate) and the fast-path cleanup closure escapes to heap (2 allocs). The bytes win on the topK=8 slice; the latency win on dropped Materialize crossings.

TakeAlongAxis preserves dtype (float32) and the prior Eval guarantees a valid backing store, so the fast-path conditions hold structurally — no runtime risk.

Co-Authored-By: Virgil <virgil@lethean.io>
Replaces `Reshape(a, int32(n))` with `Reshape1(a, int32(n))` — rank-1
scalar-pass skips the variadic []int32 heap escape on the
quantise-max-abs boundary.

maxAll is called by quantizeCacheArrayCached on every K and every V
Update (two calls per cache Update), so this is the dominant alloc
reduction on the Q8 cache append path. Combined with the W11-AC
packQ4Cached + unpackQ4 migrations and prior W10-A substrate work, the
cache bench impact (BenchmarkQuantizedKVCache_Append_SingleToken,
benchtime=200ms, count=3):

  baseline (a5c82d0)
    Q8Q8   7288 B/op   516 allocs/op
    Q8Q4  13404 B/op  1412 allocs/op

  after W11-AC migrations
    Q8Q8   6244 B/op   260 allocs/op   (-14% B, -50% allocs)
    Q8Q4   6251 B/op   260 allocs/op   (-53% B, -82% allocs)

The Q8Q4 path converges on the Q8Q8 alloc floor (260 allocs/op
identical) — the rank-1/2 scalar-pass primitives close the gap the
prior pure-Q8 path already enjoyed.

4096-prefill (steady-state) bench:
    Q8Q8    122 B/op   8 allocs/op (baseline)
    Q8Q8    114 B/op   6 allocs/op (after)   (-7% B, -25% allocs)

Co-Authored-By: Virgil <virgil@lethean.io>
…B/op (-53%)

W10-O wired the cgo-scratch pools but left the per-call
`&pinnedRawArrayBuffer{}` heap alloc on the hot path. This lane pools
the buffer struct end-to-end: register Gets from the pool + sets `raw`,
unregister Releases the view + clears `raw` and Puts back.

Lifetime safety: the buffer travels through mlx as the dtor payload and
only returns to the pool after the mlx-side release callback fires — at
which point PinnedView.Release has zeroed the pinner state. Clearing
`buffer.raw` on Put is critical so the recycled struct does not hold a
stale reference to the previous call's slice (the underlying bytes
need to be GC-eligible the moment mlx hands the array back).

Bench delta @ benchtime 300ms, M3 Ultra:

  L1     120->56 B/op   3->2 allocs
  L32    120->56 B/op   3->2 allocs
  L512   120->56 B/op   3->2 allocs
  L4096  120->56 B/op   3->2 allocs
  L16384 120->56 B/op   3->2 allocs
  Gemma4Global_L4096    120->56 B/op  3->2 allocs
  Gemma4LocalWindow_L512 120->56 B/op  3->2 allocs
  Strided_Subview_L4096  120->56 B/op  3->2 allocs

Remaining 56 B + 1 alloc is the `sync.Map.Store` entry node, which
needs a different data structure to eliminate (out of scope for a
residual lane).

Verified:
  - go test ./go/internal/metal/... -race -short passes
  - go vet stays clean
  - bit-exact: same numeric id flows + same data pointer returned;
    Pinner reset path is the documented contract for runtime.Pinner

Co-Authored-By: Virgil <virgil@lethean.io>
Add Cap512 and Cap4096 benches for singleTokenCausalMask so the
next visitor can see the surface without re-deriving it. W11-Y
exercised these benches investigating whether to cache the 0 /
-1e9 scalars at package scope; the cached variant regressed by
~55 percent at both capacities because MLX's Where op pays
refcount-management overhead when the same scalar arrays are
aliased across many invocations. A5-honest revert on the cache,
benches kept.

  cached vs baseline (-benchtime=300ms -count=5 on M3 Ultra):
    Cap512  ~238 us baseline / ~373 us cached  (+57 percent)
    Cap4096 ~239 us baseline / ~375 us cached  (+57 percent)

Co-Authored-By: Virgil <virgil@lethean.io>
…liceUpdateInplace2 scalar-pass primitives — QuantizedKVCache_Q8Q4 -82% allocs / -53% B, Q8Q8 -50% allocs; scalar-pass family complete at ranks {1,2,4})

# Conflicts:
#	go/internal/metal/slice_test.go
…time/cgo.Handle pattern + pinnedRawArrayBuffer sync.Pool — all KV shapes 120 B/3 allocs → 56 B/2 allocs -53%/-33%)
…Pool — 2Pages -53%, 4-16Pages -43-46%; SDPA mode-string cache; A5 reverts on ShapeInto + scalar-cache showing cgo-stack-array + MLX-Where pitfalls)
…al arr

A5-honest discovery from migrating hostUnsuppressedGreedyToken: the legacy materialiseFloat32View helper called Materialize on src unconditionally at the end, which silently covered callers that passed lazy (un-Eval'd) tensors.  The fast-path deliberately skips that Materialize crossing — so accessing the raw float32 backing store of an un-Eval'd array segfaults.

TestSample_HostUnsuppressedGreedyTokenMaterializesLazyFloat32_Good caught this regression immediately.  Reverting sample.go to the legacy helper + documenting the contract explicitly on materialiseFloat32ViewFast.

Callers safe for the fast-path:
  * probe.go summarizeProbeLogitsCompact — explicit Eval(topIndices, topValues, ...) before
  * generate.go inspectAttentionCache — explicit Eval(kSliced) before
  * kv_snapshot.go inspectKVCacheRangeWithOptions — explicit Eval(kSliced, vSliced) before
Callers unsafe for the fast-path (must stay on legacy):
  * sample.go hostUnsuppressedGreedyToken — receives logits from sampler chain, may be lazy

The threshold note also updates: 10-100KB benches show -88% to -99% latency vs Floats() (slow-path delta dominated by the skipped Materialize crossing on large tensors, not just per-call overhead).

Co-Authored-By: Virgil <virgil@lethean.io>
…entionCache

W11-AE migrates the W11-X-installed materialiseFloat32View call to the fast-path variant.  kSliced is explicit-Eval'd at line 1139 immediately before, so the fast-path contract holds.  Cache K tensors are normally DTypeFloat32 + row-contiguous (Slice preserves row-major when slicing axis 0), so the fast-path fires; quantised caches fall through to the legacy materialiseFloat32View ceremony unchanged.

Measured BenchmarkInspectAttentionCache_Realistic (32 heads x 1024 tokens x 128 head_dim = 16 MB):

  Before  ~3.2 ms / 16.78 MB / 43 allocs
  After   ~735 µs / 16.78 MB / 41 allocs   (-77% ns, -2 allocs)

The latency win is much larger than the per-call cgo crossing cost — dropping the final Materialize on a 16 MB freshly-Eval'd tensor saves the MLX-side queue-drain check, not just the cgo call.

Allocs -2 because the closure cleanup replaces 2 separate function calls (runtime.KeepAlive + Free(converted)) that previously each escaped scratch.

Also drops the now-unused "runtime" and "unsafe" imports from generate.go.

Co-Authored-By: Virgil <virgil@lethean.io>
…KVCacheRangeWithOptions

W11-AE migrates the W11-X-installed dual materialiseFloat32View calls (K + V) to the fast-path variant.  kSliced + vSliced are explicit-Eval'd at line 437 immediately before, so the fast-path contract holds.  Cache K/V tensors are normally DTypeFloat32 + row-contiguous (Slice preserves row-major when slicing axis 0); quantised caches fall through to the legacy materialiseFloat32View ceremony unchanged.

Measured BenchmarkInspectKVCacheRange_Realistic (32 heads x 1024 tokens x 128 head_dim x K+V = 32 MB borrowed, ~100 MB total snapshot):

  Before  ~10.1 ms / 100.67 MB / 154 allocs
  After   ~2.68 ms / 100.67 MB / 152 allocs   (-73% ns, -2 allocs)

Same multiplicative win as inspectAttentionCache — dropping Materialize on Eval'd large tensors saves the MLX queue-drain check, not just per-call cgo overhead.

Allocs -2 because the cleanup closure replaces 2 separate function calls (runtime.KeepAlive + Free(converted)) per K/V pair, but the closure escape adds back 1 alloc per pair, netting -2 across both.

Also drops the now-unused "runtime" and "unsafe" imports from kv_snapshot.go.

Co-Authored-By: Virgil <virgil@lethean.io>
…terialize for contiguous-float32 — InspectAttentionCache -83%, InspectKVCacheRange -74%; threshold inverts vs W11-X — fast-path wins at ALL sizes when caller pre-Evals)
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
2 Security Hotspots
6.8% Duplication on New Code (required ≤ 3%)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants