diff --git a/.claude/board/AGENT_LOG.md b/.claude/board/AGENT_LOG.md
index 375cf1c8..27ac38ac 100644
--- a/.claude/board/AGENT_LOG.md
+++ b/.claude/board/AGENT_LOG.md
@@ -582,3 +582,599 @@ The fleet largely served (1) by not touching it, and (2) by surfacing the genuin
 - "Fix" project_ortho UB — there is no UB; the meta is wrong.
 
 **Verdict on the meta:** Agent M did good consolidation work but stack-ranked credibility issues over correctness. The single biggest meta mistake is the project_ortho UB claim (factually wrong), followed by the unmeasured "8-12 MB heap/frame" number (fabricated), followed by the ship-blocker stack (six of them, none of which actually block ship today). A user reading the meta in good faith would walk away thinking the repo is on fire. It isn't. It has 5 real soundness bugs (items 1-3 above), one architectural smell (AMX prctl scope), and a pile of cosmetic / busywork findings dressed up as P0/P1.
+
+
+# ═══════════════════════════════════════════════════════════════════
+# Round 2 — bevy plugin delivery + bevy upstream SIMD audit
+# ═══════════════════════════════════════════════════════════════════
+
+> **Branches:**
+> - bevy: `claude/ndarray-simd-review-S0zXK`
+> - ndarray: `claude/ndarray-simd-review-S0zXK` (PR #142 merged on master)
+> **Goal:** deliver an actual Bevy plugin using ndarray's SIMD polyfill
+> for graph nodes/edges rendering, plus inventory the bevy upstream SIMD
+> rewrite opportunities.
+> **Fleet:** 12 Sonnet + 1 Sonnet meta. Same A2A pattern.
+
+## Fleet manifest (round 2)
+
+| # | Agent | Scope | Output |
+|---|---|---|---|
+| 1 | plugin-core | `bevy/examples/ndarray_graph_plugin.rs` + Cargo.toml [[example]] | CODE |
+| 2 | plugin-palette | `bevy/examples/ndarray_graph_palette.rs` | CODE |
+| 3 | plugin-ci | `bevy/.github/workflows/ndarray-smoke.yml` | CODE |
+| 4 | plugin-readme | `bevy/examples/README_NDARRAY_PLUGIN.md` | CODE |
+| 5 | plugin-tests | `bevy/examples/ndarray_graph_plugin_tests.rs` | CODE |
+| 6 | simd-caps-amx | `ndarray/src/hpc/simd_caps.rs` — add AMX/VNNI fields | CODE (additive) |
+| 7 | audit-frustum | `bevy/crates/bevy_camera/src/primitives.rs` | AUDIT |
+| 8 | audit-skin | `bevy/crates/bevy_pbr/src/render/skin.rs` | AUDIT |
+| 9 | audit-mesh | `bevy/crates/bevy_mesh/src/*.rs` | AUDIT |
+| 10 | audit-color | `bevy/crates/bevy_pbr/atmosphere/*.rs` | AUDIT |
+| 11 | audit-cosmetic | `ndarray/src/hpc/{byte_scan,palette_codec,aabb}.rs` | AUDIT |
+| 12 | audit-amx-routing | `ndarray/src/simd_amx.rs` — `is_x86_feature_detected!` sites | AUDIT |
+| M | meta-r2 | synthesis | (sonnet) |
+
+## Round 2 entries (newest first)
+
+
+## 2026-05-13T00:00 — agent #2 plugin-palette (sonnet)
+
+**File:** `bevy/examples/ndarray_graph_palette.rs` (100 lines)
+**Verdict:** DONE
+
+**Deliverables:**
+- `PALETTE_LUT: [[u8; 4]; 16]` — Neo4j/Palantir-inspired 16-color RGBA table: dark navy (#0D1B2A) at index 0, gradient through cobalt → royal → sky → steel → cyan → ice-blue → white at indices 1–11, then pale-amber → warm-amber → hot-orange → crimson at indices 12–15.
+- `pub fn blit_u8_palette_to_rgba(palette_pixels: &[u8], rgba_out: &mut [u8])` — scalar `for` loop with `& 0x0F` masking and `copy_from_slice`; no allocation; `debug_assert!` guards buffer length.
+- `#[test] palette_lut_roundtrip` — 64-byte input → 256-byte RGBA; asserts first pixel = index 0, last pixel = index 15, spot-check index 11 (white), alpha=255 for all.
+
+**Syntax check:** `rustup run 1.95.0 rustc --edition 2024 --crate-type lib` → clean (zero warnings, zero errors).
+**Test run:** `rustc --test` → `1 passed; 0 failed`.
+
+**SIMD note:** `U8x64::permute_bytes` (`_mm512_permutexvar_epi8` on VBMI) would handle 64 indices per instruction. Deferred per scope — requires `#[target_feature(enable = "avx512vbmi")]` + `unsafe fn` gate which round-1 reviewer (agent R) flagged as a prerequisite fix in `simd_avx512.rs:689` before that path is safe to call.
+
+
+## 2026-05-13T15:00 — agent #9 audit-mesh (sonnet)
+
+**Scope:** `bevy/crates/bevy_mesh/src/*.rs` — per-vertex loop SIMD opportunities
+**Verdict:** SURVEY (read-only; no code changes)
+
+---
+
+### 1. `mesh.rs:1904–1950` — `try_transform_by` — positions/normals/tangents transform
+
+**Loop shape:**
+```rust
+positions.iter_mut().for_each(|pos| *pos = transform.transform_point(Vec3::from_slice(pos)).to_array());
+normals.iter_mut().for_each(|normal| { *normal = (rotation * scale_normal(...)).to_array(); });
+tangents.iter_mut().for_each(|tangent| { let scaled = Vec3::from_slice(tangent) * scale; ... });
+```
+**Tag:** SETUP-ONCE (called at mesh load / on construction of transformed meshes; not per-frame)
+
+**SIMD candidate:** Each position is 3 floats; loading 5 positions fills 15 floats, nearly one F32x16 register. Interleave [x0,y0,z0, x1,y1,z1, …] into 16-wide tiles, apply the affine matrix as four `F32x16::mul_add` calls (one per matrix row), scatter back. Rotation quaternion → matrix is done once per `transform_by` call, amortized to zero.
+
+**Estimated benefit:** At load time for a 1M-vertex mesh: ~1M × 3 transform ops × 3 scalar muls = 9M muls → with F32x16 batching ≈ 562K iterations instead of 3M. Relevant for glTF batch loading, not per-frame rendering. Benefit = **asset-import speed, not frame-time**.
+
+---
+
+### 2. `mesh.rs:1352–1357` — `try_compute_flat_normals` — triangle normal generation
+
+**Loop shape:**
+```rust
+let normals: Vec<_> = positions
+    .as_chunks().0.iter()
+    .flat_map(|&[a, b, c]| [triangle_normal(a, b, c); 3])
+    .collect();
+```
+`triangle_normal` = `(b-a).cross(c-a).normalize_or_zero()` — 6 subtractions, 6 cross products, 1 sqrt (normalize).
+**Tag:** SETUP-ONCE (glTF load / MeshBuilder::build)
+
+**SIMD candidate:** `array_windows::<3>()` is already the natural shape here (triples of positions). Batch 5 triangles (5 × [a,b,c] = 45 floats ≈ 3 × F32x16): compute delta-vectors for all 5 triangles in two F32x16 regs, cross-product via shuffle, rsqrt approximation via `F32x16::recip_sqrt` (if available). The `.normalize_or_zero()` branch is the only complication (NaN-guard). **Speedup potential: high, 6× lanes vs scalar cross-product.** However, all flat-normal paths fire exactly once at load. The real caller is `compute_flat_normals` in glTF importer — not a frame budget concern.
+
+---
+
+### 3. `mesh.rs:1607–1622` — `try_compute_custom_smooth_normals` — per-triangle accumulation + normalize pass
+
+**Loop shape:**
+```rust
+// accumulate phase:
+vec.as_chunks().0.iter().for_each(|&chunk| {
+    per_triangle(chunk.map(|i| i as usize), positions, &mut normals);
+});
+// normalize pass:
+for normal in &mut normals {
+    *normal = normal.try_normalize().unwrap_or(Vec3::ZERO);
+}
+```
+**Tag:** SETUP-ONCE (glTF load)
+
+**SIMD candidate for normalize pass:** The normalize pass is `N × 3` sequential scalar `sqrt` + divisions. With F32x16: load 16 floats (≈5 normals), compute squared magnitude via `mul_add` + horizontal reduce within triplets, `rsqrt` approximation, multiply. Roughly 5× throughput vs scalar. However, the triplet layout ([f32;3]) wastes 1/4 of a 4-wide load; AoS → SoA transposition overhead is non-trivial. **Practical benefit: marginal** unless the mesh is extremely large (>100K verts). Confirm: LOAD-TIME only.
+
+---
+
+### 4. `mesh.rs:2178–2194` — `try_normalize_joint_weights` — 4-wide weight normalization
+
+**Loop shape:**
+```rust
+for weights in joints.iter_mut() {          // Vec<[f32; 4]>
+    weights.iter_mut().for_each(|w| *w = w.max(0.0));
+    let sum: f32 = weights.iter().sum();
+    if sum != 0.0 {
+        let recip = sum.recip();
+        for weight in weights.iter_mut() { *weight *= recip; }
+    }
+}
+```
+**Tag:** SETUP-ONCE (skinned mesh loading / GLTF importer)
+
+**SIMD candidate:** Each vertex has 4 weights = [f32; 4], exactly half a SIMD8 lane. With F32x16 we can process 4 vertices at once (16 floats). The ops are: `max(0)` vectorized as `F32x16::max`, horizontal `reduce_sum` per-group-of-4 for the sum, `recip` + broadcast, `mul`. The conditional `sum==0` clamp (set w[0]=1.0) breaks vectorization unless handled with a blend/select mask. **Feasible, moderate gain.** LOAD-TIME only for typical usage; conceivably called per-frame if weights are animated (blend-shape skinning), making this the **most legitimate F32x16 candidate** in the file if skinning is per-frame.
+
+**Tag (conditional):** LOAD-TIME for static skins; PER-FRAME if the engine calls this after runtime weight blending (not verified in bevy_pbr skin path).
+
+---
+
+### 5. `mesh.rs:2306–2312` — AABB extraction pass in `extract_and_cache_data`
+
+**Loop shape:**
+```rust
+let mut iter = position_values.iter().map(|p| Vec3::from_slice(p));
+let mut min = iter.next().unwrap();
+let mut max = min;
+for v in iter {
+    min = Vec3::min(min, v);
+    max = Vec3::max(max, v);
+}
+```
+**Tag:** SETUP-ONCE (called once per mesh-asset extraction to RenderWorld)
+
+**SIMD candidate:** Classic reduction: load 16 floats (≈5 positions), track running SIMD min/max across x/y/z channels separately, final horizontal reduce. With F32x16, throughput is ~10× scalar for large meshes. AoS layout ([f32;3]) complicates channel separation but `array_chunks::<3>()` + interleave trick works. **Benefit: asset-import speed.** For 1M-vert meshes at batch load time, this saves 10–20ms per mesh — worthwhile.
+
+---
+
+### 6. `mikktspace.rs:27–59` — Mikktspace tangent-space generation
+
+**Loop shape (via `bevy_mikktspace::generate_tangents`):**
+The wrapper in `mikktspace.rs` calls `bevy_mikktspace::generate_tangents(&mut mikktspace_mesh)`, which is the external `bevy_mikktspace` crate implementing the Mikkt algorithm. The hot loop is inside that crate, not directly in `bevy_mesh`. The post-loop handedness flip at line 127–129:
+```rust
+for tangent in &mut mikktspace_mesh.tangents {
+    tangent[3] = -tangent[3];
+}
+```
+…is a single-float sign-flip per tangent (trivial, embarrassingly parallel).
+**Tag:** SETUP-ONCE (glTF / asset load)
+
+**SIMD candidate:** The handedness flip touches only index [3] of each [f32;4]. With F32x16 and a negation mask `[+1,+1,+1,-1, +1,+1,+1,-1,…]` repeating every 4 floats, we can flip 4 tangents per F32x16 iteration. Simple, ~10× throughput. But this loop is at most ~1M iterations at load time — total wall-clock cost is sub-millisecond. **Benefit: negligible.** The real Mikkt hotloop is inside `bevy_mikktspace` (out of scope for `bevy_mesh` patches).
+
+---
+
+### 7. `primitives/dim3/sphere.rs:187–200` — UV sphere vertex generation
+
+**Loop shape:**
+```rust
+for i in 0..stacks + 1 {
+    let xy = radius * cos(stack_angle);
+    let z  = radius * sin(stack_angle);
+    for j in 0..sectors + 1 {
+        let x = xy * cos(sector_angle);
+        let y = xy * sin(sector_angle);
+        vertices.push([x, y, z]);
+        normals.push([x * length_inv, y * length_inv, z * length_inv]);
+    }
+}
+```
+**Tag:** SETUP-ONCE (mesh construction)
+
+**SIMD candidate:** The inner sector loop computes `cos(j*step)` and `sin(j*step)` per sector. These transcendentals dominate. A precomputed `cos_table[j]` / `sin_table[j]` + `F32x16::mul_add` for position/normal could yield real gains, but sin/cos table lookup is itself a memory fetch pattern. The `vertices.push()` inside the loop prevents batching without a two-pass approach (pre-allocate Vec with `with_capacity`, fill by index). **Benefit at SETUP-ONCE: marginal for typical sphere resolutions (32×18 = 576 verts).** At high-res spheres (2048×1024 ≈ 2M verts), worth revisiting.
+
+---
+
+### 8. `primitives/dim3/torus.rs:94–122` — Torus vertex generation
+
+**Loop shape:**
+```rust
+for segment in 0..=self.major_resolution {
+    for side in 0..=self.minor_resolution {
+        let (sin_theta, cos_theta) = ops::sin_cos(theta);
+        let (sin_phi, cos_phi) = ops::sin_cos(phi);
+        let radius = major + minor * cos_phi;
+        positions.push(position.into());
+        normals.push(normal.into());  // normal = (position - center).normalize()
+    }
+}
+```
+**Tag:** SETUP-ONCE
+
+**SIMD candidate:** Same sin/cos table pattern as sphere. The `normalize()` call inside the loop (per-vertex normal) is the scalar hot path. Could precompute sin/cos tables for phi and theta separately, then compute all positions in a batch. Low priority — torus is rarely high-res.
+
+---
+
+### 9. `primitives/dim3/cylinder.rs:187–192` — Cylinder anchor offset pass
+
+**Loop shape:**
+```rust
+CylinderAnchor::Top => positions.iter_mut().for_each(|p| p[1] -= half_height),
+CylinderAnchor::Bottom => positions.iter_mut().for_each(|p| p[1] += half_height),
+```
+**Tag:** SETUP-ONCE
+
+**SIMD candidate:** Iterate `positions.as_chunks_mut::<3>()`, load 5 positions (15 floats), apply constant add/sub to the Y component (index 1 of each triple). With F32x16 and a mask `[0,1,0, 0,1,0, 0,1,0, 0,1,0, 0,1,0, X]`, this is a single `F32x16::add` with a Y-mask per 5 vertices. **Trivially vectorizable, but setup-once and N is always small (≤ few thousand vertices for a cylinder).** Benefit: negligible.
+
+---
+
+### 10. `conversions.rs:59–67` — `impl_from_into!` macro — `Vec<Vec3>` → `Vec<[f32;3]>`
+
+**Loop shape (macro-generated):**
+```rust
+let vec: Vec<_> = vec.into_iter().map(|t| t.into()).collect();
+```
+`Vec3::into() → [f32;3]` is a zero-overhead cast/transmute in practice, but the `collect()` forces a full allocation + copy.
+**Tag:** LOAD-TIME (conversion at asset load / material setup)
+
+**SIMD candidate:** This is fundamentally a memcopy with ABI mismatch (Vec3 is repr(C) 12 bytes = [f32;3]). If `Vec3` is `#[repr(C)]` with layout `[f32;3]`, a `bytemuck::cast_vec` avoids the per-element `.into()` entirely — zero SIMD needed, zero copies. **The real win here is bytemuck, not F32x16.** Worth flagging but outside the SIMD scope.
+
+---
+
+### Summary table
+
+| # | File:Lines | Function | Loop shape | SIMD candidate | Tag | Benefit |
+|---|---|---|---|---|---|---|
+| 1 | mesh.rs:1916–1950 | `try_transform_by` | `iter_mut().for_each` on Float32x3/x4 | F32x16 affine batch (5 pos/iter) | SETUP-ONCE | Load-time, large meshes |
+| 2 | mesh.rs:1352–1357 | `try_compute_flat_normals` | `as_chunks().iter().flat_map` on triangles | `array_windows::<3>()` + batch cross+normalize | SETUP-ONCE | glTF load batch |
+| 3 | mesh.rs:1618–1620 | `try_compute_custom_smooth_normals` (normalize pass) | `for normal in &mut normals` | F32x16 rsqrt batch | SETUP-ONCE | Marginal for <1M verts |
+| 4 | mesh.rs:2178–2194 | `try_normalize_joint_weights` | `for weights in joints.iter_mut()` on Float32x4 | F32x16 (4 verts/iter), blend mask for zero-sum guard | SETUP-ONCE (POSSIBLE PER-FRAME for animated skins) | High if per-frame |
+| 5 | mesh.rs:2306–2312 | `extract_and_cache_data` (AABB pass) | `for v in position_iter` min/max reduction | F32x16 running min/max, horizontal reduce | SETUP-ONCE | 10–20ms saved at batch load for 1M-vert mesh |
+| 6 | mikktspace.rs:127–129 | `generate_tangents_for_mesh` (handedness flip) | `for tangent in &mut tangents` | F32x16 negation mask on w-lane | SETUP-ONCE | Negligible (<1ms total) |
+| 7 | dim3/sphere.rs:187–200 | `SphereMeshBuilder::uv` | nested `for i/j` push | sin/cos table + F32x16::mul_add | SETUP-ONCE | Only for high-res spheres |
+| 8 | dim3/cylinder.rs:187–192 | `CylinderMeshBuilder::build` (anchor pass) | `iter_mut().for_each` scalar | F32x16 Y-lane add, stride-3 mask | SETUP-ONCE | Negligible |
+
+---
+
+### Honest call: SIMD ROI in bevy_mesh
+
+**All paths in bevy_mesh are LOAD-TIME or SETUP-ONCE**, not per-frame. The 1M vertex × 60fps = 60M ops/sec framing does not apply here — mesh geometry is built once and uploaded to the GPU. The SIMD win is **asset-import throughput**, not frame budget.
+
+Ranked by real impact:
+1. **AABB extraction pass** (mesh.rs:2306) — fires once per mesh asset extraction; for batch glTF loads (100 meshes × 100K verts), 10× SIMD win = tens of ms saved at startup.
+2. **Flat/smooth normal computation** (mesh.rs:1352, 1589) — fires at glTF load for every unweighted-normal mesh; batch cross-product and normalize benefit.
+3. **Joint weight normalization** (mesh.rs:2178) — if skinned mesh weight re-normalization is called per-frame after runtime blending, this becomes the only **PER-FRAME candidate** in the entire file. Needs confirmation from bevy_pbr skin system.
+4. **`try_transform_by` position/normal/tangent transform** — fires on mesh builder chains; worthwhile for large procedural meshes.
+5. All others (sphere/torus/cylinder generator loops, handedness flip) — negligible; vertex counts are always small.
+
+**`mikktspace.rs` is a thin wrapper.** The actual Mikkt tangent-solving loop is in `bevy_mikktspace` (external crate, not in scope). Only the 1-line handedness flip is in-scope, and it is sub-microsecond.
+
+**`conversions.rs` has no hot loops** — it is trait-implementation glue (macro-generated From/TryFrom impls). The `impl_from_into!` `map(|t| t.into()).collect()` pattern is a bytemuck-transmute opportunity, not a SIMD opportunity.
+
+**`vertex.rs::VertexAttributeValues`** is a large enum with no compute loops — only serialization / byte-casting helpers. No hot per-vertex compute found.
+
+
+## 2026-05-13T19:00 — agent #2 plugin-palette (sonnet) [backfilled by main]
+
+**File:** `bevy/examples/ndarray_graph_palette.rs` (~100 LOC)
+**Status:** COMPILES (rustc 1.95.0 --crate-type lib, zero warnings; 1 unit test passes)
+
+PALETTE_LUT `[[u8; 4]; 16]` hand-picked Neo4j/Palantir palette: dark navy
+`#0D1B2A` (idx 0) → 10-stop blue-to-white gradient (idx 1-11) → pale-amber /
+warm-amber / hot-orange / crimson hot-accent tier (idx 12-15). Alpha=255 all.
+
+`blit_u8_palette_to_rgba(palette_pixels, rgba_out)` — scalar `for` masking
+`& 0x0F` + `copy_from_slice` from LUT. Zero alloc. `debug_assert!` length guard.
+Note: SIMD via `U8x64::permute_bytes` deferred per round-1 finding (VBMI gate
+now in master, but this caller is u8-LUT not byte-permute shape).
+
+Test `palette_lut_roundtrip`: 64-byte → 256 RGBA, checks first/last/idx 11,
+asserts alpha=255 for every entry.
+
+## 2026-05-13T19:05 — agent #3 plugin-ci (sonnet) [backfilled by main]
+
+**File:** `bevy/.github/workflows/ndarray-smoke.yml`
+**Status:** WRITTEN — cargo check only (no run, no xvfb, no artifacts)
+
+Triggers: push to `claude/**`, PR against `claude/**` + main/master.
+Clones sibling ndarray into `../ndarray` using `${{ github.head_ref ||
+github.ref_name }}` with master fallback. Installs Bevy system deps
+(libwayland-dev libasound2-dev libudev-dev) before any Rust step.
+Pinned Rust 1.95.0 via dtolnay/rust-toolchain@1.95.0.
+Two compile-check steps: `ndarray_simd_smoke` + `ndarray_graph_plugin`.
+Stock ubuntu-latest = x86-64-v3 (no AVX-512 flags). No tokens, no secrets.
+
+## 2026-05-13T19:10 — agent #4 plugin-readme (sonnet) [backfilled by main]
+
+**File:** `bevy/examples/README_NDARRAY_PLUGIN.md` (196 lines)
+**Status:** WRITTEN
+
+Sections: What this is / Build / Run / What it shows / Architecture ASCII
+diagram (Bevy App → NdarrayGraphPlugin → Resource<Renderer> →
+GLOBAL_RENDERER → Framebuffer → palette LUT → bevy::asset::Image → Sprite) /
+Compile-time vs runtime tier (`PREFERRED_F32_LANES` vs `simd_caps().avx512f`
+mismatch on Sapphire Rapids with x86-64-v3 build) / Companion files table /
+Known limitations (rayon crossover, PaletteTier detect proxy, GLOBAL_RENDERER
+fixed capacity 4096).
+
+
+## 2026-05-13T00:30 — agent #1 plugin-core (sonnet)
+
+**Files:** examples/ndarray_graph_plugin.rs (274 lines), Cargo.toml [[example]] entry
+**Status:** COMPILES
+**Approach summary:** Wrote a full Bevy 0.19 plugin (`NdarrayGraphPlugin`) that wraps `Renderer::with_capacity(1024)` in a `GraphRenderer` Resource.  A `Startup` chain seeds 64 circle-layout nodes + 80 edges (64 ring + 16 stride cross-links) into the back frame and swaps.  Two ordered `Update` systems call `tick_renderer` (physics via `Renderer::tick`) then `render_to_framebuffer` (compose_neo4j → palette LUT expand → blit into long-lived Bevy Image).  A long-lived `Framebuffer(512×512)` and `Image` are stored in a `RenderSurface` Resource to avoid per-frame allocation.
+**Risks / TODOs:**
+- `PALETTE_LUT` is inlined; swap for `ndarray_graph_palette::PALETTE_LUT` once agent #2 delivers `examples/ndarray_graph_palette.rs`.
+- Plugin runs with `DefaultPlugins` (needs a window/GPU at runtime); for headless CI, gate `NdarrayGraphPlugin` behind a `#[cfg(not(headless))]` or swap to `MinimalPlugins` + custom render backend.
+- `Renderer::tick` doesn't apply inter-node forces (Coulomb repulsion / spring attraction) — it only integrates existing velocities × damping.  A force-accumulation pass would make the graph actually spring-like; agent #7 (renderer) should clarify whether `tick` or a separate `apply_uniform_force` call is the right hook.
+- `TextureFormat::Rgba8Unorm` used (linear); switch to `Rgba8UnormSrgb` for perceptually-correct colors if the palette LUT is authored in sRGB.
+**API surface used from crate::simd or hpc::*:**
+- `ndarray::hpc::renderer::{Renderer, DT_60}` — double-buffered renderer, tick, read_front, write_back, swap
+- `ndarray::hpc::renderer::RenderFrame` — positions/velocities/charges/len fields
+- `ndarray::hpc::framebuffer::{Framebuffer, compose_neo4j}` — palette-indexed raster, Bresenham edges, dot sprites
+
+## 2026-05-13T19:20 — agent #12 audit-amx-routing (sonnet) [backfilled by main]
+
+**Scope:** `src/simd_amx.rs` (8 detection sites); brief scan of
+`src/backend/native.rs` (2) and `src/hpc/bitwise.rs` (16).
+**Verdict:** AUDIT COMPLETE — 7 of 8 sites foldable into SimdCaps;
+1 (prctl) must stay standalone (per-thread OS state).
+
+**Foldable into SimdCaps (CPUID-level / system-wide):**
+- L50: `__cpuid_count(7,0)` → AMX-TILE (EDX[24]) + AMX-INT8 (EDX[25]) →
+  agent #6 is adding `amx_tile` / `amx_int8` fields
+- L58: `__cpuid(1)` OSXSAVE — inline precondition in `detect()`, no field
+- L68: `_xgetbv(0)` — XCR0 bits 17+18 (TILECFG/TILEDATA), system-wide,
+  cacheable. Add `xcr0_tile_enabled: bool`.
+- L121-124: duplicate CPUID in `amx_report()`, reads AMX-BF16 (EDX[22]) →
+  agent #6's `amx_bf16` field. After migration `amx_report()` reads
+  `simd_caps()` directly.
+- L285: `is_x86_feature_detected!("avx512vnni")` in production hot path
+  `matvec_dispatch` → `simd_caps().avx512vnni` (field exists)
+- L291: `is_x86_feature_detected!("avxvnniint8")` in production hot path →
+  `simd_caps().avxvnniint8` (agent #6 adding)
+- L385: `is_x86_feature_detected!("avx512vnni")` in test eprintln (no
+  assertion) → replace or delete
+
+**Must stay standalone (per-thread OS state):**
+- L81-107: raw `syscall` `prctl(ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA=18)`.
+  Linux grants this permission to the CALLING THREAD only. A LazyLock
+  initializer runs on one init thread; rayon workers will SIGILL without
+  their own prctl call. **Recommendation:** rename `amx_available()` →
+  `amx_init_thread()`. `simd_caps().has_amx()` exposes only hardware +
+  XCR0; the prctl step happens per-thread.
+
+**native.rs:** 2 sites already wrapped in local `LazyLock<Tier>` — correct
+pattern, P3 unification only.
+
+**bitwise.rs:** 16 sites all in `#[cfg(test)]` skip-guards. Production
+dispatch already uses `simd_caps()`. No hot-path action.
+
+**Hidden gotcha (P0):** the per-thread prctl is a real SIGILL hazard for
+rayon. `integrate_simd_par` doesn't touch AMX today so safe; but any
+future AMX-accelerated rayon-parallel path needs an init-each-worker
+shim.
+
+
+## 2026-05-13T00:02 — agent #6 simd-caps-amx (sonnet)
+
+**File:** `src/hpc/simd_caps.rs`
+**Verdict:** SHIP
+
+**Fields added (all additive, no existing fields modified):**
+- `amx_tile: bool` — CPUID.07H.0H:EDX bit 24 via `__cpuid_count(7,0)`
+- `amx_int8: bool` — CPUID.07H.0H:EDX bit 25
+- `amx_bf16: bool` — CPUID.07H.0H:EDX bit 22
+- `avx512bf16: bool` — `is_x86_feature_detected!("avx512bf16")`
+- `avxvnniint8: bool` — `is_x86_feature_detected!("avxvnniint8")`
+
+**Convenience methods added:**
+- `has_amx() -> bool` — `amx_tile && amx_int8` (CPUID-only; OS-level check stays in `simd_amx::amx_available()`)
+- `has_avx512_bf16() -> bool`
+- `has_avxvnniint8() -> bool`
+
+**Complications:**
+- `__cpuid_count` is safe (no `unsafe {}` needed) in Rust 1.94.1 — the initially written `unsafe { }` wrapper produced a `warn(unused_unsafe)` warning; removed wrapper, kept explanatory comment.
+- `simd_amx::amx_available()` left untouched per scope (XCR0+prctl OS check belongs to agent #12 audit).
+
+**Tests:** 4 new tests (plus existing 4 updated), all 8 pass, 0 warnings.
+**Test command:** `rustup run 1.94.1 cargo test --features rayon --lib hpc::simd_caps`
+
+## 2026-05-13T10:00 — agent #11 audit-cosmetic (sonnet)
+
+**Files:** `src/hpc/byte_scan.rs`, `src/hpc/palette_codec.rs`, `src/hpc/aabb.rs`
+**Verdict:** All three files confirmed COSMETIC-SIMD (with one PARTIAL-REAL exception). No file is clean.
+
+---
+
+### Cosmetic-SIMD Enumeration Table
+
+| File | Line | Function | `#[target_feature]` | `_mm*` intrinsics? | Body has polyfill calls? | Classification |
+|------|------|----------|---------------------|--------------------|--------------------------|----------------|
+| `byte_scan.rs` | 22 | `byte_find_all_avx2` | `avx2` | NO | NO — scalar `haystack[i+j] == needle` loops | COSMETIC |
+| `byte_scan.rs` | 86 | `byte_count_avx2` | `avx2` | NO | NO — scalar `haystack[i+j] == needle` loops | COSMETIC |
+| `byte_scan.rs` | 52 | `byte_find_all_avx512` | `avx512bw` | NO (`_mm*` absent) | YES — uses `U8x64::splat`, `U8x64::from_slice`, `U8x64::cmpeq_mask` | REAL (polyfill-backed) |
+| `byte_scan.rs` | 115 | `byte_count_avx512` | `avx512bw` | NO (`_mm*` absent) | YES — uses `U8x64::splat`, `U8x64::from_slice`, `U8x64::cmpeq_mask`, `.count_ones()` | REAL (polyfill-backed) |
+| `palette_codec.rs` | 303 | `unpack_generic_avx512` | `avx512f` | NO | NO — scalar nested loop (`word >> bit_offset & mask_val`) | COSMETIC |
+| `palette_codec.rs` | 335 | `pack_generic_avx512` | `avx512f` | NO | NO — scalar for loop, verbatim copy of `pack_indices` | COSMETIC |
+| `palette_codec.rs` | 353 | `unpack_4bit_avx2` | `avx2` | NO | NO — scalar nibble-split loop over `bytes[i..i+32]` | COSMETIC |
+| `palette_codec.rs` | 501 | `bedrock_reorder_xzy_avx512` | `avx512f` | NO | NO — scalar triple-nested loop with `get_unchecked` | COSMETIC |
+| `aabb.rs` | 241 | `aabb_intersect_batch_sse41` | `sse4.1` | NO | NO — scalar per-candidate `if` chain, identical to `aabb_intersect_batch_scalar` | COSMETIC |
+| `aabb.rs` | 174 | `aabb_intersect_batch_avx512` | `avx512f` | NO | YES — uses `F32x16::from_array`, `F32x16::splat`, `F32x16::simd_le`, `F32x16::simd_ge`, `F32Mask16.0 &` | REAL (polyfill-backed) |
+| `aabb.rs` | 329 | `ray_aabb_slab_test_avx512` | `avx512f` | NO | YES — uses `F32x16::splat`, arithmetic ops, `simd_min`, `simd_max`, `simd_le`, `simd_ge`, `to_array` | REAL (polyfill-backed) |
+| `aabb.rs` | 464 | `aabb_expand_batch_sse2` | `sse2` | NO | NO — scalar per-AABB field update, identical to `aabb_expand_batch_scalar` | COSMETIC |
+
+**Summary: 8 COSMETIC, 4 REAL (polyfill-backed, no raw `_mm*`)**
+
+---
+
+### AUTOVEC CHECK (empirical, via `rustc 1.94.1 --emit asm`)
+
+Built a minimal replica of each cosmetic function with `#[no_mangle] extern "C"` to prevent dead-code elimination. Assembly analyzed for `ymm*`/`zmm*`/`xmm*`/`vp*`/`vcmp*` instructions:
+
+**`byte_find_all_avx2` (avx2 hint, scalar 32-byte loop):**
+Assembly: pure scalar integer ops (`cmpb`, `jne`, `movb`, `incq`). Zero YMM/XMM registers. LLVM did NOT autovectorize the append-to-Vec loop. **COSMETIC — not autovec'd.**
+
+**`aabb_intersect_batch_sse41` (sse4.1 hint, scalar per-candidate chain):**
+Assembly: `movss`/`ucomiss`/`jb`/`setae` — scalar FP comparisons and branches. Zero packed SSE4.1 instructions (`blendvps`, `cmpps` absent). **COSMETIC — not autovec'd.**
+
+**`pack_generic_avx512` (avx512f hint, scalar bit-packing loop):**
+Assembly: contains `vmovups %zmm0` for the memset/zeroing prelude (LLVM auto-vectorized the zero-init with AVX-512 store), but the main bit-packing loop is scalar shift+OR. The `%zmm0` instruction is from `vec![0u64; n_words]` zero-fill, not the index-packing loop body. **Zeroing autovec'd; bit-pack loop COSMETIC.**
+
+**`aabb_expand_batch_sse2` (sse2 hint, scalar per-AABB update):**
+Assembly: uses `movups`/`subps`/`addps`/`shufps` on `%xmm` registers — **REAL-AUTOVEC.** LLVM vectorized the 6-float struct update into XMM-register arithmetic. The SSE2 feature hint IS doing useful work here: without it, LLVM would not be permitted to use `addps`/`subps` on this loop. **Mark as REAL-AUTOVEC.**
+
+---
+
+### Replacement Plan (Cosmetic Functions Only)
+
+#### `byte_scan.rs` — `byte_find_all_avx2` (line 22) and `byte_count_avx2` (line 86)
+
+**Problem:** `#[target_feature(enable = "avx2")]` on pure scalar 32-byte loop.
+**No `U8x32` exists** in `crate::simd` (confirmed: searched entire `src/`; zero results).
+**Correct polyfill replacement:** None available at AVX2 tier. Two options:
+1. **Delete** both functions and fall through to scalar path (honest: no speedup anyway).
+2. **Add `U8x32` to `simd_avx2.rs`** with `splat`, `from_slice`, `cmpeq_mask → u32` methods, then replace scalar loops with `U8x32::splat(needle)` + `cmpeq_mask` + `trailing_zeros` bitmask scatter.
+
+**Polyfill gap:** `U8x32::cmpeq_mask` does **not exist** in `simd_avx2.rs`. The file contains zero `U8x*` types. The AVX2 tier must add this type before any real replacement is feasible.
+
+**Methods needed in `simd_avx2.rs`:**
+- `U8x32::splat(v: u8) -> U8x32`
+- `U8x32::from_slice(s: &[u8]) -> U8x32`
+- `U8x32::cmpeq_mask(self, other: U8x32) -> u32` — maps to `_mm256_cmpeq_epi8` + `_mm256_movemask_epi8`
+
+#### `palette_codec.rs` — `unpack_generic_avx512` (line 303) and `pack_generic_avx512` (line 335)
+
+**Problem:** Both are verbatim scalar copies of `unpack_indices`/`pack_indices` wearing `avx512f` decoration.
+**Real replacement requires:** gather/scatter ops — `U8x64` scatter via `U16x32` widening + `U16x32::shr_epi16` + `pack_saturate_u8`. No single polyfill maps cleanly to variable-width bit unpacking.
+**Honest replacement plan:** Delete both functions. Document `pack_indices`/`unpack_indices` as the canonical path. Add a `// NOTE: real SIMD unpack requires shr_epi16+pack_saturate_u8 per bit-width; not yet implemented.` comment in `pack_indices_simd` / `unpack_indices_simd`.
+
+**Polyfill gap:** `U16x32::shr_epi16(shift: u32)` exists (line ~1244 in simd_avx512.rs region), but **scalar fallback in `simd.rs`** lacks it. The AVX-512 path can be implemented; a scalar polyfill for `simd.rs::scalar` module would need:
+- `U16x32::shr_epi16(self, shift: u32) -> U16x32` (scalar: element-wise `>> shift`)
+
+#### `palette_codec.rs` — `unpack_4bit_avx2` (line 353)
+
+**Problem:** Nibble-split loop over 32-byte chunks, zero `_mm256_*` intrinsics.
+**Correct polyfill:** Real 4-bit unpack uses `U8x32::unpacklo_epi8` + `U8x32::and` + `U8x32::srli_epi16`. Neither `unpacklo_epi8` nor `srli_epi16` exists on the AVX2 tier.
+**Methods needed in `simd_avx2.rs`:**
+- `U8x32::unpacklo_epi8(self, other: U8x32) -> U8x32` (maps to `_mm256_unpacklo_epi8`)
+- `U8x32::unpackhi_epi8(self, other: U8x32) -> U8x32` (maps to `_mm256_unpackhi_epi8`)
+- `U8x32::srli_epi16(self, imm: i32) -> U8x32` (maps to `_mm256_srli_epi16`)
+- Or equivalently: `U8x32::and(self, mask: U8x32) -> U8x32` (maps to `_mm256_and_si256`)
+
+#### `palette_codec.rs` — `bedrock_reorder_xzy_avx512` (line 501)
+
+**Problem:** Scalar triple-loop permutation using `get_unchecked`, zero SIMD.
+**Correct polyfill:** Real AVX-512 version would use `U16x32::gather` with computed indices. No gather primitive exists in `crate::simd` for `u16`.
+**Honest replacement plan:** Delete the function; route `bedrock_reorder_xzy` directly to the scalar path. Add comment: `// AVX-512 gather on u16 requires widening to u32; not yet in polyfill.`
+**Methods needed (if implemented):**
+- `U32x16::gather_u16(base: *const u16, vindex: U32x16) -> U32x16` — not present; would wrap `_mm512_i32gather_epi32` with 2-byte scale.
+
+#### `aabb.rs` — `aabb_intersect_batch_sse41` (line 241)
+
+**Problem:** Scalar per-candidate loop, AUTOVEC confirmed: zero SSE4.1 instructions emitted.
+**The `aabb_expand_batch_sse2` function IS REAL-AUTOVEC** (SSE2 feature hint causes `addps`/`subps` emission); SSE4.1 hint on the intersection function does NOT produce `blendvps` or `cmpps`.
+**Correct polyfill:** Use `F32x4` (SSE2-width) comparison. No `F32x4` type exists in `crate::simd`. Alternatively, use `F32x8` (AVX2) for 2-candidate-at-once processing, or simply rename to `aabb_intersect_batch_scalar_hint` and document the annotation as a scheduling hint only.
+
+**Methods needed in `simd_avx2.rs` (for real SSE4.1 replacement):**
+- `F32x4::from_array([f32; 4]) -> F32x4` — type does not exist
+- OR accept that 1-candidate-at-a-time is scalar-only and rename the function honestly.
+
+---
+
+### Polyfill Methods Needed in `simd_avx2.rs` (and scalar fallback)
+
+To make the above replacements fully feasible, these methods must be added:
+
+| Method | Type | Wraps (AVX2) | Scalar fallback |
+|--------|------|--------------|-----------------|
+| `U8x32::splat(v: u8)` | `simd_avx2.rs` | `_mm256_set1_epi8` | element-wise fill |
+| `U8x32::from_slice(s: &[u8])` | `simd_avx2.rs` | `_mm256_loadu_si256` | copy 32 bytes |
+| `U8x32::cmpeq_mask(self, other: U8x32) -> u32` | `simd_avx2.rs` | `_mm256_cmpeq_epi8` + `_mm256_movemask_epi8` | `element-wise == as bitmask` |
+| `U8x32::unpacklo_epi8(self, other: U8x32)` | `simd_avx2.rs` | `_mm256_unpacklo_epi8` | interleave lo halves |
+| `U8x32::unpackhi_epi8(self, other: U8x32)` | `simd_avx2.rs` | `_mm256_unpackhi_epi8` | interleave hi halves |
+| `U8x32::and(self, mask: U8x32)` | `simd_avx2.rs` | `_mm256_and_si256` | element-wise `&` |
+| `U8x32::srli_epi16(self, imm: i32)` | `simd_avx2.rs` | `_mm256_srli_epi16` | element-wise `>> imm` |
+| `U16x32::shr_epi16(self, shift: u32)` | scalar in `simd.rs` | already in `simd_avx512.rs:~1275` | element-wise `>> shift` |
+
+The `U8x32` type itself (the 256-bit byte vector) is entirely absent from `simd_avx2.rs` — all 7 methods above require first creating the type. This is the foundational gap for the AVX2-tier byte scan and nibble unpack paths.
+
+---
+
+### Key Finding: `aabb_expand_batch_sse2` is REAL-AUTOVEC
+
+This function was previously listed as cosmetic by earlier agents. ASM confirms otherwise: the SSE2 feature annotation on the `[f32; 3] min/max subtract+add` loop causes LLVM to emit `movups`/`subps`/`addps`/`shufps` on XMM registers. Without the annotation, the same code compiles to scalar. This one function in `aabb.rs` is a legitimate use of `#[target_feature]` as an LLVM autovectorization hint. Do not remove it.
+
+
+## 2026-05-13T19:35 — agent #10 audit-color (sonnet) [backfilled by main]
+
+**Files:** bevy_pbr/atmosphere/{resources,environment}.rs +
+light_probe/generate.rs + ssao/mod.rs + bevy_image/{image,ktx2}.rs
+**Verdict:** **0 of 10 sites worth converting.** All NOT-WORTH.
+
+Root causes:
+1. All atmosphere / light-probe / SSAO f16 textures are GPU-only — CPU only
+   sets the wgpu `TextureFormat` descriptor. GPU compute shaders fill them.
+2. `Image::convert` does NOT support `Rgba16Float` as a target (returns
+   `None` at image.rs:1550). No bulk f32→f16 path exists today.
+3. `set_color_at` / `get_color_at` are single-pixel-per-call APIs. Only
+   caller is `bevy_sprite/picking_backend.rs` (1 px per pointer event).
+4. KTX2 copies half-float bytes verbatim — no decode loop.
+
+The "500-20000× BF16 batch" claim from ndarray's `f32_to_bf16_batch_rne`
+docs is real but unreachable in Bevy as-shipped. The Bevy CPU never
+touches f16/bf16 data in bulk.
+
+**Latent opportunity (not in codebase today):** if `Image::convert` were
+extended to support `Rgba16Float` as a destination, a bulk
+Rgba8Unorm → Rgba16Float path would touch W·H·4 f32→f16 values (33M at
+4K) — genuine `cast_f32_to_f16_batch` candidate. Would have to ship the
+Image::convert extension AND the SIMD path together.
+
+
+## 2026-05-13T19:45 — agent #5 plugin-tests (sonnet) [backfilled by main]
+
+**Files:** `bevy/examples/ndarray_graph_plugin_tests.rs` (308 lines) +
+Cargo.toml `[[example]]` entry
+**Status:** ALL 5 TESTS PASS (dual mode: `cargo run` exits nonzero on
+failure, `cargo test --example` also works)
+
+Tests:
+1. plugin_initializes_global_renderer_resource — `GraphRenderer` resource
+   present after plugin build; `GLOBAL_RENDERER.tick_count() == 0`
+2. startup_seeds_nodes_and_edges — front.len=2, edges.len=1 after first
+   app.update()
+3. tick_advances_position_via_integrate_simd — position 10.0 → 10.016666
+   (= 1.0 * DT_60 + 10.0, exact). Confirms F32x16::mul_add polyfill ran
+4. compose_neo4j_emits_pixels_to_framebuffer — 106 non-zero bytes in
+   128×128 buffer (threshold=50)
+5. polyfill_runtime_tier_matches_expectation — confirms avx512f=true
+   AND avx2=true on Sapphire Rapids; PREFERRED_F32_LANES=8 (the smoke
+   test's catch — compile-time AVX2 path on AVX-512 hardware)
+
+**Duplication risk:** test file defines `NdarrayGraphPlugin` + `GraphRenderer`
+INLINE because agent #5 ran in parallel with agent #1 and couldn't import.
+Main thread will consolidate after fleet completion: either (a) test file
+imports from agent #1's plugin file, or (b) move the plugin types into a
+shared `examples/ndarray_graph_lib.rs` module that both import.
+
+
+## 2026-05-13T19:55 — agent #8 audit-skin (sonnet) [backfilled by main]
+
+**File:** `bevy_pbr/src/render/skin.rs` (515 lines)
+**Verdict:** **NOT-WORTH**
+
+Bevy's skinning is GPU-side WGSL. `skin.rs` is a CPU staging step that
+computes one final `Mat4` per joint and writes it to a wgpu buffer for
+upload. Four candidate hot paths:
+
+1. `extract_joints_for_skin` (L399-413) — per-frame joint matrix update.
+   ECS change-detection gate at L406 → irregular skip pattern. Can't
+   batch for GEMM. M=N=K=4 GEMM is overhead-dominated anyway.
+2. `add_skin` (L452-474) — initial population on visibility change.
+   Contiguous loop, no skip — the ONLY uninterrupted math path. But
+   fires ~0 times/sec in stable scenes. Cold path.
+3. `prepare_skins` (L176-244) — pure DMA via `bytemuck::must_cast_slice`.
+   No arithmetic.
+4. Per-vertex weighted blend — **not in this file**. GPU-side WGSL.
+
+Numbers: MAX_JOINTS=256, full-rig scalar cost ~16 µs/mesh/frame. AVX-512
+at 8× would save 14 µs/mesh/frame. GPU skinning noise floor is 0.5-2 ms.
+SIMD savings disappear below GPU baseline.
+
+**ndarray API surface needed: NONE.** Skin is not a SIMD-polyfill
+integration candidate. The performance levers are GPU shader
+optimization + wgpu buffer bandwidth — outside ndarray's scope.
+
diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
index ccac38f0..8eac19a0 100644
--- a/.github/workflows/ci.yaml
+++ b/.github/workflows/ci.yaml
@@ -184,7 +184,16 @@ jobs:
       - run: ./scripts/miri-tests.sh
 
   cross_test:
-    #if: ${{ github.event_name == 'merge_group' }}
+    # Gated on merge_group only — cross-compile via docker (cross-rs) for
+    # s390x / i686 is slow, flaky on the s390x docker image's toolchain
+    # resolution (rust-toolchain.toml's 1.94.1 pin doesn't resolve cleanly
+    # inside the s390x cross container), and reliably caught by the
+    # `tests/{stable,beta,1.94.0}` jobs on every PR push. Reserve cross
+    # validation for the merge queue where it can fail loudly without
+    # gating individual PRs on infra flakiness. The commented `if:` was
+    # the original intent (per the pre-existing comment) — uncommenting
+    # per the PR #143 codex thread that surfaced this consistently.
+    if: ${{ github.event_name == 'merge_group' }}
     runs-on: ubuntu-latest
     strategy:
       matrix:
diff --git a/src/hpc/simd_caps.rs b/src/hpc/simd_caps.rs
index 28279630..2789ba88 100644
--- a/src/hpc/simd_caps.rs
+++ b/src/hpc/simd_caps.rs
@@ -22,7 +22,14 @@ use std::sync::LazyLock;
 ///   Pi Zero 2 W / Pi 3 (A53, v8.0): neon only
 ///   Pi 4 (A72, v8.0):               neon only (but 2× throughput)
 ///   Pi 5 (A76, v8.2):               neon + dotprod + fp16 + aes + sha2
+///
+/// `#[non_exhaustive]` per codex P2 on PR #143: future capability fields
+/// can be added without source-breaking downstream crates that construct
+/// `SimdCaps` directly via struct literal (e.g. mocks, tests, custom
+/// capability values). Downstream code must use `simd_caps()` or the
+/// public constructor instead of struct-literal init.
 #[derive(Debug, Clone, Copy)]
+#[non_exhaustive]
 pub struct SimdCaps {
     // ── x86_64 ──
     /// AVX2 (256-bit integer/FP SIMD).
@@ -49,6 +56,21 @@ pub struct SimdCaps {
     /// Skylake-X / Cascade Lake / Ice Lake-SP — calling VBMI intrinsics on
     /// those CPUs SIGILLs even though `avx512f` is true.
     pub avx512vbmi: bool,
+    /// AMX-TILE: tile register file present (CPUID.07H.0H:EDX bit 24).
+    /// Sapphire Rapids, Granite Rapids, Meteor Lake, Arrow Lake.
+    pub amx_tile: bool,
+    /// AMX-INT8: `TDPBUSD` u8×i8→i32 tile dot product (CPUID.07H.0H:EDX bit 25).
+    pub amx_int8: bool,
+    /// AMX-BF16: `TDPBF16PS` BF16×BF16→f32 tile dot product (CPUID.07H.0H:EDX bit 22).
+    pub amx_bf16: bool,
+    /// AVX-512 BF16: `VCVTNE2PS2BF16` / `VDPBF16PS` 512-bit BF16 math
+    /// (`is_x86_feature_detected!("avx512bf16")`).
+    /// Present on Cooper Lake, Sapphire Rapids, Zen 4.
+    pub avx512bf16: bool,
+    /// AVX-VNNI-INT8: 256-bit `VPDPBSSD`/`VPDPBUUD` (non-AVX-512) VNNI
+    /// (`is_x86_feature_detected!("avxvnniint8")`).
+    /// Present on Arrow Lake, Lunar Lake, NUC 14 (Meteor Lake-H).
+    pub avxvnniint8: bool,
 
     // ── aarch64 (ARM) ──
     /// NEON 128-bit SIMD (mandatory on aarch64, always true).
@@ -81,6 +103,14 @@ impl SimdCaps {
     /// Detect CPU capabilities at runtime.
     #[cfg(target_arch = "x86_64")]
     fn detect() -> Self {
+        // `__cpuid_count` is safe on x86_64 (Rust 1.87+): CPUID is always
+        // available on x86_64 (guaranteed by the ABI) and has no side effects
+        // beyond reading CPU registers.
+        let cpuid7 = core::arch::x86_64::__cpuid_count(7, 0);
+        let amx_tile = (cpuid7.edx >> 24) & 1 == 1;
+        let amx_int8 = (cpuid7.edx >> 25) & 1 == 1;
+        let amx_bf16 = (cpuid7.edx >> 22) & 1 == 1;
+
         Self {
             avx2: is_x86_feature_detected!("avx2"),
             avx512f: is_x86_feature_detected!("avx512f"),
@@ -92,6 +122,11 @@ impl SimdCaps {
             fma: is_x86_feature_detected!("fma"),
             avx512vnni: is_x86_feature_detected!("avx512vnni"),
             avx512vbmi: is_x86_feature_detected!("avx512vbmi"),
+            amx_tile,
+            amx_int8,
+            amx_bf16,
+            avx512bf16: is_x86_feature_detected!("avx512bf16"),
+            avxvnniint8: is_x86_feature_detected!("avxvnniint8"),
             // ARM fields: all false on x86
             neon: false,
             asimd_dotprod: false,
@@ -119,6 +154,11 @@ impl SimdCaps {
             fma: false,
             avx512vnni: false,
             avx512vbmi: false,
+            amx_tile: false,
+            amx_int8: false,
+            amx_bf16: false,
+            avx512bf16: false,
+            avxvnniint8: false,
             // ARM fields: runtime detection
             neon: true, // mandatory on aarch64
             asimd_dotprod: std::arch::is_aarch64_feature_detected!("dotprod"),
@@ -143,6 +183,11 @@ impl SimdCaps {
             fma: false,
             avx512vnni: false,
             avx512vbmi: false,
+            amx_tile: false,
+            amx_int8: false,
+            amx_bf16: false,
+            avx512bf16: false,
+            avxvnniint8: false,
             neon: false,
             asimd_dotprod: false,
             fp16: false,
@@ -171,6 +216,32 @@ impl SimdCaps {
         self.avx512f && self.avx512vnni
     }
 
+    /// True if AMX is available at the CPUID level (`amx_tile && amx_int8`).
+    ///
+    /// Note: CPUID presence does **not** guarantee OS enablement. The full
+    /// OS-level check (XCR0 bits 17+18, prctl ARCH_REQ_XCOMP_PERM) lives in
+    /// `simd_amx::amx_available()`. This method is a lightweight CPUID-only
+    /// probe suitable for capability reporting and coarse dispatch decisions.
+    #[inline(always)]
+    pub fn has_amx(self) -> bool {
+        self.amx_tile && self.amx_int8
+    }
+
+    /// True if AVX-512 BF16 is available (`VCVTNE2PS2BF16` / `VDPBF16PS`).
+    /// Present on Cooper Lake, Sapphire Rapids, Zen 4.
+    #[inline(always)]
+    pub fn has_avx512_bf16(self) -> bool {
+        self.avx512bf16
+    }
+
+    /// True if AVX-VNNI-INT8 (256-bit `VPDPBSSD`/`VPDPBUUD`) is available.
+    /// Present on Arrow Lake, Lunar Lake, NUC 14 (Meteor Lake-H).
+    /// This is the non-AVX-512 VNNI path — does NOT require `avx512f`.
+    #[inline(always)]
+    pub fn has_avxvnniint8(self) -> bool {
+        self.avxvnniint8
+    }
+
     // ── ARM convenience methods ──
 
     /// True if running on aarch64 with NEON (always true on aarch64).
@@ -298,6 +369,12 @@ mod tests {
         let _ = caps.avx2;
         let _ = caps.avx512f;
         let _ = caps.neon;
+        // New AMX / BF16 / VNNI fields must also be accessible without panic.
+        let _ = caps.amx_tile;
+        let _ = caps.amx_int8;
+        let _ = caps.amx_bf16;
+        let _ = caps.avx512bf16;
+        let _ = caps.avxvnniint8;
     }
 
     #[test]
@@ -335,6 +412,58 @@ mod tests {
         let _ = caps.has_crypto();
     }
 
+    #[test]
+    fn new_amx_bf16_vnni_convenience_methods_do_not_panic() {
+        let caps = simd_caps();
+        let amx = caps.has_amx();
+        let bf16 = caps.has_avx512_bf16();
+        let vnni = caps.has_avxvnniint8();
+        // Semantic invariants: has_amx() requires both tile and int8.
+        assert_eq!(amx, caps.amx_tile && caps.amx_int8);
+        // has_avx512_bf16() mirrors the raw field.
+        assert_eq!(bf16, caps.avx512bf16);
+        // has_avxvnniint8() mirrors the raw field.
+        assert_eq!(vnni, caps.avxvnniint8);
+    }
+
+    #[test]
+    fn amx_fields_false_on_non_x86() {
+        // On non-x86_64, all AMX and BF16 fields must be false because
+        // the detect() fallback / aarch64 branch sets them to false.
+        #[cfg(not(target_arch = "x86_64"))]
+        {
+            let caps = simd_caps();
+            assert!(!caps.amx_tile);
+            assert!(!caps.amx_int8);
+            assert!(!caps.amx_bf16);
+            assert!(!caps.avx512bf16);
+            assert!(!caps.avxvnniint8);
+            assert!(!caps.has_amx());
+            assert!(!caps.has_avx512_bf16());
+            assert!(!caps.has_avxvnniint8());
+        }
+        // On x86_64 we can only check that the call doesn't panic; the
+        // actual values depend on the hardware running the test.
+        #[cfg(target_arch = "x86_64")]
+        {
+            let caps = simd_caps();
+            let _ = caps.has_amx();
+            let _ = caps.has_avx512_bf16();
+            let _ = caps.has_avxvnniint8();
+        }
+    }
+
+    #[test]
+    fn simd_caps_deterministic_new_fields() {
+        let a = simd_caps();
+        let b = simd_caps();
+        assert_eq!(a.amx_tile, b.amx_tile);
+        assert_eq!(a.amx_int8, b.amx_int8);
+        assert_eq!(a.amx_bf16, b.amx_bf16);
+        assert_eq!(a.avx512bf16, b.avx512bf16);
+        assert_eq!(a.avxvnniint8, b.avxvnniint8);
+    }
+
     #[test]
     fn arm_profile_consistent() {
         let caps = simd_caps();