diff --git a/.claude/board/AGENT_LOG.md b/.claude/board/AGENT_LOG.md index 375cf1c8..27ac38ac 100644 --- a/.claude/board/AGENT_LOG.md +++ b/.claude/board/AGENT_LOG.md @@ -582,3 +582,599 @@ The fleet largely served (1) by not touching it, and (2) by surfacing the genuin - "Fix" project_ortho UB — there is no UB; the meta is wrong. **Verdict on the meta:** Agent M did good consolidation work but stack-ranked credibility issues over correctness. The single biggest meta mistake is the project_ortho UB claim (factually wrong), followed by the unmeasured "8-12 MB heap/frame" number (fabricated), followed by the ship-blocker stack (six of them, none of which actually block ship today). A user reading the meta in good faith would walk away thinking the repo is on fire. It isn't. It has 5 real soundness bugs (items 1-3 above), one architectural smell (AMX prctl scope), and a pile of cosmetic / busywork findings dressed up as P0/P1. + + +# ═══════════════════════════════════════════════════════════════════ +# Round 2 — bevy plugin delivery + bevy upstream SIMD audit +# ═══════════════════════════════════════════════════════════════════ + +> **Branches:** +> - bevy: `claude/ndarray-simd-review-S0zXK` +> - ndarray: `claude/ndarray-simd-review-S0zXK` (PR #142 merged on master) +> **Goal:** deliver an actual Bevy plugin using ndarray's SIMD polyfill +> for graph nodes/edges rendering, plus inventory the bevy upstream SIMD +> rewrite opportunities. +> **Fleet:** 12 Sonnet + 1 Sonnet meta. Same A2A pattern. + +## Fleet manifest (round 2) + +| # | Agent | Scope | Output | +|---|---|---|---| +| 1 | plugin-core | `bevy/examples/ndarray_graph_plugin.rs` + Cargo.toml [[example]] | CODE | +| 2 | plugin-palette | `bevy/examples/ndarray_graph_palette.rs` | CODE | +| 3 | plugin-ci | `bevy/.github/workflows/ndarray-smoke.yml` | CODE | +| 4 | plugin-readme | `bevy/examples/README_NDARRAY_PLUGIN.md` | CODE | +| 5 | plugin-tests | `bevy/examples/ndarray_graph_plugin_tests.rs` | CODE | +| 6 | simd-caps-amx | `ndarray/src/hpc/simd_caps.rs` — add AMX/VNNI fields | CODE (additive) | +| 7 | audit-frustum | `bevy/crates/bevy_camera/src/primitives.rs` | AUDIT | +| 8 | audit-skin | `bevy/crates/bevy_pbr/src/render/skin.rs` | AUDIT | +| 9 | audit-mesh | `bevy/crates/bevy_mesh/src/*.rs` | AUDIT | +| 10 | audit-color | `bevy/crates/bevy_pbr/atmosphere/*.rs` | AUDIT | +| 11 | audit-cosmetic | `ndarray/src/hpc/{byte_scan,palette_codec,aabb}.rs` | AUDIT | +| 12 | audit-amx-routing | `ndarray/src/simd_amx.rs` — `is_x86_feature_detected!` sites | AUDIT | +| M | meta-r2 | synthesis | (sonnet) | + +## Round 2 entries (newest first) + + +## 2026-05-13T00:00 — agent #2 plugin-palette (sonnet) + +**File:** `bevy/examples/ndarray_graph_palette.rs` (100 lines) +**Verdict:** DONE + +**Deliverables:** +- `PALETTE_LUT: [[u8; 4]; 16]` — Neo4j/Palantir-inspired 16-color RGBA table: dark navy (#0D1B2A) at index 0, gradient through cobalt → royal → sky → steel → cyan → ice-blue → white at indices 1–11, then pale-amber → warm-amber → hot-orange → crimson at indices 12–15. +- `pub fn blit_u8_palette_to_rgba(palette_pixels: &[u8], rgba_out: &mut [u8])` — scalar `for` loop with `& 0x0F` masking and `copy_from_slice`; no allocation; `debug_assert!` guards buffer length. +- `#[test] palette_lut_roundtrip` — 64-byte input → 256-byte RGBA; asserts first pixel = index 0, last pixel = index 15, spot-check index 11 (white), alpha=255 for all. + +**Syntax check:** `rustup run 1.95.0 rustc --edition 2024 --crate-type lib` → clean (zero warnings, zero errors). +**Test run:** `rustc --test` → `1 passed; 0 failed`. + +**SIMD note:** `U8x64::permute_bytes` (`_mm512_permutexvar_epi8` on VBMI) would handle 64 indices per instruction. Deferred per scope — requires `#[target_feature(enable = "avx512vbmi")]` + `unsafe fn` gate which round-1 reviewer (agent R) flagged as a prerequisite fix in `simd_avx512.rs:689` before that path is safe to call. + + +## 2026-05-13T15:00 — agent #9 audit-mesh (sonnet) + +**Scope:** `bevy/crates/bevy_mesh/src/*.rs` — per-vertex loop SIMD opportunities +**Verdict:** SURVEY (read-only; no code changes) + +--- + +### 1. `mesh.rs:1904–1950` — `try_transform_by` — positions/normals/tangents transform + +**Loop shape:** +```rust +positions.iter_mut().for_each(|pos| *pos = transform.transform_point(Vec3::from_slice(pos)).to_array()); +normals.iter_mut().for_each(|normal| { *normal = (rotation * scale_normal(...)).to_array(); }); +tangents.iter_mut().for_each(|tangent| { let scaled = Vec3::from_slice(tangent) * scale; ... }); +``` +**Tag:** SETUP-ONCE (called at mesh load / on construction of transformed meshes; not per-frame) + +**SIMD candidate:** Each position is 3 floats; loading 5 positions fills 15 floats, nearly one F32x16 register. Interleave [x0,y0,z0, x1,y1,z1, …] into 16-wide tiles, apply the affine matrix as four `F32x16::mul_add` calls (one per matrix row), scatter back. Rotation quaternion → matrix is done once per `transform_by` call, amortized to zero. + +**Estimated benefit:** At load time for a 1M-vertex mesh: ~1M × 3 transform ops × 3 scalar muls = 9M muls → with F32x16 batching ≈ 562K iterations instead of 3M. Relevant for glTF batch loading, not per-frame rendering. Benefit = **asset-import speed, not frame-time**. + +--- + +### 2. `mesh.rs:1352–1357` — `try_compute_flat_normals` — triangle normal generation + +**Loop shape:** +```rust +let normals: Vec<_> = positions + .as_chunks().0.iter() + .flat_map(|&[a, b, c]| [triangle_normal(a, b, c); 3]) + .collect(); +``` +`triangle_normal` = `(b-a).cross(c-a).normalize_or_zero()` — 6 subtractions, 6 cross products, 1 sqrt (normalize). +**Tag:** SETUP-ONCE (glTF load / MeshBuilder::build) + +**SIMD candidate:** `array_windows::<3>()` is already the natural shape here (triples of positions). Batch 5 triangles (5 × [a,b,c] = 45 floats ≈ 3 × F32x16): compute delta-vectors for all 5 triangles in two F32x16 regs, cross-product via shuffle, rsqrt approximation via `F32x16::recip_sqrt` (if available). The `.normalize_or_zero()` branch is the only complication (NaN-guard). **Speedup potential: high, 6× lanes vs scalar cross-product.** However, all flat-normal paths fire exactly once at load. The real caller is `compute_flat_normals` in glTF importer — not a frame budget concern. + +--- + +### 3. `mesh.rs:1607–1622` — `try_compute_custom_smooth_normals` — per-triangle accumulation + normalize pass + +**Loop shape:** +```rust +// accumulate phase: +vec.as_chunks().0.iter().for_each(|&chunk| { + per_triangle(chunk.map(|i| i as usize), positions, &mut normals); +}); +// normalize pass: +for normal in &mut normals { + *normal = normal.try_normalize().unwrap_or(Vec3::ZERO); +} +``` +**Tag:** SETUP-ONCE (glTF load) + +**SIMD candidate for normalize pass:** The normalize pass is `N × 3` sequential scalar `sqrt` + divisions. With F32x16: load 16 floats (≈5 normals), compute squared magnitude via `mul_add` + horizontal reduce within triplets, `rsqrt` approximation, multiply. Roughly 5× throughput vs scalar. However, the triplet layout ([f32;3]) wastes 1/4 of a 4-wide load; AoS → SoA transposition overhead is non-trivial. **Practical benefit: marginal** unless the mesh is extremely large (>100K verts). Confirm: LOAD-TIME only. + +--- + +### 4. `mesh.rs:2178–2194` — `try_normalize_joint_weights` — 4-wide weight normalization + +**Loop shape:** +```rust +for weights in joints.iter_mut() { // Vec<[f32; 4]> + weights.iter_mut().for_each(|w| *w = w.max(0.0)); + let sum: f32 = weights.iter().sum(); + if sum != 0.0 { + let recip = sum.recip(); + for weight in weights.iter_mut() { *weight *= recip; } + } +} +``` +**Tag:** SETUP-ONCE (skinned mesh loading / GLTF importer) + +**SIMD candidate:** Each vertex has 4 weights = [f32; 4], exactly half a SIMD8 lane. With F32x16 we can process 4 vertices at once (16 floats). The ops are: `max(0)` vectorized as `F32x16::max`, horizontal `reduce_sum` per-group-of-4 for the sum, `recip` + broadcast, `mul`. The conditional `sum==0` clamp (set w[0]=1.0) breaks vectorization unless handled with a blend/select mask. **Feasible, moderate gain.** LOAD-TIME only for typical usage; conceivably called per-frame if weights are animated (blend-shape skinning), making this the **most legitimate F32x16 candidate** in the file if skinning is per-frame. + +**Tag (conditional):** LOAD-TIME for static skins; PER-FRAME if the engine calls this after runtime weight blending (not verified in bevy_pbr skin path). + +--- + +### 5. `mesh.rs:2306–2312` — AABB extraction pass in `extract_and_cache_data` + +**Loop shape:** +```rust +let mut iter = position_values.iter().map(|p| Vec3::from_slice(p)); +let mut min = iter.next().unwrap(); +let mut max = min; +for v in iter { + min = Vec3::min(min, v); + max = Vec3::max(max, v); +} +``` +**Tag:** SETUP-ONCE (called once per mesh-asset extraction to RenderWorld) + +**SIMD candidate:** Classic reduction: load 16 floats (≈5 positions), track running SIMD min/max across x/y/z channels separately, final horizontal reduce. With F32x16, throughput is ~10× scalar for large meshes. AoS layout ([f32;3]) complicates channel separation but `array_chunks::<3>()` + interleave trick works. **Benefit: asset-import speed.** For 1M-vert meshes at batch load time, this saves 10–20ms per mesh — worthwhile. + +--- + +### 6. `mikktspace.rs:27–59` — Mikktspace tangent-space generation + +**Loop shape (via `bevy_mikktspace::generate_tangents`):** +The wrapper in `mikktspace.rs` calls `bevy_mikktspace::generate_tangents(&mut mikktspace_mesh)`, which is the external `bevy_mikktspace` crate implementing the Mikkt algorithm. The hot loop is inside that crate, not directly in `bevy_mesh`. The post-loop handedness flip at line 127–129: +```rust +for tangent in &mut mikktspace_mesh.tangents { + tangent[3] = -tangent[3]; +} +``` +…is a single-float sign-flip per tangent (trivial, embarrassingly parallel). +**Tag:** SETUP-ONCE (glTF / asset load) + +**SIMD candidate:** The handedness flip touches only index [3] of each [f32;4]. With F32x16 and a negation mask `[+1,+1,+1,-1, +1,+1,+1,-1,…]` repeating every 4 floats, we can flip 4 tangents per F32x16 iteration. Simple, ~10× throughput. But this loop is at most ~1M iterations at load time — total wall-clock cost is sub-millisecond. **Benefit: negligible.** The real Mikkt hotloop is inside `bevy_mikktspace` (out of scope for `bevy_mesh` patches). + +--- + +### 7. `primitives/dim3/sphere.rs:187–200` — UV sphere vertex generation + +**Loop shape:** +```rust +for i in 0..stacks + 1 { + let xy = radius * cos(stack_angle); + let z = radius * sin(stack_angle); + for j in 0..sectors + 1 { + let x = xy * cos(sector_angle); + let y = xy * sin(sector_angle); + vertices.push([x, y, z]); + normals.push([x * length_inv, y * length_inv, z * length_inv]); + } +} +``` +**Tag:** SETUP-ONCE (mesh construction) + +**SIMD candidate:** The inner sector loop computes `cos(j*step)` and `sin(j*step)` per sector. These transcendentals dominate. A precomputed `cos_table[j]` / `sin_table[j]` + `F32x16::mul_add` for position/normal could yield real gains, but sin/cos table lookup is itself a memory fetch pattern. The `vertices.push()` inside the loop prevents batching without a two-pass approach (pre-allocate Vec with `with_capacity`, fill by index). **Benefit at SETUP-ONCE: marginal for typical sphere resolutions (32×18 = 576 verts).** At high-res spheres (2048×1024 ≈ 2M verts), worth revisiting. + +--- + +### 8. `primitives/dim3/torus.rs:94–122` — Torus vertex generation + +**Loop shape:** +```rust +for segment in 0..=self.major_resolution { + for side in 0..=self.minor_resolution { + let (sin_theta, cos_theta) = ops::sin_cos(theta); + let (sin_phi, cos_phi) = ops::sin_cos(phi); + let radius = major + minor * cos_phi; + positions.push(position.into()); + normals.push(normal.into()); // normal = (position - center).normalize() + } +} +``` +**Tag:** SETUP-ONCE + +**SIMD candidate:** Same sin/cos table pattern as sphere. The `normalize()` call inside the loop (per-vertex normal) is the scalar hot path. Could precompute sin/cos tables for phi and theta separately, then compute all positions in a batch. Low priority — torus is rarely high-res. + +--- + +### 9. `primitives/dim3/cylinder.rs:187–192` — Cylinder anchor offset pass + +**Loop shape:** +```rust +CylinderAnchor::Top => positions.iter_mut().for_each(|p| p[1] -= half_height), +CylinderAnchor::Bottom => positions.iter_mut().for_each(|p| p[1] += half_height), +``` +**Tag:** SETUP-ONCE + +**SIMD candidate:** Iterate `positions.as_chunks_mut::<3>()`, load 5 positions (15 floats), apply constant add/sub to the Y component (index 1 of each triple). With F32x16 and a mask `[0,1,0, 0,1,0, 0,1,0, 0,1,0, 0,1,0, X]`, this is a single `F32x16::add` with a Y-mask per 5 vertices. **Trivially vectorizable, but setup-once and N is always small (≤ few thousand vertices for a cylinder).** Benefit: negligible. + +--- + +### 10. `conversions.rs:59–67` — `impl_from_into!` macro — `Vec` → `Vec<[f32;3]>` + +**Loop shape (macro-generated):** +```rust +let vec: Vec<_> = vec.into_iter().map(|t| t.into()).collect(); +``` +`Vec3::into() → [f32;3]` is a zero-overhead cast/transmute in practice, but the `collect()` forces a full allocation + copy. +**Tag:** LOAD-TIME (conversion at asset load / material setup) + +**SIMD candidate:** This is fundamentally a memcopy with ABI mismatch (Vec3 is repr(C) 12 bytes = [f32;3]). If `Vec3` is `#[repr(C)]` with layout `[f32;3]`, a `bytemuck::cast_vec` avoids the per-element `.into()` entirely — zero SIMD needed, zero copies. **The real win here is bytemuck, not F32x16.** Worth flagging but outside the SIMD scope. + +--- + +### Summary table + +| # | File:Lines | Function | Loop shape | SIMD candidate | Tag | Benefit | +|---|---|---|---|---|---|---| +| 1 | mesh.rs:1916–1950 | `try_transform_by` | `iter_mut().for_each` on Float32x3/x4 | F32x16 affine batch (5 pos/iter) | SETUP-ONCE | Load-time, large meshes | +| 2 | mesh.rs:1352–1357 | `try_compute_flat_normals` | `as_chunks().iter().flat_map` on triangles | `array_windows::<3>()` + batch cross+normalize | SETUP-ONCE | glTF load batch | +| 3 | mesh.rs:1618–1620 | `try_compute_custom_smooth_normals` (normalize pass) | `for normal in &mut normals` | F32x16 rsqrt batch | SETUP-ONCE | Marginal for <1M verts | +| 4 | mesh.rs:2178–2194 | `try_normalize_joint_weights` | `for weights in joints.iter_mut()` on Float32x4 | F32x16 (4 verts/iter), blend mask for zero-sum guard | SETUP-ONCE (POSSIBLE PER-FRAME for animated skins) | High if per-frame | +| 5 | mesh.rs:2306–2312 | `extract_and_cache_data` (AABB pass) | `for v in position_iter` min/max reduction | F32x16 running min/max, horizontal reduce | SETUP-ONCE | 10–20ms saved at batch load for 1M-vert mesh | +| 6 | mikktspace.rs:127–129 | `generate_tangents_for_mesh` (handedness flip) | `for tangent in &mut tangents` | F32x16 negation mask on w-lane | SETUP-ONCE | Negligible (<1ms total) | +| 7 | dim3/sphere.rs:187–200 | `SphereMeshBuilder::uv` | nested `for i/j` push | sin/cos table + F32x16::mul_add | SETUP-ONCE | Only for high-res spheres | +| 8 | dim3/cylinder.rs:187–192 | `CylinderMeshBuilder::build` (anchor pass) | `iter_mut().for_each` scalar | F32x16 Y-lane add, stride-3 mask | SETUP-ONCE | Negligible | + +--- + +### Honest call: SIMD ROI in bevy_mesh + +**All paths in bevy_mesh are LOAD-TIME or SETUP-ONCE**, not per-frame. The 1M vertex × 60fps = 60M ops/sec framing does not apply here — mesh geometry is built once and uploaded to the GPU. The SIMD win is **asset-import throughput**, not frame budget. + +Ranked by real impact: +1. **AABB extraction pass** (mesh.rs:2306) — fires once per mesh asset extraction; for batch glTF loads (100 meshes × 100K verts), 10× SIMD win = tens of ms saved at startup. +2. **Flat/smooth normal computation** (mesh.rs:1352, 1589) — fires at glTF load for every unweighted-normal mesh; batch cross-product and normalize benefit. +3. **Joint weight normalization** (mesh.rs:2178) — if skinned mesh weight re-normalization is called per-frame after runtime blending, this becomes the only **PER-FRAME candidate** in the entire file. Needs confirmation from bevy_pbr skin system. +4. **`try_transform_by` position/normal/tangent transform** — fires on mesh builder chains; worthwhile for large procedural meshes. +5. All others (sphere/torus/cylinder generator loops, handedness flip) — negligible; vertex counts are always small. + +**`mikktspace.rs` is a thin wrapper.** The actual Mikkt tangent-solving loop is in `bevy_mikktspace` (external crate, not in scope). Only the 1-line handedness flip is in-scope, and it is sub-microsecond. + +**`conversions.rs` has no hot loops** — it is trait-implementation glue (macro-generated From/TryFrom impls). The `impl_from_into!` `map(|t| t.into()).collect()` pattern is a bytemuck-transmute opportunity, not a SIMD opportunity. + +**`vertex.rs::VertexAttributeValues`** is a large enum with no compute loops — only serialization / byte-casting helpers. No hot per-vertex compute found. + + +## 2026-05-13T19:00 — agent #2 plugin-palette (sonnet) [backfilled by main] + +**File:** `bevy/examples/ndarray_graph_palette.rs` (~100 LOC) +**Status:** COMPILES (rustc 1.95.0 --crate-type lib, zero warnings; 1 unit test passes) + +PALETTE_LUT `[[u8; 4]; 16]` hand-picked Neo4j/Palantir palette: dark navy +`#0D1B2A` (idx 0) → 10-stop blue-to-white gradient (idx 1-11) → pale-amber / +warm-amber / hot-orange / crimson hot-accent tier (idx 12-15). Alpha=255 all. + +`blit_u8_palette_to_rgba(palette_pixels, rgba_out)` — scalar `for` masking +`& 0x0F` + `copy_from_slice` from LUT. Zero alloc. `debug_assert!` length guard. +Note: SIMD via `U8x64::permute_bytes` deferred per round-1 finding (VBMI gate +now in master, but this caller is u8-LUT not byte-permute shape). + +Test `palette_lut_roundtrip`: 64-byte → 256 RGBA, checks first/last/idx 11, +asserts alpha=255 for every entry. + +## 2026-05-13T19:05 — agent #3 plugin-ci (sonnet) [backfilled by main] + +**File:** `bevy/.github/workflows/ndarray-smoke.yml` +**Status:** WRITTEN — cargo check only (no run, no xvfb, no artifacts) + +Triggers: push to `claude/**`, PR against `claude/**` + main/master. +Clones sibling ndarray into `../ndarray` using `${{ github.head_ref || +github.ref_name }}` with master fallback. Installs Bevy system deps +(libwayland-dev libasound2-dev libudev-dev) before any Rust step. +Pinned Rust 1.95.0 via dtolnay/rust-toolchain@1.95.0. +Two compile-check steps: `ndarray_simd_smoke` + `ndarray_graph_plugin`. +Stock ubuntu-latest = x86-64-v3 (no AVX-512 flags). No tokens, no secrets. + +## 2026-05-13T19:10 — agent #4 plugin-readme (sonnet) [backfilled by main] + +**File:** `bevy/examples/README_NDARRAY_PLUGIN.md` (196 lines) +**Status:** WRITTEN + +Sections: What this is / Build / Run / What it shows / Architecture ASCII +diagram (Bevy App → NdarrayGraphPlugin → Resource → +GLOBAL_RENDERER → Framebuffer → palette LUT → bevy::asset::Image → Sprite) / +Compile-time vs runtime tier (`PREFERRED_F32_LANES` vs `simd_caps().avx512f` +mismatch on Sapphire Rapids with x86-64-v3 build) / Companion files table / +Known limitations (rayon crossover, PaletteTier detect proxy, GLOBAL_RENDERER +fixed capacity 4096). + + +## 2026-05-13T00:30 — agent #1 plugin-core (sonnet) + +**Files:** examples/ndarray_graph_plugin.rs (274 lines), Cargo.toml [[example]] entry +**Status:** COMPILES +**Approach summary:** Wrote a full Bevy 0.19 plugin (`NdarrayGraphPlugin`) that wraps `Renderer::with_capacity(1024)` in a `GraphRenderer` Resource. A `Startup` chain seeds 64 circle-layout nodes + 80 edges (64 ring + 16 stride cross-links) into the back frame and swaps. Two ordered `Update` systems call `tick_renderer` (physics via `Renderer::tick`) then `render_to_framebuffer` (compose_neo4j → palette LUT expand → blit into long-lived Bevy Image). A long-lived `Framebuffer(512×512)` and `Image` are stored in a `RenderSurface` Resource to avoid per-frame allocation. +**Risks / TODOs:** +- `PALETTE_LUT` is inlined; swap for `ndarray_graph_palette::PALETTE_LUT` once agent #2 delivers `examples/ndarray_graph_palette.rs`. +- Plugin runs with `DefaultPlugins` (needs a window/GPU at runtime); for headless CI, gate `NdarrayGraphPlugin` behind a `#[cfg(not(headless))]` or swap to `MinimalPlugins` + custom render backend. +- `Renderer::tick` doesn't apply inter-node forces (Coulomb repulsion / spring attraction) — it only integrates existing velocities × damping. A force-accumulation pass would make the graph actually spring-like; agent #7 (renderer) should clarify whether `tick` or a separate `apply_uniform_force` call is the right hook. +- `TextureFormat::Rgba8Unorm` used (linear); switch to `Rgba8UnormSrgb` for perceptually-correct colors if the palette LUT is authored in sRGB. +**API surface used from crate::simd or hpc::*:** +- `ndarray::hpc::renderer::{Renderer, DT_60}` — double-buffered renderer, tick, read_front, write_back, swap +- `ndarray::hpc::renderer::RenderFrame` — positions/velocities/charges/len fields +- `ndarray::hpc::framebuffer::{Framebuffer, compose_neo4j}` — palette-indexed raster, Bresenham edges, dot sprites + +## 2026-05-13T19:20 — agent #12 audit-amx-routing (sonnet) [backfilled by main] + +**Scope:** `src/simd_amx.rs` (8 detection sites); brief scan of +`src/backend/native.rs` (2) and `src/hpc/bitwise.rs` (16). +**Verdict:** AUDIT COMPLETE — 7 of 8 sites foldable into SimdCaps; +1 (prctl) must stay standalone (per-thread OS state). + +**Foldable into SimdCaps (CPUID-level / system-wide):** +- L50: `__cpuid_count(7,0)` → AMX-TILE (EDX[24]) + AMX-INT8 (EDX[25]) → + agent #6 is adding `amx_tile` / `amx_int8` fields +- L58: `__cpuid(1)` OSXSAVE — inline precondition in `detect()`, no field +- L68: `_xgetbv(0)` — XCR0 bits 17+18 (TILECFG/TILEDATA), system-wide, + cacheable. Add `xcr0_tile_enabled: bool`. +- L121-124: duplicate CPUID in `amx_report()`, reads AMX-BF16 (EDX[22]) → + agent #6's `amx_bf16` field. After migration `amx_report()` reads + `simd_caps()` directly. +- L285: `is_x86_feature_detected!("avx512vnni")` in production hot path + `matvec_dispatch` → `simd_caps().avx512vnni` (field exists) +- L291: `is_x86_feature_detected!("avxvnniint8")` in production hot path → + `simd_caps().avxvnniint8` (agent #6 adding) +- L385: `is_x86_feature_detected!("avx512vnni")` in test eprintln (no + assertion) → replace or delete + +**Must stay standalone (per-thread OS state):** +- L81-107: raw `syscall` `prctl(ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA=18)`. + Linux grants this permission to the CALLING THREAD only. A LazyLock + initializer runs on one init thread; rayon workers will SIGILL without + their own prctl call. **Recommendation:** rename `amx_available()` → + `amx_init_thread()`. `simd_caps().has_amx()` exposes only hardware + + XCR0; the prctl step happens per-thread. + +**native.rs:** 2 sites already wrapped in local `LazyLock` — correct +pattern, P3 unification only. + +**bitwise.rs:** 16 sites all in `#[cfg(test)]` skip-guards. Production +dispatch already uses `simd_caps()`. No hot-path action. + +**Hidden gotcha (P0):** the per-thread prctl is a real SIGILL hazard for +rayon. `integrate_simd_par` doesn't touch AMX today so safe; but any +future AMX-accelerated rayon-parallel path needs an init-each-worker +shim. + + +## 2026-05-13T00:02 — agent #6 simd-caps-amx (sonnet) + +**File:** `src/hpc/simd_caps.rs` +**Verdict:** SHIP + +**Fields added (all additive, no existing fields modified):** +- `amx_tile: bool` — CPUID.07H.0H:EDX bit 24 via `__cpuid_count(7,0)` +- `amx_int8: bool` — CPUID.07H.0H:EDX bit 25 +- `amx_bf16: bool` — CPUID.07H.0H:EDX bit 22 +- `avx512bf16: bool` — `is_x86_feature_detected!("avx512bf16")` +- `avxvnniint8: bool` — `is_x86_feature_detected!("avxvnniint8")` + +**Convenience methods added:** +- `has_amx() -> bool` — `amx_tile && amx_int8` (CPUID-only; OS-level check stays in `simd_amx::amx_available()`) +- `has_avx512_bf16() -> bool` +- `has_avxvnniint8() -> bool` + +**Complications:** +- `__cpuid_count` is safe (no `unsafe {}` needed) in Rust 1.94.1 — the initially written `unsafe { }` wrapper produced a `warn(unused_unsafe)` warning; removed wrapper, kept explanatory comment. +- `simd_amx::amx_available()` left untouched per scope (XCR0+prctl OS check belongs to agent #12 audit). + +**Tests:** 4 new tests (plus existing 4 updated), all 8 pass, 0 warnings. +**Test command:** `rustup run 1.94.1 cargo test --features rayon --lib hpc::simd_caps` + +## 2026-05-13T10:00 — agent #11 audit-cosmetic (sonnet) + +**Files:** `src/hpc/byte_scan.rs`, `src/hpc/palette_codec.rs`, `src/hpc/aabb.rs` +**Verdict:** All three files confirmed COSMETIC-SIMD (with one PARTIAL-REAL exception). No file is clean. + +--- + +### Cosmetic-SIMD Enumeration Table + +| File | Line | Function | `#[target_feature]` | `_mm*` intrinsics? | Body has polyfill calls? | Classification | +|------|------|----------|---------------------|--------------------|--------------------------|----------------| +| `byte_scan.rs` | 22 | `byte_find_all_avx2` | `avx2` | NO | NO — scalar `haystack[i+j] == needle` loops | COSMETIC | +| `byte_scan.rs` | 86 | `byte_count_avx2` | `avx2` | NO | NO — scalar `haystack[i+j] == needle` loops | COSMETIC | +| `byte_scan.rs` | 52 | `byte_find_all_avx512` | `avx512bw` | NO (`_mm*` absent) | YES — uses `U8x64::splat`, `U8x64::from_slice`, `U8x64::cmpeq_mask` | REAL (polyfill-backed) | +| `byte_scan.rs` | 115 | `byte_count_avx512` | `avx512bw` | NO (`_mm*` absent) | YES — uses `U8x64::splat`, `U8x64::from_slice`, `U8x64::cmpeq_mask`, `.count_ones()` | REAL (polyfill-backed) | +| `palette_codec.rs` | 303 | `unpack_generic_avx512` | `avx512f` | NO | NO — scalar nested loop (`word >> bit_offset & mask_val`) | COSMETIC | +| `palette_codec.rs` | 335 | `pack_generic_avx512` | `avx512f` | NO | NO — scalar for loop, verbatim copy of `pack_indices` | COSMETIC | +| `palette_codec.rs` | 353 | `unpack_4bit_avx2` | `avx2` | NO | NO — scalar nibble-split loop over `bytes[i..i+32]` | COSMETIC | +| `palette_codec.rs` | 501 | `bedrock_reorder_xzy_avx512` | `avx512f` | NO | NO — scalar triple-nested loop with `get_unchecked` | COSMETIC | +| `aabb.rs` | 241 | `aabb_intersect_batch_sse41` | `sse4.1` | NO | NO — scalar per-candidate `if` chain, identical to `aabb_intersect_batch_scalar` | COSMETIC | +| `aabb.rs` | 174 | `aabb_intersect_batch_avx512` | `avx512f` | NO | YES — uses `F32x16::from_array`, `F32x16::splat`, `F32x16::simd_le`, `F32x16::simd_ge`, `F32Mask16.0 &` | REAL (polyfill-backed) | +| `aabb.rs` | 329 | `ray_aabb_slab_test_avx512` | `avx512f` | NO | YES — uses `F32x16::splat`, arithmetic ops, `simd_min`, `simd_max`, `simd_le`, `simd_ge`, `to_array` | REAL (polyfill-backed) | +| `aabb.rs` | 464 | `aabb_expand_batch_sse2` | `sse2` | NO | NO — scalar per-AABB field update, identical to `aabb_expand_batch_scalar` | COSMETIC | + +**Summary: 8 COSMETIC, 4 REAL (polyfill-backed, no raw `_mm*`)** + +--- + +### AUTOVEC CHECK (empirical, via `rustc 1.94.1 --emit asm`) + +Built a minimal replica of each cosmetic function with `#[no_mangle] extern "C"` to prevent dead-code elimination. Assembly analyzed for `ymm*`/`zmm*`/`xmm*`/`vp*`/`vcmp*` instructions: + +**`byte_find_all_avx2` (avx2 hint, scalar 32-byte loop):** +Assembly: pure scalar integer ops (`cmpb`, `jne`, `movb`, `incq`). Zero YMM/XMM registers. LLVM did NOT autovectorize the append-to-Vec loop. **COSMETIC — not autovec'd.** + +**`aabb_intersect_batch_sse41` (sse4.1 hint, scalar per-candidate chain):** +Assembly: `movss`/`ucomiss`/`jb`/`setae` — scalar FP comparisons and branches. Zero packed SSE4.1 instructions (`blendvps`, `cmpps` absent). **COSMETIC — not autovec'd.** + +**`pack_generic_avx512` (avx512f hint, scalar bit-packing loop):** +Assembly: contains `vmovups %zmm0` for the memset/zeroing prelude (LLVM auto-vectorized the zero-init with AVX-512 store), but the main bit-packing loop is scalar shift+OR. The `%zmm0` instruction is from `vec![0u64; n_words]` zero-fill, not the index-packing loop body. **Zeroing autovec'd; bit-pack loop COSMETIC.** + +**`aabb_expand_batch_sse2` (sse2 hint, scalar per-AABB update):** +Assembly: uses `movups`/`subps`/`addps`/`shufps` on `%xmm` registers — **REAL-AUTOVEC.** LLVM vectorized the 6-float struct update into XMM-register arithmetic. The SSE2 feature hint IS doing useful work here: without it, LLVM would not be permitted to use `addps`/`subps` on this loop. **Mark as REAL-AUTOVEC.** + +--- + +### Replacement Plan (Cosmetic Functions Only) + +#### `byte_scan.rs` — `byte_find_all_avx2` (line 22) and `byte_count_avx2` (line 86) + +**Problem:** `#[target_feature(enable = "avx2")]` on pure scalar 32-byte loop. +**No `U8x32` exists** in `crate::simd` (confirmed: searched entire `src/`; zero results). +**Correct polyfill replacement:** None available at AVX2 tier. Two options: +1. **Delete** both functions and fall through to scalar path (honest: no speedup anyway). +2. **Add `U8x32` to `simd_avx2.rs`** with `splat`, `from_slice`, `cmpeq_mask → u32` methods, then replace scalar loops with `U8x32::splat(needle)` + `cmpeq_mask` + `trailing_zeros` bitmask scatter. + +**Polyfill gap:** `U8x32::cmpeq_mask` does **not exist** in `simd_avx2.rs`. The file contains zero `U8x*` types. The AVX2 tier must add this type before any real replacement is feasible. + +**Methods needed in `simd_avx2.rs`:** +- `U8x32::splat(v: u8) -> U8x32` +- `U8x32::from_slice(s: &[u8]) -> U8x32` +- `U8x32::cmpeq_mask(self, other: U8x32) -> u32` — maps to `_mm256_cmpeq_epi8` + `_mm256_movemask_epi8` + +#### `palette_codec.rs` — `unpack_generic_avx512` (line 303) and `pack_generic_avx512` (line 335) + +**Problem:** Both are verbatim scalar copies of `unpack_indices`/`pack_indices` wearing `avx512f` decoration. +**Real replacement requires:** gather/scatter ops — `U8x64` scatter via `U16x32` widening + `U16x32::shr_epi16` + `pack_saturate_u8`. No single polyfill maps cleanly to variable-width bit unpacking. +**Honest replacement plan:** Delete both functions. Document `pack_indices`/`unpack_indices` as the canonical path. Add a `// NOTE: real SIMD unpack requires shr_epi16+pack_saturate_u8 per bit-width; not yet implemented.` comment in `pack_indices_simd` / `unpack_indices_simd`. + +**Polyfill gap:** `U16x32::shr_epi16(shift: u32)` exists (line ~1244 in simd_avx512.rs region), but **scalar fallback in `simd.rs`** lacks it. The AVX-512 path can be implemented; a scalar polyfill for `simd.rs::scalar` module would need: +- `U16x32::shr_epi16(self, shift: u32) -> U16x32` (scalar: element-wise `>> shift`) + +#### `palette_codec.rs` — `unpack_4bit_avx2` (line 353) + +**Problem:** Nibble-split loop over 32-byte chunks, zero `_mm256_*` intrinsics. +**Correct polyfill:** Real 4-bit unpack uses `U8x32::unpacklo_epi8` + `U8x32::and` + `U8x32::srli_epi16`. Neither `unpacklo_epi8` nor `srli_epi16` exists on the AVX2 tier. +**Methods needed in `simd_avx2.rs`:** +- `U8x32::unpacklo_epi8(self, other: U8x32) -> U8x32` (maps to `_mm256_unpacklo_epi8`) +- `U8x32::unpackhi_epi8(self, other: U8x32) -> U8x32` (maps to `_mm256_unpackhi_epi8`) +- `U8x32::srli_epi16(self, imm: i32) -> U8x32` (maps to `_mm256_srli_epi16`) +- Or equivalently: `U8x32::and(self, mask: U8x32) -> U8x32` (maps to `_mm256_and_si256`) + +#### `palette_codec.rs` — `bedrock_reorder_xzy_avx512` (line 501) + +**Problem:** Scalar triple-loop permutation using `get_unchecked`, zero SIMD. +**Correct polyfill:** Real AVX-512 version would use `U16x32::gather` with computed indices. No gather primitive exists in `crate::simd` for `u16`. +**Honest replacement plan:** Delete the function; route `bedrock_reorder_xzy` directly to the scalar path. Add comment: `// AVX-512 gather on u16 requires widening to u32; not yet in polyfill.` +**Methods needed (if implemented):** +- `U32x16::gather_u16(base: *const u16, vindex: U32x16) -> U32x16` — not present; would wrap `_mm512_i32gather_epi32` with 2-byte scale. + +#### `aabb.rs` — `aabb_intersect_batch_sse41` (line 241) + +**Problem:** Scalar per-candidate loop, AUTOVEC confirmed: zero SSE4.1 instructions emitted. +**The `aabb_expand_batch_sse2` function IS REAL-AUTOVEC** (SSE2 feature hint causes `addps`/`subps` emission); SSE4.1 hint on the intersection function does NOT produce `blendvps` or `cmpps`. +**Correct polyfill:** Use `F32x4` (SSE2-width) comparison. No `F32x4` type exists in `crate::simd`. Alternatively, use `F32x8` (AVX2) for 2-candidate-at-once processing, or simply rename to `aabb_intersect_batch_scalar_hint` and document the annotation as a scheduling hint only. + +**Methods needed in `simd_avx2.rs` (for real SSE4.1 replacement):** +- `F32x4::from_array([f32; 4]) -> F32x4` — type does not exist +- OR accept that 1-candidate-at-a-time is scalar-only and rename the function honestly. + +--- + +### Polyfill Methods Needed in `simd_avx2.rs` (and scalar fallback) + +To make the above replacements fully feasible, these methods must be added: + +| Method | Type | Wraps (AVX2) | Scalar fallback | +|--------|------|--------------|-----------------| +| `U8x32::splat(v: u8)` | `simd_avx2.rs` | `_mm256_set1_epi8` | element-wise fill | +| `U8x32::from_slice(s: &[u8])` | `simd_avx2.rs` | `_mm256_loadu_si256` | copy 32 bytes | +| `U8x32::cmpeq_mask(self, other: U8x32) -> u32` | `simd_avx2.rs` | `_mm256_cmpeq_epi8` + `_mm256_movemask_epi8` | `element-wise == as bitmask` | +| `U8x32::unpacklo_epi8(self, other: U8x32)` | `simd_avx2.rs` | `_mm256_unpacklo_epi8` | interleave lo halves | +| `U8x32::unpackhi_epi8(self, other: U8x32)` | `simd_avx2.rs` | `_mm256_unpackhi_epi8` | interleave hi halves | +| `U8x32::and(self, mask: U8x32)` | `simd_avx2.rs` | `_mm256_and_si256` | element-wise `&` | +| `U8x32::srli_epi16(self, imm: i32)` | `simd_avx2.rs` | `_mm256_srli_epi16` | element-wise `>> imm` | +| `U16x32::shr_epi16(self, shift: u32)` | scalar in `simd.rs` | already in `simd_avx512.rs:~1275` | element-wise `>> shift` | + +The `U8x32` type itself (the 256-bit byte vector) is entirely absent from `simd_avx2.rs` — all 7 methods above require first creating the type. This is the foundational gap for the AVX2-tier byte scan and nibble unpack paths. + +--- + +### Key Finding: `aabb_expand_batch_sse2` is REAL-AUTOVEC + +This function was previously listed as cosmetic by earlier agents. ASM confirms otherwise: the SSE2 feature annotation on the `[f32; 3] min/max subtract+add` loop causes LLVM to emit `movups`/`subps`/`addps`/`shufps` on XMM registers. Without the annotation, the same code compiles to scalar. This one function in `aabb.rs` is a legitimate use of `#[target_feature]` as an LLVM autovectorization hint. Do not remove it. + + +## 2026-05-13T19:35 — agent #10 audit-color (sonnet) [backfilled by main] + +**Files:** bevy_pbr/atmosphere/{resources,environment}.rs + +light_probe/generate.rs + ssao/mod.rs + bevy_image/{image,ktx2}.rs +**Verdict:** **0 of 10 sites worth converting.** All NOT-WORTH. + +Root causes: +1. All atmosphere / light-probe / SSAO f16 textures are GPU-only — CPU only + sets the wgpu `TextureFormat` descriptor. GPU compute shaders fill them. +2. `Image::convert` does NOT support `Rgba16Float` as a target (returns + `None` at image.rs:1550). No bulk f32→f16 path exists today. +3. `set_color_at` / `get_color_at` are single-pixel-per-call APIs. Only + caller is `bevy_sprite/picking_backend.rs` (1 px per pointer event). +4. KTX2 copies half-float bytes verbatim — no decode loop. + +The "500-20000× BF16 batch" claim from ndarray's `f32_to_bf16_batch_rne` +docs is real but unreachable in Bevy as-shipped. The Bevy CPU never +touches f16/bf16 data in bulk. + +**Latent opportunity (not in codebase today):** if `Image::convert` were +extended to support `Rgba16Float` as a destination, a bulk +Rgba8Unorm → Rgba16Float path would touch W·H·4 f32→f16 values (33M at +4K) — genuine `cast_f32_to_f16_batch` candidate. Would have to ship the +Image::convert extension AND the SIMD path together. + + +## 2026-05-13T19:45 — agent #5 plugin-tests (sonnet) [backfilled by main] + +**Files:** `bevy/examples/ndarray_graph_plugin_tests.rs` (308 lines) + +Cargo.toml `[[example]]` entry +**Status:** ALL 5 TESTS PASS (dual mode: `cargo run` exits nonzero on +failure, `cargo test --example` also works) + +Tests: +1. plugin_initializes_global_renderer_resource — `GraphRenderer` resource + present after plugin build; `GLOBAL_RENDERER.tick_count() == 0` +2. startup_seeds_nodes_and_edges — front.len=2, edges.len=1 after first + app.update() +3. tick_advances_position_via_integrate_simd — position 10.0 → 10.016666 + (= 1.0 * DT_60 + 10.0, exact). Confirms F32x16::mul_add polyfill ran +4. compose_neo4j_emits_pixels_to_framebuffer — 106 non-zero bytes in + 128×128 buffer (threshold=50) +5. polyfill_runtime_tier_matches_expectation — confirms avx512f=true + AND avx2=true on Sapphire Rapids; PREFERRED_F32_LANES=8 (the smoke + test's catch — compile-time AVX2 path on AVX-512 hardware) + +**Duplication risk:** test file defines `NdarrayGraphPlugin` + `GraphRenderer` +INLINE because agent #5 ran in parallel with agent #1 and couldn't import. +Main thread will consolidate after fleet completion: either (a) test file +imports from agent #1's plugin file, or (b) move the plugin types into a +shared `examples/ndarray_graph_lib.rs` module that both import. + + +## 2026-05-13T19:55 — agent #8 audit-skin (sonnet) [backfilled by main] + +**File:** `bevy_pbr/src/render/skin.rs` (515 lines) +**Verdict:** **NOT-WORTH** + +Bevy's skinning is GPU-side WGSL. `skin.rs` is a CPU staging step that +computes one final `Mat4` per joint and writes it to a wgpu buffer for +upload. Four candidate hot paths: + +1. `extract_joints_for_skin` (L399-413) — per-frame joint matrix update. + ECS change-detection gate at L406 → irregular skip pattern. Can't + batch for GEMM. M=N=K=4 GEMM is overhead-dominated anyway. +2. `add_skin` (L452-474) — initial population on visibility change. + Contiguous loop, no skip — the ONLY uninterrupted math path. But + fires ~0 times/sec in stable scenes. Cold path. +3. `prepare_skins` (L176-244) — pure DMA via `bytemuck::must_cast_slice`. + No arithmetic. +4. Per-vertex weighted blend — **not in this file**. GPU-side WGSL. + +Numbers: MAX_JOINTS=256, full-rig scalar cost ~16 µs/mesh/frame. AVX-512 +at 8× would save 14 µs/mesh/frame. GPU skinning noise floor is 0.5-2 ms. +SIMD savings disappear below GPU baseline. + +**ndarray API surface needed: NONE.** Skin is not a SIMD-polyfill +integration candidate. The performance levers are GPU shader +optimization + wgpu buffer bandwidth — outside ndarray's scope. + diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml index ccac38f0..8eac19a0 100644 --- a/.github/workflows/ci.yaml +++ b/.github/workflows/ci.yaml @@ -184,7 +184,16 @@ jobs: - run: ./scripts/miri-tests.sh cross_test: - #if: ${{ github.event_name == 'merge_group' }} + # Gated on merge_group only — cross-compile via docker (cross-rs) for + # s390x / i686 is slow, flaky on the s390x docker image's toolchain + # resolution (rust-toolchain.toml's 1.94.1 pin doesn't resolve cleanly + # inside the s390x cross container), and reliably caught by the + # `tests/{stable,beta,1.94.0}` jobs on every PR push. Reserve cross + # validation for the merge queue where it can fail loudly without + # gating individual PRs on infra flakiness. The commented `if:` was + # the original intent (per the pre-existing comment) — uncommenting + # per the PR #143 codex thread that surfaced this consistently. + if: ${{ github.event_name == 'merge_group' }} runs-on: ubuntu-latest strategy: matrix: diff --git a/src/hpc/simd_caps.rs b/src/hpc/simd_caps.rs index 28279630..2789ba88 100644 --- a/src/hpc/simd_caps.rs +++ b/src/hpc/simd_caps.rs @@ -22,7 +22,14 @@ use std::sync::LazyLock; /// Pi Zero 2 W / Pi 3 (A53, v8.0): neon only /// Pi 4 (A72, v8.0): neon only (but 2× throughput) /// Pi 5 (A76, v8.2): neon + dotprod + fp16 + aes + sha2 +/// +/// `#[non_exhaustive]` per codex P2 on PR #143: future capability fields +/// can be added without source-breaking downstream crates that construct +/// `SimdCaps` directly via struct literal (e.g. mocks, tests, custom +/// capability values). Downstream code must use `simd_caps()` or the +/// public constructor instead of struct-literal init. #[derive(Debug, Clone, Copy)] +#[non_exhaustive] pub struct SimdCaps { // ── x86_64 ── /// AVX2 (256-bit integer/FP SIMD). @@ -49,6 +56,21 @@ pub struct SimdCaps { /// Skylake-X / Cascade Lake / Ice Lake-SP — calling VBMI intrinsics on /// those CPUs SIGILLs even though `avx512f` is true. pub avx512vbmi: bool, + /// AMX-TILE: tile register file present (CPUID.07H.0H:EDX bit 24). + /// Sapphire Rapids, Granite Rapids, Meteor Lake, Arrow Lake. + pub amx_tile: bool, + /// AMX-INT8: `TDPBUSD` u8×i8→i32 tile dot product (CPUID.07H.0H:EDX bit 25). + pub amx_int8: bool, + /// AMX-BF16: `TDPBF16PS` BF16×BF16→f32 tile dot product (CPUID.07H.0H:EDX bit 22). + pub amx_bf16: bool, + /// AVX-512 BF16: `VCVTNE2PS2BF16` / `VDPBF16PS` 512-bit BF16 math + /// (`is_x86_feature_detected!("avx512bf16")`). + /// Present on Cooper Lake, Sapphire Rapids, Zen 4. + pub avx512bf16: bool, + /// AVX-VNNI-INT8: 256-bit `VPDPBSSD`/`VPDPBUUD` (non-AVX-512) VNNI + /// (`is_x86_feature_detected!("avxvnniint8")`). + /// Present on Arrow Lake, Lunar Lake, NUC 14 (Meteor Lake-H). + pub avxvnniint8: bool, // ── aarch64 (ARM) ── /// NEON 128-bit SIMD (mandatory on aarch64, always true). @@ -81,6 +103,14 @@ impl SimdCaps { /// Detect CPU capabilities at runtime. #[cfg(target_arch = "x86_64")] fn detect() -> Self { + // `__cpuid_count` is safe on x86_64 (Rust 1.87+): CPUID is always + // available on x86_64 (guaranteed by the ABI) and has no side effects + // beyond reading CPU registers. + let cpuid7 = core::arch::x86_64::__cpuid_count(7, 0); + let amx_tile = (cpuid7.edx >> 24) & 1 == 1; + let amx_int8 = (cpuid7.edx >> 25) & 1 == 1; + let amx_bf16 = (cpuid7.edx >> 22) & 1 == 1; + Self { avx2: is_x86_feature_detected!("avx2"), avx512f: is_x86_feature_detected!("avx512f"), @@ -92,6 +122,11 @@ impl SimdCaps { fma: is_x86_feature_detected!("fma"), avx512vnni: is_x86_feature_detected!("avx512vnni"), avx512vbmi: is_x86_feature_detected!("avx512vbmi"), + amx_tile, + amx_int8, + amx_bf16, + avx512bf16: is_x86_feature_detected!("avx512bf16"), + avxvnniint8: is_x86_feature_detected!("avxvnniint8"), // ARM fields: all false on x86 neon: false, asimd_dotprod: false, @@ -119,6 +154,11 @@ impl SimdCaps { fma: false, avx512vnni: false, avx512vbmi: false, + amx_tile: false, + amx_int8: false, + amx_bf16: false, + avx512bf16: false, + avxvnniint8: false, // ARM fields: runtime detection neon: true, // mandatory on aarch64 asimd_dotprod: std::arch::is_aarch64_feature_detected!("dotprod"), @@ -143,6 +183,11 @@ impl SimdCaps { fma: false, avx512vnni: false, avx512vbmi: false, + amx_tile: false, + amx_int8: false, + amx_bf16: false, + avx512bf16: false, + avxvnniint8: false, neon: false, asimd_dotprod: false, fp16: false, @@ -171,6 +216,32 @@ impl SimdCaps { self.avx512f && self.avx512vnni } + /// True if AMX is available at the CPUID level (`amx_tile && amx_int8`). + /// + /// Note: CPUID presence does **not** guarantee OS enablement. The full + /// OS-level check (XCR0 bits 17+18, prctl ARCH_REQ_XCOMP_PERM) lives in + /// `simd_amx::amx_available()`. This method is a lightweight CPUID-only + /// probe suitable for capability reporting and coarse dispatch decisions. + #[inline(always)] + pub fn has_amx(self) -> bool { + self.amx_tile && self.amx_int8 + } + + /// True if AVX-512 BF16 is available (`VCVTNE2PS2BF16` / `VDPBF16PS`). + /// Present on Cooper Lake, Sapphire Rapids, Zen 4. + #[inline(always)] + pub fn has_avx512_bf16(self) -> bool { + self.avx512bf16 + } + + /// True if AVX-VNNI-INT8 (256-bit `VPDPBSSD`/`VPDPBUUD`) is available. + /// Present on Arrow Lake, Lunar Lake, NUC 14 (Meteor Lake-H). + /// This is the non-AVX-512 VNNI path — does NOT require `avx512f`. + #[inline(always)] + pub fn has_avxvnniint8(self) -> bool { + self.avxvnniint8 + } + // ── ARM convenience methods ── /// True if running on aarch64 with NEON (always true on aarch64). @@ -298,6 +369,12 @@ mod tests { let _ = caps.avx2; let _ = caps.avx512f; let _ = caps.neon; + // New AMX / BF16 / VNNI fields must also be accessible without panic. + let _ = caps.amx_tile; + let _ = caps.amx_int8; + let _ = caps.amx_bf16; + let _ = caps.avx512bf16; + let _ = caps.avxvnniint8; } #[test] @@ -335,6 +412,58 @@ mod tests { let _ = caps.has_crypto(); } + #[test] + fn new_amx_bf16_vnni_convenience_methods_do_not_panic() { + let caps = simd_caps(); + let amx = caps.has_amx(); + let bf16 = caps.has_avx512_bf16(); + let vnni = caps.has_avxvnniint8(); + // Semantic invariants: has_amx() requires both tile and int8. + assert_eq!(amx, caps.amx_tile && caps.amx_int8); + // has_avx512_bf16() mirrors the raw field. + assert_eq!(bf16, caps.avx512bf16); + // has_avxvnniint8() mirrors the raw field. + assert_eq!(vnni, caps.avxvnniint8); + } + + #[test] + fn amx_fields_false_on_non_x86() { + // On non-x86_64, all AMX and BF16 fields must be false because + // the detect() fallback / aarch64 branch sets them to false. + #[cfg(not(target_arch = "x86_64"))] + { + let caps = simd_caps(); + assert!(!caps.amx_tile); + assert!(!caps.amx_int8); + assert!(!caps.amx_bf16); + assert!(!caps.avx512bf16); + assert!(!caps.avxvnniint8); + assert!(!caps.has_amx()); + assert!(!caps.has_avx512_bf16()); + assert!(!caps.has_avxvnniint8()); + } + // On x86_64 we can only check that the call doesn't panic; the + // actual values depend on the hardware running the test. + #[cfg(target_arch = "x86_64")] + { + let caps = simd_caps(); + let _ = caps.has_amx(); + let _ = caps.has_avx512_bf16(); + let _ = caps.has_avxvnniint8(); + } + } + + #[test] + fn simd_caps_deterministic_new_fields() { + let a = simd_caps(); + let b = simd_caps(); + assert_eq!(a.amx_tile, b.amx_tile); + assert_eq!(a.amx_int8, b.amx_int8); + assert_eq!(a.amx_bf16, b.amx_bf16); + assert_eq!(a.avx512bf16, b.avx512bf16); + assert_eq!(a.avxvnniint8, b.avxvnniint8); + } + #[test] fn arm_profile_consistent() { let caps = simd_caps();