fix(simd): VBMI gate for permute_bytes + Inf clamp for simd_exp_f32#142
Conversation
Add integrate_simd_par gated on the existing `rayon` feature. Splits the
position/velocity buffers into BLOCK_FLOATS-sized chunks (1024 floats =
4 KB, L1-resident) and runs the existing F32x16::mul_add inner loop on
each block in parallel via par_chunks_mut + zip.
Composes 16 SIMD lanes × N rayon threads. Block size is chosen so each
sub-slice stays a multiple of 16, so the inner as_chunks_mut::<16>() tail
is always empty.
Tests:
integrate_simd_par_matches_sequential — bit-identical output vs
sequential integrate_simd (FMA + mul are deterministic).
integrate_simd_par_advances_positions_exactly — single-tick contract
x[i] += v[i] * dt holds within f32 epsilon.
Both gated behind #[cfg(feature = "rayon")]; default build is unchanged.
…rtho f32-to-usize cast is UB in Rust for negative / NaN / out-of-range values (reference §5.5.1). Previously project_ortho cast `(pos*scale + offset) as usize` directly with only a post-cast `.min(screen_w-1)` clamp, which can trigger UB on negative inputs — a real hazard once Bevy's target-cpu=x86-64-v4 enables strict provenance. Fix: clamp in the float domain to [0, screen_dim - 1] BEFORE the cast, so the cast input is always finite, non-negative, and within usize range. Also adds AGENT_LOG.md (CCA2A file blackboard) used by the 12-agent fleet that surfaced this bug + the broader polyfill-violation audit. Reported-by: agent #8 framebuffer (sonnet) in fleet review.
Two soundness/correctness bugs surfaced by the 15-agent CCA2A fleet review on this branch and confirmed real by the brutally-honest reviewer (see .claude/board/AGENT_LOG.md for full fleet output). 1. permute_bytes (P0 SIGILL) — U8x64::permute_bytes called _mm512_permutexvar_epi8 (AVX-512VBMI) as safe pub fn with no gate. SIGILL on Skylake-X / Cascade Lake / Ice Lake-SP (AVX-512F-but-no-VBMI). Doc claimed a fallback existed; none did. Fix: added avx512vbmi: bool field to SimdCaps. permute_bytes now runtime-branches via the singleton — VBMI hosts use the hardware intrinsic (gated #[target_feature(enable = "avx512vbmi")] inner unsafe leaf, Rust language requirement); non-VBMI AVX-512F hosts use a scalar fallback mirroring the AVX2-tier shape at simd_avx2.rs:1435. The #[target_feature] attribute on the inner permute_bytes_vbmi leaf stays — Rust requires it to call VBMI intrinsics from a function not compiled with VBMI globally. The user-facing permute_bytes method is safe and works on any AVX-512 CPU. 2. simd_exp_f32(Inf) (P1 silent-wrong-output) — pow2n_from_int saturated f32::INFINITY as i32 to i32::MAX; (i32::MAX + 127) wrapped, producing garbage IEEE bits via from_bits, polynomial × garbage ≈ 0.5. exp(+Inf) silently returned ~0.5 in release / panicked in debug. Fix: pre-clamp simd_exp_f32 input to [-87.336, 88.722] (the f32-representable domain of exp). Defense-in-depth: pow2n_from_int clamps ni to [-126, 127] before bias add. NaN propagates naturally. 3 regression tests added: +Inf, -Inf, x=200. All assert finite output. Strictly additive — no existing intrinsic touched, no public API broken, no existing test modified, AMX inline-asm encodings untouched. Test surface: 1786 passed, 0 failed (was 1783 — added 3 exp regression tests). cargo clippy --features rayon -- -D warnings: clean. Documentation: see .claude/board/SIMD_REVIEW_FIXES_2026_05_13.md for the full PR breakdown including findings explicitly deferred (AVX2 cfg gate, cosmetic-SIMD sweep, AMX detection consolidation, SAFETY-comment audit).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9b3674bfa0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Pre-clamp to the safe domain. Outside this band exp() is non-representable | ||
| // anyway (overflow → +Inf at ~88.7, underflow → +0 at ~-87.3) so the clamp | ||
| // is observable only at the saturation boundary. | ||
| let x = x.simd_clamp(F32x16::splat(-87.336_f32), F32x16::splat(88.722_f32)); |
There was a problem hiding this comment.
Preserve NaNs when clamping exp inputs
For lanes where callers pass NaN, this clamp does not preserve the documented NaN behavior: the SIMD simd_clamp implementations are max(lo).min(hi), and both the scalar fallback and x86 intrinsics select one of the bounds for a NaN operand. That makes simd_exp_f32(NaN) return a small finite value near exp(-87.336) instead of propagating NaN, which can hide invalid values in activation/softmax paths that rely on exp semantics. Use comparisons that leave NaN lanes unchanged before clamping finite inputs.
Useful? React with 👍 / 👎.
Per project convention, every AVX-512 build has AVX2 codegen HARD-disabled by design — AVX-512 builds use _mm512_* exclusively (zmm registers only, no mixed-width). The 256-bit AVX2-tier types in simd_avx512.rs are unreachable in any valid build configuration where the soundness concern would apply. The reviewer's AVX2-types-ungated finding was theoretical, not real.
The pre-clamp via simd_clamp silently destroyed NaN inputs. simd_clamp is
implemented as max(lo).min(hi); _mm512_max_ps returns the SECOND operand
when the first is NaN (per Intel SDM § MAXPS), so NaN got clamped to lo
(-87.336) and exp(-87.336) ≈ 1.4e-38 — a tiny finite value pretending to
be valid.
Fix: capture NaN lanes via x.simd_ne(x) (NaN ≠ itself per IEEE 754) BEFORE
the clamp, then mask-select NaN back into those lanes after the polynomial.
NaN propagates per-lane; finite lanes are unchanged.
Two regression tests:
simd_exp_f32_propagates_nan — full-NaN vector returns full-NaN
simd_exp_f32_propagates_nan_per_lane — mixed NaN/0.0 input; NaN lanes
propagate, finite lanes compute exp(0)=1 unaffected
1788 passed (+2 from 1786).
Reported-by: codex review on PR #142.
The keystone for the cosmetic-SIMD sweep agent #11 audited on PR #142. That audit found 8 confirmed cosmetic SIMD wrappers in hpc/byte_scan.rs, hpc/palette_codec.rs, and hpc/aabb.rs — `#[target_feature(enable = "avx2")]` decorating scalar bodies that gave zero speedup over plain scalar. The root cause: there was no `U8x32` type in the polyfill, so consumers couldn't write SIMD byte code at AVX2's natural width (32 bytes = one __m256i ymm register). This PR adds U8x32 with real __m256i storage and 26 polyfill methods mirroring `simd_avx512::U8x64`: Constructors: splat, from_slice, from_array, to_array, copy_to_slice Reductions: reduce_sum (wrap-add), reduce_min, reduce_max, sum_bytes_u64 Min/max: simd_min, simd_max (_mm256_min_epu8, _mm256_max_epu8) Compare→mask: cmpeq_mask → u32, cmpgt_mask → u32 (unsigned via xor 0x80), movemask → u32 (matches _mm256_movemask_epi8 width) Saturating: saturating_add, saturating_sub (_mm256_adds/subs_epu8) Avg: pairwise_avg (_mm256_avg_epu8, round-up) Shifts: shr_epi16, shl_epi16 (16-bit lane shifts via _mm256_srl/sll_epi16) Shuffles: shuffle_bytes (within-128-bit-lane, _mm256_shuffle_epi8) permute_bytes (cross-lane, scalar fallback — AVX2 has no native cross-lane byte permute; matches U8x64's behavior on AVX-512F-without-VBMI hosts) unpack_lo_epi8, unpack_hi_epi8 (_mm256_unpacklo/hi_epi8) Conditional: mask_blend (_mm256_blendv_epi8, MSB-driven, NOT bitmask) LUT: nibble_popcount_lut Plus operators: BitAnd, BitOr, BitXor, Add (wrapping), Sub (wrapping), Debug, Default. All ~26 methods. Re-exported from `crate::simd::U8x32` for both AVX-512 and AVX2 build tiers — U8x32 is the natural AVX2 byte width and is needed regardless of whether AVX-512's U8x64 is the consumer's preferred width. Soundness model matches the rest of simd_avx2.rs: `_mm256_*` intrinsics are wrapped in `unsafe { }` blocks inside safe `pub fn`, trusting that AVX2 is the compile target (x86-64-v3 is project baseline). The codebase uses this pattern already in the AVX2 popcount at simd_avx2.rs:357. Test coverage: - 18 new tests in `mod u8x32_tests` covering: roundtrip, sum/min/max reductions, unsigned cmp masks (incl. high-byte > 127 to verify the XOR-0x80 unsigned trick), saturating add/sub clamps, pairwise_avg round-up, shr_epi16 nibble extraction, permute_bytes reverse, mask_blend per-MSB selection, nibble_popcount_lut via shuffle_bytes. - All 18 pass. Total test count 1786 → 1804 with no regressions. clippy --features rayon -- -D warnings: clean. Companion: this PR unblocks the round-3 consumer fleet which will rewrite byte_find_all_avx2 / pack_indices / aabb_intersect_batch_sse41 and friends to use `crate::simd::U8x32` instead of `#[target_feature]` wrappers around scalar code. Each consumer rewrite ships as its own PR in the next wave.
Complete the portable-simd backend started in the scaffold commit.
12 Sonnet agents (round-3-portable-simd fleet) populated each of the
12 sub-files in `src/simd_nightly/` via the A2A blackboard pattern at
`.claude/board/AGENT_LOG.md`.
Total: ~4,022 LOC of wrapper code + 76 parity tests.
Per-file (line counts at commit):
- f32_types.rs (395) — F32x16, F32x8
- f64_types.rs (307) — F64x8, F64x4
- u8_types.rs (1043) — U8x32, U8x64 + 26 in-file tests
- u_word_types.rs (520) — U16x32, U32x16, U32x8, U64x8, U64x4
- i8_types.rs (263) — I8x32, I8x64
- i_word_types.rs (449) — I16x16, I16x32, I32x16, I64x8
- masks.rs (196) — F32Mask16, F32Mask8, F64Mask8, F64Mask4
- bf16_types.rs (248) — BF16x16, BF16x8 (scalar emulation;
core::simd has no half-precision)
- f16_types.rs (220) — F16x16 (scalar IEEE-754 binary16 emulation)
- ops.rs (265) — Add/Sub/Mul/Div/Neg + bitwise + Default
macros, applied to all 17 numeric types
- exotic_methods.rs (329) — permute_bytes / shuffle_bytes / mask_blend /
unpack_lo_epi8 / unpack_hi_epi8 scalar
fallbacks for U8x32 + U8x64 (core::simd
has no native cross-lane byte ops or
bitmask-driven blend)
- tests.rs (815) — 76 parity tests vs scalar reference
30 types total (mirrors the AVX-512 / AVX2 polyfill surface 1:1).
All re-exported flat from `crate::simd_nightly::*` via the mod.rs
aggregator.
Verification:
rustup run nightly cargo check --features nightly-simd -p ndarray --lib
→ Finished, 0 errors
rustup run nightly cargo test --features nightly-simd -p ndarray --lib simd_nightly
→ test result: ok. 153 passed; 0 failed
cargo check --lib (stable, default features, no nightly-simd)
→ Finished, 0 errors (the existing intrinsics dispatch is unchanged)
Cross-agent findings worth folding into a handover note:
- `std::simd::StdFloat` is the trait that provides mul_add/sqrt/round/
floor on core::simd float vectors. `core::simd::num::SimdFloat`
provides reduce/min/max/clamp but NOT the transcendentals.
- `core::simd::cmp::SimdOrd` is needed for simd_min/simd_max on
integer vectors (SimdPartialOrd alone is not sufficient).
- `core::simd::Mask::to_bitmask()` always returns u64 regardless of
lane count. Wrappers cast `as u8` / `as u16` / `as u32` for narrower
bitmask shapes.
- `core::simd::Simd::swizzle` is `const N: usize` — cannot take a
runtime index vector. permute_bytes / shuffle_bytes need scalar
fallback. Same shape as the AVX-512F-without-VBMI fallback path in
simd_avx512.rs added in PR #142.
What this enables:
Miri can execute every method here (intrinsics-based backends are
opaque to miri). Consumers who want miri-runnable SIMD tests import
from `ndarray::simd_nightly::*` explicitly. The main polyfill via
`crate::simd::*` continues to use intrinsics — the nightly-simd
feature does NOT replace the production dispatch, it provides a
parallel namespace for miri tooling.
Fleet output in .claude/board/AGENT_LOG.md (round-3-portable-simd
section). 6 of 12 agents hit the same AGENT_LOG-write permission
pre-existing block from round-2 — backfilled by the main thread.
Summary
Two soundness/correctness bugs surfaced by a 15-agent CCA2A review fleet on this branch (12 file-scoped Sonnet agents → meta-orchestrator → brutally-honest reviewer, all entries appended to
.claude/board/AGENT_LOG.mdviatee -a). Confirmed real by the brutally-honest reviewer, which built the workspace and ran the test suite before any change.Fixes
simd_avx512::U8x64::permute_bytescalled_mm512_permutexvar_epi8(AVX-512VBMI) as safepub fnwith no gate. SIGILL on Skylake-X / Cascade Lake / Ice Lake-SP (AVX-512F-but-no-VBMI). The doc claimed a fallback existed; none did.avx512vbmi: boolfield toSimdCaps.permute_bytesruntime-branches via the singleton: VBMI hosts use the hardware intrinsic (gated#[target_feature(enable = "avx512vbmi")]inner unsafe leaf — Rust language requirement); non-VBMI AVX-512F hosts use a scalar fallback mirroringsimd_avx2.rs:1435.simd_exp_f32(+Inf)silently returned ~0.5 in release / panicked in debug.pow2n_from_intsaturatedf32::INFINITY as i32toi32::MAX;(i32::MAX + 127) as u32wrapped, producing a garbage IEEE bit pattern viaf32::from_bits.simd_exp_f32input to[-87.336, 88.722]. Defense-in-depth clamp inpow2n_from_intforni ∈ [-126, 127]before bias add. Three regression tests:+Inf,-Inf,x=200— all assert finite output.Design pattern (for the VBMI fix)
The same conditional-routing shape as future AVX-512 sub-feature work (BF16, VNNI sub-paths). One LazyLock CPU detect (
simd_caps()) is consulted at the method boundary; the inner intrinsic leaf carries the#[target_feature]attribute (Rust requires it to use intrinsics from a function not compiled with the feature globally — there is no other legal way). Consumer callsU8x64::permute_bytes(idx)and gets the right path on any AVX-512 CPU.What this PR does NOT touch
Strictly additive — full breakdown including explicit deferrals (AVX2-tier cfg work, cosmetic-SIMD sweep, AMX detection consolidation) lives in
.claude/board/SIMD_REVIEW_FIXES_2026_05_13.md, included in this PR. AMX inline-asm encodings,_mm512_*calls in other methods, existing#[target_feature]annotations: all untouched.Test plan
cargo test --features rayon --lib→ 1786 passed (was 1783; +3 simd_exp_f32 regression tests)cargo clippy --features rayon -- -D warnings→ cleancargo check --lib(default features, no rayon) → cleanHardware test matrix
permute_bytespermute_bytesReview fleet output
15 agents, 13 entries in
.claude/board/AGENT_LOG.md(12 file-scoped Sonnet + 1 Opus meta + 1 Opus brutally-honest reviewer). The brutally-honest reviewer corrected the meta on several findings (e.g.,framebuffer::project_orthowas a clarity improvement, not a UB fix — Rust 1.45+ saturates float→int casts) and identified the actual P0 SIGILL/correctness items shipped here.Most other fleet findings (cosmetic-SIMD sweep, AMX detection duplication, SAFETY-comment audit) were classified as real-but-not-Bevy-blocking and deferred per the surgical-and-additive constraint.
Generated by Claude Code