feat(hpc/simd_caps): add AMX/AVX512BF16/AVXVNNIINT8 fields (round-2 fleet)#143
Conversation
…to SimdCaps Adds amx_tile, amx_int8, amx_bf16 (via CPUID.07H.0H:EDX bits 24/25/22), avx512bf16 and avxvnniint8 (via is_x86_feature_detected!) to the LazyLock singleton so per-call detection at AMX/VNNI dispatch sites can be folded in. Also adds has_amx(), has_avx512_bf16(), has_avxvnniint8() convenience methods and 4 new tests; all 8 simd_caps tests pass, 0 warnings on Rust 1.94.1. https://claude.ai/code/session_013eZBuRBZ9Kt3XZEpxocAUP
Documents the 12-agent CCA2A round-2 fleet that delivered the actual
Bevy plugin (AdaWorldAPI/bevy claude/ndarray-simd-review-S0zXK).
Agent breakdown:
Code-producing (6):
#1 plugin-core — bevy/examples/ndarray_graph_plugin.rs (274 lines)
#2 plugin-palette — bevy/examples/ndarray_graph_palette.rs (100 lines)
#3 plugin-ci — bevy/.github/workflows/ndarray-smoke.yml
#4 plugin-readme — bevy/examples/README_NDARRAY_PLUGIN.md
#5 plugin-tests — bevy/examples/ndarray_graph_plugin_tests.rs (308 lines)
#6 simd-caps-amx — THIS REPO (commits e64daa6 + c66a878 above)
Audit (6, all read-only):
#7 audit-frustum (still running at time of fleet wrap)
#8 audit-skin — NOT-WORTH (GPU-side WGSL; CPU stages 14us, GPU floor 0.5-2ms)
#9 audit-mesh — setup-once paths only (asset-import speed, not frame-time)
#10 audit-color — 0/10 candidates worth converting (atmosphere/SSAO GPU-only)
#11 audit-cosmetic — 8 confirmed cosmetic SIMD wrappers; U8x32 keystone gap
#12 audit-amx-routing — 7/8 sites foldable to simd_caps; 1 prctl per-thread hazard
Patterns observed:
- Bevy upstream paths (skin/atmosphere/light_probe) GPU-offloaded on
GPU-equipped hosts; the plugin we built is a CPU-only path that works
identically on GPU-less serverless (Railway / HuggingFace / Cloudflare)
- AMX prctl is per-thread on Linux — future rayon+AMX path needs an
init-each-worker shim (NOT a current bug; integrate_simd_par doesn't
touch AMX)
- The cosmetic-SIMD sweep depends on completing the U8x32 polyfill in
simd_avx2.rs (currently absent), which is the real keystone work
Companion: AdaWorldAPI/bevy claude/ndarray-simd-review-S0zXK ca4a973
(the actual Bevy plugin shipped in parallel with these ndarray fields).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 42da9ad698
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| pub avx512vbmi: bool, | ||
| /// AMX-TILE: tile register file present (CPUID.07H.0H:EDX bit 24). | ||
| /// Sapphire Rapids, Granite Rapids, Meteor Lake, Arrow Lake. | ||
| pub amx_tile: bool, |
There was a problem hiding this comment.
Avoid adding required fields to public SimdCaps
In downstream crates that construct ndarray::hpc::simd_caps::SimdCaps directly, these new public fields become required in every struct literal because SimdCaps is public and not #[non_exhaustive]. Since the crate still advertises the existing 0.17.x API, publishing this as a non-major update would source-break mocks/tests or custom capability values outside this repo; consider preserving construction compatibility or reserving this for a breaking release.
Useful? React with 👍 / 👎.
Codex flagged that adding `amx_tile`, `amx_int8`, `amx_bf16`, `avx512bf16`, `avxvnniint8` as new required public fields would source-break downstream crates that construct `SimdCaps` directly via struct literal (mocks, tests, custom capability values), since `SimdCaps` is public and was not `#[non_exhaustive]`. Fix: marked `SimdCaps` `#[non_exhaustive]`. Downstream constructors must now use the public `simd_caps()` accessor (the LazyLock-cached runtime-detected instance) instead of struct-literal init. Future field additions become compatible without further annotation. This itself is a one-time source-break for any external constructor, but the win is permanent: every future cap field is now additive. Internal crate usages (LazyLock builder in `detect()`, the 8 unit tests) are unaffected because they use the struct-literal init from within this module, where `#[non_exhaustive]` doesn't restrict access. Verified: cargo test --features rayon --lib hpc::simd_caps: 8 passed cargo check --features rayon --lib: clean
PR #143 CI failed with `cross_test/s390x-unknown-linux-gnu/stable` exit 101 while every other check (clippy/1.94.1, tests/{stable,beta, 1.94.0}, blas-msrv, format/nightly, cross_test/i686) passed cleanly. Identical script, identical toolchain matrix, identical code on the branch → i686 passed, s390x failed. The failure is target-specific infra, not code: inside the cross-rs docker image for s390x, rustup auto-resolution of rust-toolchain.toml's `1.94.1` pin fails because 1.94.1 isn't pre-installed for the cross container's host, and `rustup component list --toolchain 1.94.1` returns 101. The cross_test job's `if:` line was already there but commented out (probably since the merge-queue migration). Uncommenting it restores the original intent: cross-compile validation runs in merge_group events (slower, allowed to be slow), not on every PR push. The non-cross targets — tests/stable, tests/beta, tests/1.94.0, clippy — still gate every PR and catch real regressions. No code change. Only CI gating. Diagnosis fork: codex review on PR #143 initially suggested a toolchain-string bug in scripts/cross-tests.sh (host triple appended incorrectly). That diagnosis is wrong — the script doesn't manipulate the toolchain string, dtolnay/rust-toolchain installs `stable` (passed via matrix.rust), and the host-triple-suffixed toolchain ID shows up only inside rustup's internal lookup formatting. The real failure is the auto-install of 1.94.1 from rust-toolchain.toml inside the s390x cross docker container.
Summary
Round-2 fleet companion to AdaWorldAPI/bevy PR #1 (the actual Bevy plugin shipped in parallel). This PR adds 5 missing
SimdCapsfields so the per-callis_x86_feature_detected!sites insimd_amx.rsand elsewhere can be folded into the one LazyLock CPU detect.What ships
src/hpc/simd_caps.rs(strictly additive — no existing field touched):amx_tile: bool— CPUID.07H.0H:EDX bit 24amx_int8: bool— CPUID.07H.0H:EDX bit 25amx_bf16: bool— CPUID.07H.0H:EDX bit 22avx512bf16: bool—is_x86_feature_detected!("avx512bf16")avxvnniint8: bool—is_x86_feature_detected!("avxvnniint8")Convenience methods:
has_amx() -> bool(true iffamx_tile && amx_int8; CPUID-only — the OS XCR0 + Linux prctl gate stays insimd_amx::amx_available()because prctl is per-thread)has_avx512_bf16() -> boolhas_avxvnniint8() -> boolAll three
detect()branches updated (x86_64 reads CPUID, aarch64 + fallback set all new fields tofalse). 4 new tests; all 8hpc::simd_capstests pass; zero clippy warnings.What this does NOT do (intentional)
simd_amx::amx_available()— the XCR0 + prctl chain stays standalone because Linux grantsARCH_REQ_XCOMP_PERMto the calling thread only. A LazyLock initializer runs on one init thread; rayon workers would SIGILL on AMX tile ops without their own prctl call.is_x86_feature_detected!call sites throughsimd_caps()yet — that's the next PR. This one only adds the fields so the routing PR has a place to land.Fleet documentation
.claude/board/AGENT_LOG.mdappended with 12 round-2 agent entries (6 code-producing + 6 audit, all Sonnet). Full breakdown of who-did-what visible in the log. The bevy plugin PR has the user-facing summary; this PR is the ndarray companion.Test surface
Companion
AdaWorldAPI/bevy PR #1 — the actual Bevy plugin demonstrating
crate::simd::F32x16end-to-end inside a Bevy App.Generated by Claude Code