feat(simd_avx2): add U8x32 — native AVX2 byte vector (round-3 keystone) by AdaWorldAPI · Pull Request #144 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-05-13T16:23:05Z

Summary

Round-3 W1 (foundation) — adds U8x32, the native AVX2 byte vector type that the round-2 audit fleet (agent #11) identified as the keystone gap. Without it, the 8 cosmetic-SIMD wrappers in hpc/byte_scan.rs, hpc/palette_codec.rs, and hpc/aabb.rs (#[target_feature(enable = "avx2")] decorating scalar bodies) couldn't be rewritten to use the polyfill — there was no type to call.

This PR ships the type + 26 methods + 18 regression tests. The consumer rewrites (cosmetic-SIMD sweep) ship in a follow-up fleet of PRs.

API surface

U8x32 mirrors simd_avx512::U8x64 at the natural AVX2 width (32 bytes = one __m256i):

use ndarray::simd::U8x32;

let v = U8x32::splat(0xAA);                 // 32 lanes
let mask = v.cmpeq_mask(other);             // → u32 (one bit per lane)
let blend = U8x32::mask_blend(mask_v, a, b); // MSB-driven select
let counts = lut.shuffle_bytes(idx);        // nibble popcount via LUT

Category	Methods
Constructors	`splat`, `from_slice`, `from_array`, `to_array`, `copy_to_slice`
Reductions	`reduce_sum` (wrap-add), `reduce_min`, `reduce_max`, `sum_bytes_u64`
Min/max	`simd_min`, `simd_max`
Compare → mask	`cmpeq_mask → u32`, `cmpgt_mask → u32`, `movemask → u32`
Saturating	`saturating_add`, `saturating_sub`
Avg	`pairwise_avg` (round-up, `_mm256_avg_epu8`)
Shifts	`shr_epi16`, `shl_epi16` (16-bit lane shifts)
Shuffles	`shuffle_bytes` (within-lane), `permute_bytes` (cross-lane scalar fallback), `unpack_lo_epi8`, `unpack_hi_epi8`
Conditional	`mask_blend` (MSB-driven, NOT bitmask — distinct from `U8x64::mask_blend`)
LUT	`nibble_popcount_lut`
Operators	`BitAnd`, `BitOr`, `BitXor`, `Add` (wrap), `Sub` (wrap), `Debug`, `Default`

Design notes

Real __m256i storage, not scalar fallback. The existing simd_avx2::U8x64 uses scalar [u8; 64] because 64 bytes doesn't fit one ymm register — it's a polyfill-shape compatibility type. U8x32 is the real AVX2 byte vector; consumers wanting actual AVX2 SIMD speedup over scalar should iterate in 32-byte chunks via U8x32.

Soundness model matches the file's existing pattern: _mm256_* intrinsics are wrapped in unsafe { } inside safe pub fn, trusting AVX2 at compile time (project baseline is x86-64-v3 per the .cargo/config_ndarray_simd.toml template on the bevy side). The existing simd_avx2.rs:357 AVX2 popcount uses this same pattern.

permute_bytes cross-lane fallback is scalar. AVX2 has no native cross-lane byte permute (_mm256_permutexvar_epi8 is VBMI/AVX-512). Matches the shape of simd_avx512::U8x64::permute_bytes on AVX-512F-without-VBMI hosts (also scalar — landed in PR #142).

mask_blend differs from U8x64's: AVX-512 uses 64-bit bitmasks; AVX2's _mm256_blendv_epi8 is MSB-driven on a 32-byte mask vector. The signature reflects this (Self, Self, Self not u64, Self, Self).

Re-exports

U8x32 is re-exported from crate::simd::* on both AVX-512 and AVX2 build tiers — it's the natural AVX2 byte width and useful regardless of whether AVX-512's U8x64 is the consumer's preferred width.

Test plan

18 new tests in mod u8x32_tests covering all behavior (constructors, reductions, masks incl. unsigned high-byte cases > 127, saturating clamps, pairwise_avg round-up, nibble shifts, permute, blend, LUT)
All 18 pass
Total test count: 1786 → 1804, no regressions
cargo clippy --features rayon -- -D warnings: clean
cargo check --features rayon --lib: clean

What follows

The cosmetic-SIMD consumer-rewrite fleet — 12 Sonnet agents, each scoped to ONE function in hpc/byte_scan.rs / hpc/palette_codec.rs / hpc/aabb.rs, rewriting from #[target_feature]+scalar to crate::simd::U8x32::* calls. Will land as separate PRs after this one merges.

Generated by Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b538025f26

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T16:25:57Z

+    #[inline(always)]
+    pub fn mask_blend(mask: Self, a: Self, b: Self) -> Self {
+        // SAFETY: AVX2 baseline.
+        Self(unsafe { _mm256_blendv_epi8(b.0, a.0, mask.0) })


Preserve U8x64 blend argument order

U8x64::mask_blend(mask, a, b) and the scalar fallback select b when the mask bit is set, but this AVX2 implementation passes the operands to _mm256_blendv_epi8 in the opposite order, so set mask lanes return a. Any consumer porting code from the existing U8x64 API or feeding a comparison mask as a predicate will get the inverse result on every selected byte; call _mm256_blendv_epi8(a.0, b.0, mask.0) instead, or the new type will not mirror the documented API shape.

Useful? React with 👍 / 👎.

The keystone for the cosmetic-SIMD sweep agent #11 audited on PR #142. That audit found 8 confirmed cosmetic SIMD wrappers in hpc/byte_scan.rs, hpc/palette_codec.rs, and hpc/aabb.rs — `#[target_feature(enable = "avx2")]` decorating scalar bodies that gave zero speedup over plain scalar. The root cause: there was no `U8x32` type in the polyfill, so consumers couldn't write SIMD byte code at AVX2's natural width (32 bytes = one __m256i ymm register). This PR adds U8x32 with real __m256i storage and 26 polyfill methods mirroring `simd_avx512::U8x64`: Constructors: splat, from_slice, from_array, to_array, copy_to_slice Reductions: reduce_sum (wrap-add), reduce_min, reduce_max, sum_bytes_u64 Min/max: simd_min, simd_max (_mm256_min_epu8, _mm256_max_epu8) Compare→mask: cmpeq_mask → u32, cmpgt_mask → u32 (unsigned via xor 0x80), movemask → u32 (matches _mm256_movemask_epi8 width) Saturating: saturating_add, saturating_sub (_mm256_adds/subs_epu8) Avg: pairwise_avg (_mm256_avg_epu8, round-up) Shifts: shr_epi16, shl_epi16 (16-bit lane shifts via _mm256_srl/sll_epi16) Shuffles: shuffle_bytes (within-128-bit-lane, _mm256_shuffle_epi8) permute_bytes (cross-lane, scalar fallback — AVX2 has no native cross-lane byte permute; matches U8x64's behavior on AVX-512F-without-VBMI hosts) unpack_lo_epi8, unpack_hi_epi8 (_mm256_unpacklo/hi_epi8) Conditional: mask_blend (_mm256_blendv_epi8, MSB-driven, NOT bitmask) LUT: nibble_popcount_lut Plus operators: BitAnd, BitOr, BitXor, Add (wrapping), Sub (wrapping), Debug, Default. All ~26 methods. Re-exported from `crate::simd::U8x32` for both AVX-512 and AVX2 build tiers — U8x32 is the natural AVX2 byte width and is needed regardless of whether AVX-512's U8x64 is the consumer's preferred width. Soundness model matches the rest of simd_avx2.rs: `_mm256_*` intrinsics are wrapped in `unsafe { }` blocks inside safe `pub fn`, trusting that AVX2 is the compile target (x86-64-v3 is project baseline). The codebase uses this pattern already in the AVX2 popcount at simd_avx2.rs:357. Test coverage: - 18 new tests in `mod u8x32_tests` covering: roundtrip, sum/min/max reductions, unsigned cmp masks (incl. high-byte > 127 to verify the XOR-0x80 unsigned trick), saturating add/sub clamps, pairwise_avg round-up, shr_epi16 nibble extraction, permute_bytes reverse, mask_blend per-MSB selection, nibble_popcount_lut via shuffle_bytes. - All 18 pass. Total test count 1786 → 1804 with no regressions. clippy --features rayon -- -D warnings: clean. Companion: this PR unblocks the round-3 consumer fleet which will rewrite byte_find_all_avx2 / pack_indices / aabb_intersect_batch_sse41 and friends to use `crate::simd::U8x32` instead of `#[target_feature]` wrappers around scalar code. Each consumer rewrite ships as its own PR in the next wave.

…mt fix) The format/nightly CI job on PR #144 flagged two sites in the U8x32 additions: 1. `nibble_popcount_lut` — 32-byte literal split into two 16-element rows for readability. Nightly rustfmt's chains_overflow_last_block + width budget collapse it to one line. Restored. 2. `permute_bytes_reverse` test — a 3-method chain was on one line. Nightly rustfmt wants each `.method()` on its own line under `chain_width = 60`. Restored. No semantic change. `cargo +nightly fmt --all --check` clean after.

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

AdaWorldAPI force-pushed the claude/u8x32-polyfill-round3 branch from b538025 to 521d23f Compare May 13, 2026 17:13

AdaWorldAPI merged commit 3915bef into master May 13, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(simd_avx2): add U8x32 — native AVX2 byte vector (round-3 keystone)#144

feat(simd_avx2): add U8x32 — native AVX2 byte vector (round-3 keystone)#144
AdaWorldAPI merged 2 commits into
masterfrom
claude/u8x32-polyfill-round3

AdaWorldAPI commented May 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented May 13, 2026

Summary

API surface

Design notes

Re-exports

Test plan

What follows

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants