Skip to content

feat(simd_avx2): add U8x32 — native AVX2 byte vector (round-3 keystone)#144

Merged
AdaWorldAPI merged 2 commits into
masterfrom
claude/u8x32-polyfill-round3
May 13, 2026
Merged

feat(simd_avx2): add U8x32 — native AVX2 byte vector (round-3 keystone)#144
AdaWorldAPI merged 2 commits into
masterfrom
claude/u8x32-polyfill-round3

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Round-3 W1 (foundation) — adds U8x32, the native AVX2 byte vector type that the round-2 audit fleet (agent #11) identified as the keystone gap. Without it, the 8 cosmetic-SIMD wrappers in hpc/byte_scan.rs, hpc/palette_codec.rs, and hpc/aabb.rs (#[target_feature(enable = "avx2")] decorating scalar bodies) couldn't be rewritten to use the polyfill — there was no type to call.

This PR ships the type + 26 methods + 18 regression tests. The consumer rewrites (cosmetic-SIMD sweep) ship in a follow-up fleet of PRs.

API surface

U8x32 mirrors simd_avx512::U8x64 at the natural AVX2 width (32 bytes = one __m256i):

use ndarray::simd::U8x32;

let v = U8x32::splat(0xAA);                 // 32 lanes
let mask = v.cmpeq_mask(other);             // → u32 (one bit per lane)
let blend = U8x32::mask_blend(mask_v, a, b); // MSB-driven select
let counts = lut.shuffle_bytes(idx);        // nibble popcount via LUT
Category Methods
Constructors splat, from_slice, from_array, to_array, copy_to_slice
Reductions reduce_sum (wrap-add), reduce_min, reduce_max, sum_bytes_u64
Min/max simd_min, simd_max
Compare → mask cmpeq_mask → u32, cmpgt_mask → u32, movemask → u32
Saturating saturating_add, saturating_sub
Avg pairwise_avg (round-up, _mm256_avg_epu8)
Shifts shr_epi16, shl_epi16 (16-bit lane shifts)
Shuffles shuffle_bytes (within-lane), permute_bytes (cross-lane scalar fallback), unpack_lo_epi8, unpack_hi_epi8
Conditional mask_blend (MSB-driven, NOT bitmask — distinct from U8x64::mask_blend)
LUT nibble_popcount_lut
Operators BitAnd, BitOr, BitXor, Add (wrap), Sub (wrap), Debug, Default

Design notes

Real __m256i storage, not scalar fallback. The existing simd_avx2::U8x64 uses scalar [u8; 64] because 64 bytes doesn't fit one ymm register — it's a polyfill-shape compatibility type. U8x32 is the real AVX2 byte vector; consumers wanting actual AVX2 SIMD speedup over scalar should iterate in 32-byte chunks via U8x32.

Soundness model matches the file's existing pattern: _mm256_* intrinsics are wrapped in unsafe { } inside safe pub fn, trusting AVX2 at compile time (project baseline is x86-64-v3 per the .cargo/config_ndarray_simd.toml template on the bevy side). The existing simd_avx2.rs:357 AVX2 popcount uses this same pattern.

permute_bytes cross-lane fallback is scalar. AVX2 has no native cross-lane byte permute (_mm256_permutexvar_epi8 is VBMI/AVX-512). Matches the shape of simd_avx512::U8x64::permute_bytes on AVX-512F-without-VBMI hosts (also scalar — landed in PR #142).

mask_blend differs from U8x64's: AVX-512 uses 64-bit bitmasks; AVX2's _mm256_blendv_epi8 is MSB-driven on a 32-byte mask vector. The signature reflects this (Self, Self, Self not u64, Self, Self).

Re-exports

U8x32 is re-exported from crate::simd::* on both AVX-512 and AVX2 build tiers — it's the natural AVX2 byte width and useful regardless of whether AVX-512's U8x64 is the consumer's preferred width.

Test plan

  • 18 new tests in mod u8x32_tests covering all behavior (constructors, reductions, masks incl. unsigned high-byte cases > 127, saturating clamps, pairwise_avg round-up, nibble shifts, permute, blend, LUT)
  • All 18 pass
  • Total test count: 1786 → 1804, no regressions
  • cargo clippy --features rayon -- -D warnings: clean
  • cargo check --features rayon --lib: clean

What follows

The cosmetic-SIMD consumer-rewrite fleet — 12 Sonnet agents, each scoped to ONE function in hpc/byte_scan.rs / hpc/palette_codec.rs / hpc/aabb.rs, rewriting from #[target_feature]+scalar to crate::simd::U8x32::* calls. Will land as separate PRs after this one merges.


Generated by Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b538025f26

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/simd_avx2.rs
#[inline(always)]
pub fn mask_blend(mask: Self, a: Self, b: Self) -> Self {
// SAFETY: AVX2 baseline.
Self(unsafe { _mm256_blendv_epi8(b.0, a.0, mask.0) })
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve U8x64 blend argument order

U8x64::mask_blend(mask, a, b) and the scalar fallback select b when the mask bit is set, but this AVX2 implementation passes the operands to _mm256_blendv_epi8 in the opposite order, so set mask lanes return a. Any consumer porting code from the existing U8x64 API or feeding a comparison mask as a predicate will get the inverse result on every selected byte; call _mm256_blendv_epi8(a.0, b.0, mask.0) instead, or the new type will not mirror the documented API shape.

Useful? React with 👍 / 👎.

The keystone for the cosmetic-SIMD sweep agent #11 audited on PR #142.
That audit found 8 confirmed cosmetic SIMD wrappers in hpc/byte_scan.rs,
hpc/palette_codec.rs, and hpc/aabb.rs — `#[target_feature(enable = "avx2")]`
decorating scalar bodies that gave zero speedup over plain scalar. The
root cause: there was no `U8x32` type in the polyfill, so consumers
couldn't write SIMD byte code at AVX2's natural width (32 bytes = one
__m256i ymm register).

This PR adds U8x32 with real __m256i storage and 26 polyfill methods
mirroring `simd_avx512::U8x64`:

Constructors:    splat, from_slice, from_array, to_array, copy_to_slice
Reductions:      reduce_sum (wrap-add), reduce_min, reduce_max, sum_bytes_u64
Min/max:         simd_min, simd_max  (_mm256_min_epu8, _mm256_max_epu8)
Compare→mask:    cmpeq_mask → u32, cmpgt_mask → u32 (unsigned via xor 0x80),
                 movemask → u32  (matches _mm256_movemask_epi8 width)
Saturating:      saturating_add, saturating_sub  (_mm256_adds/subs_epu8)
Avg:             pairwise_avg  (_mm256_avg_epu8, round-up)
Shifts:          shr_epi16, shl_epi16  (16-bit lane shifts via _mm256_srl/sll_epi16)
Shuffles:        shuffle_bytes  (within-128-bit-lane, _mm256_shuffle_epi8)
                 permute_bytes  (cross-lane, scalar fallback — AVX2 has no
                 native cross-lane byte permute; matches U8x64's behavior
                 on AVX-512F-without-VBMI hosts)
                 unpack_lo_epi8, unpack_hi_epi8  (_mm256_unpacklo/hi_epi8)
Conditional:     mask_blend  (_mm256_blendv_epi8, MSB-driven, NOT bitmask)
LUT:             nibble_popcount_lut

Plus operators: BitAnd, BitOr, BitXor, Add (wrapping), Sub (wrapping),
Debug, Default. All ~26 methods.

Re-exported from `crate::simd::U8x32` for both AVX-512 and AVX2 build
tiers — U8x32 is the natural AVX2 byte width and is needed regardless
of whether AVX-512's U8x64 is the consumer's preferred width.

Soundness model matches the rest of simd_avx2.rs: `_mm256_*` intrinsics
are wrapped in `unsafe { }` blocks inside safe `pub fn`, trusting that
AVX2 is the compile target (x86-64-v3 is project baseline). The codebase
uses this pattern already in the AVX2 popcount at simd_avx2.rs:357.

Test coverage:
- 18 new tests in `mod u8x32_tests` covering: roundtrip, sum/min/max
  reductions, unsigned cmp masks (incl. high-byte > 127 to verify the
  XOR-0x80 unsigned trick), saturating add/sub clamps, pairwise_avg
  round-up, shr_epi16 nibble extraction, permute_bytes reverse,
  mask_blend per-MSB selection, nibble_popcount_lut via shuffle_bytes.
- All 18 pass. Total test count 1786 → 1804 with no regressions.

clippy --features rayon -- -D warnings: clean.

Companion: this PR unblocks the round-3 consumer fleet which will
rewrite byte_find_all_avx2 / pack_indices / aabb_intersect_batch_sse41
and friends to use `crate::simd::U8x32` instead of `#[target_feature]`
wrappers around scalar code. Each consumer rewrite ships as its own PR
in the next wave.
@AdaWorldAPI AdaWorldAPI force-pushed the claude/u8x32-polyfill-round3 branch from b538025 to 521d23f Compare May 13, 2026 17:13
…mt fix)

The format/nightly CI job on PR #144 flagged two sites in the U8x32
additions:

1. `nibble_popcount_lut` — 32-byte literal split into two 16-element
   rows for readability. Nightly rustfmt's chains_overflow_last_block
   + width budget collapse it to one line. Restored.

2. `permute_bytes_reverse` test — a 3-method chain was on one line.
   Nightly rustfmt wants each `.method()` on its own line under
   `chain_width = 60`. Restored.

No semantic change. `cargo +nightly fmt --all --check` clean after.
@AdaWorldAPI AdaWorldAPI merged commit 3915bef into master May 13, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants