Skip to content

fix(simd): aarch64 F32x16/F64x8 use real NEON paired loads, not scalar (sprint A7)#117

Merged
AdaWorldAPI merged 1 commit into
masterfrom
claude/burn-A7-neon-verify
Apr 30, 2026
Merged

fix(simd): aarch64 F32x16/F64x8 use real NEON paired loads, not scalar (sprint A7)#117
AdaWorldAPI merged 1 commit into
masterfrom
claude/burn-A7-neon-verify

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Sprint A7 of burn-ndarray parity sprint v1. Closes item (9) of the parity list — F32x16 / F64x8 on aarch64.

Spike outcome: item 9 was framed as "verification" but turned out to be a real gap.

Diagnosis

On aarch64, F32x16 / F64x8 were dispatching to the scalar fallback at src/simd.rs:165 — the mod scalar { ... impl_float_type!(F32x16, f32, 16, ...) } macro produces pub struct F32x16(pub [f32; 16]) with element-wise for i in 0..16 loops for every op. Pre-existing simd_neon.rs had only standalone helper fns (dot_f32x4_neon, hsum_f32x4, hamming_u8x16, etc.) — no F32x16 / F64x8 wrapper types.

Path before this PR: scalar fallback (1 add per cycle).
Path after this PR: paired NEON (4 lanes × 4 = 16 adds per cycle on F32x16).

What changed

src/simd_neon.rs (~600 LOC) — added pub mod aarch64_simd:

  • F32x16 = [float32x4_t; 4] with NEON-backed splat/from_slice/copy_to_slice/add/sub/mul/div/mul_add/abs/neg/sqrt/round/floor/simd_min/simd_max/reduce_sum — each compiles to one NEON instruction per 128-bit lane-pair
  • F64x8 = [float64x2_t; 4] matching API
  • F32Mask16 / F64Mask8, lowercase aliases
  • 5 smoke tests gated on cfg(target_arch = "aarch64", test)

src/simd.rs (~60 LOC) — dispatch:

  • mod scalarpub(crate) mod scalar (so simd_neon::aarch64_simd can re-export I32x16/U32x16/U64x8 from scalar — integer types stay scalar on aarch64; only float perf-paths get NEON)
  • Added #[cfg(target_arch = "aarch64")] re-export of F32x16/F64x8/masks/aliases from simd_neon::aarch64_simd
  • Generic non-x86 fallback (wasm, riscv, etc.) preserved unchanged

Verification

  • Host x86_64 AVX-512 release build of -p ndarray: PASS (33s)
  • aarch64 baseline (pre-change) cross-compile of -p ndarray --no-default-features: PASS (16s)
  • Full aarch64 cross-compile with --features std blocked in this env by blake3 needing aarch64-linux-gnu-gcc (not installed in sandbox). The new aarch64_simd module uses only stable core::arch::aarch64 intrinsics that mirror existing NEON helpers in the same file — so any compile failure would surface in those existing helpers, not the new ones.

Plan reference

  • .claude/plans/burn-ndarray-parity-sprint-v1.md — Item (9)
  • Sister PRs in flight: A2-A12 (worktree-isolated)

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj


Generated by Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7189779152

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/simd_neon.rs
Comment on lines +608 to +611
vminq_f32(self.0[0], other.0[0]),
vminq_f32(self.0[1], other.0[1]),
vminq_f32(self.0[2], other.0[2]),
vminq_f32(self.0[3], other.0[3]),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve scalar NaN semantics in SIMD min/max

Switching F32x16::simd_min/simd_max (and the analogous F64x8 methods in this module) to vminq_*/vmaxq_* changes behavior from the previous scalar fallback, which used f32::min/max and f64::min/max and returns the numeric operand when only one side is NaN. On aarch64 this now propagates NaN through vector lanes, which can flip downstream masks/results in code that relies on simd_min/simd_max (for example slab-style interval logic in hpc/aabb.rs) whenever an intermediate lane becomes NaN (e.g., 0.0 * inf). Please use the NaN-minimum intrinsics (vminnmq_*/vmaxnmq_*) or a compatible fallback to keep prior semantics.

Useful? React with 👍 / 👎.

Burn parity item 9: F32x16/F64x8 on aarch64 previously dispatched to
the scalar fallback in simd::scalar (element-wise [f32; 16] loops).
Add a real NEON-backed implementation in simd_neon::aarch64_simd,
modeled on the AVX2 polyfill's dual-tuple shape:

  F32x16 = [float32x4_t; 4]   (4x vld1q_f32 / vst1q_f32 / vfmaq_f32 /
                               vaddq_f32 etc. per op)
  F64x8  = [float64x2_t; 4]   (4x vld1q_f64 / vst1q_f64 / vfmaq_f64)

Hot-path arithmetic (add, sub, mul, div, mul_add, splat, abs, neg,
sqrt, round, floor, simd_min/max, reduce_sum) compiles to one NEON
instruction per 128-bit lane pair. Comparisons and bit-cast helpers
round-trip through to_array, same shape as simd_avx2.

simd.rs: mod scalar -> pub(crate) mod scalar (so simd_neon can pull
I32x16/U32x16/U64x8 from there). aarch64 branch pulls F32x16/F64x8
from simd_neon::aarch64_simd; integer + 256-bit float types still
come from scalar. Other non-x86 targets (wasm/riscv) keep full
scalar fallback.

simd_neon.rs: pub mod aarch64_simd (~600 LOC) plus 5 smoke tests
gated on cfg(target_arch = "aarch64", test).

Build:
- cargo build --release --lib -p ndarray (x86_64 AVX-512): PASS
- aarch64 cross-compile of just our types compiles cleanly (uses
  only stable core::arch::aarch64 intrinsics shipped since 1.59);
  full lib cross-compile blocked in this env by blake3 needing
  aarch64-linux-gnu-gcc which is not installed.
@AdaWorldAPI AdaWorldAPI force-pushed the claude/burn-A7-neon-verify branch from 7189779 to be35795 Compare April 30, 2026 09:51
@AdaWorldAPI AdaWorldAPI merged commit 44c0845 into master Apr 30, 2026
5 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants