WASM targets currently use the software fixslice implementation which processes up to 4 blocks at a time (fixslice64). WebAssembly has had the simd128 proposal stabilized in the spec for a while now, and the corresponding intrinsics ([core::arch::wasm32::*]) have been stable in Rust since 1.33. Every major WebAssembly runtime ships simd128 support today.
We use AES extensively in our projects which target running in browsers, the throughput of AES in WASM is a major performance lever for us. Translating the existing fixslice64 implementation (AI assisted, not sure what your policy is on that) to use 128 bit lanes yields the expected >2x throughput increase in V8 (chromium), and >3x in SpiderMonkey (firefox).
(AMD Ryzen 7 5800X)
| Unit: MB/s |
wasmtime soft |
wasmtime simd128 |
wasmtime relative |
v8 soft |
v8 simd128 |
v8 relative |
sm soft |
sm simd128 |
sm relative |
| AES-128 enc |
186 |
184 |
0.99× |
258 |
604 |
2.34× |
213 |
657 |
3.08× |
| AES-128 dec |
185 |
122 |
0.66× |
235 |
508 |
2.16× |
182 |
598 |
3.29× |
| AES-192 enc |
159 |
150 |
0.94× |
224 |
522 |
2.33× |
185 |
588 |
3.18× |
| AES-192 dec |
159 |
100 |
0.63× |
202 |
486 |
2.41× |
153 |
577 |
3.77× |
| AES-256 enc |
138 |
140 |
1.01× |
192 |
455 |
2.37× |
159 |
491 |
3.09× |
| AES-256 dec |
135 |
89 |
0.66× |
171 |
424 |
2.48× |
134 |
453 |
3.38× |
As you can see, this causes a regression in Wasmtime decrypt performance due to bad JIT by Cranelift. I haven't been able to mitigate that without incurring significant throughput loss in the browser runtimes. I'm not sure whether the right approach is to eat the regression by default, or to put the SIMD behind a flag.
Implementation
Can reproduce the benchmarks using wasm-harness
Would love to upstream this if possible. If so, I will open a PR.
WASM targets currently use the software fixslice implementation which processes up to 4 blocks at a time (fixslice64). WebAssembly has had the
simd128proposal stabilized in the spec for a while now, and the corresponding intrinsics ([core::arch::wasm32::*]) have been stable in Rust since 1.33. Every major WebAssembly runtime shipssimd128support today.We use AES extensively in our projects which target running in browsers, the throughput of AES in WASM is a major performance lever for us. Translating the existing fixslice64 implementation (AI assisted, not sure what your policy is on that) to use 128 bit lanes yields the expected >2x throughput increase in V8 (chromium), and >3x in SpiderMonkey (firefox).
(AMD Ryzen 7 5800X)
As you can see, this causes a regression in Wasmtime decrypt performance due to bad JIT by Cranelift. I haven't been able to mitigate that without incurring significant throughput loss in the browser runtimes. I'm not sure whether the right approach is to eat the regression by default, or to put the SIMD behind a flag.
Implementation
Can reproduce the benchmarks using wasm-harness
Would love to upstream this if possible. If so, I will open a PR.