Across 014 and 016, the Metal drawable-stall flake on macOS made steady-state FPS comparisons unreliable — every version change required waiting 5-10 minutes between runs for the OS to release its hold on the swapchain. This made it impossible to attach solid numbers to V1-V4 perf claims like "32 rays/probe + importance sampling matches 32 rays/probe uniform". Visual captures (docs/perf/ticket-*-after.png) are the de-facto acceptance signal.
A proper harness would let us actually measure:
- Temporal convergence rate (how many frames to hit some noise threshold)
- Equal-quality-at-lower-rays comparison
- SSIM vs. reference on a fixed camera pose
- FPS under consistent Metal state
Scope
- A dedicated benchmark binary (can reuse the
intel-sponza example) that:
- Forces a clean Metal state on start (workaround or explicit wait)
- Runs a warmup window (~60 frames) before measurements
- Captures N frames with timestamp queries + readback
- Writes results to JSON for regression comparison
- Python SSIM/PSNR harness comparing output frames vs. a baseline capture
- CI-friendly output format (assertion-style thresholds per metric)
Stretch
- Multi-pose sweep: stand in 4-6 fixed camera positions, capture at each
- Multi-config sweep: toggle SSGI on/off, HW/SW path, quality preset 0-4
- Result CSV for quick before/after comparison during GI work
Context
Flake documentation lives informally in the V5-V14 commit messages. Originally in the 014 closure: "FPS ≥ 40 fps ... was flake-bound throughout V8-V14 testing". A repeatable harness would close that loop.
Across 014 and 016, the Metal drawable-stall flake on macOS made steady-state FPS comparisons unreliable — every version change required waiting 5-10 minutes between runs for the OS to release its hold on the swapchain. This made it impossible to attach solid numbers to V1-V4 perf claims like "32 rays/probe + importance sampling matches 32 rays/probe uniform". Visual captures (
docs/perf/ticket-*-after.png) are the de-facto acceptance signal.A proper harness would let us actually measure:
Scope
intel-sponzaexample) that:Stretch
Context
Flake documentation lives informally in the V5-V14 commit messages. Originally in the 014 closure: "FPS ≥ 40 fps ... was flake-bound throughout V8-V14 testing". A repeatable harness would close that loop.