Migrate Benchmark#731
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #731 +/- ##
==========================================
+ Coverage 74.82% 74.93% +0.11%
==========================================
Files 56 56
Lines 8095 8095
Branches 1577 1577
==========================================
+ Hits 6057 6066 +9
+ Misses 1422 1411 -11
- Partials 616 618 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| def _store_case_result(result: dict[str, Any], comp_size: int, comp_time: float, mem_usage: float, report_mem_usage: bool) -> None: | ||
| if len(result["comp_size"]) == 0 or result["comp_size"][-1] != comp_size: | ||
| result["comp_size"].append(comp_size) | ||
| result["comp_time"].append(comp_time) | ||
| if report_mem_usage: | ||
| result["mem_usage"].append(mem_usage) | ||
| return | ||
|
|
||
| result["comp_time"][-1] = comp_time | ||
| if report_mem_usage: | ||
| result["mem_usage"][-1] = mem_usage |
There was a problem hiding this comment.
Case identity key is not unique for custom scale arrays
_store_case_result identifies a grid case by nx * ny * nz (total grid points). If two entries in the scale arrays produce the same total point count but different shapes — e.g. (2, 1) / (1, 2) with z=(1, 1) both giving 2 * start_size³ — the second case's timing will overwrite the first rather than append a new row. The default scale arrays happen to produce unique products, so this is silent with the defaults, but custom arrays passed via BenchmarkOptions can trivially trigger it and silently drop a case's result.
| def _peak_memory_bytes() -> float: | ||
| try: | ||
| import resource | ||
| except ImportError: | ||
| return float("nan") | ||
|
|
||
| usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss | ||
| if platform.system().lower() == "darwin": | ||
| return float(usage) | ||
| return float(usage * 1024) |
There was a problem hiding this comment.
ru_maxrss reports cumulative process-lifetime peak, not per-run peak
resource.getrusage(RUSAGE_SELF).ru_maxrss returns the highest RSS the process has ever reached since it started, not the peak during the current solver call. When report_mem_usage=True and the benchmark runs multiple grid sizes, every case after the largest one seen so far will record the earlier run's peak — the values will plateau rather than reflect per-run memory. Averaging this monotonically non-decreasing value further masks the issue.
| if self.number_time_points <= 0 or self.num_averages <= 0: | ||
| raise ValueError("number_time_points and num_averages must be positive") | ||
| if self.number_sensor_points <= 1: | ||
| raise ValueError("number_sensor_points must be greater than 1") |
There was a problem hiding this comment.
A start_size of 0 or a negative value passes __post_init__ silently, producing (0, 0, 0, 0) tuples from grid_sizes and a zero-division error inside build_case when computing dx = domain_size / nx.
| if self.number_time_points <= 0 or self.num_averages <= 0: | |
| raise ValueError("number_time_points and num_averages must be positive") | |
| if self.number_sensor_points <= 1: | |
| raise ValueError("number_sensor_points must be greater than 1") | |
| if self.number_time_points <= 0 or self.num_averages <= 0: | |
| raise ValueError("number_time_points and num_averages must be positive") | |
| if self.start_size <= 0: | |
| raise ValueError("start_size must be positive") | |
| if self.number_sensor_points <= 1: | |
| raise ValueError("number_sensor_points must be greater than 1") |
| def test_run_saves_partial_results_after_solver_error(tmp_path: Path): | ||
| options = small_options() | ||
| output_path = tmp_path / "benchmark.json" | ||
|
|
||
| def failing_solver(*args, **kwargs): | ||
| raise RuntimeError("solver failed") | ||
|
|
||
| result = run(options, max_cases=1, output_path=output_path, solver=failing_solver) | ||
|
|
||
| assert result["error_reached"] is True | ||
| assert result["error_message"] == "solver failed" | ||
| assert output_path.exists() | ||
| saved = json.loads(output_path.read_text()) | ||
| assert saved["error_reached"] is True | ||
| assert saved["error_message"] == "solver failed" |
There was a problem hiding this comment.
report_mem_usage=True path has no test coverage
The mem_usage list — conditionally added to result and to the JSON payload — is never exercised by the test suite. The _peak_memory_bytes helper, the rolling-average accumulation of loop_mem_usage, and the _save_results branch that serialises mem_usage are all untested.
| def peak_memory_bytes() -> float: | ||
| try: | ||
| import resource | ||
| except ImportError: | ||
| return float("nan") | ||
|
|
||
| usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss | ||
| if platform.system().lower() == "darwin": | ||
| return float(usage) | ||
| return float(usage * 1024) |
There was a problem hiding this comment.
nan sentinel silently produces unreadable JSON on Windows
peak_memory_bytes() returns float("nan") on Windows where the resource module isn't available. That nan flows unguarded through rolling_average and store_case_result into result["mem_usage"]. Python's json.dumps serialises nan as the bare token NaN (not valid JSON), and Python's own json.loads will then reject the file with a ValueError — so any run with --report-mem-usage on Windows silently writes a file that cannot be read back.
Consider raising early (ValueError: report_mem_usage is not supported on this platform) when resource is unavailable, or filtering/skipping the nan entry in save_results.
Migrates
benchmark.mscript with some additions.benchmarks/README.mdshows sample usage.Example output json file
Greptile Summary
This PR ports MATLAB's
benchmark.minto a newbenchmarks/package, adding aBenchmarkOptionsdataclass, arun()function with injected solver/timer for testability, incremental JSON persistence, and a CLI entry point. The implementation is well-structured with good fault-tolerance design (partial saves after each average).benchmarks/helpers.pydefines the full domain setup (build_case), result accumulation (store_case_result,rolling_average), and JSON serialization (save_results).benchmarks/benchmark.pywires everything into arun()loop and amain()CLI; the solver and timer are injectable, making unit tests fast and hermetic.tests/test_benchmark.pyprovides solid coverage of the happy path and error path, but does not exercise thereport_mem_usage=Truebranch.Confidence Score: 4/5
Safe to merge for the typical Linux/macOS benchmark use-case; the nan-serialization defect only bites when --report-mem-usage is passed on Windows.
When report_mem_usage=True is used on Windows, peak_memory_bytes() returns float(nan), which json.dumps writes as the bare token NaN — producing a file that Python's own json.loads rejects. The rest of the benchmark logic (timing, grid construction, incremental saves, error handling) looks correct and is well-tested.
benchmarks/helpers.py — specifically the peak_memory_bytes function and its unchecked nan return value
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD CLI["main() / CLI args"] --> RUN["run(options, backend, device, ...)"] RUN --> GS["grid_sizes(options)"] GS --> LOOP["for each (nx, ny, nz, scale)"] LOOP --> BC["build_case(options, nx, ny, nz, scale)"] BC --> AVG["for loop_num in 1..num_averages"] AVG --> SOLVE["solver(kgrid, medium, source, sensor, ...)"] SOLVE --> TIME["elapsed = timer() - start"] TIME --> RA["rolling_average(loop_time, elapsed, loop_num)"] RA --> MEM{"report_mem_usage?"} MEM -- yes --> PMB["peak_memory_bytes() - returns nan on Windows"] MEM -- no --> SCR PMB --> SCR["store_case_result(result, nx*ny*nz, ...)"] SCR --> SAVE["save_results(path, result)"] SAVE --> AVG LOOP --> EXC{"Exception?"} EXC -- yes --> ERR["set error_reached=True, save_results, break"] EXC -- no --> LOOP RUN --> RET["return result dict"]Reviews (2): Last reviewed commit: "helpers" | Re-trigger Greptile