System Metrics by jonasbhend · Pull Request #149 · MeteoSwiss/evalml

jonasbhend · 2026-05-06T08:30:57Z

Example dashboard

file:///M:/zue-prod/fc_development/seamless/S-RUC/evaluation/MRB-640_dashboard.html

Context

As a SRUC developer, I want to assess system metrics (GPU usage, memory footprint, …) for the inference jobs for a given experiment. Similarly to model metrics, system metrics could be collected for individuals timesteps, aggregated, and visualized in the dashboard in a dedicated tab.

Discussion

This is a first working prototype, clearly insight into system metrics from within anemoi inference, to diagnose time spent on data i/o and compute would be more helpful. These could be parsed from the logs, or added specifically to anemoi-inference (runner class).

The actual evalml rules are not profiled, that is we don't know where in the evaluation pipeline we spend the most time. This is something that would be useful to track, mostly for assessment of performance improvements and avoiding resource contention. This will be tackled as part of a separate PR.

Summary of changes

inference jobs are launched with a side-car to track gpu hours and memory usage
inference wall-time, GPU hours, and peak memory usage are displayed in dashboard

frazane · 2026-05-06T08:39:21Z

I like the idea of collecting these metrics, but is parsing logs the best way to go? Perhaps a better approach would be to implement the inference program profiling in anemoi-inference directly?

Also, FYI: https://anemoi.readthedocs.io/projects/inference/en/latest/usage/optimisation.html#profiling-and-troubleshooting

jonasbhend · 2026-05-06T09:01:23Z

I like the idea of collecting these metrics, but is parsing logs the best way to go? Perhaps a better approach would be to implement the inference program profiling in anemoi-inference directly?

Also, FYI: https://anemoi.readthedocs.io/projects/inference/en/latest/usage/optimisation.html#profiling-and-troubleshooting

Thanks for the hint! @frazane but even with inference program profiling we would still be reading logs to make this information accessible in the dashboard, no?

frazane · 2026-05-06T09:16:26Z

Sorry, I didn't mean we should use what I put in the FYI, but I thought it was relevant.

I was thinking more about something like structured, machine-readable logs or profiling results, possibly using existing tools like py-spy, memory_profiling, pytorch profiler, etc.

jonasbhend · 2026-05-06T09:26:26Z

Sorry, I didn't mean we should use what I put in the FYI, but I thought it was relevant.

I was thinking more about something like structured, machine-readable logs or profiling results, possibly using existing tools like py-spy, memory_profiling, pytorch profiler, etc.

Sounds like a great alternative. Do we have experience in using such tools on our HPC systems? I clearly don't and wouldn't really know where to start. So in case this is where this should be going, I suggest someone else picks this up. @dnerini @cosunae thoughts?

jonasbhend · 2026-05-06T12:00:29Z

Here is an example dashboard (all coded by our dear friend with minimal intervention from my side):

M:/zue-prod/fc_development/seamless/S-RUC/evaluation/MRB-640_dashboard.html

Do you think this is even remotely useful (there is virtually no spread in results with only three initialization being processed)? Would you expect other things to see (GPU usage, memory use ...)? Is there a better way than parsing the anemoi logs?

Thanks for your feedback: @dnerini @frazane @MicheleCattaneo @icedoom888

radiradev · 2026-05-06T12:31:07Z

I think as @frazane is suggesting it would be nice to have an extensive log using the torch profiler, I think https://github.com/gaogaotiantian/viztracer might be a good option, it seems to be lightweight and in an online format but the default solution is for the dashboard to be self-hosted.

dnerini · 2026-05-06T12:34:17Z

Thanks @jonasbhend this looks already very good! I think what we're looking at in terms of visualization is some sort of distribution view given all individual runs included in an experiment. We did something similar (manually) for the SDL-25 some time ago.

Concerning the profiling approach, I agree that parsing the Anemoi logs is perhaps not the best approach. I wonder if we could rely on SLURM instead to collect some basic statistic, which would also decouple it from any specific ML framework.

I could also imagine having something very lightweight, SLURM-based that provides basic statistics for all runs, while something more involved could be used in parallel on only a few test runs to collect more detail information?

lclanzi · 2026-06-24T08:24:59Z

Thanks, the dashboard displays the metrics correctly and the pipeline looks OK to me.

Some comments:

Anemoi-inference supports sharding during inference (though I’m not sure if we use it). Do we handle metrics correctly in that case? Also, if only 1 GPU per node is used, are metrics reported for the full node (e.g. 4 GPUs) or only for the active GPU?
It might be useful to normalize some of the metrics. The most obvious one to me is peak CPU memory, normalizing it by the total available system memory.
I agree that a full-fledged profiler is an overkill for the dashboard. However, it might be useful to add a simple table with runtime per Snakemake rule (maybe normalized by the total execution time). The runtime per rule should be available in .snakemake/metadata via starttime/ endtime. Just an idea, it may be worth discussing it.

dnerini · 2026-06-24T09:02:14Z

The actual evalml rules are not profiled, that is we don't know where in the evaluation pipeline we spend the most time. This is something that would be useful to track, mostly for assessment of performance improvements and avoiding resource contention. This will be tackled as part of a separate PR.

My take about this: the point of the dashboard is to show model inference metrics, thus I would strongly argue against showing metrics for other evalml rules, since profiling evalml would be out of scope in my opinion (note that snakemake already provides a profling of all rules when creating a report).

Another comment: I wonder if the system metrics would be better displayed as distributions (histograms) rather than time series?

jonasbhend added 3 commits May 6, 2026 09:55

Initial suggestion from our dear friend

4e3fae0

fix referencing

6755cda

exclude rulegraph and dag from being tracked

d8249d5

expose individual runs

1b871a7

jonasbhend added 14 commits June 2, 2026 09:17

add slurm-based collection of system metrics for inference

58f9a6c

log accumulation for dashboard

3d06438

update dashboard

cde34d7

Merge branch 'main' into MRB-640-Add-system-metrics-to-dashboard

efabd5e

fix error in inference from merging in main

2d9f358

linting

16c4df7

convert to wall time to minutes

b9ec45e

fix getSelected error

6424430

remove duplicates

1709ed7

fix indentation error

7c9af61

reintroduce fix for n_samples

b9e1df3

use interpolator label for corresponding forecaster run

4895066

fix failing test

723a013

Merge branch 'main' into MRB-640-Add-system-metrics-to-dashboard

ef1e5dc

jonasbhend marked this pull request as ready for review June 22, 2026 07:39

dnerini requested a review from lclanzi June 23, 2026 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

System Metrics#149

System Metrics#149
jonasbhend wants to merge 18 commits into
mainfrom
MRB-640-Add-system-metrics-to-dashboard

jonasbhend commented May 6, 2026 •

edited

Loading

Uh oh!

frazane commented May 6, 2026 •

edited

Loading

Uh oh!

jonasbhend commented May 6, 2026 •

edited

Loading

Uh oh!

frazane commented May 6, 2026

Uh oh!

jonasbhend commented May 6, 2026

Uh oh!

jonasbhend commented May 6, 2026

Uh oh!

radiradev commented May 6, 2026

Uh oh!

dnerini commented May 6, 2026 •

edited

Loading

Uh oh!

lclanzi commented Jun 24, 2026

Uh oh!

dnerini commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

jonasbhend commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example dashboard

Context

Discussion

Summary of changes

Uh oh!

frazane commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonasbhend commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frazane commented May 6, 2026

Uh oh!

jonasbhend commented May 6, 2026

Uh oh!

jonasbhend commented May 6, 2026

Uh oh!

radiradev commented May 6, 2026

Uh oh!

dnerini commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lclanzi commented Jun 24, 2026

Uh oh!

dnerini commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jonasbhend commented May 6, 2026 •

edited

Loading

frazane commented May 6, 2026 •

edited

Loading

jonasbhend commented May 6, 2026 •

edited

Loading

dnerini commented May 6, 2026 •

edited

Loading

dnerini commented Jun 24, 2026 •

edited

Loading