Skip to content

fix(torchtitan): correct MFU on MI350X/MI355X via peak FLOPS patch#772

Open
WangLingxun wants to merge 1 commit into
mainfrom
fix/torchtitan-mi350-peak-flops-mfu
Open

fix(torchtitan): correct MFU on MI350X/MI355X via peak FLOPS patch#772
WangLingxun wants to merge 1 commit into
mainfrom
fix/torchtitan-mi350-peak-flops-mfu

Conversation

@WangLingxun

Copy link
Copy Markdown
Collaborator

TorchTitan computes MFU as num_flops_per_token * tps / gpu_peak_flops, where gpu_peak_flops comes from a hardcoded device-name table in torchtitan.tools.utils.get_peak_flops. The vendored TorchTitan does not know about the AMD MI350 series (gfx950), so the lookup falls through to the A100 fallback (312 TFLOPS), making the MFU denominator ~8x too small and reporting impossible values (e.g. 563%). Throughput (TFLOP/s) itself was correct; only MFU was affected. MI300X was unaffected since it is present in the upstream table.

Add a setup-phase patch that wraps get_peak_flops and returns the correct BF16 dense peaks from AMD's product pages:

  • MI355X: 2500 TFLOPS (matches upstream TorchTitan)
  • MI350X: 2300 TFLOPS (not yet covered upstream)

The patch runs before MetricsProcessor caches gpu_peak_flops and delegates all other devices to the original implementation, so it is safe to keep after the vendored TorchTitan is updated.

fix(torchtitan): correct MFU on MI350X/MI355X via peak FLOPS patch

TorchTitan computes MFU as num_flops_per_token * tps / gpu_peak_flops,
where gpu_peak_flops comes from a hardcoded device-name table in
torchtitan.tools.utils.get_peak_flops. The vendored TorchTitan does not
know about the AMD MI350 series (gfx950), so the lookup falls through to
the A100 fallback (312 TFLOPS), making the MFU denominator ~8x too small
and reporting impossible values (e.g. 563%). Throughput (TFLOP/s) itself
was correct; only MFU was affected. MI300X was unaffected since it is
present in the upstream table.

Add a setup-phase patch that wraps get_peak_flops and returns the correct
BF16 dense peaks from AMD's product pages:
  - MI355X: 2500 TFLOPS (matches upstream TorchTitan)
  - MI350X: 2300 TFLOPS (not yet covered upstream)

The patch runs before MetricsProcessor caches gpu_peak_flops and delegates
all other devices to the original implementation, so it is safe to keep
after the vendored TorchTitan is updated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant