Honor user --bind-to / --map-by in --mpi-params by FileSystemGuy · Pull Request #428 · mlcommons/storage

FileSystemGuy · 2026-06-11T05:41:28Z

Summary

generate_mpi_prefix_cmd unconditionally appended --bind-to none --map-by {node,socket} defaults and then appended the user's --mpi-params, producing duplicate flags that OpenMPI rejects when a user supplied their own --bind-to or --map-by.

What changed

Added _mpi_params_contain_flag() helper that recognizes --flag, -flag, --flag=val, and -flag=val token forms. (Necessary because cli_parser flattens --mpi-params into a token list via shlex.split, so a substring-on-whole-strings check would not reliably trigger.)
Suppress the --bind-to default and the --map-by default independently — overriding one no longer silently drops the other's default.
Untangled BTL transport selection/logging from the bind/map block so single-host BTL behavior is unaffected by overrides.

Test plan

All 19 existing TestGenerateMpiPrefixCmd cases pass unchanged
6 new cases cover: bind-only override, map-only override, both overridden, --flag=val form, single-dash -flag form, unrelated --mpi-params flags leaving defaults intact
Full unit suite: 1,335 passed, 4 skipped

generate_mpi_prefix_cmd unconditionally appended defaults (--bind-to none, --map-by node/socket) and then appended user --mpi-params, producing duplicate flags that OpenMPI rejects. Detect each flag independently across token forms (--flag, -flag, --flag=val) and only emit the default for whichever the user did not supply, so overriding one does not silently drop the other.

github-actions · 2026-06-11T05:41:36Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

idevasena · 2026-06-13T01:02:25Z

Validated fix:
Functional validation:

smrc@dskbd029:~/Storage_Repo_Tests/storage$ uv run python validate_pr428.py 
PASS  bind-to override (single host)
      mpirun -n 2 -host h1:2 --map-by socket --bind-to core
PASS  map-by override (multi host)
      mpirun -n 8 -host h1:4,h2:4 --bind-to none --map-by ppr:4:node
PASS  equals form, both overridden
      mpirun -n 4 -host h1:4 --bind-to=core --map-by=numa
PASS  unrelated params keep defaults
      mpirun -n 4 -host h1:4 --bind-to none --map-by socket --oversubscribe --allow-run-as-root --mca btl tcp,self -x FOO=bar

smrc@dskbd029:~/Storage_Repo_Tests/storage$ git status
On branch pr428
nothing to commit, working tree clean
smrc@dskbd029:~/Storage_Repo_Tests/storage$ git checkout main
Switched to branch 'main'
Your branch is up to date with 'origin/main'.

smrc@dskbd029:~/Storage_Repo_Tests/storage$ uv run python validate_pr428.py 
FAIL  bind-to override (single host)
      mpirun -n 2 -host h1:2 --bind-to none --map-by socket --bind-to core
FAIL  map-by override (multi host)
      mpirun -n 8 -host h1:4,h2:4 --bind-to none --map-by node --map-by ppr:4:node
FAIL  equals form, both overridden
      mpirun -n 4 -host h1:4 --bind-to none --map-by socket --bind-to=core --map-by=numa
PASS  unrelated params keep defaults
      mpirun -n 4 -host h1:4 --bind-to none --map-by socket --oversubscribe --allow-run-as-root --mca btl tcp,self -x FOO=bar

mpirun parsed the command line, accepted --bind-to core --map-by ppr:2:node, launched both ranks, and the real dlio_benchmark from venv started up. The failure is purely Not enough training dataset is found because /tmp/d only has 168 leftover files versus the 36,889 unet3d requires. There's no MPI flag error anywhere — no "unknown option," no duplicate-flag rejection.

Inspected the run metadata — the executed command is persisted as a run artifact:

RUN_DIR=/tmp/r/training/unet3d/run/20260612_235627
grep -oE 'mpirun[^"]*' $RUN_DIR/training_20260612_235627_metadata.json | head -1
Pass criteria: exactly one --bind-to (yours, core) and one --map-by (yours, ppr:2:node), and no --bind-to none.

smrc@dskbd029:~/Storage_Repo_Tests/storage$ ./mlpstorage whatif training unet3d run file   --hosts localhost --client-host-memory-in-gb 64   --num-accelerators 2 --ac
Setting attr from num_accelerators to 2
2026-06-12 23:56:27|STATUS: 
2026-06-12 23:56:27|STATUS: --- Run Configuration (mlpstorage 3.0.9) ---
2026-06-12 23:56:27|STATUS:   benchmark:                      training
2026-06-12 23:56:27|STATUS:   command:                        run
2026-06-12 23:56:27|STATUS:   mode:                           whatif
2026-06-12 23:56:27|STATUS:   data_dir:                       /tmp/d
2026-06-12 23:56:27|STATUS:   results_dir:                    /tmp/r
2026-06-12 23:56:27|STATUS:   data_access_protocol:           file
2026-06-12 23:56:27|STATUS:   num_accelerators:               2
2026-06-12 23:56:27|STATUS:   num_processes:                  2
2026-06-12 23:56:27|STATUS:   accelerator_type:               h100
2026-06-12 23:56:27|STATUS:   client_host_memory_in_gb:       64.0
2026-06-12 23:56:27|STATUS:   hosts:                          ['localhost']
2026-06-12 23:56:27|STATUS:   exec_type:                      mpi
2026-06-12 23:56:27|STATUS:   mpi_bin:                        mpirun
2026-06-12 23:56:27|STATUS:   loops:                          1
2026-06-12 23:56:27|STATUS: 
2026-06-12 23:56:27|STATUS: --- Environment ---
2026-06-12 23:56:27|STATUS:   MLPERF_RESULTS_DIR:             [not set]
2026-06-12 23:56:27|STATUS:   MPI_RUN_BIN:                    [not set]
2026-06-12 23:56:27|STATUS:   MPI_EXEC_BIN:                   [not set]
2026-06-12 23:56:27|STATUS: 
2026-06-12 23:56:27|WARNING: Skipping environment validation (--skip-validation flag)
2026-06-12 23:56:27|STATUS: Benchmark results directory: /tmp/r/training/unet3d/run/20260612_235627
2026-06-12 23:56:27|INFO: Collector script staged at /tmp/r/training/unet3d/run/20260612_235627/collector-staging/mlps_collector.py (persisted as run artifact)
2026-06-12 23:56:27|INFO: Running MPI collection across 1 host(s)
2026-06-12 23:56:29|INFO: MPI collection completed successfully (1 hosts reported)
2026-06-12 23:56:29|INFO: Created benchmark run: training_run_unet3d_20260612_235627
2026-06-12 23:56:29|STATUS: Verifying benchmark run for training_run_unet3d_20260612_235627
2026-06-12 23:56:29|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-06-12 23:56:29|ERROR: INVALID: [INVALID] Insufficient number of training files (Parameter: dataset.num_files_train, Expected: >= 36889, Actual: 168)
2026-06-12 23:56:29|STATUS: Benchmark run is INVALID due to 1 issues ([RunID(program='training', command='run', model='unet3d', run_datetime='20260612_235627')])
2026-06-12 23:56:29|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use closed or o
2026-06-12 23:56:29|INFO: Creating data directory: /tmp/d/unet3d...
2026-06-12 23:56:29|INFO: Creating directory: /tmp/d/unet3d/train...
2026-06-12 23:56:29|INFO: Creating directory: /tmp/d/unet3d/valid...
2026-06-12 23:56:29|INFO: Creating directory: /tmp/d/unet3d/test...
⠋ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:00
⠋ Collecting cluster info... 0:00:002026-06-12 23:56:29|INFO: Collector script staged at /tmp/r/training/unet3d/run/20260612_235627/collector-staging/mlps_collector.p
⠙ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:01
⠙ Collecting via MPI... 0:00:012026-06-12 23:56:31|INFO: MPI collection completed successfully (1 hosts reported)
2026-06-12 23:56:31|INFO: MPI BTL transport: auto (OpenMPI default selection)
[OUTPUT] [DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT]   storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT]   storage_root   = './'
[OUTPUT]   storage_options= None
[OUTPUT]   data_folder    = '/tmp/d/unet3d'
[OUTPUT]   framework      = <FrameworkType.PYTORCH: 'pytorch'>
[OUTPUT]   num_files_train= 168
[OUTPUT]   record_length  = 146600628
[OUTPUT]   generate_data  = False
[OUTPUT]   do_train       = True
[OUTPUT]   do_checkpoint  = True
[OUTPUT]   epochs         = 5
[OUTPUT]   batch_size     = 7
[OUTPUT] 2026-06-12T23:56:34.001746 Running DLIO [Training & Checkpointing] with 2 process(es)
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching 
effect!!!
Error executing job with overrides: ['workload=unet3d_h100', '++workload.dataset.data_folder=/tmp/d/unet3d']
Error executing job with overrides: ['workload=unet3d_h100', '++workload.dataset.data_folder=/tmp/d/unet3d']
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 517, in run_benchmark
    benchmark.initialize()
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dftracer/python/common.py", line 504, in wrapper
    x = f(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 517, in run_benchmark
    benchmark.initialize()
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dftracer/python/common.py", line 504, in wrapper
    x = f(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 250, in initialize
    raise Exception(
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 250, in initialize
    raise Exception(
Exception: Not enough training dataset is found; Please run the code with ++workload.workflow.generate_data=True
Exception: Not enough training dataset is found; Please run the code with ++workload.workflow.generate_data=True

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[60930,1],0]
  Exit code:    1
--------------------------------------------------------------------------
  Processing results... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:05
⠋ Collecting cluster info... 0:00:002026-06-12 23:56:35|INFO: Collector script staged at /tmp/r/training/unet3d/run/20260612_235627/collector-staging/mlps_collector.p
  Processing results... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:05
2026-06-12 23:56:37|STATUS: Writing metadata for benchmark to: /tmp/r/training/unet3d/run/20260612_235627/training_20260612_235627_metadata.json
smrc@dskbd029:~/Storage_Repo_Tests/storage$ RUN_DIR=/tmp/r/training/unet3d/run/20260612_235627
smrc@dskbd029:~/Storage_Repo_Tests/storage$ grep -oE 'mpirun[^"]*' $RUN_DIR/training_20260612_235627_metadata.json | head -1
mpirun -n 2 -host localhost:2 --bind-to core --map-by ppr:2:node /home/smrc/Storage_Repo_Tests/storage/.venv/bin/dlio_benchmark workload=unet3d_h100 ++hydra.run.dir=/

FileSystemGuy requested a review from a team June 11, 2026 05:41

FileSystemGuy added 6 commits June 11, 2026 12:41

Merge branch 'main' into FileSystemGuy-mpi-params

b1b02e9

Merge branch 'main' into FileSystemGuy-mpi-params

ad01f50

Merge branch 'main' into FileSystemGuy-mpi-params

0473952

Merge branch 'main' into FileSystemGuy-mpi-params

f1ea775

Merge branch 'main' into FileSystemGuy-mpi-params

5a8c03a

Merge branch 'main' into FileSystemGuy-mpi-params

d50cadd

idevasena approved these changes Jun 13, 2026

View reviewed changes

idevasena merged commit 4777814 into main Jun 13, 2026
3 checks passed

github-actions Bot locked and limited conversation to collaborators Jun 13, 2026

FileSystemGuy deleted the FileSystemGuy-mpi-params branch June 13, 2026 01:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Honor user --bind-to / --map-by in --mpi-params#428

Honor user --bind-to / --map-by in --mpi-params#428
idevasena merged 7 commits into
mainfrom
FileSystemGuy-mpi-params

FileSystemGuy commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

idevasena commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FileSystemGuy commented Jun 11, 2026

Summary

What changed

Test plan

Uh oh!

github-actions Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

idevasena commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 11, 2026 •

edited

Loading