Skip to content

Honor user --bind-to / --map-by in --mpi-params#428

Merged
idevasena merged 7 commits into
mainfrom
FileSystemGuy-mpi-params
Jun 13, 2026
Merged

Honor user --bind-to / --map-by in --mpi-params#428
idevasena merged 7 commits into
mainfrom
FileSystemGuy-mpi-params

Conversation

@FileSystemGuy

Copy link
Copy Markdown
Contributor

Summary

generate_mpi_prefix_cmd unconditionally appended --bind-to none --map-by {node,socket} defaults and then appended the user's --mpi-params, producing duplicate flags that OpenMPI rejects when a user supplied their own --bind-to or --map-by.

What changed

  • Added _mpi_params_contain_flag() helper that recognizes --flag, -flag, --flag=val, and -flag=val token forms. (Necessary because cli_parser flattens --mpi-params into a token list via shlex.split, so a substring-on-whole-strings check would not reliably trigger.)
  • Suppress the --bind-to default and the --map-by default independently — overriding one no longer silently drops the other's default.
  • Untangled BTL transport selection/logging from the bind/map block so single-host BTL behavior is unaffected by overrides.

Test plan

  • All 19 existing TestGenerateMpiPrefixCmd cases pass unchanged
  • 6 new cases cover: bind-only override, map-only override, both overridden, --flag=val form, single-dash -flag form, unrelated --mpi-params flags leaving defaults intact
  • Full unit suite: 1,335 passed, 4 skipped

generate_mpi_prefix_cmd unconditionally appended defaults
(--bind-to none, --map-by node/socket) and then appended user
--mpi-params, producing duplicate flags that OpenMPI rejects.

Detect each flag independently across token forms (--flag, -flag,
--flag=val) and only emit the default for whichever the user did
not supply, so overriding one does not silently drop the other.
@FileSystemGuy FileSystemGuy requested a review from a team June 11, 2026 05:41
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@idevasena

Copy link
Copy Markdown
Contributor

Validated fix:
Functional validation:

smrc@dskbd029:~/Storage_Repo_Tests/storage$ uv run python validate_pr428.py 
PASS  bind-to override (single host)
      mpirun -n 2 -host h1:2 --map-by socket --bind-to core
PASS  map-by override (multi host)
      mpirun -n 8 -host h1:4,h2:4 --bind-to none --map-by ppr:4:node
PASS  equals form, both overridden
      mpirun -n 4 -host h1:4 --bind-to=core --map-by=numa
PASS  unrelated params keep defaults
      mpirun -n 4 -host h1:4 --bind-to none --map-by socket --oversubscribe --allow-run-as-root --mca btl tcp,self -x FOO=bar

smrc@dskbd029:~/Storage_Repo_Tests/storage$ git status
On branch pr428
nothing to commit, working tree clean
smrc@dskbd029:~/Storage_Repo_Tests/storage$ git checkout main
Switched to branch 'main'
Your branch is up to date with 'origin/main'.

smrc@dskbd029:~/Storage_Repo_Tests/storage$ uv run python validate_pr428.py 
FAIL  bind-to override (single host)
      mpirun -n 2 -host h1:2 --bind-to none --map-by socket --bind-to core
FAIL  map-by override (multi host)
      mpirun -n 8 -host h1:4,h2:4 --bind-to none --map-by node --map-by ppr:4:node
FAIL  equals form, both overridden
      mpirun -n 4 -host h1:4 --bind-to none --map-by socket --bind-to=core --map-by=numa
PASS  unrelated params keep defaults
      mpirun -n 4 -host h1:4 --bind-to none --map-by socket --oversubscribe --allow-run-as-root --mca btl tcp,self -x FOO=bar

mpirun parsed the command line, accepted --bind-to core --map-by ppr:2:node, launched both ranks, and the real dlio_benchmark from venv started up. The failure is purely Not enough training dataset is found because /tmp/d only has 168 leftover files versus the 36,889 unet3d requires. There's no MPI flag error anywhere — no "unknown option," no duplicate-flag rejection.

Inspected the run metadata — the executed command is persisted as a run artifact:

RUN_DIR=/tmp/r/training/unet3d/run/20260612_235627
grep -oE 'mpirun[^"]*' $RUN_DIR/training_20260612_235627_metadata.json | head -1
Pass criteria: exactly one --bind-to (yours, core) and one --map-by (yours, ppr:2:node), and no --bind-to none. 
smrc@dskbd029:~/Storage_Repo_Tests/storage$ ./mlpstorage whatif training unet3d run file   --hosts localhost --client-host-memory-in-gb 64   --num-accelerators 2 --ac
Setting attr from num_accelerators to 2
2026-06-12 23:56:27|STATUS: 
2026-06-12 23:56:27|STATUS: --- Run Configuration (mlpstorage 3.0.9) ---
2026-06-12 23:56:27|STATUS:   benchmark:                      training
2026-06-12 23:56:27|STATUS:   command:                        run
2026-06-12 23:56:27|STATUS:   mode:                           whatif
2026-06-12 23:56:27|STATUS:   data_dir:                       /tmp/d
2026-06-12 23:56:27|STATUS:   results_dir:                    /tmp/r
2026-06-12 23:56:27|STATUS:   data_access_protocol:           file
2026-06-12 23:56:27|STATUS:   num_accelerators:               2
2026-06-12 23:56:27|STATUS:   num_processes:                  2
2026-06-12 23:56:27|STATUS:   accelerator_type:               h100
2026-06-12 23:56:27|STATUS:   client_host_memory_in_gb:       64.0
2026-06-12 23:56:27|STATUS:   hosts:                          ['localhost']
2026-06-12 23:56:27|STATUS:   exec_type:                      mpi
2026-06-12 23:56:27|STATUS:   mpi_bin:                        mpirun
2026-06-12 23:56:27|STATUS:   loops:                          1
2026-06-12 23:56:27|STATUS: 
2026-06-12 23:56:27|STATUS: --- Environment ---
2026-06-12 23:56:27|STATUS:   MLPERF_RESULTS_DIR:             [not set]
2026-06-12 23:56:27|STATUS:   MPI_RUN_BIN:                    [not set]
2026-06-12 23:56:27|STATUS:   MPI_EXEC_BIN:                   [not set]
2026-06-12 23:56:27|STATUS: 
2026-06-12 23:56:27|WARNING: Skipping environment validation (--skip-validation flag)
2026-06-12 23:56:27|STATUS: Benchmark results directory: /tmp/r/training/unet3d/run/20260612_235627
2026-06-12 23:56:27|INFO: Collector script staged at /tmp/r/training/unet3d/run/20260612_235627/collector-staging/mlps_collector.py (persisted as run artifact)
2026-06-12 23:56:27|INFO: Running MPI collection across 1 host(s)
2026-06-12 23:56:29|INFO: MPI collection completed successfully (1 hosts reported)
2026-06-12 23:56:29|INFO: Created benchmark run: training_run_unet3d_20260612_235627
2026-06-12 23:56:29|STATUS: Verifying benchmark run for training_run_unet3d_20260612_235627
2026-06-12 23:56:29|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-06-12 23:56:29|ERROR: INVALID: [INVALID] Insufficient number of training files (Parameter: dataset.num_files_train, Expected: >= 36889, Actual: 168)
2026-06-12 23:56:29|STATUS: Benchmark run is INVALID due to 1 issues ([RunID(program='training', command='run', model='unet3d', run_datetime='20260612_235627')])
2026-06-12 23:56:29|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use closed or o
2026-06-12 23:56:29|INFO: Creating data directory: /tmp/d/unet3d...
2026-06-12 23:56:29|INFO: Creating directory: /tmp/d/unet3d/train...
2026-06-12 23:56:29|INFO: Creating directory: /tmp/d/unet3d/valid...
2026-06-12 23:56:29|INFO: Creating directory: /tmp/d/unet3d/test...
⠋ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:00
⠋ Collecting cluster info... 0:00:002026-06-12 23:56:29|INFO: Collector script staged at /tmp/r/training/unet3d/run/20260612_235627/collector-staging/mlps_collector.p
⠙ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:01
⠙ Collecting via MPI... 0:00:012026-06-12 23:56:31|INFO: MPI collection completed successfully (1 hosts reported)
2026-06-12 23:56:31|INFO: MPI BTL transport: auto (OpenMPI default selection)
[OUTPUT] [DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT]   storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT]   storage_root   = './'
[OUTPUT]   storage_options= None
[OUTPUT]   data_folder    = '/tmp/d/unet3d'
[OUTPUT]   framework      = <FrameworkType.PYTORCH: 'pytorch'>
[OUTPUT]   num_files_train= 168
[OUTPUT]   record_length  = 146600628
[OUTPUT]   generate_data  = False
[OUTPUT]   do_train       = True
[OUTPUT]   do_checkpoint  = True
[OUTPUT]   epochs         = 5
[OUTPUT]   batch_size     = 7
[OUTPUT] 2026-06-12T23:56:34.001746 Running DLIO [Training & Checkpointing] with 2 process(es)
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching 
effect!!!
Error executing job with overrides: ['workload=unet3d_h100', '++workload.dataset.data_folder=/tmp/d/unet3d']
Error executing job with overrides: ['workload=unet3d_h100', '++workload.dataset.data_folder=/tmp/d/unet3d']
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 517, in run_benchmark
    benchmark.initialize()
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dftracer/python/common.py", line 504, in wrapper
    x = f(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 517, in run_benchmark
    benchmark.initialize()
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dftracer/python/common.py", line 504, in wrapper
    x = f(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 250, in initialize
    raise Exception(
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 250, in initialize
    raise Exception(
Exception: Not enough training dataset is found; Please run the code with ++workload.workflow.generate_data=True
Exception: Not enough training dataset is found; Please run the code with ++workload.workflow.generate_data=True

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[60930,1],0]
  Exit code:    1
--------------------------------------------------------------------------
  Processing results... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:05
⠋ Collecting cluster info... 0:00:002026-06-12 23:56:35|INFO: Collector script staged at /tmp/r/training/unet3d/run/20260612_235627/collector-staging/mlps_collector.p
  Processing results... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:05
2026-06-12 23:56:37|STATUS: Writing metadata for benchmark to: /tmp/r/training/unet3d/run/20260612_235627/training_20260612_235627_metadata.json
smrc@dskbd029:~/Storage_Repo_Tests/storage$ RUN_DIR=/tmp/r/training/unet3d/run/20260612_235627
smrc@dskbd029:~/Storage_Repo_Tests/storage$ grep -oE 'mpirun[^"]*' $RUN_DIR/training_20260612_235627_metadata.json | head -1
mpirun -n 2 -host localhost:2 --bind-to core --map-by ppr:2:node /home/smrc/Storage_Repo_Tests/storage/.venv/bin/dlio_benchmark workload=unet3d_h100 ++hydra.run.dir=/

@idevasena idevasena merged commit 4777814 into main Jun 13, 2026
3 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 13, 2026
@FileSystemGuy FileSystemGuy deleted the FileSystemGuy-mpi-params branch June 13, 2026 01:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants