Honor user --bind-to / --map-by in --mpi-params#428
Merged
Conversation
generate_mpi_prefix_cmd unconditionally appended defaults (--bind-to none, --map-by node/socket) and then appended user --mpi-params, producing duplicate flags that OpenMPI rejects. Detect each flag independently across token forms (--flag, -flag, --flag=val) and only emit the default for whichever the user did not supply, so overriding one does not silently drop the other.
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Contributor
|
Validated fix: mpirun parsed the command line, accepted --bind-to core --map-by ppr:2:node, launched both ranks, and the real dlio_benchmark from venv started up. The failure is purely Not enough training dataset is found because /tmp/d only has 168 leftover files versus the 36,889 unet3d requires. There's no MPI flag error anywhere — no "unknown option," no duplicate-flag rejection. Inspected the run metadata — the executed command is persisted as a run artifact: |
idevasena
approved these changes
Jun 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
generate_mpi_prefix_cmdunconditionally appended--bind-to none --map-by {node,socket}defaults and then appended the user's--mpi-params, producing duplicate flags that OpenMPI rejects when a user supplied their own--bind-toor--map-by.What changed
_mpi_params_contain_flag()helper that recognizes--flag,-flag,--flag=val, and-flag=valtoken forms. (Necessary becausecli_parserflattens--mpi-paramsinto a token list viashlex.split, so a substring-on-whole-strings check would not reliably trigger.)--bind-todefault and the--map-bydefault independently — overriding one no longer silently drops the other's default.Test plan
TestGenerateMpiPrefixCmdcases pass unchanged--flag=valform, single-dash-flagform, unrelated--mpi-paramsflags leaving defaults intact