fix(cli): closed mode accepts --params; correct datasize emitter (#433)#437
Merged
FileSystemGuy merged 3 commits intoJun 13, 2026
Merged
Conversation
Closed training submissions were unable to pass dotted-key overrides like `dataset.num_files_train` because --params was gated to open/whatif mode, even though TrainingRunRulesChecker.CLOSED_ALLOWED_PARAMS explicitly lists those keys as CLOSED-allowed. The emitted hint from `datasize` was also malformed against the current CLI shape (missing mode prefix, --model= instead of positional, --param singular vs --params plural, missing storage-protocol positional, comma-joined --hosts). Changes: * Move --params/--param/-p into core training args so closed mode accepts the keys CLOSED_ALLOWED_PARAMS already permits. --param is registered as a legacy alias so older docs/strings still parse. * Rewrite generate_datagen_benchmark_command to emit the current CLI shape: mode prefix, positional model, --hosts as nargs tokens (not comma-joined), single --params group, trailing storage-protocol positional. Fix --param typo in the user-facing num_subfolders_train warning. * Move add_dlio_arguments from checkpointing open-args to core args. --dlio-bin-path is a deployment knob, not a submission tunable; training already exposed it in core args. Consistency fix surfaced by the audit done alongside #433. * Update the stale comment in common_args.add_dlio_arguments that asserted --params should not live in core; that policy is now per-builder, driven by the rules-checker allow-lists. Tests: * tests/unit/test_datagen_command_generation.py — round-trip tests that shlex-split the emitted datagen command and run it through parse_arguments(). Caught the --hosts comma-join bug before this landed. Prevents future drift between the emitter and the parser. * tests/unit/test_rules_parser_gating_consistency.py — parametrized over TrainingRunRulesChecker.CLOSED_ALLOWED_PARAMS and OPEN_ALLOWED_PARAMS, asserts every dotted key round-trips in the correct mode. Asserts --dlio-bin-path is accessible in closed and open for both training and checkpointing. * tests/unit/test_cli.py, tests/unit/test_parser_modes.py — three tests that codified the original bug ("closed must reject --params", "--params hidden in closed help") flipped to the correct behavior, with comments tying them to #433 and the rules-checker tables. Total: 1356 passed, 4 skipped, 0 failed. 18 new parametrized gating tests + 5 round-trip emitter tests.
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
idevasena
approved these changes
Jun 13, 2026
idevasena
left a comment
Contributor
There was a problem hiding this comment.
Validated fix:
mpirun parsed the command line, accepted --bind-to core --map-by ppr:2:node, launched both ranks, and the real dlio_benchmark from venv started up. The failure is purely Not enough training dataset is found because /tmp/d only has 168 leftover files versus the 36,889 unet3d requires. There's no MPI flag error anywhere — no "unknown option," no duplicate-flag rejection.
Inspected the run metadata — the executed command is persisted as a run artifact:
RUN_DIR=/tmp/r/training/unet3d/run/20260612_235627
grep -oE 'mpirun[^"]*' $RUN_DIR/training_20260612_235627_metadata.json | head -1
Pass criteria: exactly one --bind-to (yours, core) and one --map-by (yours, ppr:2:node), and no --bind-to none.
smrc@dskbd029:~/Storage_Repo_Tests/storage$ ./mlpstorage whatif training unet3d run file --hosts localhost --client-host-memory-in-gb 64 --num-accelerators 2 --ac
Setting attr from num_accelerators to 2
2026-06-12 23:56:27|STATUS:
2026-06-12 23:56:27|STATUS: --- Run Configuration (mlpstorage 3.0.9) ---
2026-06-12 23:56:27|STATUS: benchmark: training
2026-06-12 23:56:27|STATUS: command: run
2026-06-12 23:56:27|STATUS: mode: whatif
2026-06-12 23:56:27|STATUS: data_dir: /tmp/d
2026-06-12 23:56:27|STATUS: results_dir: /tmp/r
2026-06-12 23:56:27|STATUS: data_access_protocol: file
2026-06-12 23:56:27|STATUS: num_accelerators: 2
2026-06-12 23:56:27|STATUS: num_processes: 2
2026-06-12 23:56:27|STATUS: accelerator_type: h100
2026-06-12 23:56:27|STATUS: client_host_memory_in_gb: 64.0
2026-06-12 23:56:27|STATUS: hosts: ['localhost']
2026-06-12 23:56:27|STATUS: exec_type: mpi
2026-06-12 23:56:27|STATUS: mpi_bin: mpirun
2026-06-12 23:56:27|STATUS: loops: 1
2026-06-12 23:56:27|STATUS:
2026-06-12 23:56:27|STATUS: --- Environment ---
2026-06-12 23:56:27|STATUS: MLPERF_RESULTS_DIR: [not set]
2026-06-12 23:56:27|STATUS: MPI_RUN_BIN: [not set]
2026-06-12 23:56:27|STATUS: MPI_EXEC_BIN: [not set]
2026-06-12 23:56:27|STATUS:
2026-06-12 23:56:27|WARNING: Skipping environment validation (--skip-validation flag)
2026-06-12 23:56:27|STATUS: Benchmark results directory: /tmp/r/training/unet3d/run/20260612_235627
2026-06-12 23:56:27|INFO: Collector script staged at /tmp/r/training/unet3d/run/20260612_235627/collector-staging/mlps_collector.py (persisted as run artifact)
2026-06-12 23:56:27|INFO: Running MPI collection across 1 host(s)
2026-06-12 23:56:29|INFO: MPI collection completed successfully (1 hosts reported)
2026-06-12 23:56:29|INFO: Created benchmark run: training_run_unet3d_20260612_235627
2026-06-12 23:56:29|STATUS: Verifying benchmark run for training_run_unet3d_20260612_235627
2026-06-12 23:56:29|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-06-12 23:56:29|ERROR: INVALID: [INVALID] Insufficient number of training files (Parameter: dataset.num_files_train, Expected: >= 36889, Actual: 168)
2026-06-12 23:56:29|STATUS: Benchmark run is INVALID due to 1 issues ([RunID(program='training', command='run', model='unet3d', run_datetime='20260612_235627')])
2026-06-12 23:56:29|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use closed or o
2026-06-12 23:56:29|INFO: Creating data directory: /tmp/d/unet3d...
2026-06-12 23:56:29|INFO: Creating directory: /tmp/d/unet3d/train...
2026-06-12 23:56:29|INFO: Creating directory: /tmp/d/unet3d/valid...
2026-06-12 23:56:29|INFO: Creating directory: /tmp/d/unet3d/test...
⠋ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:00
⠋ Collecting cluster info... 0:00:002026-06-12 23:56:29|INFO: Collector script staged at /tmp/r/training/unet3d/run/20260612_235627/collector-staging/mlps_collector.p
⠙ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:01
⠙ Collecting via MPI... 0:00:012026-06-12 23:56:31|INFO: MPI collection completed successfully (1 hosts reported)
2026-06-12 23:56:31|INFO: MPI BTL transport: auto (OpenMPI default selection)
[OUTPUT] [DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT] storage_type = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT] storage_root = './'
[OUTPUT] storage_options= None
[OUTPUT] data_folder = '/tmp/d/unet3d'
[OUTPUT] framework = <FrameworkType.PYTORCH: 'pytorch'>
[OUTPUT] num_files_train= 168
[OUTPUT] record_length = 146600628
[OUTPUT] generate_data = False
[OUTPUT] do_train = True
[OUTPUT] do_checkpoint = True
[OUTPUT] epochs = 5
[OUTPUT] batch_size = 7
[OUTPUT] 2026-06-12T23:56:34.001746 Running DLIO [Training & Checkpointing] with 2 process(es)
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching
effect!!!
Error executing job with overrides: ['workload=unet3d_h100', '++workload.dataset.data_folder=/tmp/d/unet3d']
Error executing job with overrides: ['workload=unet3d_h100', '++workload.dataset.data_folder=/tmp/d/unet3d']
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 517, in run_benchmark
benchmark.initialize()
File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dftracer/python/common.py", line 504, in wrapper
x = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 517, in run_benchmark
benchmark.initialize()
File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dftracer/python/common.py", line 504, in wrapper
x = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 250, in initialize
raise Exception(
File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 250, in initialize
raise Exception(
Exception: Not enough training dataset is found; Please run the code with ++workload.workflow.generate_data=True
Exception: Not enough training dataset is found; Please run the code with ++workload.workflow.generate_data=True
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[60930,1],0]
Exit code: 1
--------------------------------------------------------------------------
Processing results... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:05
⠋ Collecting cluster info... 0:00:002026-06-12 23:56:35|INFO: Collector script staged at /tmp/r/training/unet3d/run/20260612_235627/collector-staging/mlps_collector.p
Processing results... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:05
2026-06-12 23:56:37|STATUS: Writing metadata for benchmark to: /tmp/r/training/unet3d/run/20260612_235627/training_20260612_235627_metadata.json
smrc@dskbd029:~/Storage_Repo_Tests/storage$ RUN_DIR=/tmp/r/training/unet3d/run/20260612_235627
smrc@dskbd029:~/Storage_Repo_Tests/storage$ grep -oE 'mpirun[^"]*' $RUN_DIR/training_20260612_235627_metadata.json | head -1
mpirun -n 2 -host localhost:2 --bind-to core --map-by ppr:2:node /home/smrc/Storage_Repo_Tests/storage/.venv/bin/dlio_benchmark workload=unet3d_h100 ++hydra.run.dir=/
Contributor
|
Verified the issue reported by Lou too #433. The fix works. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #433.
Closed-mode training submissions were unable to pass dotted-key DLIO overrides like
dataset.num_files_trainbecause--paramswas gated to open/whatif mode, even thoughTrainingRunRulesChecker.CLOSED_ALLOWED_PARAMSexplicitly lists those keys as CLOSED-allowed. The hint thatdatasizeemits for the next step was also malformed against the current CLI shape and didn't parse.This PR fixes both, plus a gating audit across all four workloads (training, checkpointing, vectordb, kvcache) to flush out any other #433-class drift between Rules.md / the rules-checker / the parser.
Changes
Immediate fix for #433
mlpstorage_py/cli/training_args.py—--params/--param/-pmoved from_add_training_open_argsinto_add_training_core_argsso closed mode accepts the dotted-key overrides thatCLOSED_ALLOWED_PARAMSalready permits.--param(singular) registered as a legacy alias so existing docs and the strings emitted bydatasizestill parse.mlpstorage_py/benchmarks/dlio.py—generate_datagen_benchmark_commandrewritten to emit the current CLI shape: mode prefix, positional<model>(not--model=),--hostsas separate tokens (not comma-joined —nargs="+"doesn't auto-split), single--params k1=v1 k2=v2group, trailing storage-protocol positional (file).--paramtypo in the user-facingnum_subfolders_trainwarning also fixed.Audit-driven consistency fix
mlpstorage_py/cli/checkpointing_args.py—add_dlio_arguments(which registers--dlio-bin-path) moved from open-only to core args.--dlio-bin-pathis a deployment knob, not a submission tunable; training already exposed it in core args, checkpointing was inconsistent.mlpstorage_py/cli/common_args.py— stale comment inadd_dlio_argumentsupdated. The old comment asserted--paramsshould never live in core; that policy is now per-builder, driven by the rules-checker allow-lists.Tests (+18 new, 3 corrected)
tests/unit/test_datagen_command_generation.py(new, 5 tests) — round-trip tests that shlex-split the emitter output and run it throughparse_arguments(). Caught the--hosts=h1,h2comma-join bug before this landed. Future drift between the emitter and the parser fails in CI.tests/unit/test_rules_parser_gating_consistency.py(new, 18 tests) — parametrized overTrainingRunRulesChecker.CLOSED_ALLOWED_PARAMSandOPEN_ALLOWED_PARAMS, asserts every dotted key round-trips through the parser in the correct mode. Also asserts--dlio-bin-pathworks in closed and open for both training and checkpointing.tests/unit/test_cli.py,tests/unit/test_parser_modes.py— three tests that codified the original bug ("closed must reject--params", "--paramshidden in closed help") flipped to the correct behavior, with comments tying them to It appears that the "--param" option does not work with "./mlpstorage closed training retinanet datagen file". #433 andCLOSED_ALLOWED_PARAMS.Audit summary (rules-checker ↔ Rules.md ↔ parser)
--paramsopen-only despite CLOSED_ALLOWED_PARAMS having dataset.* entriescheckpoint.checkpoint_folderinCLOSED_ALLOWED_PARAMSbut absent from Rules.md §3.6.2--dlio-bin-pathopen-only in checkpointing but core in training--paramsopen-only — defensible since Rules.md §4.6.3 only namescheckpoint.checkpoint_folderwhich has its own flagOn the datasize emitter
Curtis flagged a concern that the emitter looked like "mlpstorage invoking itself instead of DLIO." After tracing both paths:
execute_command→generate_dlio_command→_execute_command) invokes the DLIO binary directly. No self-invocation at runtime.generate_datagen_benchmark_commandis only called fromdatasize()and only produces a string for the user to copy-paste. It deliberately targetsmlpstorage(not raw DLIO) becausemlpstorage training datagenadds validation, metadata recording, cluster-info collection, MPI wrapping, lockfile checks, and YAML param merging — all of which a submission needs.So the architecture is correct; the bug was drift between the hint string and the real CLI shape, now prevented by the round-trip test.
Test plan
uv run --extra test pytest tests/unit -q→ 1356 passed, 4 skipped, 0 failed--param(singular) legacy alias also parses in closed modecheckpoint.checkpoint_folderdivergence (training rules-checker has it, Rules.md §3.6.2 doesn't) should stay flagged for separate human resolution, not fixed in this PR