fix(cli): closed mode accepts --params; correct datasize emitter (#433) by FileSystemGuy · Pull Request #437 · mlcommons/storage

FileSystemGuy · 2026-06-12T19:27:37Z

Summary

Closes #433.

Closed-mode training submissions were unable to pass dotted-key DLIO overrides like dataset.num_files_train because --params was gated to open/whatif mode, even though TrainingRunRulesChecker.CLOSED_ALLOWED_PARAMS explicitly lists those keys as CLOSED-allowed. The hint that datasize emits for the next step was also malformed against the current CLI shape and didn't parse.

This PR fixes both, plus a gating audit across all four workloads (training, checkpointing, vectordb, kvcache) to flush out any other #433-class drift between Rules.md / the rules-checker / the parser.

Changes

Immediate fix for #433

mlpstorage_py/cli/training_args.py — --params/--param/-p moved from _add_training_open_args into _add_training_core_args so closed mode accepts the dotted-key overrides that CLOSED_ALLOWED_PARAMS already permits. --param (singular) registered as a legacy alias so existing docs and the strings emitted by datasize still parse.
mlpstorage_py/benchmarks/dlio.py — generate_datagen_benchmark_command rewritten to emit the current CLI shape: mode prefix, positional <model> (not --model=), --hosts as separate tokens (not comma-joined — nargs="+" doesn't auto-split), single --params k1=v1 k2=v2 group, trailing storage-protocol positional (file). --param typo in the user-facing num_subfolders_train warning also fixed.

Audit-driven consistency fix

mlpstorage_py/cli/checkpointing_args.py — add_dlio_arguments (which registers --dlio-bin-path) moved from open-only to core args. --dlio-bin-path is a deployment knob, not a submission tunable; training already exposed it in core args, checkpointing was inconsistent.
mlpstorage_py/cli/common_args.py — stale comment in add_dlio_arguments updated. The old comment asserted --params should never live in core; that policy is now per-builder, driven by the rules-checker allow-lists.

Tests (+18 new, 3 corrected)

tests/unit/test_datagen_command_generation.py (new, 5 tests) — round-trip tests that shlex-split the emitter output and run it through parse_arguments(). Caught the --hosts=h1,h2 comma-join bug before this landed. Future drift between the emitter and the parser fails in CI.
tests/unit/test_rules_parser_gating_consistency.py (new, 18 tests) — parametrized over TrainingRunRulesChecker.CLOSED_ALLOWED_PARAMS and OPEN_ALLOWED_PARAMS, asserts every dotted key round-trips through the parser in the correct mode. Also asserts --dlio-bin-path works in closed and open for both training and checkpointing.
tests/unit/test_cli.py, tests/unit/test_parser_modes.py — three tests that codified the original bug ("closed must reject --params", "--params hidden in closed help") flipped to the correct behavior, with comments tying them to It appears that the "--param" option does not work with "./mlpstorage closed training retinanet datagen file". #433 and CLOSED_ALLOWED_PARAMS.

Audit summary (rules-checker ↔ Rules.md ↔ parser)

Workload	Finding	Action
Training	`--params` open-only despite CLOSED_ALLOWED_PARAMS having dataset.* entries	Fixed (this PR)
Training	`checkpoint.checkpoint_folder` in `CLOSED_ALLOWED_PARAMS` but absent from Rules.md §3.6.2	Flagged for human judgment; not touched
Checkpointing	`--dlio-bin-path` open-only in checkpointing but core in training	Fixed (this PR)
Checkpointing	`--params` open-only — defensible since Rules.md §4.6.3 only names `checkpoint.checkpoint_folder` which has its own flag	No change
VDB	Rules.md §5.6 empty (PREVIEW workload)	No source-of-truth, no change
KVCache	Rules.md §6.6 empty	No source-of-truth, no change

On the datasize emitter

Curtis flagged a concern that the emitter looked like "mlpstorage invoking itself instead of DLIO." After tracing both paths:

Real execution path (execute_command → generate_dlio_command → _execute_command) invokes the DLIO binary directly. No self-invocation at runtime.
generate_datagen_benchmark_command is only called from datasize() and only produces a string for the user to copy-paste. It deliberately targets mlpstorage (not raw DLIO) because mlpstorage training datagen adds validation, metadata recording, cluster-info collection, MPI wrapping, lockfile checks, and YAML param merging — all of which a submission needs.

So the architecture is correct; the bug was drift between the hint string and the real CLI shape, now prevented by the round-trip test.

Test plan

uv run --extra test pytest tests/unit -q → 1356 passed, 4 skipped, 0 failed
Manual round-trip of the original It appears that the "--param" option does not work with "./mlpstorage closed training retinanet datagen file". #433 reproducer command works in closed mode
--param (singular) legacy alias also parses in closed mode
Reviewer: confirm the checkpoint.checkpoint_folder divergence (training rules-checker has it, Rules.md §3.6.2 doesn't) should stay flagged for separate human resolution, not fixed in this PR

Closed training submissions were unable to pass dotted-key overrides like `dataset.num_files_train` because --params was gated to open/whatif mode, even though TrainingRunRulesChecker.CLOSED_ALLOWED_PARAMS explicitly lists those keys as CLOSED-allowed. The emitted hint from `datasize` was also malformed against the current CLI shape (missing mode prefix, --model= instead of positional, --param singular vs --params plural, missing storage-protocol positional, comma-joined --hosts). Changes: * Move --params/--param/-p into core training args so closed mode accepts the keys CLOSED_ALLOWED_PARAMS already permits. --param is registered as a legacy alias so older docs/strings still parse. * Rewrite generate_datagen_benchmark_command to emit the current CLI shape: mode prefix, positional model, --hosts as nargs tokens (not comma-joined), single --params group, trailing storage-protocol positional. Fix --param typo in the user-facing num_subfolders_train warning. * Move add_dlio_arguments from checkpointing open-args to core args. --dlio-bin-path is a deployment knob, not a submission tunable; training already exposed it in core args. Consistency fix surfaced by the audit done alongside #433. * Update the stale comment in common_args.add_dlio_arguments that asserted --params should not live in core; that policy is now per-builder, driven by the rules-checker allow-lists. Tests: * tests/unit/test_datagen_command_generation.py — round-trip tests that shlex-split the emitted datagen command and run it through parse_arguments(). Caught the --hosts comma-join bug before this landed. Prevents future drift between the emitter and the parser. * tests/unit/test_rules_parser_gating_consistency.py — parametrized over TrainingRunRulesChecker.CLOSED_ALLOWED_PARAMS and OPEN_ALLOWED_PARAMS, asserts every dotted key round-trips in the correct mode. Asserts --dlio-bin-path is accessible in closed and open for both training and checkpointing. * tests/unit/test_cli.py, tests/unit/test_parser_modes.py — three tests that codified the original bug ("closed must reject --params", "--params hidden in closed help") flipped to the correct behavior, with comments tying them to #433 and the rules-checker tables. Total: 1356 passed, 4 skipped, 0 failed. 18 new parametrized gating tests + 5 round-trip emitter tests.

github-actions · 2026-06-12T19:27:47Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

idevasena

Validated fix:
mpirun parsed the command line, accepted --bind-to core --map-by ppr:2:node, launched both ranks, and the real dlio_benchmark from venv started up. The failure is purely Not enough training dataset is found because /tmp/d only has 168 leftover files versus the 36,889 unet3d requires. There's no MPI flag error anywhere — no "unknown option," no duplicate-flag rejection.

Inspected the run metadata — the executed command is persisted as a run artifact:

RUN_DIR=/tmp/r/training/unet3d/run/20260612_235627
grep -oE 'mpirun[^"]*' $RUN_DIR/training_20260612_235627_metadata.json | head -1
Pass criteria: exactly one --bind-to (yours, core) and one --map-by (yours, ppr:2:node), and no --bind-to none.

smrc@dskbd029:~/Storage_Repo_Tests/storage$ ./mlpstorage whatif training unet3d run file   --hosts localhost --client-host-memory-in-gb 64   --num-accelerators 2 --ac
Setting attr from num_accelerators to 2
2026-06-12 23:56:27|STATUS: 
2026-06-12 23:56:27|STATUS: --- Run Configuration (mlpstorage 3.0.9) ---
2026-06-12 23:56:27|STATUS:   benchmark:                      training
2026-06-12 23:56:27|STATUS:   command:                        run
2026-06-12 23:56:27|STATUS:   mode:                           whatif
2026-06-12 23:56:27|STATUS:   data_dir:                       /tmp/d
2026-06-12 23:56:27|STATUS:   results_dir:                    /tmp/r
2026-06-12 23:56:27|STATUS:   data_access_protocol:           file
2026-06-12 23:56:27|STATUS:   num_accelerators:               2
2026-06-12 23:56:27|STATUS:   num_processes:                  2
2026-06-12 23:56:27|STATUS:   accelerator_type:               h100
2026-06-12 23:56:27|STATUS:   client_host_memory_in_gb:       64.0
2026-06-12 23:56:27|STATUS:   hosts:                          ['localhost']
2026-06-12 23:56:27|STATUS:   exec_type:                      mpi
2026-06-12 23:56:27|STATUS:   mpi_bin:                        mpirun
2026-06-12 23:56:27|STATUS:   loops:                          1
2026-06-12 23:56:27|STATUS: 
2026-06-12 23:56:27|STATUS: --- Environment ---
2026-06-12 23:56:27|STATUS:   MLPERF_RESULTS_DIR:             [not set]
2026-06-12 23:56:27|STATUS:   MPI_RUN_BIN:                    [not set]
2026-06-12 23:56:27|STATUS:   MPI_EXEC_BIN:                   [not set]
2026-06-12 23:56:27|STATUS: 
2026-06-12 23:56:27|WARNING: Skipping environment validation (--skip-validation flag)
2026-06-12 23:56:27|STATUS: Benchmark results directory: /tmp/r/training/unet3d/run/20260612_235627
2026-06-12 23:56:27|INFO: Collector script staged at /tmp/r/training/unet3d/run/20260612_235627/collector-staging/mlps_collector.py (persisted as run artifact)
2026-06-12 23:56:27|INFO: Running MPI collection across 1 host(s)
2026-06-12 23:56:29|INFO: MPI collection completed successfully (1 hosts reported)
2026-06-12 23:56:29|INFO: Created benchmark run: training_run_unet3d_20260612_235627
2026-06-12 23:56:29|STATUS: Verifying benchmark run for training_run_unet3d_20260612_235627
2026-06-12 23:56:29|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-06-12 23:56:29|ERROR: INVALID: [INVALID] Insufficient number of training files (Parameter: dataset.num_files_train, Expected: >= 36889, Actual: 168)
2026-06-12 23:56:29|STATUS: Benchmark run is INVALID due to 1 issues ([RunID(program='training', command='run', model='unet3d', run_datetime='20260612_235627')])
2026-06-12 23:56:29|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use closed or o
2026-06-12 23:56:29|INFO: Creating data directory: /tmp/d/unet3d...
2026-06-12 23:56:29|INFO: Creating directory: /tmp/d/unet3d/train...
2026-06-12 23:56:29|INFO: Creating directory: /tmp/d/unet3d/valid...
2026-06-12 23:56:29|INFO: Creating directory: /tmp/d/unet3d/test...
⠋ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:00
⠋ Collecting cluster info... 0:00:002026-06-12 23:56:29|INFO: Collector script staged at /tmp/r/training/unet3d/run/20260612_235627/collector-staging/mlps_collector.p
⠙ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:01
⠙ Collecting via MPI... 0:00:012026-06-12 23:56:31|INFO: MPI collection completed successfully (1 hosts reported)
2026-06-12 23:56:31|INFO: MPI BTL transport: auto (OpenMPI default selection)
[OUTPUT] [DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT]   storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT]   storage_root   = './'
[OUTPUT]   storage_options= None
[OUTPUT]   data_folder    = '/tmp/d/unet3d'
[OUTPUT]   framework      = <FrameworkType.PYTORCH: 'pytorch'>
[OUTPUT]   num_files_train= 168
[OUTPUT]   record_length  = 146600628
[OUTPUT]   generate_data  = False
[OUTPUT]   do_train       = True
[OUTPUT]   do_checkpoint  = True
[OUTPUT]   epochs         = 5
[OUTPUT]   batch_size     = 7
[OUTPUT] 2026-06-12T23:56:34.001746 Running DLIO [Training & Checkpointing] with 2 process(es)
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching 
effect!!!
Error executing job with overrides: ['workload=unet3d_h100', '++workload.dataset.data_folder=/tmp/d/unet3d']
Error executing job with overrides: ['workload=unet3d_h100', '++workload.dataset.data_folder=/tmp/d/unet3d']
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 517, in run_benchmark
    benchmark.initialize()
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dftracer/python/common.py", line 504, in wrapper
    x = f(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 517, in run_benchmark
    benchmark.initialize()
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dftracer/python/common.py", line 504, in wrapper
    x = f(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 250, in initialize
    raise Exception(
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 250, in initialize
    raise Exception(
Exception: Not enough training dataset is found; Please run the code with ++workload.workflow.generate_data=True
Exception: Not enough training dataset is found; Please run the code with ++workload.workflow.generate_data=True

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[60930,1],0]
  Exit code:    1
--------------------------------------------------------------------------
  Processing results... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:05
⠋ Collecting cluster info... 0:00:002026-06-12 23:56:35|INFO: Collector script staged at /tmp/r/training/unet3d/run/20260612_235627/collector-staging/mlps_collector.p
  Processing results... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:05
2026-06-12 23:56:37|STATUS: Writing metadata for benchmark to: /tmp/r/training/unet3d/run/20260612_235627/training_20260612_235627_metadata.json
smrc@dskbd029:~/Storage_Repo_Tests/storage$ RUN_DIR=/tmp/r/training/unet3d/run/20260612_235627
smrc@dskbd029:~/Storage_Repo_Tests/storage$ grep -oE 'mpirun[^"]*' $RUN_DIR/training_20260612_235627_metadata.json | head -1
mpirun -n 2 -host localhost:2 --bind-to core --map-by ppr:2:node /home/smrc/Storage_Repo_Tests/storage/.venv/bin/dlio_benchmark workload=unet3d_h100 ++hydra.run.dir=/

idevasena · 2026-06-13T00:52:56Z

Verified the issue reported by Lou too #433. The fix works.

smrc@dskbd029:~/Storage_Repo_Tests/storage$ ./mlpstorage open training retinanet datagen file \
> --hosts localhost \
> --num-processes 79 \
  --data-dir /mnt/drives/nvme3n1/training \
  --results-dir /mnt/drives/nvme3n1/results \
  --params dataset.num_files_train=20 dataset.num_subfolders_train=10 \
  --debug \
  --exec-type mpi \
  --mpi-bin mpiexec
2026-06-13 00:06:36|DEBUG:history:31: Parsing history line: 171,20260612_235627,/home/smrc/Storage_Repo_Tests/storage/mlpstorage_py/main.py whatif training unet3d run

2026-06-13 00:06:36|VERBOSER:mlps_logging:101: Adding command to history: 172,20260613_000636,/home/smrc/Storage_Repo_Tests/storage/mlpstorage_py/main.py open trainin
2026-06-13 00:06:36|STATUS:mlps_logging:101: 
2026-06-13 00:06:36|STATUS:mlps_logging:101: --- Run Configuration (mlpstorage 3.0.9) ---
2026-06-13 00:06:36|STATUS:mlps_logging:101:   benchmark:                      training
2026-06-13 00:06:36|STATUS:mlps_logging:101:   command:                        datagen
2026-06-13 00:06:36|STATUS:mlps_logging:101:   mode:                           open
2026-06-13 00:06:36|STATUS:mlps_logging:101:   data_dir:                       /mnt/drives/nvme3n1/training
2026-06-13 00:06:36|STATUS:mlps_logging:101:   results_dir:                    /mnt/drives/nvme3n1/results
2026-06-13 00:06:36|STATUS:mlps_logging:101:   data_access_protocol:           file
2026-06-13 00:06:36|STATUS:mlps_logging:101:   num_accelerators:               [not set]
2026-06-13 00:06:36|STATUS:mlps_logging:101:   num_processes:                  79
2026-06-13 00:06:36|STATUS:mlps_logging:101:   accelerator_type:               [not set]
2026-06-13 00:06:36|STATUS:mlps_logging:101:   client_host_memory_in_gb:       [not set]
2026-06-13 00:06:36|STATUS:mlps_logging:101:   hosts:                          ['localhost']
2026-06-13 00:06:36|STATUS:mlps_logging:101:   exec_type:                      mpi
2026-06-13 00:06:36|STATUS:mlps_logging:101:   mpi_bin:                        mpiexec
2026-06-13 00:06:36|STATUS:mlps_logging:101:   loops:                          1
2026-06-13 00:06:36|STATUS:mlps_logging:101: 
2026-06-13 00:06:36|STATUS:mlps_logging:101: --- Environment ---
2026-06-13 00:06:36|STATUS:mlps_logging:101:   MLPERF_RESULTS_DIR:             [not set]
2026-06-13 00:06:36|STATUS:mlps_logging:101:   MPI_RUN_BIN:                    [not set]
2026-06-13 00:06:36|STATUS:mlps_logging:101:   MPI_EXEC_BIN:                   [not set]
2026-06-13 00:06:36|STATUS:mlps_logging:101: 
⠋ Validating environment... 0:00:002026-06-13 00:06:36|DEBUG:validation_helpers:572: Starting environment validation (OS: Linux)
2026-06-13 00:06:36|INFO:validation_helpers:686: Environment validation passed
2026-06-13 00:06:36|STATUS:mlps_logging:101: Benchmark results directory: /mnt/drives/nvme3n1/results/training/retinanet/datagen/20260613_000636
2026-06-13 00:06:36|DEBUG:dependency_check:276: Checking for MPI runtime (mpiexec)...
2026-06-13 00:06:36|DEBUG:dependency_check:279: Found MPI at: /usr/bin/mpiexec
2026-06-13 00:06:36|DEBUG:dependency_check:283: Checking for DLIO benchmark...
2026-06-13 00:06:36|DEBUG:dependency_check:286: Found DLIO at: /home/smrc/Storage_Repo_Tests/storage/.venv/bin/dlio_benchmark
2026-06-13 00:06:36|DEBUG:dlio:224: yaml params: 
{'dataset': {'data_folder': 'data/retinanet/',
             'format': 'jpeg',
             'num_files_train': 1170301,
             'num_samples_per_file': 1,
             'record_length_bytes': 322957},
 'framework': 'pytorch',
 'model': {'name': 'retinanet', 'type': 'cnn'},
 'workflow': {'checkpoint': False, 'generate_data': True, 'train': False}}
2026-06-13 00:06:36|DEBUG:dlio:225: combined params: 
{'dataset': {'data_folder': 'data/retinanet/',
             'format': 'jpeg',
             'num_files_train': '20',
             'num_samples_per_file': 1,
             'num_subfolders_train': '10',
             'record_length_bytes': 322957},
 'framework': 'pytorch',
 'model': {'name': 'retinanet', 'type': 'cnn'},
 'workflow': {'checkpoint': False, 'generate_data': True, 'train': False}}
2026-06-13 00:06:36|DEBUG:dlio:226: Instance params: 
{'_cluster_collector': None,
 '_config_name': 'retinanet_datagen',
 '_timeseries_collector': None,
 '_timeseries_data': None,
 '_validator': None,
 'args': Namespace(mode='open', benchmark='training', model='retinanet', command='datagen', hosts=['localhost'], num_processes=79, exec_type=<EXEC_TYPE.MPI: 'mpi'>, m
 'base_command': 'dlio_benchmark',
 'base_command_path': '/home/smrc/Storage_Repo_Tests/storage/.venv/bin/dlio_benchmark',
 'base_path': '/home/smrc/Storage_Repo_Tests/storage/mlpstorage_py',
 'benchmark_run_verifier': None,
 'cmd_executor': <mlpstorage_py.utils.CommandExecutor object at 0x7c7e16f1fce0>,
 'command_method_map': {'configview': <bound method DLIOBenchmark.execute_command of <mlpstorage_py.benchmarks.dlio.TrainingBenchmark object at 0x7c7e167c6ff0>>,
                        'datagen': <bound method DLIOBenchmark.execute_command of <mlpstorage_py.benchmarks.dlio.TrainingBenchmark object at 0x7c7e167c6ff0>>,
                        'datasize': <bound method TrainingBenchmark.datasize of <mlpstorage_py.benchmarks.dlio.TrainingBenchmark object at 0x7c7e167c6ff0>>,
                        'reportgen': <bound method DLIOBenchmark.execute_command of <mlpstorage_py.benchmarks.dlio.TrainingBenchmark object at 0x7c7e167c6ff0>>,
                        'run': <bound method DLIOBenchmark.execute_command of <mlpstorage_py.benchmarks.dlio.TrainingBenchmark object at 0x7c7e167c6ff0>>},
 'command_output_files': [],
 'config_file': 'retinanet_datagen.yaml',
 'config_path': '/home/smrc/Storage_Repo_Tests/storage/configs/dlio',
 'debug': True,
 'logger': <Logger MLPerfStorage (DEBUG)>,
 'metadata_file_path': '/mnt/drives/nvme3n1/results/training/retinanet/datagen/20260613_000636/training_20260613_000636_metadata.json',
 'metadata_filename': 'training_20260613_000636_metadata.json',
 'per_host_mem_kB': None,
 'run_datetime': '20260613_000636',
 'run_number': 0,
 'run_result_output': '/mnt/drives/nvme3n1/results/training/retinanet/datagen/20260613_000636',
 'runtime': 0,
 'timeseries_file_path': '/mnt/drives/nvme3n1/results/training/retinanet/datagen/20260613_000636/training_20260613_000636_timeseries.json',
 'timeseries_filename': 'training_20260613_000636_timeseries.json',
 'total_mem_kB': None,
 'verification': None}
2026-06-13 00:06:36|VERBOSER:mlps_logging:101: Instantiated the Training Benchmark...
⠋ Validating environment... ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/4 0:00:002026-06-13 00:06:36|DEBUG:base:592: Skipping start cluster collection (conditions not 
2026-06-13 00:06:36|DEBUG:base:676: Skipping time-series collection (disabled or not applicable)
2026-06-13 00:06:36|VERBOSER:mlps_logging:101: Generating DLIO command for benchmark training
2026-06-13 00:06:36|DEBUG:dlio:270: Generating MPI Command with binary "mpiexec"
2026-06-13 00:06:36|DEBUG:utils:570: Configured slots for hosts: ['localhost:79']
2026-06-13 00:06:36|INFO:utils:612: MPI BTL transport: auto (OpenMPI default selection)
2026-06-13 00:06:36|STATUS:mlps_logging:101: Running benchmark command:: mpiexec -n 79 -host localhost:79 --bind-to none --map-by socket /home/smrc/Storage_Repo_Tests
2026-06-13 00:06:36|DEBUG:utils:343: DEBUG - Executing command: mpiexec -n 79 -host localhost:79 --bind-to none --map-by socket /home/smrc/Storage_Repo_Tests/storage/
[OUTPUT] [DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT]   storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT]   storage_root   = './'
[OUTPUT]   storage_options= None
[OUTPUT]   data_folder    = '/mnt/drives/nvme3n1/training/retinanet'
[OUTPUT]   framework      = <FrameworkType.PYTORCH: 'pytorch'>
[OUTPUT]   num_files_train= 20
[OUTPUT]   record_length  = 322957
[OUTPUT]   generate_data  = True
[OUTPUT]   do_train       = False
[OUTPUT]   do_checkpoint  = False
[OUTPUT]   epochs         = 1
[OUTPUT]   batch_size     = 1
[OUTPUT] 2026-06-13T00:06:43.706265 Running DLIO [Generating data] with 79 process(es)
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MiB overhead
[OUTPUT] ================================================================================
[OUTPUT] 2026-06-13T00:06:43.821833 Starting data generation
[?25l
[2K[32m⠋[0m Generating JPEG Data [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 0/20 [33m0:00:00[0m
[OUTPUT] 2026-06-13T00:06:43.883289 Generation done
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MiB overhead
[OUTPUT] ================================================================================
⠸ Running benchmark... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 0:00:092026-06-13 00:06:46|VERBOSE:mlps_logging:101: Command stdout saved to: training_datagen.std
2026-06-13 00:06:46|VERBOSE:mlps_logging:101: Command stderr saved to: training_datagen.stderr.log
2026-06-13 00:06:46|DEBUG:base:620: Skipping end cluster collection (no start collection)
2026-06-13 00:06:46|STATUS:mlps_logging:101: Writing metadata for benchmark to: /mnt/drives/nvme3n1/results/training/retinanet/datagen/20260613_000636/training_202606
{
  "benchmark_type": "training",
  "model": "retinanet",
  "command": "datagen",
  "run_datetime": "20260613_000636",
  "num_processes": 79,
  "accelerator": null,
  "result_dir": "/mnt/drives/nvme3n1/results/training/retinanet/datagen/20260613_000636",
  "parameters": {
    "model": {
      "name": "retinanet",
      "type": "cnn"
    },
    "framework": "pytorch",
    "workflow": {
      "generate_data": true,
      "train": false,
      "checkpoint": false
    },
    "dataset": {
      "data_folder": "/mnt/drives/nvme3n1/training/retinanet",
      "format": "jpeg",
      "num_files_train": "20",
      "num_samples_per_file": 1,
      "record_length_bytes": 322957,
      "num_subfolders_train": "10"
    }
  },
  "override_parameters": {
    "dataset.num_files_train": "20",
    "dataset.num_subfolders_train": "10",
    "dataset.data_folder": "/mnt/drives/nvme3n1/training/retinanet"
  },
  "system_info": null,
  "runtime": 9.109416484832764,
  "verification": null,
  "executed_command": "mpiexec -n 79 -host localhost:79 --bind-to none --map-by socket /home/smrc/Storage_Repo_Tests/storage/.venv/bin/dlio_benchmark workload=retinan
  "command_output_files": [
    {
      "command": "mpiexec -n 79 -host localhost:79 --bind-to none --map-by socket /home/smrc/Storage_Repo_Tests/storage/.venv/bin/dlio_benchmark workload=retinanet_da
      "stdout": "/mnt/drives/nvme3n1/results/training/retinanet/datagen/20260613_000636/training_datagen.stdout.log",
      "stderr": "/mnt/drives/nvme3n1/results/training/retinanet/datagen/20260613_000636/training_datagen.stderr.log"
    }
  ],
  "args": {
    "mode": "open",
    "benchmark": "training",
    "model": "retinanet",
    "command": "datagen",
    "hosts": [
      "localhost"
    ],
    "num_processes": 79,
    "exec_type": "mpi",
    "mpi_bin": "mpiexec",
    "oversubscribe": false,
    "allow_run_as_root": false,
    "mpi_btl": "auto",
    "mpi_params": null,
    "data_dir": "/mnt/drives/nvme3n1/training",
    "dlio_bin_path": null,
    "results_dir": "/mnt/drives/nvme3n1/results",
    "config_file": null,
    "debug": true,
    "verbose": false,
    "stream_log_level": "INFO",
    "quiet": false,
    "dry_run": false,
    "verify_lockfile": null,
    "skip_validation": false,
    "data_access_protocol": "file",
    "loops": 1,
    "allow_invalid_params": false,
    "params": [
      "dataset.num_files_train=20",
      "dataset.num_subfolders_train=10"
    ],
    "num_client_hosts": 1
  }
}

FileSystemGuy requested a review from a team June 12, 2026 19:27

Merge branch 'main' into fix/issue-433-params-gating-and-datagen-emitter

f77a4be

idevasena approved these changes Jun 13, 2026

View reviewed changes

Merge branch 'main' into fix/issue-433-params-gating-and-datagen-emitter

e558b4e

FileSystemGuy merged commit 7c28a36 into main Jun 13, 2026
3 checks passed

github-actions Bot locked and limited conversation to collaborators Jun 13, 2026

FileSystemGuy deleted the fix/issue-433-params-gating-and-datagen-emitter branch June 13, 2026 00:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cli): closed mode accepts --params; correct datasize emitter (#433)#437

fix(cli): closed mode accepts --params; correct datasize emitter (#433)#437
FileSystemGuy merged 3 commits into
mainfrom
fix/issue-433-params-gating-and-datagen-emitter

FileSystemGuy commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

idevasena left a comment

Uh oh!

Uh oh!

idevasena commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FileSystemGuy commented Jun 12, 2026

Summary

Changes

Immediate fix for #433

Audit-driven consistency fix

Tests (+18 new, 3 corrected)

Audit summary (rules-checker ↔ Rules.md ↔ parser)

On the datasize emitter

Test plan

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

idevasena left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

idevasena commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 12, 2026 •

edited

Loading