Skip to content

fix(cli): closed mode accepts --params; correct datasize emitter (#433)#437

Merged
FileSystemGuy merged 3 commits into
mainfrom
fix/issue-433-params-gating-and-datagen-emitter
Jun 13, 2026
Merged

fix(cli): closed mode accepts --params; correct datasize emitter (#433)#437
FileSystemGuy merged 3 commits into
mainfrom
fix/issue-433-params-gating-and-datagen-emitter

Conversation

@FileSystemGuy

Copy link
Copy Markdown
Contributor

Summary

Closes #433.

Closed-mode training submissions were unable to pass dotted-key DLIO overrides like dataset.num_files_train because --params was gated to open/whatif mode, even though TrainingRunRulesChecker.CLOSED_ALLOWED_PARAMS explicitly lists those keys as CLOSED-allowed. The hint that datasize emits for the next step was also malformed against the current CLI shape and didn't parse.

This PR fixes both, plus a gating audit across all four workloads (training, checkpointing, vectordb, kvcache) to flush out any other #433-class drift between Rules.md / the rules-checker / the parser.

Changes

Immediate fix for #433

  • mlpstorage_py/cli/training_args.py--params/--param/-p moved from _add_training_open_args into _add_training_core_args so closed mode accepts the dotted-key overrides that CLOSED_ALLOWED_PARAMS already permits. --param (singular) registered as a legacy alias so existing docs and the strings emitted by datasize still parse.
  • mlpstorage_py/benchmarks/dlio.pygenerate_datagen_benchmark_command rewritten to emit the current CLI shape: mode prefix, positional <model> (not --model=), --hosts as separate tokens (not comma-joined — nargs="+" doesn't auto-split), single --params k1=v1 k2=v2 group, trailing storage-protocol positional (file). --param typo in the user-facing num_subfolders_train warning also fixed.

Audit-driven consistency fix

  • mlpstorage_py/cli/checkpointing_args.pyadd_dlio_arguments (which registers --dlio-bin-path) moved from open-only to core args. --dlio-bin-path is a deployment knob, not a submission tunable; training already exposed it in core args, checkpointing was inconsistent.
  • mlpstorage_py/cli/common_args.py — stale comment in add_dlio_arguments updated. The old comment asserted --params should never live in core; that policy is now per-builder, driven by the rules-checker allow-lists.

Tests (+18 new, 3 corrected)

  • tests/unit/test_datagen_command_generation.py (new, 5 tests) — round-trip tests that shlex-split the emitter output and run it through parse_arguments(). Caught the --hosts=h1,h2 comma-join bug before this landed. Future drift between the emitter and the parser fails in CI.
  • tests/unit/test_rules_parser_gating_consistency.py (new, 18 tests) — parametrized over TrainingRunRulesChecker.CLOSED_ALLOWED_PARAMS and OPEN_ALLOWED_PARAMS, asserts every dotted key round-trips through the parser in the correct mode. Also asserts --dlio-bin-path works in closed and open for both training and checkpointing.
  • tests/unit/test_cli.py, tests/unit/test_parser_modes.py — three tests that codified the original bug ("closed must reject --params", "--params hidden in closed help") flipped to the correct behavior, with comments tying them to It appears that the "--param" option does not work with "./mlpstorage closed training retinanet datagen file". #433 and CLOSED_ALLOWED_PARAMS.

Audit summary (rules-checker ↔ Rules.md ↔ parser)

Workload Finding Action
Training --params open-only despite CLOSED_ALLOWED_PARAMS having dataset.* entries Fixed (this PR)
Training checkpoint.checkpoint_folder in CLOSED_ALLOWED_PARAMS but absent from Rules.md §3.6.2 Flagged for human judgment; not touched
Checkpointing --dlio-bin-path open-only in checkpointing but core in training Fixed (this PR)
Checkpointing --params open-only — defensible since Rules.md §4.6.3 only names checkpoint.checkpoint_folder which has its own flag No change
VDB Rules.md §5.6 empty (PREVIEW workload) No source-of-truth, no change
KVCache Rules.md §6.6 empty No source-of-truth, no change

On the datasize emitter

Curtis flagged a concern that the emitter looked like "mlpstorage invoking itself instead of DLIO." After tracing both paths:

  • Real execution path (execute_commandgenerate_dlio_command_execute_command) invokes the DLIO binary directly. No self-invocation at runtime.
  • generate_datagen_benchmark_command is only called from datasize() and only produces a string for the user to copy-paste. It deliberately targets mlpstorage (not raw DLIO) because mlpstorage training datagen adds validation, metadata recording, cluster-info collection, MPI wrapping, lockfile checks, and YAML param merging — all of which a submission needs.

So the architecture is correct; the bug was drift between the hint string and the real CLI shape, now prevented by the round-trip test.

Test plan

Closed training submissions were unable to pass dotted-key overrides like
`dataset.num_files_train` because --params was gated to open/whatif mode,
even though TrainingRunRulesChecker.CLOSED_ALLOWED_PARAMS explicitly lists
those keys as CLOSED-allowed. The emitted hint from `datasize` was also
malformed against the current CLI shape (missing mode prefix, --model=
instead of positional, --param singular vs --params plural, missing
storage-protocol positional, comma-joined --hosts).

Changes:

* Move --params/--param/-p into core training args so closed mode
  accepts the keys CLOSED_ALLOWED_PARAMS already permits. --param is
  registered as a legacy alias so older docs/strings still parse.

* Rewrite generate_datagen_benchmark_command to emit the current CLI
  shape: mode prefix, positional model, --hosts as nargs tokens (not
  comma-joined), single --params group, trailing storage-protocol
  positional. Fix --param typo in the user-facing num_subfolders_train
  warning.

* Move add_dlio_arguments from checkpointing open-args to core args.
  --dlio-bin-path is a deployment knob, not a submission tunable;
  training already exposed it in core args. Consistency fix surfaced
  by the audit done alongside #433.

* Update the stale comment in common_args.add_dlio_arguments that
  asserted --params should not live in core; that policy is now
  per-builder, driven by the rules-checker allow-lists.

Tests:

* tests/unit/test_datagen_command_generation.py — round-trip tests that
  shlex-split the emitted datagen command and run it through
  parse_arguments(). Caught the --hosts comma-join bug before this
  landed. Prevents future drift between the emitter and the parser.

* tests/unit/test_rules_parser_gating_consistency.py — parametrized over
  TrainingRunRulesChecker.CLOSED_ALLOWED_PARAMS and OPEN_ALLOWED_PARAMS,
  asserts every dotted key round-trips in the correct mode. Asserts
  --dlio-bin-path is accessible in closed and open for both training
  and checkpointing.

* tests/unit/test_cli.py, tests/unit/test_parser_modes.py — three tests
  that codified the original bug ("closed must reject --params",
  "--params hidden in closed help") flipped to the correct behavior,
  with comments tying them to #433 and the rules-checker tables.

Total: 1356 passed, 4 skipped, 0 failed. 18 new parametrized gating
tests + 5 round-trip emitter tests.
@FileSystemGuy FileSystemGuy requested a review from a team June 12, 2026 19:27
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@idevasena idevasena left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validated fix:
mpirun parsed the command line, accepted --bind-to core --map-by ppr:2:node, launched both ranks, and the real dlio_benchmark from venv started up. The failure is purely Not enough training dataset is found because /tmp/d only has 168 leftover files versus the 36,889 unet3d requires. There's no MPI flag error anywhere — no "unknown option," no duplicate-flag rejection.

Inspected the run metadata — the executed command is persisted as a run artifact:

RUN_DIR=/tmp/r/training/unet3d/run/20260612_235627
grep -oE 'mpirun[^"]*' $RUN_DIR/training_20260612_235627_metadata.json | head -1
Pass criteria: exactly one --bind-to (yours, core) and one --map-by (yours, ppr:2:node), and no --bind-to none. 
smrc@dskbd029:~/Storage_Repo_Tests/storage$ ./mlpstorage whatif training unet3d run file   --hosts localhost --client-host-memory-in-gb 64   --num-accelerators 2 --ac
Setting attr from num_accelerators to 2
2026-06-12 23:56:27|STATUS: 
2026-06-12 23:56:27|STATUS: --- Run Configuration (mlpstorage 3.0.9) ---
2026-06-12 23:56:27|STATUS:   benchmark:                      training
2026-06-12 23:56:27|STATUS:   command:                        run
2026-06-12 23:56:27|STATUS:   mode:                           whatif
2026-06-12 23:56:27|STATUS:   data_dir:                       /tmp/d
2026-06-12 23:56:27|STATUS:   results_dir:                    /tmp/r
2026-06-12 23:56:27|STATUS:   data_access_protocol:           file
2026-06-12 23:56:27|STATUS:   num_accelerators:               2
2026-06-12 23:56:27|STATUS:   num_processes:                  2
2026-06-12 23:56:27|STATUS:   accelerator_type:               h100
2026-06-12 23:56:27|STATUS:   client_host_memory_in_gb:       64.0
2026-06-12 23:56:27|STATUS:   hosts:                          ['localhost']
2026-06-12 23:56:27|STATUS:   exec_type:                      mpi
2026-06-12 23:56:27|STATUS:   mpi_bin:                        mpirun
2026-06-12 23:56:27|STATUS:   loops:                          1
2026-06-12 23:56:27|STATUS: 
2026-06-12 23:56:27|STATUS: --- Environment ---
2026-06-12 23:56:27|STATUS:   MLPERF_RESULTS_DIR:             [not set]
2026-06-12 23:56:27|STATUS:   MPI_RUN_BIN:                    [not set]
2026-06-12 23:56:27|STATUS:   MPI_EXEC_BIN:                   [not set]
2026-06-12 23:56:27|STATUS: 
2026-06-12 23:56:27|WARNING: Skipping environment validation (--skip-validation flag)
2026-06-12 23:56:27|STATUS: Benchmark results directory: /tmp/r/training/unet3d/run/20260612_235627
2026-06-12 23:56:27|INFO: Collector script staged at /tmp/r/training/unet3d/run/20260612_235627/collector-staging/mlps_collector.py (persisted as run artifact)
2026-06-12 23:56:27|INFO: Running MPI collection across 1 host(s)
2026-06-12 23:56:29|INFO: MPI collection completed successfully (1 hosts reported)
2026-06-12 23:56:29|INFO: Created benchmark run: training_run_unet3d_20260612_235627
2026-06-12 23:56:29|STATUS: Verifying benchmark run for training_run_unet3d_20260612_235627
2026-06-12 23:56:29|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-06-12 23:56:29|ERROR: INVALID: [INVALID] Insufficient number of training files (Parameter: dataset.num_files_train, Expected: >= 36889, Actual: 168)
2026-06-12 23:56:29|STATUS: Benchmark run is INVALID due to 1 issues ([RunID(program='training', command='run', model='unet3d', run_datetime='20260612_235627')])
2026-06-12 23:56:29|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use closed or o
2026-06-12 23:56:29|INFO: Creating data directory: /tmp/d/unet3d...
2026-06-12 23:56:29|INFO: Creating directory: /tmp/d/unet3d/train...
2026-06-12 23:56:29|INFO: Creating directory: /tmp/d/unet3d/valid...
2026-06-12 23:56:29|INFO: Creating directory: /tmp/d/unet3d/test...
⠋ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:00
⠋ Collecting cluster info... 0:00:002026-06-12 23:56:29|INFO: Collector script staged at /tmp/r/training/unet3d/run/20260612_235627/collector-staging/mlps_collector.p
⠙ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:01
⠙ Collecting via MPI... 0:00:012026-06-12 23:56:31|INFO: MPI collection completed successfully (1 hosts reported)
2026-06-12 23:56:31|INFO: MPI BTL transport: auto (OpenMPI default selection)
[OUTPUT] [DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT]   storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT]   storage_root   = './'
[OUTPUT]   storage_options= None
[OUTPUT]   data_folder    = '/tmp/d/unet3d'
[OUTPUT]   framework      = <FrameworkType.PYTORCH: 'pytorch'>
[OUTPUT]   num_files_train= 168
[OUTPUT]   record_length  = 146600628
[OUTPUT]   generate_data  = False
[OUTPUT]   do_train       = True
[OUTPUT]   do_checkpoint  = True
[OUTPUT]   epochs         = 5
[OUTPUT]   batch_size     = 7
[OUTPUT] 2026-06-12T23:56:34.001746 Running DLIO [Training & Checkpointing] with 2 process(es)
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching 
effect!!!
Error executing job with overrides: ['workload=unet3d_h100', '++workload.dataset.data_folder=/tmp/d/unet3d']
Error executing job with overrides: ['workload=unet3d_h100', '++workload.dataset.data_folder=/tmp/d/unet3d']
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 517, in run_benchmark
    benchmark.initialize()
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dftracer/python/common.py", line 504, in wrapper
    x = f(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 517, in run_benchmark
    benchmark.initialize()
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dftracer/python/common.py", line 504, in wrapper
    x = f(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 250, in initialize
    raise Exception(
  File "/home/smrc/Storage_Repo_Tests/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 250, in initialize
    raise Exception(
Exception: Not enough training dataset is found; Please run the code with ++workload.workflow.generate_data=True
Exception: Not enough training dataset is found; Please run the code with ++workload.workflow.generate_data=True

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[60930,1],0]
  Exit code:    1
--------------------------------------------------------------------------
  Processing results... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:05
⠋ Collecting cluster info... 0:00:002026-06-12 23:56:35|INFO: Collector script staged at /tmp/r/training/unet3d/run/20260612_235627/collector-staging/mlps_collector.p
  Processing results... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:05
2026-06-12 23:56:37|STATUS: Writing metadata for benchmark to: /tmp/r/training/unet3d/run/20260612_235627/training_20260612_235627_metadata.json
smrc@dskbd029:~/Storage_Repo_Tests/storage$ RUN_DIR=/tmp/r/training/unet3d/run/20260612_235627
smrc@dskbd029:~/Storage_Repo_Tests/storage$ grep -oE 'mpirun[^"]*' $RUN_DIR/training_20260612_235627_metadata.json | head -1
mpirun -n 2 -host localhost:2 --bind-to core --map-by ppr:2:node /home/smrc/Storage_Repo_Tests/storage/.venv/bin/dlio_benchmark workload=unet3d_h100 ++hydra.run.dir=/

@FileSystemGuy FileSystemGuy merged commit 7c28a36 into main Jun 13, 2026
3 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 13, 2026
@FileSystemGuy FileSystemGuy deleted the fix/issue-433-params-gating-and-datagen-emitter branch June 13, 2026 00:06
@idevasena

Copy link
Copy Markdown
Contributor

Verified the issue reported by Lou too #433. The fix works.

smrc@dskbd029:~/Storage_Repo_Tests/storage$ ./mlpstorage open training retinanet datagen file \
> --hosts localhost \
> --num-processes 79 \
  --data-dir /mnt/drives/nvme3n1/training \
  --results-dir /mnt/drives/nvme3n1/results \
  --params dataset.num_files_train=20 dataset.num_subfolders_train=10 \
  --debug \
  --exec-type mpi \
  --mpi-bin mpiexec
2026-06-13 00:06:36|DEBUG:history:31: Parsing history line: 171,20260612_235627,/home/smrc/Storage_Repo_Tests/storage/mlpstorage_py/main.py whatif training unet3d run

2026-06-13 00:06:36|VERBOSER:mlps_logging:101: Adding command to history: 172,20260613_000636,/home/smrc/Storage_Repo_Tests/storage/mlpstorage_py/main.py open trainin
2026-06-13 00:06:36|STATUS:mlps_logging:101: 
2026-06-13 00:06:36|STATUS:mlps_logging:101: --- Run Configuration (mlpstorage 3.0.9) ---
2026-06-13 00:06:36|STATUS:mlps_logging:101:   benchmark:                      training
2026-06-13 00:06:36|STATUS:mlps_logging:101:   command:                        datagen
2026-06-13 00:06:36|STATUS:mlps_logging:101:   mode:                           open
2026-06-13 00:06:36|STATUS:mlps_logging:101:   data_dir:                       /mnt/drives/nvme3n1/training
2026-06-13 00:06:36|STATUS:mlps_logging:101:   results_dir:                    /mnt/drives/nvme3n1/results
2026-06-13 00:06:36|STATUS:mlps_logging:101:   data_access_protocol:           file
2026-06-13 00:06:36|STATUS:mlps_logging:101:   num_accelerators:               [not set]
2026-06-13 00:06:36|STATUS:mlps_logging:101:   num_processes:                  79
2026-06-13 00:06:36|STATUS:mlps_logging:101:   accelerator_type:               [not set]
2026-06-13 00:06:36|STATUS:mlps_logging:101:   client_host_memory_in_gb:       [not set]
2026-06-13 00:06:36|STATUS:mlps_logging:101:   hosts:                          ['localhost']
2026-06-13 00:06:36|STATUS:mlps_logging:101:   exec_type:                      mpi
2026-06-13 00:06:36|STATUS:mlps_logging:101:   mpi_bin:                        mpiexec
2026-06-13 00:06:36|STATUS:mlps_logging:101:   loops:                          1
2026-06-13 00:06:36|STATUS:mlps_logging:101: 
2026-06-13 00:06:36|STATUS:mlps_logging:101: --- Environment ---
2026-06-13 00:06:36|STATUS:mlps_logging:101:   MLPERF_RESULTS_DIR:             [not set]
2026-06-13 00:06:36|STATUS:mlps_logging:101:   MPI_RUN_BIN:                    [not set]
2026-06-13 00:06:36|STATUS:mlps_logging:101:   MPI_EXEC_BIN:                   [not set]
2026-06-13 00:06:36|STATUS:mlps_logging:101: 
⠋ Validating environment... 0:00:002026-06-13 00:06:36|DEBUG:validation_helpers:572: Starting environment validation (OS: Linux)
2026-06-13 00:06:36|INFO:validation_helpers:686: Environment validation passed
2026-06-13 00:06:36|STATUS:mlps_logging:101: Benchmark results directory: /mnt/drives/nvme3n1/results/training/retinanet/datagen/20260613_000636
2026-06-13 00:06:36|DEBUG:dependency_check:276: Checking for MPI runtime (mpiexec)...
2026-06-13 00:06:36|DEBUG:dependency_check:279: Found MPI at: /usr/bin/mpiexec
2026-06-13 00:06:36|DEBUG:dependency_check:283: Checking for DLIO benchmark...
2026-06-13 00:06:36|DEBUG:dependency_check:286: Found DLIO at: /home/smrc/Storage_Repo_Tests/storage/.venv/bin/dlio_benchmark
2026-06-13 00:06:36|DEBUG:dlio:224: yaml params: 
{'dataset': {'data_folder': 'data/retinanet/',
             'format': 'jpeg',
             'num_files_train': 1170301,
             'num_samples_per_file': 1,
             'record_length_bytes': 322957},
 'framework': 'pytorch',
 'model': {'name': 'retinanet', 'type': 'cnn'},
 'workflow': {'checkpoint': False, 'generate_data': True, 'train': False}}
2026-06-13 00:06:36|DEBUG:dlio:225: combined params: 
{'dataset': {'data_folder': 'data/retinanet/',
             'format': 'jpeg',
             'num_files_train': '20',
             'num_samples_per_file': 1,
             'num_subfolders_train': '10',
             'record_length_bytes': 322957},
 'framework': 'pytorch',
 'model': {'name': 'retinanet', 'type': 'cnn'},
 'workflow': {'checkpoint': False, 'generate_data': True, 'train': False}}
2026-06-13 00:06:36|DEBUG:dlio:226: Instance params: 
{'_cluster_collector': None,
 '_config_name': 'retinanet_datagen',
 '_timeseries_collector': None,
 '_timeseries_data': None,
 '_validator': None,
 'args': Namespace(mode='open', benchmark='training', model='retinanet', command='datagen', hosts=['localhost'], num_processes=79, exec_type=<EXEC_TYPE.MPI: 'mpi'>, m
 'base_command': 'dlio_benchmark',
 'base_command_path': '/home/smrc/Storage_Repo_Tests/storage/.venv/bin/dlio_benchmark',
 'base_path': '/home/smrc/Storage_Repo_Tests/storage/mlpstorage_py',
 'benchmark_run_verifier': None,
 'cmd_executor': <mlpstorage_py.utils.CommandExecutor object at 0x7c7e16f1fce0>,
 'command_method_map': {'configview': <bound method DLIOBenchmark.execute_command of <mlpstorage_py.benchmarks.dlio.TrainingBenchmark object at 0x7c7e167c6ff0>>,
                        'datagen': <bound method DLIOBenchmark.execute_command of <mlpstorage_py.benchmarks.dlio.TrainingBenchmark object at 0x7c7e167c6ff0>>,
                        'datasize': <bound method TrainingBenchmark.datasize of <mlpstorage_py.benchmarks.dlio.TrainingBenchmark object at 0x7c7e167c6ff0>>,
                        'reportgen': <bound method DLIOBenchmark.execute_command of <mlpstorage_py.benchmarks.dlio.TrainingBenchmark object at 0x7c7e167c6ff0>>,
                        'run': <bound method DLIOBenchmark.execute_command of <mlpstorage_py.benchmarks.dlio.TrainingBenchmark object at 0x7c7e167c6ff0>>},
 'command_output_files': [],
 'config_file': 'retinanet_datagen.yaml',
 'config_path': '/home/smrc/Storage_Repo_Tests/storage/configs/dlio',
 'debug': True,
 'logger': <Logger MLPerfStorage (DEBUG)>,
 'metadata_file_path': '/mnt/drives/nvme3n1/results/training/retinanet/datagen/20260613_000636/training_20260613_000636_metadata.json',
 'metadata_filename': 'training_20260613_000636_metadata.json',
 'per_host_mem_kB': None,
 'run_datetime': '20260613_000636',
 'run_number': 0,
 'run_result_output': '/mnt/drives/nvme3n1/results/training/retinanet/datagen/20260613_000636',
 'runtime': 0,
 'timeseries_file_path': '/mnt/drives/nvme3n1/results/training/retinanet/datagen/20260613_000636/training_20260613_000636_timeseries.json',
 'timeseries_filename': 'training_20260613_000636_timeseries.json',
 'total_mem_kB': None,
 'verification': None}
2026-06-13 00:06:36|VERBOSER:mlps_logging:101: Instantiated the Training Benchmark...
⠋ Validating environment... ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/4 0:00:002026-06-13 00:06:36|DEBUG:base:592: Skipping start cluster collection (conditions not 
2026-06-13 00:06:36|DEBUG:base:676: Skipping time-series collection (disabled or not applicable)
2026-06-13 00:06:36|VERBOSER:mlps_logging:101: Generating DLIO command for benchmark training
2026-06-13 00:06:36|DEBUG:dlio:270: Generating MPI Command with binary "mpiexec"
2026-06-13 00:06:36|DEBUG:utils:570: Configured slots for hosts: ['localhost:79']
2026-06-13 00:06:36|INFO:utils:612: MPI BTL transport: auto (OpenMPI default selection)
2026-06-13 00:06:36|STATUS:mlps_logging:101: Running benchmark command:: mpiexec -n 79 -host localhost:79 --bind-to none --map-by socket /home/smrc/Storage_Repo_Tests
2026-06-13 00:06:36|DEBUG:utils:343: DEBUG - Executing command: mpiexec -n 79 -host localhost:79 --bind-to none --map-by socket /home/smrc/Storage_Repo_Tests/storage/
[OUTPUT] [DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT]   storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT]   storage_root   = './'
[OUTPUT]   storage_options= None
[OUTPUT]   data_folder    = '/mnt/drives/nvme3n1/training/retinanet'
[OUTPUT]   framework      = <FrameworkType.PYTORCH: 'pytorch'>
[OUTPUT]   num_files_train= 20
[OUTPUT]   record_length  = 322957
[OUTPUT]   generate_data  = True
[OUTPUT]   do_train       = False
[OUTPUT]   do_checkpoint  = False
[OUTPUT]   epochs         = 1
[OUTPUT]   batch_size     = 1
[OUTPUT] 2026-06-13T00:06:43.706265 Running DLIO [Generating data] with 79 process(es)
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MiB overhead
[OUTPUT] ================================================================================
[OUTPUT] 2026-06-13T00:06:43.821833 Starting data generation
[?25l
[2K[32m⠋[0m Generating JPEG Data [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 0/20 [33m0:00:00[0m
[OUTPUT] 2026-06-13T00:06:43.883289 Generation done
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MiB overhead
[OUTPUT] ================================================================================
⠸ Running benchmark... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 0:00:092026-06-13 00:06:46|VERBOSE:mlps_logging:101: Command stdout saved to: training_datagen.std
2026-06-13 00:06:46|VERBOSE:mlps_logging:101: Command stderr saved to: training_datagen.stderr.log
2026-06-13 00:06:46|DEBUG:base:620: Skipping end cluster collection (no start collection)
2026-06-13 00:06:46|STATUS:mlps_logging:101: Writing metadata for benchmark to: /mnt/drives/nvme3n1/results/training/retinanet/datagen/20260613_000636/training_202606
{
  "benchmark_type": "training",
  "model": "retinanet",
  "command": "datagen",
  "run_datetime": "20260613_000636",
  "num_processes": 79,
  "accelerator": null,
  "result_dir": "/mnt/drives/nvme3n1/results/training/retinanet/datagen/20260613_000636",
  "parameters": {
    "model": {
      "name": "retinanet",
      "type": "cnn"
    },
    "framework": "pytorch",
    "workflow": {
      "generate_data": true,
      "train": false,
      "checkpoint": false
    },
    "dataset": {
      "data_folder": "/mnt/drives/nvme3n1/training/retinanet",
      "format": "jpeg",
      "num_files_train": "20",
      "num_samples_per_file": 1,
      "record_length_bytes": 322957,
      "num_subfolders_train": "10"
    }
  },
  "override_parameters": {
    "dataset.num_files_train": "20",
    "dataset.num_subfolders_train": "10",
    "dataset.data_folder": "/mnt/drives/nvme3n1/training/retinanet"
  },
  "system_info": null,
  "runtime": 9.109416484832764,
  "verification": null,
  "executed_command": "mpiexec -n 79 -host localhost:79 --bind-to none --map-by socket /home/smrc/Storage_Repo_Tests/storage/.venv/bin/dlio_benchmark workload=retinan
  "command_output_files": [
    {
      "command": "mpiexec -n 79 -host localhost:79 --bind-to none --map-by socket /home/smrc/Storage_Repo_Tests/storage/.venv/bin/dlio_benchmark workload=retinanet_da
      "stdout": "/mnt/drives/nvme3n1/results/training/retinanet/datagen/20260613_000636/training_datagen.stdout.log",
      "stderr": "/mnt/drives/nvme3n1/results/training/retinanet/datagen/20260613_000636/training_datagen.stderr.log"
    }
  ],
  "args": {
    "mode": "open",
    "benchmark": "training",
    "model": "retinanet",
    "command": "datagen",
    "hosts": [
      "localhost"
    ],
    "num_processes": 79,
    "exec_type": "mpi",
    "mpi_bin": "mpiexec",
    "oversubscribe": false,
    "allow_run_as_root": false,
    "mpi_btl": "auto",
    "mpi_params": null,
    "data_dir": "/mnt/drives/nvme3n1/training",
    "dlio_bin_path": null,
    "results_dir": "/mnt/drives/nvme3n1/results",
    "config_file": null,
    "debug": true,
    "verbose": false,
    "stream_log_level": "INFO",
    "quiet": false,
    "dry_run": false,
    "verify_lockfile": null,
    "skip_validation": false,
    "data_access_protocol": "file",
    "loops": 1,
    "allow_invalid_params": false,
    "params": [
      "dataset.num_files_train=20",
      "dataset.num_subfolders_train=10"
    ],
    "num_client_hosts": 1
  }
}

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

It appears that the "--param" option does not work with "./mlpstorage closed training retinanet datagen file".

2 participants