Submission checker: complete Rules.md coverage + 'mlpstorage validate' subcommand#432
Submission checker: complete Rules.md coverage + 'mlpstorage validate' subcommand#432FileSystemGuy wants to merge 92 commits into
Conversation
…s, rule13_violation, bad_deployment kwargs - Fix Rule 1 bug: _build_system_yaml used power_capacity_watts (invalid PSU field) and missing inlet_voltage/nameplate_power_watts; switch to cloud+local deployment which removes rack_units/power/networking requirements on nodes while keeping all Phase 1 structural properties intact (65 tests still pass) - Add system_yaml_bad_capabilities: perturb capabilities via key overwrite, remove (drop field), or add (inject key) operations - Add system_yaml_rule13_violation: set both_simultaneous=True, remap=5 to trigger Capabilities.check_remap_time (Rule 13 cross-field) - Add system_yaml_bad_deployment (Resolution A): set deployment to non-enum int 12345 to drive SCHEMA_ERROR_RULE_MAP unmapped-fallback test (loc 'system_under_test -> deployment' not in map -> 2.1.7 fallback) - All three new kwargs recognized by sealed-enum guard; unknown kwargs still raise
…s 7 classes
- TestSchemaCheck_DefaultClean: positive case (valid YAML -> zero violations)
- TestSchemaCheck_BadType: unrecognised string/int values for bool fields -> 4.7.4
NOTE: Pydantic v2 coerces truthy strings ('yes','true') and int 0/1 to bool;
tests use 'not-a-bool', int 2, 'maybe' which actually trigger type errors
- TestSchemaCheck_MissingField: remove remap_time -> 4.7.3; remove multi_host -> 4.7.4
- TestSchemaCheck_Rule13CrossField: rule13_violation fixture -> 4.7.3 via cross-field check
- TestSchemaCheck_UnmappedFallback (Resolution A): bad_deployment=12345 -> 2.1.7 fallback
- TestSchemaCheck_OpenSubtree: open/ subtree walked per D-A1
- TestSchemaCheck_AccumulateDontAbort: two errors in one YAML -> two records (QUAL-01)
- All 327 tests pass (317 pre-existing + 10 new)
…ark_api helper - Add @rule("3.3.4", "trainingSingleHostClientLimit") decorator to single_host_client_limit - Upgrade single_host_client_limit from bare self.log.error to self.log_violation (QUAL-02) - Register self.single_host_client_limit in init_checks (TRAIN-01 wire-up — was missing) - Add _get_benchmark_api() helper reading self.submissions_logs.system_file (D-B7) - Import rule decorator and _check_filesystem_separation helper at top of file - Implement mlpstorage_filesystem_check with @rule("3.4.2", ...) + helper call (TRAIN-02) - Object-API silent-pass gate via _get_benchmark_api() (D-B7) - Emit [3.4.2 trainingMlpstorageFilesystemCheck] 'df output not found' when df absent (D-B4) - Emit same violation when data_dir and results_dir share the same filesystem mount
- Import _check_filesystem_separation, _pair_checkpoint_runs, _parse_iso_gap from helpers.py - Add _get_benchmark_api() helper reading self.system_file with safe .get chain (D-B7) - Add _get_capability(field) helper for system YAML capabilities (D-A3 silent-skip support) - Register CHKPT-01..03 methods in init_checks (partial — CHKPT-04..06 follow in Task 2b) - CHKPT-01 open_mpi_processes: @rule(4.6.4) verify num_processes is multiple of TP*PP - CHKPT-02 cache_flush_validation: @rule(4.7.1) verify write->read gap <=30s; strict per spec - CHKPT-03 total_test_duration: @rule(4.7.2) happy-path via log.info (D-D1); missing-ts via log_violation - self.system_file assignment (D-D3) from Plan 02-01 preserved — NOT duplicated
…Task 2b) - CHKPT-04 remapping_time_reporting: @rule(4.7.3) cross-check observed vs declared remap interval * Silent-skip when _get_capability returns None (D-A3 — schema check owns missing-field) * 0.5x tolerance band on declared remap_time_in_seconds - CHKPT-05 simultaneous_rw_support: @rule(4.7.4) deferred runtime cross-check (TODO-002) * Schema covers structural requirements (D-A3); runtime per-host timing not in summary.json * Emits log.info with TODO-002 reference; returns True - CHKPT-06 checkpoint_filesystem_check: @rule(4.4.2) df-based filesystem separation for chkpt * Maps checkpoint_folder as data_dir analog per RESEARCH.md * Object-API silent-pass via _get_benchmark_api() (D-B7) * 'df output not found' emits violation (D-B4) - init_checks finalized: all six CHKPT-01..06 registered in locked order
… metadata + 16 new kwargs - MockLogger.infos captures info() calls for CHKPT-03/05 test assertions - _DEFAULT_METADATA module-level dict written to each timestamp directory (training/datagen, training/run, checkpointing) as metadata.json - _MOCK_DF_OUTPUT_DIFFERENT_MOUNTS + _MOCK_DF_OUTPUT_SAME_MOUNT snippets - 16 new mutation kwargs: benchmark_api, run_logfile_df_block, chkpt_logfile_df_block, run_metadata_hosts, chkpt_summary_timestamps, chkpt_split_mode, chkpt_cache_flush_gap_seconds, chkpt_model, chkpt_open_num_processes, chkpt_closed_num_processes, chkpt_remap_time_seconds, chkpt_simultaneous_flags, chkpt_checkpoint_folder, chkpt_results_dir, run_data_dir, run_results_dir - Sealed-enum guard preserved; _build_system_yaml extended with new args - All 75 Phase 1 + Plan 02-02 tests continue to pass Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…discover_rules - Test @rule sets __rule_id__ and __rule_name__ on decorated function - Test decorator returns function unchanged (signature + return value preserved) - Test decorated method binds self correctly on class instances - Test discover_rules returns correct count for class with 3 decorated + 2 plain methods - Test discover_rules returns rule_id -> (rule_name, method_name) tuples - Test discover_rules returns empty dict for class with no decorated methods - Test discover_rules does not raise with non-callable class attributes
…r_rules helper - @rule(rule_id, rule_name) attaches __rule_id__ and __rule_name__ to method without wrapping it, preserving inspect.signature and self-binding (D-03) - discover_rules(check_class) returns dict[rule_id -> (rule_name, method_name)] via inspect.getmembers with callable predicate and __rule_id__ filter (D-04) - Flat module with Args/Returns docstrings matching utils.py style (PATTERNS.md) - No top-level side effects; only stdlib inspect imported - 9 tests pass covering all 6 specified behaviors
…iolation - Test locked format [rule_id rule_name] path: msg emitted via log.error - Test warn_violation uses log.warning, log_violation uses log.error - Test both helpers return None and do not raise - Test percent-style formatting passthrough via *args - Test run_checks behavior unchanged (pass, fail, exception cases)
…elpers - Add log_violation(rule_id, rule_name, path, msg, *args) emitting '[<id> <name>] <path>: <msg>' via self.log.error (D-06, D-07) - Add warn_violation(rule_id, rule_name, path, msg, *args) same format via self.log.warning for STRUCT-06 / symlink rejection (D-08) - Both helpers accept *args and pass through to logging for lazy % formatting - Place helpers after __init__ and before run_checks (cross-cutting helpers near construction, not entangled with execution machinery) per PATTERNS.md - run_checks, execute, __call__, and except BaseException clause unchanged - 10 tests pass; test_rules.py 32 tests still pass (no regression)
…MP_COUNT, MD5 exclude constants, Config.get_reference_checksum - Import-error confirms REFERENCE_CHECKSUMS not yet in constants.py - 7 tests covering D-09 (constants), D-10/D-12 (Config precedence), D-13 (MD5 excludes), D-22
…5 exclude constants; extend Config - Add REFERENCE_CHECKSUMS version-keyed dict (D-09, Rules.md 2.1.6/3.6.1) - Add RUN_TIMESTAMP_COUNT = 6 (D-22, Rules.md 2.1.17) - Add MD5_EXCLUDE_PREFIXES and MD5_EXCLUDE_FILENAMES tuples (D-13) - Extend Config.__init__ with reference_checksum_override kwarg (D-10) - Add Config.get_reference_checksum(cli_override) with cli > ctor > version-dict > None precedence (D-12)
…and CLI (D-13, D-14, D-11) - 9 predicate tests: .git exclusion, __pycache__ exclusion, binary-mode, POSIX sort, symlink rejection, rename detection, .pyc exclusion, .egg-info exclusion, nonexistent root - 5 CLI integration tests (D-11): hex digest, error exit, determinism, excludes, lib-vs-cli parity - All fail: tools/ package does not yet exist
…ee_md5 predicate
- tools/__init__.py: empty package marker (mirrors checks/__init__.py)
- tools/code_checksum.py: compute_code_tree_md5(root_path, log) -> str | None
- os.walk(root_path, followlinks=False) — path-traversal prevention (T-02-01)
- Prunes excluded dirs in-place via dirnames[:] for efficient walk
- MD5_EXCLUDE_PREFIXES / MD5_EXCLUDE_FILENAMES imported from constants (D-13)
- .egg-info suffix exclusion handled in predicate (D-13)
- Binary-mode reads only (no line-ending normalization)
- POSIX-byte-sorted entries (rel_path.encode('utf-8'))
- Relative path bytes hashed before content (rename detection per D-14)
- Symlinks emit [2.1.6 codeDirectoryContents] warning, skipped (not raised)
- Returns None for nonexistent root
- All 9 predicate tests pass
…(D-11) - Mirrors main.py shape: argparse, logging.basicConfig (same format string), get_args(), main() -> int, sys.exit(main()) - Imports compute_code_tree_md5 from .code_checksum (does not duplicate predicate) - Prints 32-char hex digest to stdout; exits 1 with log.error for nonexistent path - All 5 CLI integration tests pass (determinism, excludes, library parity) - 26 total tests passing (9 predicate + 5 CLI + 12 constants/Config)
…-01..14) and fixture factory - TestFixtureFactory: asserts build_submission default tree structure - TestStruct01..14: positive + negative tests for every STRUCT rule - TestAccumulateDontAbort: two-violation proof for STRUCT-07 and STRUCT-09 - TestQual02RuleIdPrefix: programmatic [rule_id rule_name] prefix enforcement - All tests fail: SubmissionStructureCheck and conftest.py do not exist yet
…+ conftest factory - Add SubmissionStructureCheck class with @rule-decorated methods for STRUCT-01..14 (Rules.md §2.1.1–2.1.13 + 2.1.21). - Each method logs its own violation and returns bool — never raises, consistent with accumulate-don't-abort orchestration spirit. - STRUCT-04 enforces both count==1 AND name match against basename(args.input) per PLAN.md D-277. - Add tests/conftest.py with build_submission factory fixture and MockLogger. Root nested under tmp_path/<submitter>/ so basename satisfies STRUCT-04 by default. - Drop the ill-formed test_submitter_name_with_slash variant — a directory name containing '/' is unrepresentable on POSIX; STRUCT-01 negative coverage is still provided by test_submitter_name_with_space. 63/63 tests pass in mlpstorage_py/tests/test_submission_checker_structure.py.
…nce-checksum CLI (D-02) - Add --reference-checksum CLI flag (default None) — propagates to Config as reference_checksum_override (the override hook added in plan 01-02). - Instantiate SubmissionStructureCheck once BEFORE the per-benchmark loader.load() loop (per PLAN.md 01-03 D-02). Failure is appended to the existing `errors` accumulator but does NOT short-circuit the loop — every benchmark still runs its own checks (accumulate-don't-abort). - Add TestMainWiring smoke tests asserting the import and CLI flag exist. 65/65 tests pass in mlpstorage_py/tests/test_submission_checker_structure.py.
…ale-comment removal) - 5-timestamp regression: returns False, error mentions both 6 and 5 - 6-timestamp happy path: returns True, no errors - monkeypatch proves count is driven by RUN_TIMESTAMP_COUNT, not inline `6` - source assertion: stale `# v2.0 only 5 required` comment must be gone AND docstring must cite Rules.md 2.1.17 - invalid-format check still flags malformed entries (no-regression guard for the sibling regex check) Expected RED bar at this commit: 2 failures - test_count_is_driven_by_constant_not_inline_literal (impl still uses inline 6) - test_stale_comment_is_gone_and_rule_cited (stale comment + missing citation) GREEN follows in the next commit.
…to RUN_TIMESTAMP_COUNT
Mechanical fix per Plan 01-04 D-22 and PATTERNS.md "directory_checks.py (BUG-04
modification)" section. Single method modified: run_files_timestamp_check.
Changes:
- Delete the two-line stale comment that suggested v2.0 only required 5
timestamps (factually wrong vs. Rules.md 2.1.17 — the rule has always
been 6: 1 warm-up + 5 measured).
- Replace inline `6` with the named constant `RUN_TIMESTAMP_COUNT` (already
imported via the wildcard `from ..constants import *` at the top of the
module; constant was added in plan 01-02).
- Augment the method docstring to cite Rules.md 2.1.17 (runTimestamps) and
spell out the 1 warm-up + 5 measured semantic so submitters can grep the
spec from validator output.
- Update the error message format string to reference the constant via `%d`
so a future v3.0 bump to RUN_TIMESTAMP_COUNT propagates cleanly.
Out of scope (intentional carve-out per PATTERNS.md "Departures" + PITFALLS.md
#14): does NOT promote self.log.error to self.log_violation. BUG-04's scope is
the count fix only; QUAL-04 message-format upgrades apply to the new Phase 1
SubmissionStructureCheck surface, not to DirectoryCheck retrofits.
Acceptance:
- new regression tests: 6/6 pass
- full mlpstorage_py/tests suite: 238 passed, 23 pre-existing pyarrow
failures (unchanged from Plan 03's baseline)
- grep -c "v2.0 only 5 required" → 0
- grep -c "len(timestamps) != 6" → 0
- diff stat: 10 insertions, 8 deletions (well under 20-line bound)
Closes BUG-04. Phase 1 success criterion #5 satisfied.
- test_loader_metadata_refresh.py: BUG-01 regression (training run loop reuses datagen metadata_path; test confirms carryover produces num_processes=4 for all 6 run tuples when per-run metadata has num_processes=8; also pins checkpointing branch is unaffected) - test_constants_checkpoint_files.py: BUG-02 regression (CHECKPOINT_REQUIRED_FILES uses training_run.* prefix; tests assert checkpointing_run prefix and escaped dot before log for all three version keys) - test_bug03_closed_mpi_processes.py: BUG-03 regression (subset branch raises UnboundLocalError on model_key; non-subset branch missing rule-ID-tagged log_violation format) All new tests currently FAIL as expected (RED phase).
BUG-01 (loader.py): add metadata_path = self.find_metadata_path(timestamp_path) inside training run loop before metadata_file load; mirrors datagen loop line 112; checkpointing branch (lines 128-134) unchanged (already correct) BUG-02 (constants.py): replace training_run.* prefix with checkpointing_run.* in CHECKPOINT_REQUIRED_FILES for all three version keys; also escape dot before 'log' (\.log) to avoid latent over-matching; cites Rules.md 2.1.25 BUG-03 (checkpointing_checks.py): derive model_key BEFORE subset-branch check so UnboundLocalError is gone; add @rule("4.6.1", "checkpointClosedMpiProcesses") decorator; upgrade log.error to log_violation (QUAL-02 retrofit); delegate model_process_requirements lookup to self.config.get_closed_mpi_processes (D-C4 — Config method added in Task 3) D-D3 (checkpointing_checks.py): add self.system_file = submissions_logs.system_file so CHKPT-04/05/06 can read benchmark_API and capabilities
…get_closed_mpi_processes
constants.py: add CLOSED_MPI_PROCESSES dict (Rules.md Table 2 — CLOSED total
MPI processes per model: 8b=8, 70b=64, 405b=512, 1t=1024); comment cites
D-C4 and explains why DP cannot come from DLIO workload YAMLs
configuration.py: add os/yaml imports; add _parallelism_cache to __init__;
add get_model_parallelism(model_size) — lazy-loads llama3_{key}.yaml,
caches per key, returns (tp, pp) tuple; add get_closed_mpi_processes(key)
— one-line delegate to CLOSED_MPI_PROCESSES[key]
test_config_model_parallelism.py: 19 tests covering all four model keys,
caching behavior (yaml read only once), FileNotFoundError on unknown model,
KeyError on unknown closed-mpi key, Phase 1 backwards-compat assertions
… _pair_checkpoint_runs, _parse_iso_gap
helpers.py (new flat module, no circular imports, no log calls — caller emits):
- DF_HEADER_RE: anchored df header regex (D-B1); tolerates df and df-h column
variants via \S+ for second column
- _check_filesystem_separation(metadata_args, logfile_path) -> (ok, df_found):
D-B1..B5; silent-skip on missing paths (D-B3); df-not-found returns False,False
(D-B4); longest-prefix realpath mount match (D-B2); accepts data_dir OR
checkpoint_folder key (TRAIN-02 and CHKPT-06 reuse)
- _pair_checkpoint_runs(summaries) -> list[(write, read)]: D-D2; splits on
num_checkpoints_write/read; zip-truncates on unequal counts (documented)
- _parse_iso_gap(start, end) -> float: both space and T ISO forms; Python 3.10
compatibility via s.replace(' ','T') fallback
test_helpers.py: 24 tests covering all behaviors including silent-skip (D-B3),
df-not-found (D-B4), same/different mounts (D-B2), longest-prefix (D-B2),
df-h tolerance (D-B1), checkpoint_folder key (CHKPT-06), symlink resolution,
pairing/truncation/combined-mode-drop, ISO gap parsing
…E_MAP
system_yaml_schema_checks.py (new skeleton — Plan 02-02 fleshes out):
- SystemYamlSchemaCheck subclasses BaseCheck (D-A1); mirrors Phase 1's
SubmissionStructureCheck plug-in pattern (above-the-workload tier)
- Constructor accepts (log, config, root_path); assigns name, root_path,
config; calls init_checks()
- SCHEMA_ERROR_RULE_MAP (D-A2): maps four locked Pydantic loc field paths
to (rule_id, rule_name) tuples for CHKPT-04/05 violation tagging
- init_checks() sets self.checks = [] (Plan 02-02 appends
schema_validate_all_system_yamls and wires into main.py)
- Calling the instance (run_checks over empty list) returns True cleanly
No schema_validator import yet — Plan 02-02 owns that.
Smoke test: python3 -c "... check() is True ... print('OK')" passes.
…c string
- Empirically verified 2026-06-10: Pydantic v2 model_validator(mode='after')
on Capabilities fires at loc ('system_under_test','solution','capabilities')
which stringifies to 'system_under_test -> solution -> capabilities'
- Both Rule-13 trigger conditions produce the same loc string
- RESEARCH.md predicted empty string; empirical run shows Pydantic v2
propagates the container path (corrects Gray Area 4 prediction)
- Finalized map has 5 entries (4 field-level + 1 cross-field Rule-13)
- Also implemented schema_validate_all_system_yamls method and wired into
init_checks (full method implementation per Plan 02-02 Task 2)
…op (D-A1) - Add import for SystemYamlSchemaCheck - Instantiate schema_check after SubmissionStructureCheck, before for-loop - Errors accumulated into existing errors list without aborting per QUAL-01 - Mirrors Phase 1 SubmissionStructureCheck plug-in pattern (D-A1)
…N-02 coverage - TestTrain01_SingleHostClientLimit: 4 tests (positive + negative for single-host client limit check, rule 3.3.4) - TestTrain02_MlpstorageFilesystemCheck: 5 tests (different mounts pass, same mount fails, df-not-found fails, object-API silent-pass, missing data_dir silent-skip per D-B3) - TestTrain02_DfBlockNotFoundIsRuleIdTagged: 1 test (ROADMAP criterion #3 — logfile path included in violation message) - All 10 tests pass; rule-ID prefixes [3.3.4] and [3.4.2] asserted Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cross-cutting tests - TestChkpt01_OpenMpiProcesses: 4 tests (positive + negative for open MPI process count check, rule 4.6.4; closed and unknown model silent-skips) - TestChkpt02_CacheFlushValidation: 4 tests (combined-mode pass, 25s gap pass, 45s gap violation, missing timestamps violation, rule 4.7.1) - TestChkpt03_TotalTestDuration: 3 tests (happy-path info emit, missing timestamps error, combined-mode pass, rule 4.7.2) - TestChkpt04_RemappingTimeReporting: 4 tests (remap=0 pass, absent field pass, too-small observed violation, default zero pass, rule 4.7.3) - TestChkpt05_SimultaneousRwSupport: 3 tests (defaults, sim_write=False info, missing field silent-pass, rule 4.7.4) - TestChkpt06_CheckpointFilesystemCheck: 4 tests (different mounts pass, same mount violation, df-not-found, object-API silent-pass, rule 4.4.2) - TestQual02RuleIdPrefix: 6 parametrized cases asserting every error starts with [rule_id rule_name] prefix (QUAL-02 enforcement) - TestAccumulateDontAbort: 10 violations from 10 timestamps prove accumulate-don't-abort (QUAL-01 / Pitfall 11) - TestChkpt05DeferredFollowUp: 2 tests pinning TODO-002 deferral — info only, returns True, mentions 'runtime cross-check' - 31 tests total, all passing; full suite at 368 passed, 0 failed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CR-01: DF_HEADER_RE required the literal "Available" so real-world `df -h`
output (which emits "Avail") never matched. TRAIN-02 / CHKPT-06 would
silently fail to detect filesystem-separation violations on any submission
captured with `df -h`. The matching test hard-coded "Available", giving
false confidence. Fix: fourth column accepts `Avail\w*`.
CR-02: `content.index('\\n', match.end())` raised ValueError when the df
header was the last line with no trailing newline; BaseException handler
swallowed it without a rule-tagged diagnostic. Fix: use str.find() and
return (False, False) so the caller emits the standard "df output not
found" violation.
Tests: rewrote test_df_h_format_tolerated to use the real "Size + Avail"
header form, added test_df_long_format_tolerated for the original "df"
form, and added test_df_header_at_eof_no_newline to lock in CR-02.
26 tests pass in test_helpers.py (was 24); 342 phase-2-scope tests pass
overall, zero regressions.
Findings from .planning/phases/02-per-run-parameter-logfile-coverage/02-REVIEW.md.
… release version The --mlperf-version default was hardcoded to 'v5.1', a stale literal that's never been in VERSIONS = ['v2.0', 'v3.0']. argparse does not validate the 'default' kwarg against 'choices', so the bad default flowed silently through every default invocation, causing every per-version dict lookup (DATAGEN_REQUIRED_FILES, RUN_REQUIRED_FILES, CHECKPOINT_REQUIRED_FILES, CHECKPOINT_REQUIRED_FOLDERS, SYSTEM_PATH) to raise KeyError. Rules referencing those dicts crashed under the recently unmasked BaseCheck traceback path. Replace the literal with DEFAULT_SPEC_VERSION, derived from the package release version's major.minor: a '3.0.x' package binds to spec 'v3.0', a '4.0.x' package will bind to 'v4.0', etc. This holds the architectural distinction the user requested — the --mlperf-version flag remains overridable so the validator can still target prior rounds — while guaranteeing the default is always a value that round's constants dicts actually define entries for. Fallback: when the package version cannot be parsed (the 'unknown' result from _resolve_version when both PEP 621 metadata and pyproject .toml are unavailable) or when the derived round isn't in VERSIONS yet (during the transition window after a major bump), the derivation falls back to the last entry of VERSIONS. Empty VERSIONS yields 'unknown'. DEFAULT_SPEC_VERSION lives in submission_checker.constants alongside VERSIONS. Both the standalone submission_checker CLI and the new mlpstorage validate shim import it; the shim's import is lazy so the fast-path top-level CLI parser construction doesn't pay the import cost. New test file test_default_spec_version.py pins both layers: the pure derivation function and the module-level constant. The module-level suite parametrizes over every per-version dict to catch the original KeyError regression — if any per-version dict ever drops the entry for DEFAULT_SPEC_VERSION, the test fails loudly at import time. Fixing the default unmasks further latent KeyErrors (e.g. KeyError: 'unet3d_a100' from a workload/accelerator lookup) that the v5.1 short circuit had been hiding. Those are surfaced via the BaseCheck traceback path and tracked separately.
…ames Three per-model dict lookups (NUM_DATASET_TRAIN_FILES, NUM_DATASET_EVAL_FILES, CHECKPOINT_FILE_MAP) crashed with KeyError whenever the workload directory name did not exactly match a known model. v2.0 submissions used non-conforming names like "unet3d_a100", "cosmoflow-20N-6PPN-A100", "resnet50-24N-14PPN-H100" that encode the accelerator and topology in the directory name. STRUCT-11 (2.1.11 trainingWorkloads) already flagged these as structural violations, but the downstream cross-check rules then crashed on the dict lookup rather than acknowledging the 2.1.11 violation and skipping cleanly. Three layered fixes: 1. Config.get_num_train_files / get_num_eval_files / get_checkpoint_file switch from d[key] to d.get(key) so unknown models return None instead of raising KeyError. This makes the Config API safe to call speculatively; the caller decides whether None means "skip" or "this was supposed to resolve". 2. TrainingCheck.run_data_matches_datasize resolves expected_train and expected_eval up front. When either is None, it emits an INFO diagnostic explaining the skip and pointing at the 2.1.11 structural violation, then returns True. Also tightens the inner loop's comparisons from `num_files_train > expected` to `elif num_files_train > expected` so a None num_files_train (already flagged with its own 3.3.1 violation above) no longer feeds a None>int comparison crash on the very next line. 3. CheckpointingCheck.closed_checkpoint_parameters (4.6.3) adds a matching None-guard on get_checkpoint_file. When the LLM workload directory name is non-conforming, the function emits an INFO diagnostic pointing at the 2.1.21 checkpointingWorkloads violation and returns True rather than feeding None into os.path.join + open. The diagnostics use INFO level (not warn_violation) so they read as "this is why the check is skipping" rather than "the spec is being violated again" — the structural rule already owns the underlying complaint, and the skip diagnostic exists to make the runtime control flow visible to reviewers.
…stamp dirs in loader Loader.load() called list_dir() unconditionally on datagen_path, run_path, and checkpoint_path. When a submission omits one of these (a structural violation that SubmissionStructureCheck reports under STRUCT-12 / 2.1.12), os.listdir() raised FileNotFoundError, which escaped run_checks() and killed the entire validate run mid-corpus — violating the Rules.md intro guarantee of best-effort continuation across failures. Guard each list_dir call with os.path.isdir; yield empty file lists when the optional subdir is missing. The structural check still flags the violation; downstream submitters are now reached. Adds 4 regression tests covering each missing-dir case plus a multi- submitter traversal continuation pin.
…ummary/metadata When summary.json or metadata.json is missing from a run timestamp, Loader returns None for the corresponding parsed dict. Six TrainingCheck methods called summary.get(...) / metadata.get(...) without guarding, producing six AttributeError tracebacks per affected submitter — all escaping run_checks() into the log. Methods fixed: recalculate_dataset_size (training_checks.py:135) run_data_matches_datasize (training_checks.py:255) accelerator_utilization_check (training_checks.py:296) single_host_simulated_accelerators (training_checks.py:325) single_host_client_limit (training_checks.py:371) identical_accelerators_per_node (training_checks.py:395) Each skips run-file tuples where the required dict is None and emits a debug-level note. The missing summary.json / metadata.json is already reported under rule 2.1.19 (runFiles) by SubmissionStructureCheck, so silent skip avoids duplicate noise while keeping the diagnostic. Adds parametrized regression tests covering each method against (1) a single None-tuple and (2) a mixed None + well-formed run_files list.
…ngTimestampGap
The shortest-run-duration tracker was initialised as ``max_gap = float("inf")``
and then compared against ``run_duration`` (a ``datetime.timedelta``). Python
raises ``TypeError: '<' not supported between instances of 'datetime.timedelta'
and 'float'`` on the first iteration. The existing
``except (ValueError, KeyError, TypeError)`` swallowed it and re-emitted the
TypeError as a spurious 2.1.24 "Failed to parse timestamp data" violation on
every well-formed checkpointing submission with a parseable summary.json.
Replace the sentinel with ``timedelta.max`` so the running minimum stays a
timedelta. Also guard against ``checkpoint_dict is None`` (missing
summary.json is already reported under 2.1.22), avoiding a TypeError on the
dict lookup below.
Adds 3 regression tests: well-formed checkpoints emit no 2.1.24 violation,
None summary entries are skipped, and the single-checkpoint case doesn't
crash the sentinel comparison.
When summary.json is missing for a run timestamp, Loader returns None. 2.1.18 then evaluated run_dict["start"] and raised TypeError: 'NoneType' object is not subscriptable. The broad except (ValueError, KeyError, TypeError) clause caught it and re-emitted as a 2.1.18 "Failed to parse timestamp data" violation — double-reporting alongside the 2.1.19 missing- file diagnostic that already covers the same structural complaint. Skip None entries silently; 2.1.19 owns the surface diagnostic. Adds 2 regression tests: single None entry mixed with well-formed runs, and all-None run_files. Both must not raise and must not produce a 2.1.18 parse-failure violation.
…y/metadata
Eleven per-timestamp methods in CheckpointingCheck called summary.get(...)
or metadata.get(...) without guarding the case where Loader returns None
for a missing summary.json / metadata.json. The v2.0 reviewer corpus run
produced 273 AttributeError tracebacks (13 methods × ~21 affected
submitters), all escaping run_checks() into the log.
Centralize the guard as a generator method _iter_valid_files() that
silently skips None entries (the underlying structural cause is already
reported under 2.1.22 / 2.1.25 by SubmissionStructureCheck). Replace
every "for summary, metadata, _ in self.submissions_logs" loop header
with "for summary, metadata, _ in self._iter_valid_files()".
Methods covered (11 total):
checkpoint_data_size_ratio 4.3.1
fsync_verification 4.3.2
model_configuration_req 4.3.3
closed_mpi_processes 4.6.1
closed_accelerators_per_host 4.6.2
aggregate_accelerator_memory 4.3.4
closed_checkpoint_parameters 4.6.3
checkpoint_path_args 4.4.1
subset_run_validation 4.3.5
open_mpi_processes 4.6.4
checkpoint_filesystem_check 4.4.2
CHKPT-02 / 03 / 04 (cache_flush_validation, total_test_duration,
remapping_time_reporting) already handled None via the (x or {}).get(...)
pattern and via _pair_checkpoint_runs's defensive None skip — left as-is.
Adds 23 regression tests: generator filter behavior + parametrized
all-None / mixed-None for each of the 11 guarded methods.
… 7 loop sites BUG-T2 covered the 6 sites surfaced by the first validate run, but the corpus-wide re-run unmasked 5 more methods (training_checks.py:88, 512, 563, 593, 642) plus datagen_minimum_size's two internal loops — all crashing with the same AttributeError pattern when metadata.json is missing. Sites guarded (matches BUG-T2's inline pattern): verify_datasize_usage 3.1.1 line 86 datagen_minimum_size 3.2.1 lines 206 + 215 closed_submission_parameters 3.6.2 line 512 open_submission_parameters 3.6.3 line 563 mlpstorage_path_args 3.4.1 line 592 mlpstorage_filesystem_check 3.4.2 line 640 Adds 12 parametrized regression tests (all-None + mixed-None) covering the 6 newly-guarded methods.
…run/ dir run_results_json_check called list_files(self.run_path) directly. When training/<workload>/run/ is absent — a structural violation already reported under 2.1.12 by SubmissionStructureCheck — utils.list_files raised FileNotFoundError that escaped run_checks() into 39 tracebacks in the v2.0 corpus run (also accounts for the 8 distinct "training/<wl>/run" FileNotFoundError messages counted separately — same root cause). Probe os.path.isdir(self.run_path) first; emit a 2.1.16 violation pointing at the missing run/ dir and continue rather than escape. Adds 3 tests: missing-run-dir surface complaint, positive path, present-but-empty run/ dir.
- test_main_warnings: patch TrainingBenchmark at its real module — it is now lazy-imported inside run_benchmark(), so the symbol no longer lives on mlpstorage_py.main. - test_benchmark_flow: --what-if was renamed to --dry-run (#412) and the fail-fast MPI/DLIO check in dlio.py keys on args.dry_run. Set dry_run on the test args, point data_dir at tmp_path so the constructor does not mkdir /data, and pass the required `command` arg to generate_command(). - test_ab_comparison: restore the library_name pytest fixture (added on another branch in 9234f76 but never merged here). - test_multi_endpoint_integration: initialize DLIOMPI before constructing S3dlioStorage — its parent storage_handler calls DLIOMPI.size() during __init__, which asserts MPI was initialized. Also skip gracefully when s3dlio is not installed.
|
Please review: @pgmpablo157321 I cloned https://github.com/mlcommons/submissions_storage_v2.0 and ran this code against that repo. |
|
NOTE: this is still using the v2.0 definitions of workloads: unet3d, cosmoflow, and resnet50. I will update the Ruled.md with the latest training workloads (unet3d, and retinanet) in a few days, including updating this code to accept those workload names. The structure of the validator will be unchanged, but the workloads and accelerators will be updated. |
…dvanced usage Refresh the README to match the closed/open/whatif mode tree, fix the validator default version (v3.0, derived from package major.minor; was incorrectly v5.1), and add an Advanced Usage section covering object storage, YAML/--params overrides, lockfile gating, history replay, reportgen, and CI exit codes.
….1 invocation structure (#435) Issue #434: ``mlpstorage closed checkpointing run`` did not expose --num-checkpoints-write / --num-checkpoints-read, so closed submitters could not split the run into a write phase and a read phase with a filesystem cache flush in between — which Rules.md 4.7.1 requires when the client can cache the entire checkpoint set. CLI (layer A): - Move both flags out of the open-gated block in checkpointing_args.py into _add_checkpointing_core_args; defaults stay at 10/10 in every mode so existing single-invocation usage is unchanged. - validate_checkpointing_arguments now rejects, in mode=closed, any value other than {10, 0} and rejects the both-zero combination. - Update help_formatter.py CK_DATASIZE_CLOSED / CK_RUN_CLOSED to list both flags and explain the two-invocation split. Rules (layer B — live BenchmarkVerifier): - Add end_datetime to BenchmarkRunData/BenchmarkRun and populate it from summary.json's "end" field in DLIOResultParser. - Add CheckpointSubmissionRulesChecker.check_invocation_structure, which (for CLOSED runs only) requires either a single invocation with writes=10/reads=10 or exactly two invocations of 10/0 then 0/10 with a non-overlapping inter-phase gap of at most 30 seconds. Standalone validate (layer C — mlpstorage validate): - Add CheckpointingCheck.checkpoint_invocation_structure alongside the existing cache_flush_validation. The new method enforces the strict structural pattern (1 vs 2 invocations, exact 10/10 single-run or 10/0 -> 0/10 pair, no-overlap), while cache_flush_validation continues to enforce the 30-second upper bound on the inter-phase gap. Both fire under rule 4.7.1. Rules.md: - Rewrite 4.7.1 so the "if a flush is needed" phrasing is unambiguous: a flush is required only when the client has enough memory to cache the entire checkpoint set written in that run (rule of thumb: total per-client checkpoint size < 3x client RAM). Fix the prior typo that said the read phase was started with --num-checkpoints-write=0. - Update Table 3 so --num-checkpoints-write/-read read "Only 10 or 0" in CLOSED and "YES" in OPEN, with a footnote pointing at 4.7.1 for the two-invocation cache-flush workflow. Tests: - Replace the two test_cli.py tests that asserted closed-mode rejection at parse time with tests covering the new layer-A behavior: closed accepts 0/10 and 10/0 at parse time; validate_checkpointing_arguments rejects 20-valued counts and the both-zero combination. Confirmed against ../../v2-results/submissions_storage_v2.0: zero structural violations on the v2.0 archive (every closed submission used either single-run 10/10 or paired 10/0 -> 0/10). 26 submissions trip the existing 30-second gap check, which is the option-(b) outcome we expected — submitters who batch-queued the two invocations on HPC clusters drove their inter-phase gaps from minutes to a full half-hour.
…ples Pydantic models now inherit from a StrictModel base with extra="forbid", so typos in submission YAML (e.g. power_capacity_watts vs. nameplate_power_watts) surface as validation errors instead of being silently dropped. Brought the bundled example_*.yaml files in line with the actual schema by adding the previously-missing inlet_voltage / nameplate_power_watts PSU fields and removing the stale power_capacity_watts field. Updated the Yamale schema.yaml to match the pydantic source of truth: swapped power_capacity_watts for inlet_voltage + nameplate_power_watts and expanded the PSU efficiency enum to the full eight-value set. Added tests/unit/test_example_system_descriptions.py to validate every example_*.yaml against the live schema, plus a TestExtraFieldsForbidden class to fence the new strict-extras behavior against regression.
System YAML schema violations previously logged only the Pydantic message
("Field required", "Input should be a valid list"), discarding the loc
path and offering no pointer into the source file. validate_file now
re-parses each YAML with yaml.compose() and resolves every error's loc
tuple to a 1-indexed source line; the schema check emits both the field
path and "(line N)" alongside the message. The logging format also drops
the noisy "[filename:lineno LEVEL]" prefix in favor of plain "[LEVEL]"
so the rule-id and path are the first things on every line.
example_drive.yaml is updated to a realistic single-drive NVMe submission
(Supermicro / Sandisk SN861) that round-trips through the schema.
2.1.22/2.1.23 in directory_checks.py iterated one level past what loader.py:103 actually yields (loader_metadata.folder is already the workload dir), so they reported wrong paths and flagged dlio_config/ as a malformed timestamp dir. Treat self.checkpointing_path as the workload dir directly. Rules.md 2.1.23 said "exactly ten timestamp directories", which contradicted Rules.md 4.7.1's "1 or 2 invocations" (one invocation == one timestamp dir). The "ten" refers to the ten checkpoint operations performed in aggregate, not to ten directories. Update both Rules.md and the rule code to require 1 or 2. 2.1.24 compared dir-name gap against per-invocation duration, which misfires on every two-invocation submission (the natural write->read gap is intentionally larger than either phase). Recast the check to compare end-of-first-invocation -> start-of-second-invocation against the slower invocation's duration. 4.7.1 still owns the 30-second upper bound. 2.1.9 used int(yaml_nameplate_GB) == int(summary_usable_GiB), which fires on every well-formed submission (usable RAM is ~95-98% of nameplate after DDR/kernel overhead). Allow +/-10%. base.py: demote BaseCheck "Starting X for: Y" and "All X checks passed for: Y" from INFO to DEBUG so a passing run produces only the final SUMMARY line. Keep "Some X Checks failed" at ERROR. Tests: - conftest.py fixture now builds 2 checkpointing timestamp dirs (the canonical write+read shape) instead of 10. - test_directory_check_retrofit.py uses checkpointing_path = the workload dir (matching the loader contract) and reflects the 1-or-2 expected count. - test_submission_checker_structure.py asserts 2 dirs. - test_definition_of_done.py anchors its loader-loop sentinel on any [3.* / [4.* rule record instead of the demoted "Starting" INFO line. scripts/fix_micron_9550_15tb.py: helper that normalises the Micron 9550_15TB sample submission so it passes the validator end-to-end (stub files, datagen dedupe, warm-up run, df blocks, host_memory_GB normalisation, plus self-healing for prior buggy script runs).
|
FYI -- @idevasena @russfellows @dslik @bbelgodere Please review and approve this monster at your earliest convenience! I'll merge in the PR for going to "unet3d, retinanet" for training as soon as this one goes in. Pablo can then clone this point in the repo history rather than head to get his testing done. |
Summary
Extends
mlpstorage_py/submission_checkerso it implements every validation rule documented inRules.mdfor the MLPerf™ Storage v2.0 benchmark, and exposes the checker through a new top-level CLI subcommand:Both subcommands slot in as siblings to
reports/history/lockfile/versionunder the closed/open/whatif restructure (PR #412), and both work on a base install — benchmark deps are now lazy-loaded.The PR retains the existing
BaseChecksubclass architecture: every new rule is bound via an@rule(rule_id, rule_name)-decorated method on the appropriateChecksubclass, and every violation is reported with theRules.mdrule ID prefix so submitters can grep the spec from any error message.What's covered
SubmissionStructureCheckrunning once before the per-benchmark loader loop; covers submitter directory naming, MLPerf version dir layout, results-dir cardinality, code-tree MD5, and theRules.md 2.1.27directoryDiagram pictorial rule (bound via a no-op@rulemethod so coverage reports it as covered structurally).TrainingCheck, includingTRAIN-01(per-host accelerator parity),TRAIN-02(filesystem separation viadf), and the deferred-stub rules 3.3.5 / 3.3.7 (Rules.md flags both as "not clear how to check").CheckpointingCheck, plus the newCHKPT-06checkpointFilesystemCheck(Rules.md 4.4.2) that was missing from the existing requirements.4.3.1is routed through a newwarn_violation(advisory) channel; the rest use the standardlog_violationchannel.SystemYamlSchemaCheck— a new top-level check that schema-validates everysystems/<name>.yamlonce before the loader loop, with theSCHEMA_ERROR_RULE_MAPfield→rule-ID dict carrying Rules.md IDs all the way through to per-field schema errors.VdbCheck/KVCacheCheck) — extension points registered withMODE_TO_CHECKERSso future Rules.md §5/§6 fill-in lands symmetrically.mlpstorage rules-coverage(and the underlyingtools/rules_coverage.py) reconciles every Rules.md §2/§3/§4 ID against the live@rulebindings +OUT_OF_SCOPE_RULES+STUB_COVERAGE+SCHEMA_ERROR_RULE_MAPregistries; exits1on any unmapped live ID.Two prerequisite bug fixes surfaced during development
mlpstorage_py/submission_checker/__main__.pysopython -m mlpstorage_py.submission_checkerworks (it previously failed silently).args.submittershandling insubmission_checker/main.py— the pre-existingstr(args.submitters).split(",")produced["None"]when--submitterswas absent, silently filtering every real submitter out of the loader loop and making the validator a no-op for the default invocation.Sample-submission shakeout fixes (commit 93b3a2f)
Validating the
Micron/9550_15TBsample submission end-to-end surfaced a handful of rule-code defects and inter-rule contradictions that this PR also fixes:directory_checks.py) — both methods iteratedlist_dir(self.checkpointing_path)and treated each timestamp dir as a workload dir, even thoughloader.py:103already yields the workload dir asloader_metadata.folder. That reported wrong paths and bogus "Invalid timestamp format 'dlio_config'" diagnostics. Now treatsself.checkpointing_pathas the workload directly.Rules.md 2.1.23said "exactly ten timestamp directories", but4.7.1mandates 1 or 2 invocations and each invocation produces exactly one timestamp dir. The "ten" referred to the ten checkpoint operations performed in aggregate, not to ten directories. Updated bothRules.mdand the rule code to require 1 or 2 (matching4.7.1).end-of-first-invocation → start-of-second-invocationagainst the slower invocation's duration.4.7.1still owns the 30-second upper bound.int(yaml_nameplate_GB) == int(summary_usable_GiB)fires on every well-formed submission (usable RAM is ~95-98% of nameplate after DDR/kernel overhead). Now uses a ±10% band: catches genuine config mismatches (e.g. 256 GB rig reporting against a 768 GB YAML — ~3× off) while letting nameplate-vs-usable through.Validator output UX
BaseCheck.__call__previously emitted[INFO] Starting <name> for: <path>and[INFO] All <name> checks passed for: <path>per check class per benchmark, generating ~16 wrapper lines per system before any actionable content. Both are demoted to DEBUG so a passing run produces just the finalSUMMARY: submission looks OKline.Some <name> Checks failed for: <path>stays at ERROR — it's a useful failure-output transition marker, not status scaffolding.Fixture and test alignment
conftest.pybuild_submissionnow builds 2 checkpointing timestamp dirs (the canonical write+read shape) instead of 10.test_directory_check_retrofit.pyusescheckpointing_path = the workload dir(matching the loader contract) and reflects the 1-or-2 expected count.test_submission_checker_structure.pyasserts 2 dirs.test_definition_of_done.pyanchors its loader-loop sentinel on any[3.*/[4.*rule record instead of the demoted "Starting" INFO line — more robust to log-level changes.Sample data:
scripts/fix_micron_9550_15tb.pyReproducible helper that normalises the
Micron/9550_15TBsample submission so the validator passes end-to-end. Operations:output.json,per_epoch_stats.json,summary.json, logs, anddlio_config/files per the §2.1.14 / §2.1.19 / §2.1.25 file contracts.run/results.json(training) and<workload>/results.json(checkpointing).run/(§2.1.17).num_files_trainto the reference value for resnet50 and unet3d (§3.3.1).dfblock to every*_run.stdout.log(§3.4.2 / §4.4.2).llama3-1t'shost_memory_GBto the dominant rig's value so a singlememory_capacityin the system YAML covers every run.summary.endvalues left over from a prior (buggy) version of the script.CLI structure
validateandrules-coverageare top-level utility siblings — not nested underclosed/open/whatif, because validation runs against an already-built submission tree rather than executing a benchmark in a particular submission mode. This matches howreports,history,lockfile, andversionare wired today.The
submission_checker.mainandtools/rules_coveragemodules now expose arun(args)public entry; the standalone CLI entries still parseargvviaget_args()and delegate torun(), sopython -m mlpstorage_py.submission_checkerand direct-tool invocations keep working.Known limitations / Phase-4 follow-ups
build_submissionfixture fires ~17 pre-existing structural noise records inherited from the Phase-1/2 retrofit expansion. Cleaning the noise floor and re-tightening to set equality is a follow-up cleanup that touches conftest rather than checker code.WR-02,WR-05,WR-08,WR-10,WR-13,IN-01) were deferred out of the one-shot code-review fix loop — rationale captured in the SUMMARY files.TODO-002(CHKPT-05 runtime cross-check) andTODO-001(runtimedfcapture in the mlpstorage CLI) remain open and are tracked separately.Test plan
python3 -m pytest mlpstorage_py/tests -q -m "not integration"— 506 passed, 1 xfailed (the W-03 4.3.1 conftest-extension lock). The 34 failures in this environment are pre-existing onorigin/main(same 34 fail without this PR's changes) — they need optional deps (pyarrow / pydantic / pymilvus) that aren't part of the base test environment.python3 -m pytest mlpstorage_py/tests -m integration— 4 passed (the Definition-of-Done end-to-end suite, including the bad-fixture origin-context binding lock).mlpstorage validate --help— usage prints correctly under the new closed/open/whatif top-level parser.mlpstorage rules-coverageagainst a realRules.md— exits0, reconciles every live ID.mlpstorage validate <dir>— runs the full checker pipeline; reported violations carry rule IDs.mlpstorage validate ../../sample_data—Micron/9550_15TBpasses after runningscripts/fix_micron_9550_15tb.py(zero[ERROR]records for that system)..planning/artifacts in the PR (verified viagit diff --name-only).