fix(checkpointing): allow --num-checkpoints-* in CLOSED + enforce 4.7.1 invocation structure (#434)#435
Merged
FileSystemGuy merged 2 commits intoJun 12, 2026
Conversation
….1 invocation structure Issue #434: ``mlpstorage closed checkpointing run`` did not expose --num-checkpoints-write / --num-checkpoints-read, so closed submitters could not split the run into a write phase and a read phase with a filesystem cache flush in between — which Rules.md 4.7.1 requires when the client can cache the entire checkpoint set. CLI (layer A): - Move both flags out of the open-gated block in checkpointing_args.py into _add_checkpointing_core_args; defaults stay at 10/10 in every mode so existing single-invocation usage is unchanged. - validate_checkpointing_arguments now rejects, in mode=closed, any value other than {10, 0} and rejects the both-zero combination. - Update help_formatter.py CK_DATASIZE_CLOSED / CK_RUN_CLOSED to list both flags and explain the two-invocation split. Rules (layer B — live BenchmarkVerifier): - Add end_datetime to BenchmarkRunData/BenchmarkRun and populate it from summary.json's "end" field in DLIOResultParser. - Add CheckpointSubmissionRulesChecker.check_invocation_structure, which (for CLOSED runs only) requires either a single invocation with writes=10/reads=10 or exactly two invocations of 10/0 then 0/10 with a non-overlapping inter-phase gap of at most 30 seconds. Standalone validate (layer C — mlpstorage validate): - Add CheckpointingCheck.checkpoint_invocation_structure alongside the existing cache_flush_validation. The new method enforces the strict structural pattern (1 vs 2 invocations, exact 10/10 single-run or 10/0 -> 0/10 pair, no-overlap), while cache_flush_validation continues to enforce the 30-second upper bound on the inter-phase gap. Both fire under rule 4.7.1. Rules.md: - Rewrite 4.7.1 so the "if a flush is needed" phrasing is unambiguous: a flush is required only when the client has enough memory to cache the entire checkpoint set written in that run (rule of thumb: total per-client checkpoint size < 3x client RAM). Fix the prior typo that said the read phase was started with --num-checkpoints-write=0. - Update Table 3 so --num-checkpoints-write/-read read "Only 10 or 0" in CLOSED and "YES" in OPEN, with a footnote pointing at 4.7.1 for the two-invocation cache-flush workflow. Tests: - Replace the two test_cli.py tests that asserted closed-mode rejection at parse time with tests covering the new layer-A behavior: closed accepts 0/10 and 10/0 at parse time; validate_checkpointing_arguments rejects 20-valued counts and the both-zero combination. Confirmed against ../../v2-results/submissions_storage_v2.0: zero structural violations on the v2.0 archive (every closed submission used either single-run 10/10 or paired 10/0 -> 0/10). 26 submissions trip the existing 30-second gap check, which is the option-(b) outcome we expected — submitters who batch-queued the two invocations on HPC clusters drove their inter-phase gaps from minutes to a full half-hour.
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #434. The CLI redesign in PR #412 inadvertently hid
--num-checkpoints-write/--num-checkpoints-readfrom CLOSED mode, which meant closed submitters could not split a checkpointing run into a write phase and a read phase with a filesystem cache flush in between — the workflow Rules.md §4.7.1 requires when the client can cache the entire checkpoint set.This PR fixes the regression, enforces the strict structural pattern at three layers, and tightens the rule text.
Base branch: this PR is stacked on
FileSystemGuy-rules-validator(PR #432) because the layer-C check plugs into the standalonemlpstorage validatetool that #432 introduces. Please merge #432 first, then re-target this PR atmain.CLI (layer A — parse time)
--num-checkpoints-*flags out of the open-gated block inmlpstorage_py/cli/checkpointing_args.pyinto_add_checkpointing_core_args. Defaults stay at10/10in every mode, so existing single-invocation usage is unchanged.validate_checkpointing_argumentsnow rejects, inmode=closed, any value other than{10, 0}and rejects the both-zero combination.help_formatter.pyCK_DATASIZE_CLOSED / CK_RUN_CLOSED now list both flags and explain the two-invocation split.Live BenchmarkVerifier (layer B — end of a live run)
end_datetimetoBenchmarkRunData/BenchmarkRunand populate it fromsummary.json'sendfield inDLIOResultParser.CheckpointSubmissionRulesChecker.check_invocation_structure, which for CLOSED runs requires either a single invocation withwrites=10/reads=10or exactly two invocations of10/0then0/10with a non-overlapping inter-phase gap of at most 30 seconds.Standalone
mlpstorage validate(layer C — post-archive)CheckpointingCheck.checkpoint_invocation_structurealongside the existingcache_flush_validation. The new method enforces the strict structural pattern (1 vs 2 invocations, exact10/10single-run or10/0→0/10pair, no overlap).cache_flush_validationcontinues to enforce the 30-second upper bound on the inter-phase gap. Both fire under rule §4.7.1.Rules.md
checkpointing/README.md). Fix the prior typo that said the read phase was started with--num-checkpoints-write=0.--num-checkpoints-write/--num-checkpoints-readread"Only 10 or 0"in CLOSED and"YES"in OPEN, with a footnote pointing at §4.7.1 for the two-invocation cache-flush workflow.Tests
test_cli.pytests that asserted closed-mode argparse rejection with tests covering the new layer-A behavior: closed accepts0and10at parse time;validate_checkpointing_argumentsrejects20-valued counts and the both-zero combination. Adds positive coverage too.Validation against the v2.0 archive
Ran
./mlpstorage validate ../../v2-results/submissions_storage_v2.0after the port:10/10or paired10/0→0/10.cache-flush gap %.1fs exceeds 30-second limit) — submitters who batch-queued the two invocations on HPC clusters drove their inter-phase gaps from minutes to ~30 minutes. ANL, IBM, JNIST, Oracle, Simplyblock, TTA, UBIX, YanRongTech are among those affected. This is the expected option-(b) outcome: keep the 30-second rule strict, surface the offenders.Test plan
pytest tests/unit -k "checkpoint"— 44 passedpytest tests/unit -k "rules or benchmarkrun or extractor"— 182 passed./mlpstorage validate ../../v2-results/submissions_storage_v2.0— exit 1, 26 §4.7.1 ERRORs (gap), 0 structural ERRORsmlpstorage validateon a private staging copy if you maintain one