fix(checkpointing): allow --num-checkpoints-* in CLOSED + enforce 4.7.1 invocation structure (#434) by FileSystemGuy · Pull Request #435 · mlcommons/storage

FileSystemGuy · 2026-06-12T16:04:51Z

Summary

Fixes #434. The CLI redesign in PR #412 inadvertently hid --num-checkpoints-write / --num-checkpoints-read from CLOSED mode, which meant closed submitters could not split a checkpointing run into a write phase and a read phase with a filesystem cache flush in between — the workflow Rules.md §4.7.1 requires when the client can cache the entire checkpoint set.

This PR fixes the regression, enforces the strict structural pattern at three layers, and tightens the rule text.

Base branch: this PR is stacked on FileSystemGuy-rules-validator (PR #432) because the layer-C check plugs into the standalone mlpstorage validate tool that #432 introduces. Please merge #432 first, then re-target this PR at main.

CLI (layer A — parse time)

Move both --num-checkpoints-* flags out of the open-gated block in mlpstorage_py/cli/checkpointing_args.py into _add_checkpointing_core_args. Defaults stay at 10/10 in every mode, so existing single-invocation usage is unchanged.
validate_checkpointing_arguments now rejects, in mode=closed, any value other than {10, 0} and rejects the both-zero combination.
help_formatter.py CK_DATASIZE_CLOSED / CK_RUN_CLOSED now list both flags and explain the two-invocation split.

Live BenchmarkVerifier (layer B — end of a live run)

Add end_datetime to BenchmarkRunData/BenchmarkRun and populate it from summary.json's end field in DLIOResultParser.
Add CheckpointSubmissionRulesChecker.check_invocation_structure, which for CLOSED runs requires either a single invocation with writes=10/reads=10 or exactly two invocations of 10/0 then 0/10 with a non-overlapping inter-phase gap of at most 30 seconds.

Standalone `mlpstorage validate` (layer C — post-archive)

Add CheckpointingCheck.checkpoint_invocation_structure alongside the existing cache_flush_validation. The new method enforces the strict structural pattern (1 vs 2 invocations, exact 10/10 single-run or 10/0 → 0/10 pair, no overlap). cache_flush_validation continues to enforce the 30-second upper bound on the inter-phase gap. Both fire under rule §4.7.1.

Rules.md

Rewrite §4.7.1 so the "if a flush is needed" phrasing is unambiguous: a flush is required only when the client has enough memory to cache the entire checkpoint set written in that run (rule of thumb: total per-client checkpoint size < 3× client RAM, per checkpointing/README.md). Fix the prior typo that said the read phase was started with --num-checkpoints-write=0.
Update Table 3 so --num-checkpoints-write/--num-checkpoints-read read "Only 10 or 0" in CLOSED and "YES" in OPEN, with a footnote pointing at §4.7.1 for the two-invocation cache-flush workflow.

Tests

Replace the two test_cli.py tests that asserted closed-mode argparse rejection with tests covering the new layer-A behavior: closed accepts 0 and 10 at parse time; validate_checkpointing_arguments rejects 20-valued counts and the both-zero combination. Adds positive coverage too.

Validation against the v2.0 archive

Ran ./mlpstorage validate ../../v2-results/submissions_storage_v2.0 after the port:

0 structural-pattern violations — every CLOSED submission in v2.0 already uses either single-run 10/10 or paired 10/0 → 0/10.
26 ERROR-level §4.7.1 gap violations (cache-flush gap %.1fs exceeds 30-second limit) — submitters who batch-queued the two invocations on HPC clusters drove their inter-phase gaps from minutes to ~30 minutes. ANL, IBM, JNIST, Oracle, Simplyblock, TTA, UBIX, YanRongTech are among those affected. This is the expected option-(b) outcome: keep the 30-second rule strict, surface the offenders.

Test plan

pytest tests/unit -k "checkpoint" — 44 passed
pytest tests/unit -k "rules or benchmarkrun or extractor" — 182 passed
./mlpstorage validate ../../v2-results/submissions_storage_v2.0 — exit 1, 26 §4.7.1 ERRORs (gap), 0 structural ERRORs
Reviewers: re-run mlpstorage validate on a private staging copy if you maintain one
Reviewers: confirm the 30-second threshold is the intended bar for v3.x submissions (the v2.0 finding is informational — v2.0 is closed)

….1 invocation structure Issue #434: ``mlpstorage closed checkpointing run`` did not expose --num-checkpoints-write / --num-checkpoints-read, so closed submitters could not split the run into a write phase and a read phase with a filesystem cache flush in between — which Rules.md 4.7.1 requires when the client can cache the entire checkpoint set. CLI (layer A): - Move both flags out of the open-gated block in checkpointing_args.py into _add_checkpointing_core_args; defaults stay at 10/10 in every mode so existing single-invocation usage is unchanged. - validate_checkpointing_arguments now rejects, in mode=closed, any value other than {10, 0} and rejects the both-zero combination. - Update help_formatter.py CK_DATASIZE_CLOSED / CK_RUN_CLOSED to list both flags and explain the two-invocation split. Rules (layer B — live BenchmarkVerifier): - Add end_datetime to BenchmarkRunData/BenchmarkRun and populate it from summary.json's "end" field in DLIOResultParser. - Add CheckpointSubmissionRulesChecker.check_invocation_structure, which (for CLOSED runs only) requires either a single invocation with writes=10/reads=10 or exactly two invocations of 10/0 then 0/10 with a non-overlapping inter-phase gap of at most 30 seconds. Standalone validate (layer C — mlpstorage validate): - Add CheckpointingCheck.checkpoint_invocation_structure alongside the existing cache_flush_validation. The new method enforces the strict structural pattern (1 vs 2 invocations, exact 10/10 single-run or 10/0 -> 0/10 pair, no-overlap), while cache_flush_validation continues to enforce the 30-second upper bound on the inter-phase gap. Both fire under rule 4.7.1. Rules.md: - Rewrite 4.7.1 so the "if a flush is needed" phrasing is unambiguous: a flush is required only when the client has enough memory to cache the entire checkpoint set written in that run (rule of thumb: total per-client checkpoint size < 3x client RAM). Fix the prior typo that said the read phase was started with --num-checkpoints-write=0. - Update Table 3 so --num-checkpoints-write/-read read "Only 10 or 0" in CLOSED and "YES" in OPEN, with a footnote pointing at 4.7.1 for the two-invocation cache-flush workflow. Tests: - Replace the two test_cli.py tests that asserted closed-mode rejection at parse time with tests covering the new layer-A behavior: closed accepts 0/10 and 10/0 at parse time; validate_checkpointing_arguments rejects 20-valued counts and the both-zero combination. Confirmed against ../../v2-results/submissions_storage_v2.0: zero structural violations on the v2.0 archive (every closed submission used either single-run 10/10 or paired 10/0 -> 0/10). 26 submissions trip the existing 30-second gap check, which is the option-(b) outcome we expected — submitters who batch-queued the two invocations on HPC clusters drove their inter-phase gaps from minutes to a full half-hour.

github-actions · 2026-06-12T16:05:05Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

…-runs

FileSystemGuy requested a review from a team as a code owner June 12, 2026 16:04

Merge branch 'FileSystemGuy-rules-validator' into FileSystemGuy-chkpt…

f89e788

…-runs

FileSystemGuy merged commit 530416d into FileSystemGuy-rules-validator Jun 12, 2026
2 checks passed

github-actions Bot locked and limited conversation to collaborators Jun 12, 2026

FileSystemGuy deleted the FileSystemGuy-chkpt-runs branch June 13, 2026 00:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(checkpointing): allow --num-checkpoints-* in CLOSED + enforce 4.7.1 invocation structure (#434)#435

fix(checkpointing): allow --num-checkpoints-* in CLOSED + enforce 4.7.1 invocation structure (#434)#435
FileSystemGuy merged 2 commits into
FileSystemGuy-rules-validatorfrom
FileSystemGuy-chkpt-runs

FileSystemGuy commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FileSystemGuy commented Jun 12, 2026

Summary

CLI (layer A — parse time)

Live BenchmarkVerifier (layer B — end of a live run)

Standalone mlpstorage validate (layer C — post-archive)

Rules.md

Tests

Validation against the v2.0 archive

Test plan

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Standalone `mlpstorage validate` (layer C — post-archive)

github-actions Bot commented Jun 12, 2026 •

edited

Loading