Skip to content

fix(checkpointing): allow --num-checkpoints-* in CLOSED + enforce 4.7.1 invocation structure (#434)#435

Merged
FileSystemGuy merged 2 commits into
FileSystemGuy-rules-validatorfrom
FileSystemGuy-chkpt-runs
Jun 12, 2026
Merged

fix(checkpointing): allow --num-checkpoints-* in CLOSED + enforce 4.7.1 invocation structure (#434)#435
FileSystemGuy merged 2 commits into
FileSystemGuy-rules-validatorfrom
FileSystemGuy-chkpt-runs

Conversation

@FileSystemGuy

Copy link
Copy Markdown
Contributor

Summary

Fixes #434. The CLI redesign in PR #412 inadvertently hid --num-checkpoints-write / --num-checkpoints-read from CLOSED mode, which meant closed submitters could not split a checkpointing run into a write phase and a read phase with a filesystem cache flush in between — the workflow Rules.md §4.7.1 requires when the client can cache the entire checkpoint set.

This PR fixes the regression, enforces the strict structural pattern at three layers, and tightens the rule text.

Base branch: this PR is stacked on FileSystemGuy-rules-validator (PR #432) because the layer-C check plugs into the standalone mlpstorage validate tool that #432 introduces. Please merge #432 first, then re-target this PR at main.

CLI (layer A — parse time)

  • Move both --num-checkpoints-* flags out of the open-gated block in mlpstorage_py/cli/checkpointing_args.py into _add_checkpointing_core_args. Defaults stay at 10/10 in every mode, so existing single-invocation usage is unchanged.
  • validate_checkpointing_arguments now rejects, in mode=closed, any value other than {10, 0} and rejects the both-zero combination.
  • help_formatter.py CK_DATASIZE_CLOSED / CK_RUN_CLOSED now list both flags and explain the two-invocation split.

Live BenchmarkVerifier (layer B — end of a live run)

  • Add end_datetime to BenchmarkRunData/BenchmarkRun and populate it from summary.json's end field in DLIOResultParser.
  • Add CheckpointSubmissionRulesChecker.check_invocation_structure, which for CLOSED runs requires either a single invocation with writes=10/reads=10 or exactly two invocations of 10/0 then 0/10 with a non-overlapping inter-phase gap of at most 30 seconds.

Standalone mlpstorage validate (layer C — post-archive)

  • Add CheckpointingCheck.checkpoint_invocation_structure alongside the existing cache_flush_validation. The new method enforces the strict structural pattern (1 vs 2 invocations, exact 10/10 single-run or 10/00/10 pair, no overlap). cache_flush_validation continues to enforce the 30-second upper bound on the inter-phase gap. Both fire under rule §4.7.1.

Rules.md

  • Rewrite §4.7.1 so the "if a flush is needed" phrasing is unambiguous: a flush is required only when the client has enough memory to cache the entire checkpoint set written in that run (rule of thumb: total per-client checkpoint size < 3× client RAM, per checkpointing/README.md). Fix the prior typo that said the read phase was started with --num-checkpoints-write=0.
  • Update Table 3 so --num-checkpoints-write/--num-checkpoints-read read "Only 10 or 0" in CLOSED and "YES" in OPEN, with a footnote pointing at §4.7.1 for the two-invocation cache-flush workflow.

Tests

  • Replace the two test_cli.py tests that asserted closed-mode argparse rejection with tests covering the new layer-A behavior: closed accepts 0 and 10 at parse time; validate_checkpointing_arguments rejects 20-valued counts and the both-zero combination. Adds positive coverage too.

Validation against the v2.0 archive

Ran ./mlpstorage validate ../../v2-results/submissions_storage_v2.0 after the port:

  • 0 structural-pattern violations — every CLOSED submission in v2.0 already uses either single-run 10/10 or paired 10/00/10.
  • 26 ERROR-level §4.7.1 gap violations (cache-flush gap %.1fs exceeds 30-second limit) — submitters who batch-queued the two invocations on HPC clusters drove their inter-phase gaps from minutes to ~30 minutes. ANL, IBM, JNIST, Oracle, Simplyblock, TTA, UBIX, YanRongTech are among those affected. This is the expected option-(b) outcome: keep the 30-second rule strict, surface the offenders.

Test plan

  • pytest tests/unit -k "checkpoint" — 44 passed
  • pytest tests/unit -k "rules or benchmarkrun or extractor" — 182 passed
  • ./mlpstorage validate ../../v2-results/submissions_storage_v2.0 — exit 1, 26 §4.7.1 ERRORs (gap), 0 structural ERRORs
  • Reviewers: re-run mlpstorage validate on a private staging copy if you maintain one
  • Reviewers: confirm the 30-second threshold is the intended bar for v3.x submissions (the v2.0 finding is informational — v2.0 is closed)

….1 invocation structure

Issue #434: ``mlpstorage closed checkpointing run`` did not expose
--num-checkpoints-write / --num-checkpoints-read, so closed submitters
could not split the run into a write phase and a read phase with a
filesystem cache flush in between — which Rules.md 4.7.1 requires when
the client can cache the entire checkpoint set.

CLI (layer A):
- Move both flags out of the open-gated block in checkpointing_args.py
  into _add_checkpointing_core_args; defaults stay at 10/10 in every
  mode so existing single-invocation usage is unchanged.
- validate_checkpointing_arguments now rejects, in mode=closed, any
  value other than {10, 0} and rejects the both-zero combination.
- Update help_formatter.py CK_DATASIZE_CLOSED / CK_RUN_CLOSED to list
  both flags and explain the two-invocation split.

Rules (layer B — live BenchmarkVerifier):
- Add end_datetime to BenchmarkRunData/BenchmarkRun and populate it
  from summary.json's "end" field in DLIOResultParser.
- Add CheckpointSubmissionRulesChecker.check_invocation_structure,
  which (for CLOSED runs only) requires either a single invocation
  with writes=10/reads=10 or exactly two invocations of 10/0 then
  0/10 with a non-overlapping inter-phase gap of at most 30 seconds.

Standalone validate (layer C — mlpstorage validate):
- Add CheckpointingCheck.checkpoint_invocation_structure alongside the
  existing cache_flush_validation. The new method enforces the strict
  structural pattern (1 vs 2 invocations, exact 10/10 single-run or
  10/0 -> 0/10 pair, no-overlap), while cache_flush_validation
  continues to enforce the 30-second upper bound on the inter-phase
  gap. Both fire under rule 4.7.1.

Rules.md:
- Rewrite 4.7.1 so the "if a flush is needed" phrasing is unambiguous:
  a flush is required only when the client has enough memory to cache
  the entire checkpoint set written in that run (rule of thumb: total
  per-client checkpoint size < 3x client RAM). Fix the prior typo that
  said the read phase was started with --num-checkpoints-write=0.
- Update Table 3 so --num-checkpoints-write/-read read "Only 10 or 0"
  in CLOSED and "YES" in OPEN, with a footnote pointing at 4.7.1 for
  the two-invocation cache-flush workflow.

Tests:
- Replace the two test_cli.py tests that asserted closed-mode rejection
  at parse time with tests covering the new layer-A behavior: closed
  accepts 0/10 and 10/0 at parse time; validate_checkpointing_arguments
  rejects 20-valued counts and the both-zero combination.

Confirmed against ../../v2-results/submissions_storage_v2.0: zero
structural violations on the v2.0 archive (every closed submission used
either single-run 10/10 or paired 10/0 -> 0/10). 26 submissions trip
the existing 30-second gap check, which is the option-(b) outcome we
expected — submitters who batch-queued the two invocations on HPC
clusters drove their inter-phase gaps from minutes to a full half-hour.
@FileSystemGuy FileSystemGuy requested a review from a team as a code owner June 12, 2026 16:04
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@FileSystemGuy FileSystemGuy merged commit 530416d into FileSystemGuy-rules-validator Jun 12, 2026
2 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 12, 2026
@FileSystemGuy FileSystemGuy deleted the FileSystemGuy-chkpt-runs branch June 13, 2026 00:37
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant