Skip to content

test: Parquet merge pipeline verification suite#6369

Open
g-talbot wants to merge 15 commits intomainfrom
gtt/parquet-merge-verification
Open

test: Parquet merge pipeline verification suite#6369
g-talbot wants to merge 15 commits intomainfrom
gtt/parquet-merge-verification

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

@g-talbot g-talbot commented May 2, 2026

Summary

End-to-end verification stack for the Parquet merge pipeline. Closes the
loop from formal spec → exhaustive model → production runtime checks →
trace-conformance proof, so a divergence between production and the
verified model surfaces as a test failure or a Datadog metric, not a
silent corruption.

Verification pyramid (single source of truth)

TLA+ specs (docs/internals/specs/tla/)
        │ mirrors
Shared invariants (quickwit-dst/src/invariants/merge_pipeline.rs)  ← single Rust module
        │ used by                          │ called by
Stateright models                Production check_invariant! sites
        │                                  │
        └────── Trace conformance test ────┘

The Stateright model and production code evaluate the same Rust
predicate functions on the same state struct. Drift impossible by
construction.

What landed

TLA+ — MergePipelineShutdown.tla
Extended from the original orderly-shutdown spec to cover crash and
restart at every transition. CompleteMerge split into UploadMergeOutput

  • PublishMergeAndFeedback so a crash between upload and metastore
    replace is reachable (orphan_outputs failure mode). Multi-lifetime
    support (Restart can fire after graceful shutdown, modeling fresh
    process invocations). New invariants:
  • RowsConserved (MP-4): sum of rows in published splits = total ingested rows
  • LeakIsObjectStoreOnly (MP-10): orphans never become durable data
  • BoundedWriteAmp (MP-5), MP1_LevelHomogeneity, NoOrphanInPlanner (MP-8), NoOrphanWhenConnected (MP-9)
  • Action property RestartReSeedsAllImmature (MP-11): every Restart re-seeds correctly
  • Liveness NoPersistentOrphan: orphans always eventually clear

Two TLC configs: primary (multi-lifetime, ~16K states, 1s) and chains
(deep merge chain, ~218K states, 12s). 11 invariants + 3 temporal
properties verified.

Shared invariants — quickwit-dst/src/invariants/merge_pipeline.rs
New MergePipelineState struct mirrors the TLA+ VARIABLES block
literally; used directly by the Stateright model (no parallel
definition). 9 pure-function predicates (MP-1, MP-4..MP-11) with unit
tests for passing and failing paths.

Stateright — quickwit-dst/src/models/merge_pipeline.rs
Refactored to use the shared state and predicates. Mirrors the TLA+
multi-phase merge and multi-lifetime semantics. Three configs (small,
multi_lifetime, deep_chains) all pass.

Production runtime checks — quickwit-indexing/.../metrics_pipeline/
Eight new check_invariant! call sites, all forwarding to the existing
DogStatsD recorder (pomsky.invariant.{checked,violated}{invariant=MP-X}):

  • ParquetMergeExecutor — MP-1 (level homogeneity), MP-2 (≥2 inputs), MP-3 (scope), MP-4 (input rows == output rows)
  • ParquetMergePlanner::new — MP-11 (post-restart re-seed completeness, the formal "fetch_immature_splits is correct" claim)

Trace conformance — parquet_merge_pipeline_trace_conformance_test.rs
Production emits a MergePipelineEvent stream at every modeled
transition (Restart, PlanMerge, UploadMergeOutput,
PublishMergeAndFeedback, DisconnectMergePlanner,
RunFinalizeAndQuit, etc.). The test replays the captured events
through the same predicates Stateright verifies. Two scenarios pass:

  • Normal path: 4 splits → cascade of merges → mature output
  • Crash mid-cascade: publish failure forces respawn; production re-seeds
    cleanly across the crash boundary; rows conserved end-to-end

A divergence between production and the verified model would surface
here as a predicate failure with the offending event in the panic
message.

Original verification suite (still here)
Property test, sketch integration test, crash/restart integration test,
multi-round merge integration test — unchanged.

What's deferred

  • CS-3 (no compaction before threshold): the TLA+ model has it, but
    production ParquetMergePolicyConfig doesn't expose
    compaction_start_time yet. TODO in
    parquet_merge_planner::record_splits_if_necessary so the check lands
    alongside the feature.
  • MP-5..MP-10 lifted to runtime: currently checked only by Stateright +
    trace test; would need a periodic background task to materialize the
    full MergePipelineState for runtime evaluation. Not added because
    the trace-conformance proof already covers them.

Coverage table (23 invariants in registry)

  • 13/23 runtime-checked in production via check_invariant!
  • 6/23 (MP-5..MP-10) verified by Stateright + trace conformance
  • 4/23 (SS-3, SS-4, DM-4, DM-5) by construction (unfalsifiable runtime check)

Sesh-mode rule

Added a strong rule to .claude/skills/sesh-mode/SKILL.md: when a
verification check fails, never silently weaken or remove the invariant.
Almost always either (a) the implementation has a real bug, or (b) the
property is over-strong — and the failing trace is revealing the
weaker safety property the design does guarantee.

Test plan

  • TLC: MergePipelineShutdown.cfg passes (~16K states, 1s); _chains.cfg passes (~218K states, 12s); 11 invariants + 3 temporal properties
  • cargo nextest run -p quickwit-dst --features model-checking — 49 tests pass + 1 ignored deep_chains passes when run
  • cargo nextest run -p quickwit-indexing --features metrics metrics_pipeline — 54 tests pass (52 prior + 2 new trace conformance)
  • Clippy clean (-Dwarnings)
  • Nightly fmt clean
  • cargo doc --no-deps -p quickwit-indexing --features metrics clean
  • cargo machete clean

🤖 Generated with Claude Code

g-talbot and others added 7 commits May 1, 2026 21:53
Property test verifies that no merge task ever contains a split that
should have been filtered: mature by ops (>= max_merge_ops), mature by
size (>= target), time-matured (created_at + maturation_period <= now),
or missing a window. Generates random splits across the maturity
boundary and tests at the actor level.

Also makes test helpers pub(super) so sibling test modules (sketch,
crash, multi-round) can reuse them.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verifies that sketch splits dispatch to the correct metastore RPCs
(stage_sketch_splits, publish_sketch_splits) and that the merged output
has ParquetSplitKind::Sketches with correct metadata.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Crash/restart test:
- Injects publish failure on 2nd call → pipeline detects failure,
  kills actors, respawns
- Verifies list_metrics_splits called on respawn (re-seeding)
- Verifies pipeline generation >= 2 (respawn occurred)
- Verifies original splits eventually replaced (no data loss)

Multi-round merge test:
- 4 input splits → 2 round-1 merges → 1 round-2 merge
- Verifies num_merge_ops=2 on final output
- Verifies all original + intermediate splits replaced
- Verifies MC-1: total rows preserved across lifecycle

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Models the two-phase shutdown: DisconnectMergePlanner breaks feedback,
RunFinalizeMergePolicyAndQuit drains cold windows, then in-flight
merges complete.

Safety invariants:
- NoSplitLoss: every merge input is published or in-flight
- NoDuplicateMerge: no split in multiple concurrent merges
- FinalizeWithinBound: at most MaxFinalizeOps finalize operations
- ShutdownOnlyWhenDrained: shutdown requires empty in-flight set

Liveness:
- ShutdownEventuallyCompletes under weak fairness

Two configs: _small (hundreds of states) and full (larger space).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Model checks 5 properties across ingest, merge, crash, and restart:
- MC-1 lifecycle: total rows preserved (no loss/duplication)
- Bounded WA: merge_ops never exceeds max_merge_ops
- No duplicate merge: no split in multiple concurrent merges
- No orphan after restart: all immature splits re-seeded
- MP-1: level homogeneity (by construction)

Small model (~instant), full model (~seconds, gated behind #[ignore]).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Document the actual tla2tools.jar path from brew cask tla+-toolbox
- Add "Run All Specs" one-liner for quick verification
- Add spec catalog with state counts and invariant names
- Remove stale references to brew formula (doesn't exist)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot requested a review from mattmkim May 2, 2026 02:04
g-talbot and others added 8 commits May 4, 2026 07:04
CI clippy lint (`clippy::question-mark`, -D warnings) rejected the
explicit match-and-return-None pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ifetime

Phase A of the verification-gap closure (PR #6369). The original spec
only modeled orderly two-phase shutdown — no Crash action, no Restart,
no row-conservation claim. The four invariants were either trivially
true by construction or static set-membership, so the spec proved
nothing about the failure modes that motivated it.

This rewrite:

- Adds Crash and Restart actions. Crash invalidates in-memory state at
  any point in the pipeline; uploaded-but-unpublished merge outputs
  become orphan_outputs. Restart re-seeds cold_windows from durable
  published_splits (models fetch_immature_splits).

- Splits CompleteMerge into UploadMergeOutput then PublishMergeAndFeedback
  so a Crash between the two phases is reachable. This is what catches
  the "leaked but invisible" failure mode — output blob in S3 with no
  metastore reference. The model proves these are object-store-only
  garbage (LeakIsObjectStoreOnly), not durable data loss.

- Adds split_rows / split_merge_ops / split_window ghost functions so
  RowsConserved expresses the strong "no data loss, no duplication"
  property: total ingested rows equal sum of rows in published_splits
  in every reachable state.

- Adds NoOrphanWhenConnected (state invariant), RestartReSeedsAllImmature
  (action property) and NoPersistentOrphan (liveness leads-to). Together
  these capture: while connected the planner sees every immature split;
  every Restart correctly recovers all immature splits; in any run with
  restart budget remaining, orphans always eventually clear.

- Extends Restart to fire after graceful shutdown too (multi-lifetime),
  so the cross-process recovery claim is explicit. Bounded by MaxRestarts.

The replaced NoOrphanAfterRestart invariant fired during initial TLC
runs — the failing trace was a *legitimate* state in production
(publish during shutdown disconnect leaves the output untracked by the
planner until the next process invocation). The fix wasn't to remove
the invariant but to split the safety claim from the recovery claim:
NoOrphanWhenConnected for the steady state, RestartReSeedsAllImmature
for the recovery transition, NoPersistentOrphan for cross-lifetime
liveness. Recorded as a sesh-mode rule: never silently weaken an
invariant that produced a counter-example.

Single config drops the small/full split — full config now runs in
~12s on a workstation (217,854 distinct states, 11 invariants + 3
temporal properties verified). The states/ directory dropped by TLC
is now gitignored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The primary `MergePipelineShutdown.cfg` is now optimized for multi-lifetime
exercise (MaxIngests=3, MaxRestarts=2) — 15,732 states in ~1s — covering
the cross-process recovery claim added in the previous commit.

The new `MergePipelineShutdown_chains.cfg` keeps MaxIngests=4 with
MaxRestarts=1 for deeper merge-chain coverage (217,854 states, ~10s):
exercises level-0 → level-1 → level-2-mature compaction so BoundedWriteAmp
is checked across the full chain and concurrent in-flight merges interact.

Combined run is ~11s. Both share the same invariant + property set; only
the constants differ. The TLA+ run-all loop in CLAUDE.md now globs every
.cfg and resolves the matching .tla, supporting any number of configs
per spec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ariants

Phase B of the verification-gap closure (PR #6369). Eliminates drift
between the merge-pipeline model and any production-side checks by
making them evaluate the *same* Rust functions on the *same* state.

quickwit-dst/src/invariants/merge_pipeline.rs (new):
- MergePipelineState struct mirroring the TLA+ VARIABLES block
  (planner_alive, in_flight_merges, cold_windows, published_splits,
  splits, orphan_outputs, lifecycle counters, total_ingested_rows ghost)
- Pure-function predicates corresponding 1:1 to TLA+ invariants:
  rows_conserved, bounded_write_amp, no_split_loss, no_duplicate_merge,
  no_orphan_in_planner, no_orphan_when_connected, leak_is_object_store_only,
  mp1_level_homogeneity, restart_re_seeds_all_immature
- Helper orphan_set mirrors the TLA+ OrphanSet operator
- 17 unit tests covering passing/failing paths for each predicate

quickwit-dst/src/invariants/registry.rs:
- Adds MP4..MP11 invariant IDs (preserved short-code naming convention)
- Each maps to a TLA+ invariant from MergePipelineShutdown.tla

quickwit-dst/src/models/merge_pipeline.rs:
- State type is now invariants::merge_pipeline::MergePipelineState
  (literal sharing — no parallel definition, no conversion layer)
- CompleteMerge split into UploadMergeOutput + PublishMergeAndFeedback
  to match the TLA+ multi-phase merge completion. A Crash between phases
  orphans the upload (orphan_outputs) without losing data.
- Restart re-seeds cold_windows from durable published_splits (mirrors
  fetch_immature_splits) and resets shutdown_complete / finalize state,
  modelling a fresh process invocation. Bounded by max_restarts.
- DrainComplete added so graceful-shutdown is a distinct terminal
  action (matching TLA+).
- properties() calls the shared predicate functions instead of inline
  closures — model and runtime checks evaluate identical Rust code.
- Three configs: small (fast iteration), multi_lifetime (3 lifetimes,
  matches MergePipelineShutdown.cfg), deep_chains (level 0→1→2 mature,
  matches MergePipelineShutdown_chains.cfg).
- Restart's next_state debug_asserts MP-11 (restart_re_seeds_all_immature)
  on its post-state, encoding the action property as a runtime check.

All 50 tests pass under `cargo nextest run -p quickwit-dst --features
model-checking` (33 invariant unit tests + 4 merge-pipeline model
configs + 13 other models). Multi-lifetime checks in 0.1s, deep-chains
in 0.6s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase C of the verification-gap closure (PR #6369). Production code now
evaluates the *same* invariant predicates that the Stateright model and
TLA+ spec verify, with results forwarded to the existing DogStatsD
recorder (already wired in quickwit-cli/src/logger.rs). Every check
emits `pomsky.invariant.checked{invariant=MP-X}` and, on failure,
`pomsky.invariant.violated{invariant=MP-X}` — usable directly as
Datadog alert criteria.

ParquetMergeExecutor (post-merge, pre-uploader):
- MP-2 (HasMinimumSplits): merge has >= 2 inputs
- MP-1 (LevelHomogeneity): all inputs share num_merge_ops level
- MP-3 (ScopeCompatibility): all inputs share sort_fields and window
- MP-4 (RowsConserved): sum of input num_rows == sum of output num_rows.
  Empty-output path checks input rows are all zero (otherwise data was
  silently dropped). The non-empty path checks the strong row-preservation
  property — the same MC-1 / RowsConserved invariant from the TLA+ spec.

ParquetMergePlanner::new (post-restart re-seed):
- MP-11 (RestartReSeedsAllImmature): every split the merge policy
  classifies as still-immature (and which has a window for compaction)
  is recorded in scoped_young_splits and known_split_ids. Filters out
  mature splits, time-expired splits, and pre-Phase-31 splits before
  comparing — these are intentional drops, not the failure mode MP-11
  protects against.

Cargo.toml: quickwit-indexing now depends on quickwit-dst for the
shared invariant module and check_invariant! macro.

Also fixes a pre-existing rustdoc warning: the parquet_merge_planner
docs referenced `[`MergePlanner`]` as an intra-doc link, but the type
lives in another crate and can't resolve. Changed to plain backticks.

All 52 metrics_pipeline tests pass; clippy + nightly fmt + machete +
cargo doc clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… model

Phase D of the verification-gap closure (PR #6369). The TLA+ spec and
Stateright model are now backed by *production code that actually emits
events at every modeled transition*, and a test replays those events
through the same predicates the model verifies. A divergence between
production and the model surfaces here as a predicate failure on real
production state.

This closes the loop: TLA+ → Stateright (Phase A+B) → production checks
(Phase C) → trace conformance (Phase D).

quickwit-dst/src/events/merge_pipeline.rs (new):
- MergePipelineEvent enum with one variant per modeled action, using
  production types (String split IDs, Range<i64> windows). Notably no
  Crash variant — process death cannot emit anything; crashes are
  inferred during replay by the absence of events between actions and a
  subsequent Restart.
- Pluggable observer (fn pointer) following the existing
  set_invariant_recorder pattern. No-op single-atomic-load when no
  observer is installed; production hot path stays cheap.

Production emission sites (always-on, no feature flag):
- ParquetMergePipeline::spawn_pipeline → Restart, after fetch_immature_splits
- ParquetMergePipeline::FinishPendingMergesAndShutdownPipeline →
  DisconnectMergePlanner and RunFinalizeAndQuit
- ParquetMergePlanner::send_merge_ops → PlanMerge
- ParquetUploader::handle (post-upload, pre-publish) → UploadMergeOutput
  for merge outputs
- Publisher::handle (Parquet publish path) → IngestSplit (when no
  replaced_split_ids) or PublishMergeAndFeedback (when replacing)

quickwit-indexing/.../parquet_merge_pipeline_trace_conformance_test.rs (new):
- Two scenarios:
  1. Normal happy path: 4 splits → 2 merges → 1 mature output. Replay
     verifies MP-1, MP-4..MP-10 hold at every state. ~0.6s.
  2. Crash mid-cascade: publish failure injected on the 2nd merge
     forces respawn. Trace covers Restart → PlanMerge → Upload → crash →
     Restart → re-seed → PlanMerge → Upload → Publish. ~1.1s.
- StateMirror reconstructs MergePipelineState from events using a
  production-id → model-u32 interner, applies the same predicate
  functions used by the Stateright model.
- The mock metastore tracks `staged` and `published` separately so
  list_metrics_splits returns only what real metastores would (not
  staged-but-failed). Earlier test-side bug here would have masked
  divergences; the trace test caught it immediately.

What this catches that prior tests do not:
- The crash test only verified re-seeding *happened*. The trace test
  verifies that the post-restart state preserves all formal safety
  invariants (no orphans the planner can never reclaim, no row drift,
  no level mixing on re-merge, etc.).
- A divergence between production order/atomicity and the model's
  atomic actions surfaces as either a state-mirror error ("event can't
  be applied") or a predicate failure with the offending event in the
  panic message.

Both tests pass with the current production code: production behavior
matches the TLA+ spec for the scenarios exercised. The test
infrastructure is now ready to surface real divergences if they exist,
and to be extended with adversarial scenarios (multiple back-to-back
crashes, finalize-during-crash, concurrency stress).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CS-3 ("splits before compaction_start_time are never compacted") is
defined in the TLA+ model and the InvariantId registry, but the
production `ParquetMergePolicyConfig` does not yet expose a
`compaction_start_time` field. Adding a runtime check now would have
nothing to filter against.

Records the gap as a planner-level TODO so the check lands alongside
the feature when it's implemented (filter splits in
`record_splits_if_necessary` + verify via `check_invariant!`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant