feat: implement sequencer recovery for stale batches#12
Conversation
97d62d2 to
c0c0810
Compare
Refine TLA+ model Add more tests
f597e0d to
dcd2e86
Compare
0bab3cf to
dd3a554
Compare
Extract DangerDetector as its own worker; submitter is pure submission. Unify SchedulerRules + RecoveryParams into one ProtocolConfig in core. Pure decide_submit_start + decide_startup_action with exhaustive tests. DangerZone is a deliberate RunError variant, not a BatchSubmitterError.
dcc6bff to
b879624
Compare
Transactions use read/write closures; 11 manual sites collapsed. internals.rs split into convert/queries/mutations; drop load_ prefix. pending_batches now bakes the authoritative nonce into wire bytes. Extract 2000-line test block from recovery.rs into a sibling file. Improve flusher error handling
b879624 to
4a592fe
Compare
stephenctw
left a comment
There was a problem hiding this comment.
Looks great! I've left a few minor comments.
| }); | ||
| } | ||
| Ok(_) => {} // verified | ||
| Err(e) => { |
There was a problem hiding this comment.
Could we fail startup if get_chain_id() errors here? Right now we warn+continue, which can skip chain-id validation on transient RPC issues.
| /// zero would make preemptive recovery indistinguishable from hard | ||
| /// staleness. Callers should catch this at startup. | ||
| pub fn danger_threshold(&self) -> u64 { | ||
| assert!( |
There was a problem hiding this comment.
this assert! panics on invalid operator config. Would you consider returning a typed startup config error instead (still fail-fast, just cleaner)?
| batch_submitter_address = %l1_config.batch_submitter_address, | ||
| max_wait_blocks = protocol.max_wait_blocks, | ||
| preemptive_margin_blocks = protocol.preemptive_margin_blocks, | ||
| danger_threshold = protocol.danger_threshold(), |
There was a problem hiding this comment.
Related to above: calling danger_threshold() in startup logging means invalid config panics before structured error handling. Maybe validate once up front and return a typed error.
| @@ -218,33 +254,31 @@ pub async fn run_preemptive_recovery( | |||
|
|
|||
| tracing::info!("re-syncing L1 safe head after flush"); | |||
| input_reader.sync_to_current_safe_head().await?; | |||
There was a problem hiding this comment.
Correct me if I am wrong.
Flush → sync → recover_post_flush: after flush_and_wait, sync_to_current_safe_head().await? fails on any error (including Provider). Step 1 only treats Provider as unreachable. So the post-flush sync can error out before recover_post_flush, meaning L1 flush effects are possible without the DB cascade / recovery batch. Flag as intentional or add retries / a clearer error / runbook.
There was a problem hiding this comment.
Great question! I've added a comment on the code addressing this.
Here's the comment:
If this re-sync errors out, L1 has been flushed but the DB has NOT been cascaded — we exit with the InputReaderError and rely on the orchestrator to respawn.
That's safe by design:
flush_and_waitis idempotent: on the next attempt it queries L1 for pending wallet-nonces, finds zero (the previous flush cleared them), and returns immediately.check_dangeris stable across the failure window:safe_blockonly moves forward and flush doesn't retroactively change closed batches'first_frame_safe_block, so the danger condition that fired before still fires after the restart.recover_post_flushis idempotent against the resulting DB state (verified byafter_post_recovery_crash_is_no_opinrecovery_tests).
So a failure here just costs an extra orchestrator respawn; correctness is preserved.
More importantly, it refuses to boot, during a recovery scenario, when we can't reach L1
Sounds good?
stephenctw
left a comment
There was a problem hiding this comment.
LGTM! I just left a comment, not sure if it's a real issue but it'd be great that you resolve my concern.
No description provided.