Skip to content

feat: implement sequencer recovery for stale batches#12

Open
GCdePaula wants to merge 20 commits into
mainfrom
feature/recovery
Open

feat: implement sequencer recovery for stale batches#12
GCdePaula wants to merge 20 commits into
mainfrom
feature/recovery

Conversation

@GCdePaula
Copy link
Copy Markdown
Collaborator

No description provided.

Extract DangerDetector as its own worker; submitter is pure submission.
Unify SchedulerRules + RecoveryParams into one ProtocolConfig in core.
Pure decide_submit_start + decide_startup_action with exhaustive tests.
DangerZone is a deliberate RunError variant, not a BatchSubmitterError.
Transactions use read/write closures; 11 manual sites collapsed.
internals.rs split into convert/queries/mutations; drop load_ prefix.
pending_batches now bakes the authoritative nonce into wire bytes.
Extract 2000-line test block from recovery.rs into a sibling file.
Improve flusher error handling
Copy link
Copy Markdown
Collaborator

@stephenctw stephenctw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I've left a few minor comments.

Comment thread sequencer/src/runtime/mod.rs Outdated
});
}
Ok(_) => {} // verified
Err(e) => {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we fail startup if get_chain_id() errors here? Right now we warn+continue, which can skip chain-id validation on transient RPC issues.

Comment thread sequencer-core/src/protocol.rs Outdated
/// zero would make preemptive recovery indistinguishable from hard
/// staleness. Callers should catch this at startup.
pub fn danger_threshold(&self) -> u64 {
assert!(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assert! panics on invalid operator config. Would you consider returning a typed startup config error instead (still fail-fast, just cleaner)?

Comment thread sequencer/src/runtime/mod.rs Outdated
batch_submitter_address = %l1_config.batch_submitter_address,
max_wait_blocks = protocol.max_wait_blocks,
preemptive_margin_blocks = protocol.preemptive_margin_blocks,
danger_threshold = protocol.danger_threshold(),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to above: calling danger_threshold() in startup logging means invalid config panics before structured error handling. Maybe validate once up front and return a typed error.

Comment thread sequencer/src/recovery/mod.rs Outdated
@@ -218,33 +254,31 @@ pub async fn run_preemptive_recovery(

tracing::info!("re-syncing L1 safe head after flush");
input_reader.sync_to_current_safe_head().await?;
Copy link
Copy Markdown
Collaborator

@stephenctw stephenctw May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I am wrong.

Flush → sync → recover_post_flush: after flush_and_wait, sync_to_current_safe_head().await? fails on any error (including Provider). Step 1 only treats Provider as unreachable. So the post-flush sync can error out before recover_post_flush, meaning L1 flush effects are possible without the DB cascade / recovery batch. Flag as intentional or add retries / a clearer error / runbook.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question! I've added a comment on the code addressing this.

Here's the comment:


If this re-sync errors out, L1 has been flushed but the DB has NOT been cascaded — we exit with the InputReaderError and rely on the orchestrator to respawn.

That's safe by design:

  • flush_and_wait is idempotent: on the next attempt it queries L1 for pending wallet-nonces, finds zero (the previous flush cleared them), and returns immediately.
  • check_danger is stable across the failure window: safe_block only moves forward and flush doesn't retroactively change closed batches' first_frame_safe_block, so the danger condition that fired before still fires after the restart.
  • recover_post_flush is idempotent against the resulting DB state (verified by after_post_recovery_crash_is_no_op in recovery_tests).

So a failure here just costs an extra orchestrator respawn; correctness is preserved.

More importantly, it refuses to boot, during a recovery scenario, when we can't reach L1


Sounds good?

Copy link
Copy Markdown
Collaborator

@stephenctw stephenctw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I just left a comment, not sure if it's a real issue but it'd be great that you resolve my concern.

@GCdePaula GCdePaula marked this pull request as ready for review May 17, 2026 00:14
@GCdePaula GCdePaula requested a review from stephenctw May 17, 2026 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants