feat: implement sequencer recovery for stale batches by GCdePaula · Pull Request #12 · cartesi/sequencer

GCdePaula · 2026-04-14T18:56:23Z

No description provided.

Refine TLA+ model Add more tests

Extract DangerDetector as its own worker; submitter is pure submission. Unify SchedulerRules + RecoveryParams into one ProtocolConfig in core. Pure decide_submit_start + decide_startup_action with exhaustive tests. DangerZone is a deliberate RunError variant, not a BatchSubmitterError.

Transactions use read/write closures; 11 manual sites collapsed. internals.rs split into convert/queries/mutations; drop load_ prefix. pending_batches now bakes the authoritative nonce into wire bytes. Extract 2000-line test block from recovery.rs into a sibling file. Improve flusher error handling

stephenctw

Looks great! I've left a few minor comments.

stephenctw · 2026-04-29T07:20:10Z

+                        });
+                    }
+                    Ok(_) => {} // verified
+                    Err(e) => {


Could we fail startup if get_chain_id() errors here? Right now we warn+continue, which can skip chain-id validation on transient RPC issues.

stephenctw · 2026-04-29T07:21:30Z

+    /// zero would make preemptive recovery indistinguishable from hard
+    /// staleness. Callers should catch this at startup.
+    pub fn danger_threshold(&self) -> u64 {
+        assert!(


this assert! panics on invalid operator config. Would you consider returning a typed startup config error instead (still fail-fast, just cleaner)?

stephenctw · 2026-04-29T07:22:38Z

+        batch_submitter_address = %l1_config.batch_submitter_address,
+        max_wait_blocks = protocol.max_wait_blocks,
+        preemptive_margin_blocks = protocol.preemptive_margin_blocks,
+        danger_threshold = protocol.danger_threshold(),


Related to above: calling danger_threshold() in startup logging means invalid config panics before structured error handling. Maybe validate once up front and return a typed error.

stephenctw · 2026-05-14T14:32:05Z

@@ -218,33 +254,31 @@ pub async fn run_preemptive_recovery(

            tracing::info!("re-syncing L1 safe head after flush");
            input_reader.sync_to_current_safe_head().await?;


Correct me if I am wrong.

Flush → sync → recover_post_flush: after flush_and_wait, sync_to_current_safe_head().await? fails on any error (including Provider). Step 1 only treats Provider as unreachable. So the post-flush sync can error out before recover_post_flush, meaning L1 flush effects are possible without the DB cascade / recovery batch. Flag as intentional or add retries / a clearer error / runbook.

Great question! I've added a comment on the code addressing this.

Here's the comment:

If this re-sync errors out, L1 has been flushed but the DB has NOT been cascaded — we exit with the InputReaderError and rely on the orchestrator to respawn.

That's safe by design:

flush_and_wait is idempotent: on the next attempt it queries L1 for pending wallet-nonces, finds zero (the previous flush cleared them), and returns immediately.

check_danger is stable across the failure window: safe_block only moves forward and flush doesn't retroactively change closed batches' first_frame_safe_block, so the danger condition that fired before still fires after the restart.

recover_post_flush is idempotent against the resulting DB state (verified by after_post_recovery_crash_is_no_op in recovery_tests).

So a failure here just costs an extra orchestrator respawn; correctness is preserved.

More importantly, it refuses to boot, during a recovery scenario, when we can't reach L1

Sounds good?

stephenctw

LGTM! I just left a comment, not sure if it's a real issue but it'd be great that you resolve my concern.

feat: implement sequencer recovery for stale batches

b42f20f

GCdePaula force-pushed the feature/recovery branch from 97d62d2 to c0c0810 Compare April 16, 2026 00:51

GCdePaula added 12 commits April 22, 2026 16:10

refactor: rework storage module

77b4bfd

refactor: refine inclusion lane module

d347526

refactor: refactor api module

bd1172f

refactor: reorganize modules structure

3fb7511

feat: re-add health check

95e2104

docs: rework agents, claude, readme, docs

6e03399

fix: address 712 domain bug, harden implementation

891092d

fix: fix wallclock danger detection for tip batch

c393e64

tests: improve test coverage and harness

2dee3da

refactor: make storage batch lifecycle structural

b7d3204

test: add thorough e2e tests and unit tests

aeef6c2

fix: fix wallclock recovery path

dcd2e86

Refine TLA+ model Add more tests

GCdePaula force-pushed the feature/recovery branch from f597e0d to dcd2e86 Compare April 22, 2026 19:13

GCdePaula added 2 commits April 22, 2026 22:32

refactor: make safe_accepted_batches an invariant of append_safe_inputs

b63df2b

refactor: rework submitter worker loop

dd3a554

GCdePaula force-pushed the feature/recovery branch from 0bab3cf to dd3a554 Compare April 23, 2026 10:59

GCdePaula force-pushed the feature/recovery branch from dcc6bff to b879624 Compare April 23, 2026 21:18

GCdePaula force-pushed the feature/recovery branch from b879624 to 4a592fe Compare April 23, 2026 21:32

GCdePaula requested a review from stephenctw April 28, 2026 17:18

stephenctw reviewed Apr 29, 2026

View reviewed changes

refactor(recovery): split detect_and_recover by startup path

2816ffb

stephenctw reviewed May 14, 2026

View reviewed changes

GCdePaula added 2 commits May 16, 2026 20:56

feat(recovery): improve danger zone detection with L1 staleness

a58eb4e

refactor: cleanup startup and workers

7b2271e

GCdePaula marked this pull request as ready for review May 17, 2026 00:14

GCdePaula requested a review from stephenctw May 17, 2026 00:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement sequencer recovery for stale batches#12

feat: implement sequencer recovery for stale batches#12
GCdePaula wants to merge 20 commits into
mainfrom
feature/recovery

GCdePaula commented Apr 14, 2026

Uh oh!

stephenctw left a comment

Uh oh!

stephenctw Apr 29, 2026

Uh oh!

stephenctw Apr 29, 2026

Uh oh!

stephenctw Apr 29, 2026

Uh oh!

stephenctw May 14, 2026 •

edited

Loading

Uh oh!

GCdePaula May 16, 2026

Uh oh!

stephenctw left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -218,33 +254,31 @@ pub async fn run_preemptive_recovery(

		tracing::info!("re-syncing L1 safe head after flush");
		input_reader.sync_to_current_safe_head().await?;

Conversation

GCdePaula commented Apr 14, 2026

Uh oh!

stephenctw left a comment

Choose a reason for hiding this comment

Uh oh!

stephenctw Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

stephenctw Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

stephenctw Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

stephenctw May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GCdePaula May 16, 2026

Choose a reason for hiding this comment

Uh oh!

stephenctw left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stephenctw May 14, 2026 •

edited

Loading