Skip to content

Fix manul failover with CLIENT PAUSE/UNPAUSE#389

Open
Paragrf wants to merge 2 commits intoapache:unstablefrom
Paragrf:pause
Open

Fix manul failover with CLIENT PAUSE/UNPAUSE#389
Paragrf wants to merge 2 commits intoapache:unstablefrom
Paragrf:pause

Conversation

@Paragrf
Copy link
Copy Markdown
Contributor

@Paragrf Paragrf commented Apr 23, 2026

Motivation

Implement Controller modifications to ensure master-replica data consistency during failover, aligning with server-side changes in apache/kvrocks#3377

Solution

Step 1 (Pause): Send CLIENT PAUSE WRITE from the controller to the current master.

Step 2 (Wait): Monitor the master-replica sequence gap until it hits zero, ensuring no data loss.

Step 3 (Metadata): Update the global topology metadata for the switchover.

Step 4 (Switch & Unpause): Promote the target and demote the old master; then explicitly call CLIENT UNPAUSE on the old master to restore its status.

Step 5 (Replicate): Reconfigure all other followers to sync from the new master.

Configuration Options

To prevent excessive blocking durations during periods of high write traffic, a maximum pause timeout has been introduced; the failover process will fail if the synchronization times out. The following parameters are added to the Controller failover configuration:

  • "force_on_timeout": false,

  • "sync_timeout_ms": 100,

  • "pause_timeout_ms": 500

Related Issues

Fixes #384

@Paragrf Paragrf changed the title fix(cli): fix  manul failover with CLIENT PAUSE/UNPAUSE Fix manul failover with CLIENT PAUSE/UNPAUSE Apr 23, 2026
@Paragrf
Copy link
Copy Markdown
Contributor Author

Paragrf commented Apr 23, 2026

In actual testing with the Controller and Kvrocks deployed in the same IDC, the system achieved a single-node write QPS of 10k/s with 20MB/s throughput. The write-stop duration (stall time) remained consistently under 10ms.

@git-hulk git-hulk self-requested a review April 24, 2026 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement a write-stop during failover to ensure data consistency during a planned primary-secondary switch

1 participant