Skip to content

[CELEBORN-2334] Automatically restore RocksDB in case of failures#3695

Open
AmandeepSingh285 wants to merge 9 commits into
apache:mainfrom
AmandeepSingh285:auto-recover-rocks-db
Open

[CELEBORN-2334] Automatically restore RocksDB in case of failures#3695
AmandeepSingh285 wants to merge 9 commits into
apache:mainfrom
AmandeepSingh285:auto-recover-rocks-db

Conversation

@AmandeepSingh285
Copy link
Copy Markdown
Contributor

@AmandeepSingh285 AmandeepSingh285 commented May 18, 2026

What changes were proposed in this pull request?

The patch re-instantiates RocksDB in case of failures. In the current implementation, when RocksDB enters a read-only mode due to failures, Celeborn metadata operations fail and remain blocked until manual intervention or restart. This pull request adds logic to detect such RocksDB failures and re-instantiate the RocksDB instance so that metadata operations can recover automatically and continue functioning without prolonged disruption. RocksDB can enter a read-only or unusable state under scenarios such as: corruption in files, errors from underlying file system. In such cases, RocksDB prevents further writes to protect data consistency, which causes Celeborn metadata operations to fail.

Why are the changes needed?

Once RocksDB enters a read-only or error state, Celeborn metadata operations become unavailable because the existing RocksDB instance remains unusable, which could lead to failures in metadata updates.

Does this PR resolve a correctness bug?

No.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests.

@SteNicholas SteNicholas changed the title [CELEBORN-2334] Auto-recover rocks db in case of failures [CELEBORN-2334] Auto-recover RocksDB in case of failures May 19, 2026
@SteNicholas SteNicholas changed the title [CELEBORN-2334] Auto-recover RocksDB in case of failures [CELEBORN-2334] Automatically restore RocksDB in case of failures May 19, 2026
@SteNicholas SteNicholas requested a review from Copilot May 19, 2026 03:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds automatic recovery for the worker's RocksDB-backed metadata store: when a DB operation throws RocksDBException, the RocksDB wrapper closes the existing native instance and reopens it via RocksDBProvider.initRockDB, coordinated by a ReentrantReadWriteLock and a generation counter so only one thread performs the reopen.

Changes:

  • RocksDB now holds a volatile db reference plus dbFile/version and exposes a recreateDBInstance(generation) helper invoked from each operation's catch block.
  • All putInternal/getInternal/deleteInternal/newIterator/close methods are wrapped in read-/write-lock scopes around the generation snapshot.
  • DBProvider.initDB passes dbFile and version through to the new RocksDB constructor.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File Description
worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/RocksDB.java Adds locking, generation counter, and recreateDBInstance recovery in each DB operation.
worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/DBProvider.java Threads dbFile/version into the updated RocksDB constructor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions Bot added correctness Correctness bugfix and removed correctness Correctness bugfix labels May 19, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 9 comments.

Comment thread common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Comment thread docs/configuration/worker.md Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

@SteNicholas
Copy link
Copy Markdown
Member

@AmandeepSingh285, please update the ConfigOption to align with the changes of configuration/worker.md for failure of CI.

@AmandeepSingh285
Copy link
Copy Markdown
Contributor Author

@SteNicholas could you please help with review. Thanks!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Comment thread common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Outdated
Copy link
Copy Markdown
Member

@SteNicholas SteNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AmandeepSingh285, thanks for updates. I have left minor comments. PTAL.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Comment thread common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Outdated
@AmandeepSingh285
Copy link
Copy Markdown
Contributor Author

@AmandeepSingh285, thanks for updates. I have left minor comments. PTAL.

Thanks @SteNicholas , made changes according to the comments

@RexXiong
Copy link
Copy Markdown
Contributor

Thanks for the work on this PR — the overall design (ManagedRocksDB lifecycle wrapper, generation + ReadWriteLock for concurrent recovery dedup, stale iterator detection) is solid. A couple of observations:

1. tryRecoverDBInstance: failed reopen leaves DB in an infinite recovery loop

If reopenRocksDB throws, db still references the already-closed ManagedRocksDB, and the generation has already been incremented:

try {
    if (db != null) {
        db.close();           // old DB closed
    }
} catch (Exception e) { ... }

dbGeneration.incrementAndGet();   // generation bumped

try {
    db = RocksDBProvider.reopenRocksDB(dbFile, conf);  // if this fails...
} catch (IOException e) {
    logger.error("Safe reopen failed ...", e);
    // db still points to the CLOSED ManagedRocksDB
    // generation already incremented → dedup won't block next attempt
}

Every subsequent operation will: use the closed DB → throw → trigger a new recovery (passes the dedup check since generation advanced) → close the already-closed DB → attempt reopen → fail again. Each operation pays the cost of a write-lock acquisition + a full reopen attempt, and it never converges.

Suggestion: mark the DB as terminally failed after reopen failure so that checkState() fast-fails with a clear message (e.g., "DB recovery failed, manual intervention required") instead of looping:

} catch (IOException e) {
    logger.error("Safe reopen failed for RocksDB at {}.", dbFile, e);
    db = null;
    closed = true;  // or a dedicated recoveryFailed flag
}

2. Minor: testNewIteratorReturnsUsableIterator assertion is fragile

The test puts 2 entries ("a", "b") but asserts seen >= 3, relying on the version entry written internally by initRockDB. If the version storage mechanism changes, this test will break. Consider asserting seen >= 2 or using an exact count.

Reviewed with Claude Code

@AmandeepSingh285
Copy link
Copy Markdown
Contributor Author

Thanks @RexXiong for the review.

  1. Recovery after failed reopen - The goal of the change was to have RocksDB automatically recovered rather than entering a terminal state. In the current state as well, in case of any failures when RocksDB enters a Read only mode, a manual intervention is required to have metadata updates unblocked which I tried to solve through this PR.
  2. This is true that the test is relying on the version entry written. I can update the same with a small change if required.

@AmandeepSingh285
Copy link
Copy Markdown
Contributor Author

@RexXiong have updated the test with the commented changes. Could you please help with review

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants