[CELEBORN-2334] Automatically restore RocksDB in case of failures#3695
[CELEBORN-2334] Automatically restore RocksDB in case of failures#3695AmandeepSingh285 wants to merge 9 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds automatic recovery for the worker's RocksDB-backed metadata store: when a DB operation throws RocksDBException, the RocksDB wrapper closes the existing native instance and reopens it via RocksDBProvider.initRockDB, coordinated by a ReentrantReadWriteLock and a generation counter so only one thread performs the reopen.
Changes:
RocksDBnow holds avolatiledb reference plusdbFile/versionand exposes arecreateDBInstance(generation)helper invoked from each operation'scatchblock.- All
putInternal/getInternal/deleteInternal/newIterator/closemethods are wrapped in read-/write-lock scopes around the generation snapshot. DBProvider.initDBpassesdbFileandversionthrough to the newRocksDBconstructor.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/RocksDB.java | Adds locking, generation counter, and recreateDBInstance recovery in each DB operation. |
| worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/DBProvider.java | Threads dbFile/version into the updated RocksDB constructor. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@AmandeepSingh285, please update the |
|
@SteNicholas could you please help with review. Thanks! |
SteNicholas
left a comment
There was a problem hiding this comment.
@AmandeepSingh285, thanks for updates. I have left minor comments. PTAL.
Thanks @SteNicholas , made changes according to the comments |
1d92a40 to
cf8d472
Compare
|
Thanks for the work on this PR — the overall design (ManagedRocksDB lifecycle wrapper, generation + ReadWriteLock for concurrent recovery dedup, stale iterator detection) is solid. A couple of observations: 1. If try {
if (db != null) {
db.close(); // old DB closed
}
} catch (Exception e) { ... }
dbGeneration.incrementAndGet(); // generation bumped
try {
db = RocksDBProvider.reopenRocksDB(dbFile, conf); // if this fails...
} catch (IOException e) {
logger.error("Safe reopen failed ...", e);
// db still points to the CLOSED ManagedRocksDB
// generation already incremented → dedup won't block next attempt
}Every subsequent operation will: use the closed DB → throw → trigger a new recovery (passes the dedup check since generation advanced) → close the already-closed DB → attempt reopen → fail again. Each operation pays the cost of a write-lock acquisition + a full reopen attempt, and it never converges. Suggestion: mark the DB as terminally failed after reopen failure so that } catch (IOException e) {
logger.error("Safe reopen failed for RocksDB at {}.", dbFile, e);
db = null;
closed = true; // or a dedicated recoveryFailed flag
}2. Minor: The test puts 2 entries ("a", "b") but asserts Reviewed with Claude Code |
|
Thanks @RexXiong for the review.
|
|
@RexXiong have updated the test with the commented changes. Could you please help with review |
What changes were proposed in this pull request?
The patch re-instantiates RocksDB in case of failures. In the current implementation, when RocksDB enters a read-only mode due to failures, Celeborn metadata operations fail and remain blocked until manual intervention or restart. This pull request adds logic to detect such RocksDB failures and re-instantiate the RocksDB instance so that metadata operations can recover automatically and continue functioning without prolonged disruption. RocksDB can enter a read-only or unusable state under scenarios such as: corruption in files, errors from underlying file system. In such cases, RocksDB prevents further writes to protect data consistency, which causes Celeborn metadata operations to fail.
Why are the changes needed?
Once RocksDB enters a read-only or error state, Celeborn metadata operations become unavailable because the existing RocksDB instance remains unusable, which could lead to failures in metadata updates.
Does this PR resolve a correctness bug?
No.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit tests.