[CELEBORN-2348] Support end-to-end shuffle integrity check for Flink#3718
[CELEBORN-2348] Support end-to-end shuffle integrity check for Flink#3718SteNicholas wants to merge 2 commits into
Conversation
Extend the end-to-end integrity checks from CELEBORN-894 to Flink, on both the regular and tiered (hybrid) read paths. When the check is enabled, the write side records a per-subpartition CRC32/byte count and the driver validates it against what the reader consumed, failing the read on a mismatch. - Write side: FlinkShuffleClientImpl hashes each push payload into PushState and reports per-subpartition CRC32/bytes at MapperEnd. - Read side: RemoteBufferStreamReader and CelebornChannelBufferReader accumulate the read CRC32/bytes via a shared ReadIntegrityTracker, reported at the last partition's stream end. - Driver side: MapPartitionCommitHandler.finishPartition combines the recorded write-side checksums over the consumed range, failing closed on a mismatch. Also add zero-copy ByteBuffer overloads to CelebornCRC32, CommitMetadata and PushState, with unit and Flink integration tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Ping @gauravkm, @RexXiong, @pltbkd, @xumingming. |
xumingming
left a comment
There was a problem hiding this comment.
Good overall, some comments inline.
|
Overall LGTM. The only concern is that map partition reuses |
|
@pltbkd thanks for the review! Agreed the reducer-centric naming reads oddly now that map partitions reuse it. The good news is the rename would be wire-safe, because the protocol is keyed by numbers, not names:
And this RPC is client → My only hesitation is mixing a protocol-wide rename (it also touches the existing Spark path) into this Flink feature PR. I'd lean toward doing the |
…ition-type-agnostic finishPartition is now reused for map partitions, so the trait doc no longer applies only to reduce partitions. Reword it to cover both and point at MapPartitionCommitHandler.finishPartition for the map-partition semantics. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Merged to main(v0.7.0). |
Sorry for the late reply. Agreed. |
What changes were proposed in this pull request?
This PR extends the end-to-end shuffle integrity checks introduced in CELEBORN-894 (Spark-only) to Flink workloads, covering both the regular and the tiered (hybrid) read paths. When the check is enabled, the write side records a per subpartition CRC32 + byte count and the driver validates it against what the reader actually consumed, failing the read on a mismatch.
FlinkShuffleClientImplhashes each push payload (the body after the batch header) intoPushStatevia a zero-copyByteBufferview and reports the per-subpartition CRC32/bytes atMapperEnd, reusing the existingcrc32PerPartition/bytesWrittenPerPartitionplumbing. The constructor fails fast if the write-sideBATCH_HEADER_SIZEever diverges from the read-sideBufferUtils.HEADER_LENGTH_PREFIX.RemoteBufferStreamReaderandCelebornChannelBufferReaderaccumulate the read CRC32/bytes through a sharedReadIntegrityTrackerand report them at the last partition's stream end. The tracker owns the per-path framing/stripping and disables itself on any unexpected buffer shape (wrong component count, or a buffer shorter than the batch header) rather than risk a false mismatch.ReadReducerPartitionEndis reused for MAP partitions, andMapPartitionCommitHandler.finishPartitioncombines the recorded write-side checksums over the consumed subpartition range, failing closed on a mismatch or missing metadata.ByteBufferoverloads toCelebornCRC32,CommitMetadataandPushState.handleReducerPartitionEnd's failure branch and update the client config doc.Why are the changes needed?
CELEBORN-894 added end-to-end integrity verification only for Spark. Flink workloads — including hybrid/tiered shuffle — had no equivalent guard, so silent shuffle data corruption (bit flips, truncation, mis-framing) could go undetected and surface as wrong results rather than a failed task. This PR brings the same write-vs-read checksum/byte-count validation to Flink so such corruption fails the read instead of being silently consumed.
Does this PR resolve a correctness bug?
Does this PR introduce any user-facing change?
The existing
celeborn.client.shuffle.integrityCheck.enabledconfig now also applies to Flink (previously Spark-only); its documentation is updated accordingly. The default remainsfalse, so there is no behavior change unless the check is explicitly enabled.How was this patch tested?
Added unit and integration tests:
CelebornCRC32Test/CommitMetadataTest: the newByteBufferoverloads (single and split header/data), order-independence, and corruption / byte-count-mismatch detection.MapPartitionCommitHandlerTest:finishPartitionsuccess and all failure branches (no metadata, missing map partition, out-of-bounds range, checksum mismatch, byte-count mismatch), concurrent recording, and the expired-shuffle race.ReadIntegrityTrackerTest: report-once / disable semantics and per-path framing for both the regular and tiered read paths.RemoteBufferStreamReaderTest: the stream-end-after-close race (a failed report must not notify the failure listener on a closed channel).CelebornBufferStreamTest: thehasRemainingPartitionslocation-index boundary.WordCountTest(WordCountTestWithIntegrityCheck) andHybridShuffleWordCountTest: end-to-end Flink runs with the check enabled, on both the regular and hybrid shuffle paths.