CASSANDRA-21134: Direct I/O for background SSTable writes#4815
CASSANDRA-21134: Direct I/O for background SSTable writes#4815samueldlightfoot wants to merge 8 commits into
Conversation
3349322 to
005f4e1
Compare
a49806c to
ca8ef09
Compare
aweisberg
left a comment
There was a problem hiding this comment.
Mostly just nits but at least one or two things worth doing and then I shared a pastebin with you of an analysis of the DirectCompressedSequentialWriterTest code and boundary coverage that is worth considering.
| } | ||
|
|
||
| @Test | ||
| public void testInitializeBackgroundWriteDiskAccessModeRejectsZeroBufferSize() |
There was a problem hiding this comment.
Technically haven't exercised the checks of the paths of isDirectIOSupported which would require being able to intercept that and provide test answers to whether paths support that. Maybe something that can be done with jimfs.
The auto path which we discussed might be worth changing.
There was a problem hiding this comment.
I think the best we could do is a custom hook that captures what open options were requested, but not via JimFS, since getBlockSize throws an exception. Let me know how strongly you feel on this one.
My plan with auto was to have it resolve to standard now, and then if/when we decide to use direct as the default, we can have it check DIO support and fallback to standard if not.
a3e2fae to
0ce2a2e
Compare
Adds an opt-in O_DIRECT write path for background SSTable producers, bypassing the OS page cache for data that is unlikely to be re-read soon after being written. Memtable flushes remain buffered. Enabled via two new YAML knobs: - background_write_disk_access_mode: standard (default) | direct - direct_write_buffer_size: 1MiB (default; aligned up to FS block size, auto-grown to chunk_length) The path is gated by config, table compression being enabled, and an OperationType allowlist in DataComponent. The allowlist is exhaustive: any new OperationType with writesData=true that is not classified will fail static initialization. Operations on the DIO path: COMPACTION, MAJOR_COMPACTION, TOMBSTONE_COMPACTION, ANTICOMPACTION, GARBAGE_COLLECT, CLEANUP, UPGRADE_SSTABLES, WRITE, STREAM (chunked receiver only). Operations off the DIO path: - FLUSH (policy: just-flushed data is hot, keep in page cache) - SCRUB (correctness: tryAppend needs mark/resetAndTruncate) - Zero-Copy Streaming (bypasses DataComponent.buildWriter) - Uncompressed writers (only CompressedSequentialWriter has a DIO subclass in this change) StartupChecks fails fast if 'direct' is requested on a platform/FS that does not support O_DIRECT. patch by Sam Lightfoot; reviewed by <reviewers> for CASSANDRA-21134
0ce2a2e to
8af930a
Compare
- Reuse ChecksumWriter for O_DIRECT compressed writes via DirectChecksumWriter - Centralize Direct I/O write paths and drop redundant background-write support validation - Rely on startup checks for engagement; remove the per-operation log and its reflective test seam (and StreamingDirectWriteTest) - Fail fast on unsupported background_write_disk_access_mode values - Drop "scrub" from the DIO operation docs (it is UNSUPPORTED_CORRECTNESS) - Exercise isDirectIOSupported path checks in DatabaseDescriptorTest - Tighten AntiCompactionTest/CompactionsTest to assert Direct I/O actually engages rather than passing spuriously
… SSTables
Early-open publishes readers before commit pads/truncates the file. Under
O_DIRECT this exposed not-yet-durable or block-padded tails, so early
readers short-read past EOF (CorruptBlockException/CorruptSSTableException).
Fix all three early-open paths inside the DIO writer subclass:
- final early-open: override syncInternal() to flush the buffered sub-block
tail, restoring channel/buffer cursors so the commit-time pad+truncate
still yields a byte-identical file.
- preemptive early-open (SSTableRewriter.maybeReopenEarly): report the
durable uncompressed offset by tracking each staged chunk's
{compressedEnd, uncompressedEnd} and advancing only over chunks whose
compressed bytes sit below fchannel.position(). No extra I/O.
- scan early-open: truncate to actualDataSize at openFinalEarly so a
scanner's snapshotted file size already equals the committed size.
…nType RELOCATE (online relocateSSTables) and UNKNOWN (offline sstablesplit) reach DataComponent.buildWriter() despite writesData==false, so keying the direct-write classification off writesData left RELOCATE unclassified and threw under background_write_disk_access_mode: direct. Make DIRECT_WRITE_SUPPORT total over OperationType.values(): classify RELOCATE and UNKNOWN as SUPPORTED (both append-only, safe under O_DIRECT) and the nine genuine non-writers as a new NOT_A_WRITER marker. The build-time guard now asserts completeness over values(), so a newly-added OperationType fails at class-load instead of in production.
…st config Add background_write_disk_access_mode: direct to cassandra_latest.yaml so new-install/latest configurations exercise Direct I/O by default, and keep the test latest config in sync (test/conf/latest_diff.yaml and the DTEST_JVM_DTESTS_USE_LATEST block in InstanceConfig.java). The cassandra.yaml default remains standard. CQLTester.InMemory installs a global jimfs filesystem, which cannot do Direct I/O (no FileStore.getBlockSize(), rejects ExtendedOpenOption.DIRECT), so pin in-memory tests to buffered writes to keep the latest config green.
Cover checkKernelBug1057843 for the new background-write Direct I/O data paths, and tidy the cassandra_latest.yaml comment wording.
…upport Reshuffle DirectCompressedSequentialWriterTest for congruence and add a CompressionParams.noop(chunkLength) overload for tests that need a specific chunk length.
8af930a to
9ab9c8d
Compare
CASSANDRA-21134: Direct I/O for background SSTable writes
Summary
Opt-in
O_DIRECTwrite path for background SSTable producers, bypassing the OS page cache for write-once read-never data. Memtable flushes remain buffered (hot data benefits from the cache).Gated by (1) config, (2) table compression enabled, (3) an
OperationTypeallowlist (DataComponent#DIRECT_WRITE_SUPPORT). Selection is central inDataComponent.buildWriter; producers are unchanged.Performance
Benchmark results are attached to the JIRA. Significant p99 read latency improvements under throttled compaction.
Operations covered (DIO eligible)
WRITECQLSSTableWriterDaemonTest(parameterised on disk mode)COMPACTIONCompactionsTest(parameterised on disk mode)MAJOR_COMPACTIONCompactionsTest.testCompactionWithSizeLimitedRewriterCLEANUP,GARBAGE_COLLECT,TOMBSTONE_COMPACTION,UPGRADE_SSTABLESCompactionsTest(transitive)ANTICOMPACTIONAntiCompactionTest.testAntiCompactionWithCompressedTableAndDirectWritesSTREAMStreamingDirectWriteTestThe allowlist is exhaustive: any new
OperationTypewithwritesData == truethat is not classified fails static initialization (AssertionError).Operations NOT covered
FLUSH(memtable)UNSUPPORTED_POLICYDataComponentDirectWriteSelectionTestSCRUBUNSUPPORTED_CORRECTNESStryAppendneedsmark()/resetAndTruncate(), which DIO cannot satisfy.DataComponentDirectWriteSelectionTestDataComponent.buildWriter.StreamingDirectWriteTest(disables ZCS)CompressedSequentialWriterhas a DIO subclass.DataComponentDirectWriteSelectionTest(compression gate)Removing an
UNSUPPORTED_CORRECTNESSentry requires code changes;UNSUPPORTED_POLICYis a policy decision.Key code
io/DirectIoSupport.java— eligibility enum (SUPPORTED/UNSUPPORTED_CORRECTNESS/UNSUPPORTED_POLICY/NOT_APPLICABLE).io/sstable/format/DataComponent.java— selection, allowlist, exhaustiveness check;per-op first-activation log.
io/compress/DirectCompressedSequentialWriter.java— new writer; aligned buffers,mark()/resetAndTruncate()unsupported.io/compress/CompressedSequentialWriter.java— refactored so the DIO subclass canoverride the write-chunk path;
writeChunkcontract documented and asserted.config/Config.java,config/DatabaseDescriptor.java— new knobs, validation, startupwiring; buffer size aligned to FS block size, auto-grown to chunk length.
service/StartupChecks.java— fails fast ifdirectis requested on a platform/FSthat does not support
O_DIRECT.Tests introduced
vs. the buffered writer over compressors × chunk lengths × random payload sizes;
seed-logged for repro (
DirectCompressedSequentialWriterTest).flushCompleteBlocks(DirectCompressedSequentialWriterTest).OperationTypeeligibility, allowlist exhaustiveness,compression gate, config-mode gate (
DataComponentDirectWriteSelectionTest).WRITE(CQLSSTableWriterDirectWriteTest),STREAM(in-JVM dtest,StreamingDirectWriteTest), and the compaction family +ANTICOMPACTION(extendedCompactionsTest,AntiCompactionTest).block-size rejection, once-per-JVM undersized-buffer warn, SCRUB-gating canaries
(
DirectCompressedSequentialWriterTest).BufferPoolMXBeancheck that the off-heap alignedbuffer is returned on close (
DirectCompressedSequentialWriterTest).in
DatabaseDescriptorTest.Not in scope
Reviewer notes
Findings from the Cassandra bug-hunting skills (Opus 4.7 xhigh & kimi-k2.6:cloud) were addressed prior to
review.