Avoid staging local KV cache payloads#353
Open
leblancfg wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Disk KV saves currently write the session payload twice for local sessions:
ds4_session_stage_payload(session, &staged, ...).kv.tmp...cache file;ds4_session_write_staged_payload(&staged, fp, ...)rename(tmp, path)Current path:
session -> staged payload tmp -> final .kv.tmp -> rename.This PR:
session -> final .kv.tmp -> rename.For long contexts, this payload can be multi-GiB. The extra staged-file copy adds latency to cold/continued/session saves and also raises peak temporary disk usage.
In this PR
When the engine can predict the payload size with
ds4_session_payload_bytes(), write the payload directly into the final temporary cache file.The cache file is still atomic from the reader's point of view: it is written as
.tmp...and renamed only after the full header, text, payload and trailer are written successfully. If the direct write fails, the temp file is unlinked as before.Unknown-size payloads still use the old staged path. In practice this preserves the existing distributed-session behavior, since distributed payload size is not predicted by
ds4_session_payload_bytes().Implementation details:
ds4_session_save_payload_counted(), a small wrapper around the existingds4_session_save_payload()serializer;Benchmark
I used a long-context server run because this is where disk KV saves are large enough for the extra copy to matter. The benchmark does not try to show faster prefill, just measures the save path around normal long-context checkpoints.
Machine: Apple M2 Max, 64 GiB unified memory, Metal SSD streaming, 32GB routed expert cache.
Command:
./ds4-server \ -m ./ds4flash.gguf \ --ssd-streaming \ --ssd-streaming-cache-experts 32GB \ --ctx 524288 \ --kv-disk-dir "$CACHE" \ --kv-disk-space-mb 200000 \ --kv-cache-min-tokens 512 \ --kv-cache-cold-max-tokens 520192 \ --kv-cache-continued-interval-tokens 10000 \ --kv-cache-boundary-trim-tokens 32 \ --kv-cache-boundary-align-tokens 2048Request:
speed-bench/promessi_sposi.txt;model=deepseek-chat,thinking=false,temperature=0,max_tokens=1;For
main, I used a temporary timing-only log aroundds4_session_stage_payload()so the old path can be measured asserialize to temp + copy temp to cache. That instrumentation is not part of this PR.Per-checkpoint save times, excluding shutdown:
Tests
Additional live cache check on the direct-save branch, reusing the 128k-token benchmark cache:
cached_tokens=126976,cache_write_tokens=1214load=355.5 msfor the 1.69 GiB cold checkpoint