Support requantizing kvcache while model is loaded#24367
Conversation
…or dequantize as needed - also expose current kvcache type name via GET /props
- remove unreachable v_trans branch - fix: change architecture check to allow all but recurrent - refac: rename endpoint to /cache/requantize - remove mtmd restriction from server task; mtmd works fine
- refactor task api to use optional types, defaulting to existing cache types - fix mem_other param when creating memory
|
Closing #24134 in favor of this; PR is ready for review. |
ngxson
left a comment
There was a problem hiding this comment.
tbh I don't have a good feeling about this change. the use case of this feature is too narrow, I don't believe it's worth adding
This functionality might fit better as a CLI flag that automatically quantizes your kvcache as your device fills up
your logic breaks in case of multiple slot: one slot can be nearly full while the other is not, and requantize means all slots will be affected.
| } else { | ||
| const size_t k_size_row_dst = ggml_row_size(k_type_dst, n_embd_k_gqa); | ||
|
|
||
| std::vector<uint8_t> src_buf(cell_count * k_size_row); | ||
| std::vector<uint8_t> dst_buf(cell_count * k_size_row_dst); | ||
|
|
||
| io.read(src_buf.data(), src_buf.size()); | ||
|
|
||
| if (!kv_convert_rows(k_type_src, k_type_dst, src_buf.data(), dst_buf.data(), n_embd_k_gqa, cell_count)) { | ||
| LLAMA_LOG_ERROR("%s: unable to convert between key types (layer %d)\n", __func__, il); | ||
| return false; | ||
| } | ||
|
|
||
| if (sinfo.is_contiguous()) { | ||
| // Fast path: contiguous cells, single memcpy | ||
| ggml_backend_tensor_set(k, dst_buf.data(), sinfo.head() * k_size_row_dst, dst_buf.size()); | ||
| } else { | ||
| // Slow path: scatter to non-contiguous positions | ||
| for (uint32_t i = 0; i < cell_count; ++i) { | ||
| const size_t dst_start = i * k_size_row_dst; | ||
| const size_t dst_offset = sinfo.idxs[0][i] * k_size_row_dst; | ||
| ggml_backend_tensor_set(k, dst_buf.data() + dst_start, dst_offset, k_size_row_dst); | ||
| } | ||
| } |
There was a problem hiding this comment.
this won't bring any additional benefits. the backend buffer for KV cache is already allocated, requantize it will shrink the active size but leave the total buffer size unchanged, leaving behind unused/wasteful memory space
There was a problem hiding this comment.
How so? The requantize_memory method resets the kvcache pointer; does that not free the buffer?
The code you're referencing here is just cache restore once the kvcache already exists
There was a problem hiding this comment.
in such case, I don't see how it's different from the current llama_state_get/set_data , which is exposed via server slot save/restore API
the only thing might be different is that llama_context need to be re-created, but even so, no much different in term of how it works
There was a problem hiding this comment.
It's different because it quantizes the kvcache.
Before, kvcache takes up X GB VRAM because you're at 60k tokens at high precision.
After calling requantize to q8_0 k/v, kvcache takes up roughly X/2 GB VRAM, but you're still at 60k tokens and now at q8_0 precision.
You can continue with the same context without having to reprocess the prompt. And you were able to do the first part of generation at high precision.
There was a problem hiding this comment.
The current server slot save/restore API doesn't free the backend buffers, so I don't see any actual gain if I use that path. (It also doesn't accept restoring a slot from a different quantization level; that's part of this PR)
If I want to free the backend buffers to use the smaller ones, I would need to completely unload the model and reload it with the new config.
There was a problem hiding this comment.
Maybe this just doesn't make sense as an API endpoint? Have a look at my comment below describing the CLI usecase.
Does it make sense when it's implemented like that? I would be happy to implement the CLI version instead.
There was a problem hiding this comment.
The current server slot save/restore API doesn't free the backend buffers
you can free the whole llama_context and create a new one? or just save the state to disk --> create a new server instance --> load it?
I might be hard here, but let me get this straight: I'm not convinced that this change is needed. it's just too much maintenance burden for a specific use case that can already be done via existing APIs
unless other maintainers say otherwise, I refuse to comment further on this subject
There was a problem hiding this comment.
you can free the whole llama_context and create a new one?
that's what requantize_memory does. there's no api for this otherwise
or just save the state to disk --> create a new server instance --> load it?
well yes, but as i described below, that's extremely slow.
I'm not convinced that this change is needed. it's just too much maintenance burden for a specific use case that can already be done via existing APIs
okay, fair enough. thanks for your time.
As an HTTP endpoint, I think I agree. You would need to integrate it into whatever harness/app you're using, which people may not want to do. As a CLI flag, I feel like a lot of people would want to use this - at least I personally really want this!
What about this is broken? requantize is meant to target the entire kvcache - just like when you start the server, the entire kvcache is at a set ctk/ctv. Would you prefer to see a per-slot requantize? I could look into that, but the implementation would look very different, given the server doesn't currently support setting ctk/ctv per-slot (unless i missed that?) |
|
FWIW as a CLI flag, I envision something that slots into how The idea would be that doing So, e.g. I feel like the usecase for that is "anyone that's willing to quantize their kvcache" |
what about letting user to create different servers with different ctk/ctv? then transfer state via slot save/restore function? |
I would need to think about this more, but:
|
|
I think, something I'm not making entirely clear here: For me, unloading/reloading Qwen3.5 27B takes 3-5 seconds, and re-processing 100k tokens can take over 10 seconds. If I want to requantize my kvcache currently, that's 13-15 seconds waiting. With this method, it's like half a second. And I'm on a fairly nice GPU (5090). For other devices, this is an even better time save. |
Overview
This PR adds a LLAMA_API method
llama_requantize_memory, which allows a kvcache to be quantized without having to unload the current model.llama_requantize_memory(ggml_type ctk, ggml_type ctv):ctkandctvTo support this method, I modified
llama_kv_cache::state_read_datato convert between k/v types as a slot is restored.Additionally, I exposed this API via an HTTP endpoint
POST /cache/requantize, which accepts values forctk,ctv,ctkd, andctvdto be applied to the server's current kvcache.Note: I have tested this with Qwen3 and Qwen3.5 using
-kvuand-np >= 1; with and without mmproj. This works fine. What's not fine is Hadamard rotation -- f16 to q8_0 is broken with attention rotation on (works fine without). I'm not sure where to add in support for this; I think I could add a step to "maybe rotate" in the slot restoration path, but that involves piping a flag through several methods. I wanted to get maintainer feedback before I make a decision here.Additional information
Motivation
I want to be able to quantize my kvcache on the fly, so that my inference setup can start at high precision and step down as context fills, rather than sitting at low precision for the entire session.
Currently, the only way to achieve this is to unload+reload the entire model, which is slow and also requires re-processing the existing prompt. By exposing a
/cache/requantizeendpoint, I can unload+reload just the kvcache and avoid having to do prompt processing a second time.This functionality might fit better as a CLI flag that automatically quantizes your kvcache as your device fills up, but I went with HTTP for now because it was easier to test directly. If a CLI flag is preferred, I'd be happy to implement that instead.
Requirements
I reviewed, refactored, and edited all the code in this PR and accept full responsibility.