Skip to content

Support requantizing kvcache while model is loaded#24367

Open
wadealexc wants to merge 4 commits into
ggml-org:masterfrom
wadealexc:support-requantize-memory
Open

Support requantizing kvcache while model is loaded#24367
wadealexc wants to merge 4 commits into
ggml-org:masterfrom
wadealexc:support-requantize-memory

Conversation

@wadealexc

Copy link
Copy Markdown

Overview

This PR adds a LLAMA_API method llama_requantize_memory, which allows a kvcache to be quantized without having to unload the current model.

llama_requantize_memory(ggml_type ctk, ggml_type ctv):

  • reads the state from the existing kvcache, then tears it down
  • creates a new kvcache quantized to the provided ctk and ctv
  • restores the old cache's state, quantizing as needed

To support this method, I modified llama_kv_cache::state_read_data to convert between k/v types as a slot is restored.

Additionally, I exposed this API via an HTTP endpoint POST /cache/requantize, which accepts values for ctk, ctv, ctkd, and ctvd to be applied to the server's current kvcache.

Note: I have tested this with Qwen3 and Qwen3.5 using -kvu and -np >= 1; with and without mmproj. This works fine. What's not fine is Hadamard rotation -- f16 to q8_0 is broken with attention rotation on (works fine without). I'm not sure where to add in support for this; I think I could add a step to "maybe rotate" in the slot restoration path, but that involves piping a flag through several methods. I wanted to get maintainer feedback before I make a decision here.

Additional information

Motivation

I want to be able to quantize my kvcache on the fly, so that my inference setup can start at high precision and step down as context fills, rather than sitting at low precision for the entire session.

Currently, the only way to achieve this is to unload+reload the entire model, which is slow and also requires re-processing the existing prompt. By exposing a /cache/requantize endpoint, I can unload+reload just the kvcache and avoid having to do prompt processing a second time.

This functionality might fit better as a CLI flag that automatically quantizes your kvcache as your device fills up, but I went with HTTP for now because it was easier to test directly. If a CLI flag is preferred, I'd be happy to implement that instead.

Requirements

  • I have read and agree with the contributing guidelines: Yes!
  • AI usage disclosure: Yes, I used Qwen3.5-27B and Opus 4.7 to:
    • Help me research kvcache architecture and server internals
    • Help write the spec and implementation
    • Help test the implementation

I reviewed, refactored, and edited all the code in this PR and accept full responsibility.

wadealexc added 4 commits June 9, 2026 11:41
…or dequantize as needed

- also expose current kvcache type name via GET /props
- remove unreachable v_trans branch

- fix: change architecture check to allow all but recurrent
- refac: rename endpoint to /cache/requantize

- remove mtmd restriction from server task; mtmd works fine
- refactor task api to use optional types, defaulting to existing cache types
- fix mem_other param when creating memory
@wadealexc wadealexc requested review from a team and ggerganov as code owners June 9, 2026 15:47
@wadealexc

Copy link
Copy Markdown
Author

Closing #24134 in favor of this; PR is ready for review.

@ngxson ngxson left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh I don't have a good feeling about this change. the use case of this feature is too narrow, I don't believe it's worth adding

This functionality might fit better as a CLI flag that automatically quantizes your kvcache as your device fills up

your logic breaks in case of multiple slot: one slot can be nearly full while the other is not, and requantize means all slots will be affected.

Comment thread src/llama-kv-cache.cpp
Comment on lines +2416 to +2439
} else {
const size_t k_size_row_dst = ggml_row_size(k_type_dst, n_embd_k_gqa);

std::vector<uint8_t> src_buf(cell_count * k_size_row);
std::vector<uint8_t> dst_buf(cell_count * k_size_row_dst);

io.read(src_buf.data(), src_buf.size());

if (!kv_convert_rows(k_type_src, k_type_dst, src_buf.data(), dst_buf.data(), n_embd_k_gqa, cell_count)) {
LLAMA_LOG_ERROR("%s: unable to convert between key types (layer %d)\n", __func__, il);
return false;
}

if (sinfo.is_contiguous()) {
// Fast path: contiguous cells, single memcpy
ggml_backend_tensor_set(k, dst_buf.data(), sinfo.head() * k_size_row_dst, dst_buf.size());
} else {
// Slow path: scatter to non-contiguous positions
for (uint32_t i = 0; i < cell_count; ++i) {
const size_t dst_start = i * k_size_row_dst;
const size_t dst_offset = sinfo.idxs[0][i] * k_size_row_dst;
ggml_backend_tensor_set(k, dst_buf.data() + dst_start, dst_offset, k_size_row_dst);
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't bring any additional benefits. the backend buffer for KV cache is already allocated, requantize it will shrink the active size but leave the total buffer size unchanged, leaving behind unused/wasteful memory space

@wadealexc wadealexc Jun 9, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How so? The requantize_memory method resets the kvcache pointer; does that not free the buffer?

The code you're referencing here is just cache restore once the kvcache already exists

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in such case, I don't see how it's different from the current llama_state_get/set_data , which is exposed via server slot save/restore API

the only thing might be different is that llama_context need to be re-created, but even so, no much different in term of how it works

@wadealexc wadealexc Jun 9, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's different because it quantizes the kvcache.

Before, kvcache takes up X GB VRAM because you're at 60k tokens at high precision.

After calling requantize to q8_0 k/v, kvcache takes up roughly X/2 GB VRAM, but you're still at 60k tokens and now at q8_0 precision.

You can continue with the same context without having to reprocess the prompt. And you were able to do the first part of generation at high precision.

@wadealexc wadealexc Jun 9, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current server slot save/restore API doesn't free the backend buffers, so I don't see any actual gain if I use that path. (It also doesn't accept restoring a slot from a different quantization level; that's part of this PR)

If I want to free the backend buffers to use the smaller ones, I would need to completely unload the model and reload it with the new config.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this just doesn't make sense as an API endpoint? Have a look at my comment below describing the CLI usecase.

Does it make sense when it's implemented like that? I would be happy to implement the CLI version instead.

@ngxson ngxson Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current server slot save/restore API doesn't free the backend buffers

you can free the whole llama_context and create a new one? or just save the state to disk --> create a new server instance --> load it?


I might be hard here, but let me get this straight: I'm not convinced that this change is needed. it's just too much maintenance burden for a specific use case that can already be done via existing APIs

unless other maintainers say otherwise, I refuse to comment further on this subject

@wadealexc wadealexc Jun 9, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can free the whole llama_context and create a new one?

that's what requantize_memory does. there's no api for this otherwise

or just save the state to disk --> create a new server instance --> load it?

well yes, but as i described below, that's extremely slow.

I'm not convinced that this change is needed. it's just too much maintenance burden for a specific use case that can already be done via existing APIs

okay, fair enough. thanks for your time.

@wadealexc

wadealexc commented Jun 9, 2026

Copy link
Copy Markdown
Author

tbh I don't have a good feeling about this change. the use case of this feature is too narrow, I don't believe it's worth adding

As an HTTP endpoint, I think I agree. You would need to integrate it into whatever harness/app you're using, which people may not want to do. As a CLI flag, I feel like a lot of people would want to use this - at least I personally really want this!

This functionality might fit better as a CLI flag that automatically quantizes your kvcache as your device fills up

your logic breaks in case of multiple slot: one slot can be nearly full while the other is not, and requantize means all slots will be affected.

What about this is broken? requantize is meant to target the entire kvcache - just like when you start the server, the entire kvcache is at a set ctk/ctv.

Would you prefer to see a per-slot requantize? I could look into that, but the implementation would look very different, given the server doesn't currently support setting ctk/ctv per-slot (unless i missed that?)

@wadealexc

wadealexc commented Jun 9, 2026

Copy link
Copy Markdown
Author

FWIW as a CLI flag, I envision something that slots into how --fit currently works.

The idea would be that doing --fit plus --dynamic-cache would fit model+cache into your device at high precision initially, but also calculate different context thresholds at which the cache gets automatically quantized. There's a minimum ctx size argument currently; something like a 'minimum precision' argument could have it calculate context limits at max precision, but also set up a watcher that automatically quantizes the kvcache once the context limit is reached.

So, e.g. --fit --dynamic-cache --min_k q8_0 --min_v q8_0: initially calculates we can fit 60k tokens of context at f16. Once we hit ~60k, the cache is quantized to q8_0, and we're able to expand the context window.

I feel like the usecase for that is "anyone that's willing to quantize their kvcache"

@ngxson

ngxson commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

I feel like the usecase for that is "anyone that's willing to quantize their kvcache"

what about letting user to create different servers with different ctk/ctv? then transfer state via slot save/restore function?

@wadealexc

wadealexc commented Jun 9, 2026

Copy link
Copy Markdown
Author

I feel like the usecase for that is "anyone that's willing to quantize their kvcache"

what about letting user to create different servers with different ctk/ctv? then transfer state via slot save/restore function?

I would need to think about this more, but:

  1. server save slot/restore doesn't allow restoring cross quantization level. you would need to do the quantize outside of the server
  2. using this would be way harder, because both servers want to use the same backend device. You would need to coordinate unloading from server 1 before you load onto server 2. i think this would be difficult and im not convinced it's faster than just killing server 1 and restarting it with a new config.
  3. both server 1 and 2 want to have the same model loaded, but that's not something they can share. i would need to tell server 1 "unload your model and your kvcache" before passing it to server 2. at that point i should just restart server 1.

@wadealexc

Copy link
Copy Markdown
Author

I think, something I'm not making entirely clear here: requantize_memory allows me to select a new quantization level without having to unload + reload the model (in addition to reprocessing prompt!).

For me, unloading/reloading Qwen3.5 27B takes 3-5 seconds, and re-processing 100k tokens can take over 10 seconds. If I want to requantize my kvcache currently, that's 13-15 seconds waiting. With this method, it's like half a second. And I'm on a fairly nice GPU (5090). For other devices, this is an even better time save.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants