Support requantizing kvcache while model is loaded by wadealexc · Pull Request #24367 · ggml-org/llama.cpp

wadealexc · 2026-06-09T15:47:27Z

Overview

This PR adds a LLAMA_API method llama_requantize_memory, which allows a kvcache to be quantized without having to unload the current model.

llama_requantize_memory(ggml_type ctk, ggml_type ctv):

reads the state from the existing kvcache, then tears it down
creates a new kvcache quantized to the provided ctk and ctv
restores the old cache's state, quantizing as needed

To support this method, I modified llama_kv_cache::state_read_data to convert between k/v types as a slot is restored.

Additionally, I exposed this API via an HTTP endpoint POST /cache/requantize, which accepts values for ctk, ctv, ctkd, and ctvd to be applied to the server's current kvcache.

Note: I have tested this with Qwen3 and Qwen3.5 using -kvu and -np >= 1; with and without mmproj. This works fine. What's not fine is Hadamard rotation -- f16 to q8_0 is broken with attention rotation on (works fine without). I'm not sure where to add in support for this; I think I could add a step to "maybe rotate" in the slot restoration path, but that involves piping a flag through several methods. I wanted to get maintainer feedback before I make a decision here.

Additional information

Motivation

I want to be able to quantize my kvcache on the fly, so that my inference setup can start at high precision and step down as context fills, rather than sitting at low precision for the entire session.

Currently, the only way to achieve this is to unload+reload the entire model, which is slow and also requires re-processing the existing prompt. By exposing a /cache/requantize endpoint, I can unload+reload just the kvcache and avoid having to do prompt processing a second time.

This functionality might fit better as a CLI flag that automatically quantizes your kvcache as your device fills up, but I went with HTTP for now because it was easier to test directly. If a CLI flag is preferred, I'd be happy to implement that instead.

Requirements

I have read and agree with the contributing guidelines: Yes!
AI usage disclosure: Yes, I used Qwen3.5-27B and Opus 4.7 to:
- Help me research kvcache architecture and server internals
- Help write the spec and implementation
- Help test the implementation

I reviewed, refactored, and edited all the code in this PR and accept full responsibility.

…or dequantize as needed - also expose current kvcache type name via GET /props

- remove unreachable v_trans branch - fix: change architecture check to allow all but recurrent - refac: rename endpoint to /cache/requantize - remove mtmd restriction from server task; mtmd works fine

- refactor task api to use optional types, defaulting to existing cache types - fix mem_other param when creating memory

wadealexc · 2026-06-09T15:48:24Z

Closing #24134 in favor of this; PR is ready for review.

ngxson

tbh I don't have a good feeling about this change. the use case of this feature is too narrow, I don't believe it's worth adding

This functionality might fit better as a CLI flag that automatically quantizes your kvcache as your device fills up

your logic breaks in case of multiple slot: one slot can be nearly full while the other is not, and requantize means all slots will be affected.

ngxson · 2026-06-09T16:01:44Z

+        } else {
+            const size_t k_size_row_dst = ggml_row_size(k_type_dst, n_embd_k_gqa);
+
+            std::vector<uint8_t> src_buf(cell_count * k_size_row);
+            std::vector<uint8_t> dst_buf(cell_count * k_size_row_dst);
+
+            io.read(src_buf.data(), src_buf.size());
+
+            if (!kv_convert_rows(k_type_src, k_type_dst, src_buf.data(), dst_buf.data(), n_embd_k_gqa, cell_count)) {
+                LLAMA_LOG_ERROR("%s: unable to convert between key types (layer %d)\n", __func__, il);
+                return false;
+            }
+
+            if (sinfo.is_contiguous()) {
+                // Fast path: contiguous cells, single memcpy
+                ggml_backend_tensor_set(k, dst_buf.data(), sinfo.head() * k_size_row_dst, dst_buf.size());
+            } else {
+                // Slow path: scatter to non-contiguous positions
+                for (uint32_t i = 0; i < cell_count; ++i) {
+                    const size_t dst_start = i * k_size_row_dst;
+                    const size_t dst_offset = sinfo.idxs[0][i] * k_size_row_dst;
+                    ggml_backend_tensor_set(k, dst_buf.data() + dst_start, dst_offset, k_size_row_dst);
+                }
+            }


this won't bring any additional benefits. the backend buffer for KV cache is already allocated, requantize it will shrink the active size but leave the total buffer size unchanged, leaving behind unused/wasteful memory space

How so? The requantize_memory method resets the kvcache pointer; does that not free the buffer?

The code you're referencing here is just cache restore once the kvcache already exists

in such case, I don't see how it's different from the current llama_state_get/set_data , which is exposed via server slot save/restore API

the only thing might be different is that llama_context need to be re-created, but even so, no much different in term of how it works

It's different because it quantizes the kvcache.

Before, kvcache takes up X GB VRAM because you're at 60k tokens at high precision.

After calling requantize to q8_0 k/v, kvcache takes up roughly X/2 GB VRAM, but you're still at 60k tokens and now at q8_0 precision.

You can continue with the same context without having to reprocess the prompt. And you were able to do the first part of generation at high precision.

The current server slot save/restore API doesn't free the backend buffers, so I don't see any actual gain if I use that path. (It also doesn't accept restoring a slot from a different quantization level; that's part of this PR)

If I want to free the backend buffers to use the smaller ones, I would need to completely unload the model and reload it with the new config.

Maybe this just doesn't make sense as an API endpoint? Have a look at my comment below describing the CLI usecase.

Does it make sense when it's implemented like that? I would be happy to implement the CLI version instead.

The current server slot save/restore API doesn't free the backend buffers

you can free the whole llama_context and create a new one? or just save the state to disk --> create a new server instance --> load it?

I might be hard here, but let me get this straight: I'm not convinced that this change is needed. it's just too much maintenance burden for a specific use case that can already be done via existing APIs

unless other maintainers say otherwise, I refuse to comment further on this subject

you can free the whole llama_context and create a new one?

that's what requantize_memory does. there's no api for this otherwise

or just save the state to disk --> create a new server instance --> load it?

well yes, but as i described below, that's extremely slow.

I'm not convinced that this change is needed. it's just too much maintenance burden for a specific use case that can already be done via existing APIs

okay, fair enough. thanks for your time.

wadealexc · 2026-06-09T16:31:14Z

tbh I don't have a good feeling about this change. the use case of this feature is too narrow, I don't believe it's worth adding

As an HTTP endpoint, I think I agree. You would need to integrate it into whatever harness/app you're using, which people may not want to do. As a CLI flag, I feel like a lot of people would want to use this - at least I personally really want this!

This functionality might fit better as a CLI flag that automatically quantizes your kvcache as your device fills up

your logic breaks in case of multiple slot: one slot can be nearly full while the other is not, and requantize means all slots will be affected.

What about this is broken? requantize is meant to target the entire kvcache - just like when you start the server, the entire kvcache is at a set ctk/ctv.

Would you prefer to see a per-slot requantize? I could look into that, but the implementation would look very different, given the server doesn't currently support setting ctk/ctv per-slot (unless i missed that?)

wadealexc · 2026-06-09T16:50:08Z

FWIW as a CLI flag, I envision something that slots into how --fit currently works.

The idea would be that doing --fit plus --dynamic-cache would fit model+cache into your device at high precision initially, but also calculate different context thresholds at which the cache gets automatically quantized. There's a minimum ctx size argument currently; something like a 'minimum precision' argument could have it calculate context limits at max precision, but also set up a watcher that automatically quantizes the kvcache once the context limit is reached.

So, e.g. --fit --dynamic-cache --min_k q8_0 --min_v q8_0: initially calculates we can fit 60k tokens of context at f16. Once we hit ~60k, the cache is quantized to q8_0, and we're able to expand the context window.

I feel like the usecase for that is "anyone that's willing to quantize their kvcache"

ngxson · 2026-06-09T17:18:41Z

I feel like the usecase for that is "anyone that's willing to quantize their kvcache"

what about letting user to create different servers with different ctk/ctv? then transfer state via slot save/restore function?

wadealexc · 2026-06-09T17:22:11Z

I feel like the usecase for that is "anyone that's willing to quantize their kvcache"

what about letting user to create different servers with different ctk/ctv? then transfer state via slot save/restore function?

I would need to think about this more, but:

server save slot/restore doesn't allow restoring cross quantization level. you would need to do the quantize outside of the server
using this would be way harder, because both servers want to use the same backend device. You would need to coordinate unloading from server 1 before you load onto server 2. i think this would be difficult and im not convinced it's faster than just killing server 1 and restarting it with a new config.
both server 1 and 2 want to have the same model loaded, but that's not something they can share. i would need to tell server 1 "unload your model and your kvcache" before passing it to server 2. at that point i should just restart server 1.

wadealexc · 2026-06-09T17:25:49Z

I think, something I'm not making entirely clear here: requantize_memory allows me to select a new quantization level without having to unload + reload the model (in addition to reprocessing prompt!).

For me, unloading/reloading Qwen3.5 27B takes 3-5 seconds, and re-processing 100k tokens can take over 10 seconds. If I want to requantize my kvcache currently, that's 13-15 seconds waiting. With this method, it's like half a second. And I'm on a fairly nice GPU (5090). For other devices, this is an even better time save.

wadealexc added 4 commits June 9, 2026 11:41

feat(llama-server): when restoring from slot, automatically quantize …

4b8e60c

…or dequantize as needed - also expose current kvcache type name via GET /props

feat(llama-server): add POST /requantize_kvcache endpoint

21a0b4e

refactor: clean up implementation

4875dc7

- remove unreachable v_trans branch - fix: change architecture check to allow all but recurrent - refac: rename endpoint to /cache/requantize - remove mtmd restriction from server task; mtmd works fine

feat: add support for draft models

1b8cfd8

- refactor task api to use optional types, defaulting to existing cache types - fix mem_other param when creating memory

wadealexc requested review from a team and ggerganov as code owners June 9, 2026 15:47

github-actions Bot added examples server labels Jun 9, 2026

ngxson requested changes Jun 9, 2026

View reviewed changes

Conversation

wadealexc commented Jun 9, 2026

Overview

Additional information

Motivation

Requirements

Uh oh!

wadealexc commented Jun 9, 2026

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

wadealexc Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

wadealexc Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wadealexc Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wadealexc Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wadealexc Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wadealexc commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wadealexc commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jun 9, 2026

Uh oh!

wadealexc commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wadealexc commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wadealexc Jun 9, 2026 •

edited

Loading

wadealexc Jun 9, 2026 •

edited

Loading

wadealexc Jun 9, 2026 •

edited

Loading

ngxson Jun 9, 2026 •

edited

Loading

wadealexc Jun 9, 2026 •

edited

Loading

wadealexc commented Jun 9, 2026 •

edited

Loading

wadealexc commented Jun 9, 2026 •

edited

Loading

wadealexc commented Jun 9, 2026 •

edited

Loading