CUDA: extend K-type validation to V-types for flash attention by sanmai · Pull Request #24403 · ggml-org/llama.cpp

sanmai · 2026-06-10T09:19:09Z

Overview

Extends the K-type validation to V-types in ggml_cuda_get_best_fattn_kernel.

Additional information

As with GGML_CUDA_FA_ALL_QUANTS=ON all combinations are possible and unsupported combinations lead to crash instead of falling back to the CPU backend, say -ctk iq4_nl -ctv q8_0 falls back to the CPU, but -ctk q8_0 -ctv iq4_nl leads to crash.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007ff360aa5ffe in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#0  0x00007ff360aa5ffe in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ff360a9a7a4 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ff360a9a7ed in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#3  0x00007ff360b06c07 in wait4 () from /usr/lib/x86_64-linux-gnu/libc.so.6
#4  0x00007ff360f1d973 in ggml_print_backtrace () from bin/libggml-base.so.0
#5  0x00007ff360f1dabf in ggml_abort () from bin/libggml-base.so.0
#6  0x00007ff35c609b33 in ggml_cuda_flash_attn_ext(ggml_backend_cuda_context&, ggml_tensor*) () from bin/libggml-cuda.so.0
#7  0x00007ff35c65ba26 in ggml_cuda_graph_evaluate_and_capture(ggml_backend_cuda_context*, ggml_cgraph*, bool, bool, void const*) () from bin/libggml-cuda.so.0
#8  0x00007ff35c65bfd2 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from bin/libggml-cuda.so.0
#9  0x00007ff360f39953 in ggml_backend_sched_graph_compute_async () from bin/libggml-base.so.0
#10 0x00007ff3600cb3b0 in llama_context::graph_compute(ggml_cgraph*, bool) () from bin/libllama.so.0
#11 0x00007ff3600cdfd5 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from bin/libllama.so.0
#12 0x00007ff3600d4ac5 in llama_context::decode(llama_batch const&) () from bin/libllama.so.0
#13 0x00007ff3600d66be in llama_decode () from bin/libllama.so.0
#14 0x00007ff3605ec61c in common_init_from_params(common_params&, bool) () from bin/libllama-common.so.0
#15 0x00007ff361174540 in server_context_impl::load_model(common_params&) () from bin/libllama-server-impl.so
#16 0x00007ff3610c5542 in llama_server(int, char**) () from bin/libllama-server-impl.so
#17 0x00007ff360a31f77 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#18 0x00007ff360a32027 in __libc_start_main () from /usr/lib/x86_64-linux-gnu/libc.so.6

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, to debug the issue

CUDA: extend K-type validation to V-types for flash attention

5aa1f98

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 10, 2026

reorder

5b41fa3

sanmai marked this pull request as ready for review June 10, 2026 09:30

sanmai requested a review from a team as a code owner June 10, 2026 09:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: extend K-type validation to V-types for flash attention#24403

CUDA: extend K-type validation to V-types for flash attention#24403
sanmai wants to merge 2 commits into
ggml-org:masterfrom
sanmai:fix/kv-cache-types-flash-attention-validation

sanmai commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sanmai commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sanmai commented Jun 10, 2026 •

edited

Loading