Skip to content

CUDA: extend K-type validation to V-types for flash attention#24403

Open
sanmai wants to merge 2 commits into
ggml-org:masterfrom
sanmai:fix/kv-cache-types-flash-attention-validation
Open

CUDA: extend K-type validation to V-types for flash attention#24403
sanmai wants to merge 2 commits into
ggml-org:masterfrom
sanmai:fix/kv-cache-types-flash-attention-validation

Conversation

@sanmai

@sanmai sanmai commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Overview

Extends the K-type validation to V-types in ggml_cuda_get_best_fattn_kernel.

Additional information

As with GGML_CUDA_FA_ALL_QUANTS=ON all combinations are possible and unsupported combinations lead to crash instead of falling back to the CPU backend, say -ctk iq4_nl -ctv q8_0 falls back to the CPU, but -ctk q8_0 -ctv iq4_nl leads to crash.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007ff360aa5ffe in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#0  0x00007ff360aa5ffe in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ff360a9a7a4 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ff360a9a7ed in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#3  0x00007ff360b06c07 in wait4 () from /usr/lib/x86_64-linux-gnu/libc.so.6
#4  0x00007ff360f1d973 in ggml_print_backtrace () from bin/libggml-base.so.0
#5  0x00007ff360f1dabf in ggml_abort () from bin/libggml-base.so.0
#6  0x00007ff35c609b33 in ggml_cuda_flash_attn_ext(ggml_backend_cuda_context&, ggml_tensor*) () from bin/libggml-cuda.so.0
#7  0x00007ff35c65ba26 in ggml_cuda_graph_evaluate_and_capture(ggml_backend_cuda_context*, ggml_cgraph*, bool, bool, void const*) () from bin/libggml-cuda.so.0
#8  0x00007ff35c65bfd2 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from bin/libggml-cuda.so.0
#9  0x00007ff360f39953 in ggml_backend_sched_graph_compute_async () from bin/libggml-base.so.0
#10 0x00007ff3600cb3b0 in llama_context::graph_compute(ggml_cgraph*, bool) () from bin/libllama.so.0
#11 0x00007ff3600cdfd5 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from bin/libllama.so.0
#12 0x00007ff3600d4ac5 in llama_context::decode(llama_batch const&) () from bin/libllama.so.0
#13 0x00007ff3600d66be in llama_decode () from bin/libllama.so.0
#14 0x00007ff3605ec61c in common_init_from_params(common_params&, bool) () from bin/libllama-common.so.0
#15 0x00007ff361174540 in server_context_impl::load_model(common_params&) () from bin/libllama-server-impl.so
#16 0x00007ff3610c5542 in llama_server(int, char**) () from bin/libllama-server-impl.so
#17 0x00007ff360a31f77 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#18 0x00007ff360a32027 in __libc_start_main () from /usr/lib/x86_64-linux-gnu/libc.so.6

Requirements

@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 10, 2026
@sanmai sanmai marked this pull request as ready for review June 10, 2026 09:30
@sanmai sanmai requested a review from a team as a code owner June 10, 2026 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant