fix: speculative verify with --mtp-draft > 2 commits wrong tokens by pandysp · Pull Request #358 · antirez/ds4

pandysp · 2026-06-08T18:06:57Z

Summary

In metal_graph_verify_suffix_tops, the multi-row branch calls ds4_gpu_indexer_topk_tensor with n_tokens and top_k swapped: it asks for the top-top_rows tokens of a single row instead of the top-1 of each of top_rows rows. The signature is (selected, scores, n_comp, n_tokens, top_k), so row_tops[i>0] come back as row 0's runner-ups. At draft depth > 2 the verifier then accepts or rejects drafts against the wrong rows and commits tokens the model never produced.

It only fires above the shipped depth: at --mtp-draft 2, top_rows == 1 takes the dedicated argmax path just above this branch, so the swapped call is never reached. So this is latent — it does not affect default output — but raising the draft depth gives silently wrong tokens (a spurious EOS or divergence), and nothing in the suite covered the multi-row verify path.

Fix

Swap the last two arguments back (signature is (selected, scores, n_comp, n_tokens, top_k)):

- ds4_gpu_indexer_topk_tensor(g->comp_selected, g->spec_logits, DS4_N_VOCAB, 1, top_rows)
+ ds4_gpu_indexer_topk_tensor(g->comp_selected, g->spec_logits, DS4_N_VOCAB, top_rows, 1)

Test

Adds tests/ds4_test.c --mtp-verify-depth. It runs greedy speculative decode at draft 4 over a verbatim-copy task, then teacher-forces the committed tokens back through plain decode and requires each to be a (near-)argmax — the invariant speculative verify must preserve. Unlike comparing whole token streams, that tolerates the benign tie divergences of near-greedy speculation.

The MTP head is not in the default test model set, so it self-skips unless DS4_TEST_MTP points at an MTP GGUF:

make ds4_test
DS4_TEST_MODEL=<base.gguf> DS4_TEST_MTP=<mtp.gguf> ./ds4_test --mtp-verify-depth

Without the fix a committed token sits ~21 logits below the argmax (worst gap 20.96 at token 145); with the fix every committed token is the argmax (gap 0.00).

Notes

Verified on M4 Max / Metal, DeepSeek-V4-Flash q2-q4 imatrix base + MTP-Q4K-Q8_0 head. Ran alongside --server --metal-tensor-equivalence --long-context; all pass.
The call site is backend-shared, so CUDA is affected and fixed identically, but I have not verified it on a CUDA machine.
The other four ds4_gpu_indexer_topk_tensor call sites use the correct argument order; this was the only swapped one.

metal_graph_verify_suffix_tops() asked ds4_gpu_indexer_topk_tensor() for the top-`top_rows` tokens of a single logits row (n_tokens=1, top_k=top_rows) instead of the top-1 of each of `top_rows` rows. The signature is (selected, scores, n_comp, n_tokens, top_k), so row_tops[i>0] came back as row 0 runner-ups and the verifier accepted or rejected drafts against the wrong rows, committing tokens the model never produced. Visible only at draft depth > 2: at depth 2 top_rows == 1 and the code takes the dedicated argmax path above this branch, so the swapped call is never reached. Adds tests/ds4_test.c --mtp-verify-depth: runs greedy speculative decode at draft 4 over a verbatim-copy task, then teacher-forces the committed tokens back through plain decode and asserts each is a (near-)argmax at its position. That is the invariant speculative verify must preserve, and unlike comparing whole token streams it tolerates the benign tie divergences of near-greedy speculation. Self-skips unless DS4_TEST_MTP points at an MTP GGUF. Without the fix a committed token sits ~21 logits below the argmax (worst gap 20.96 at token 145); with the fix every committed token is the argmax (gap 0.00). The shared test engine loads the MTP head only when DS4_TEST_MTP is set, and only on the fast engine, so the default suite and the quality engine are unaffected. Verified on M4 Max / Metal, DeepSeek-V4-Flash q2-q4 imatrix base + MTP-Q4K-Q8_0 head: make ds4_test DS4_TEST_MODEL=<base> DS4_TEST_MTP=<mtp> ./ds4_test --mtp-verify-depth Ran alongside --server --metal-tensor-equivalence --long-context: all pass. The swapped call site is backend-shared (ds4.c), so CUDA is affected and fixed identically, but this was not verified on a CUDA machine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: speculative verify with --mtp-draft > 2 commits wrong tokens#358

fix: speculative verify with --mtp-draft > 2 commits wrong tokens#358
pandysp wants to merge 1 commit into
antirez:mainfrom
pandysp:fix-mtp-verify-topk

pandysp commented Jun 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pandysp commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Test

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pandysp commented Jun 8, 2026 •

edited

Loading