Skip to content

fix: speculative verify with --mtp-draft > 2 commits wrong tokens#358

Open
pandysp wants to merge 1 commit into
antirez:mainfrom
pandysp:fix-mtp-verify-topk
Open

fix: speculative verify with --mtp-draft > 2 commits wrong tokens#358
pandysp wants to merge 1 commit into
antirez:mainfrom
pandysp:fix-mtp-verify-topk

Conversation

@pandysp

@pandysp pandysp commented Jun 8, 2026

Copy link
Copy Markdown

Summary

In metal_graph_verify_suffix_tops, the multi-row branch calls ds4_gpu_indexer_topk_tensor with n_tokens and top_k swapped: it asks for the top-top_rows tokens of a single row instead of the top-1 of each of top_rows rows. The signature is (selected, scores, n_comp, n_tokens, top_k), so row_tops[i>0] come back as row 0's runner-ups. At draft depth > 2 the verifier then accepts or rejects drafts against the wrong rows and commits tokens the model never produced.

It only fires above the shipped depth: at --mtp-draft 2, top_rows == 1 takes the dedicated argmax path just above this branch, so the swapped call is never reached. So this is latent — it does not affect default output — but raising the draft depth gives silently wrong tokens (a spurious EOS or divergence), and nothing in the suite covered the multi-row verify path.

Fix

Swap the last two arguments back (signature is (selected, scores, n_comp, n_tokens, top_k)):

- ds4_gpu_indexer_topk_tensor(g->comp_selected, g->spec_logits, DS4_N_VOCAB, 1, top_rows)
+ ds4_gpu_indexer_topk_tensor(g->comp_selected, g->spec_logits, DS4_N_VOCAB, top_rows, 1)

Test

Adds tests/ds4_test.c --mtp-verify-depth. It runs greedy speculative decode at draft 4 over a verbatim-copy task, then teacher-forces the committed tokens back through plain decode and requires each to be a (near-)argmax — the invariant speculative verify must preserve. Unlike comparing whole token streams, that tolerates the benign tie divergences of near-greedy speculation.

The MTP head is not in the default test model set, so it self-skips unless DS4_TEST_MTP points at an MTP GGUF:

make ds4_test
DS4_TEST_MODEL=<base.gguf> DS4_TEST_MTP=<mtp.gguf> ./ds4_test --mtp-verify-depth

Without the fix a committed token sits ~21 logits below the argmax (worst gap 20.96 at token 145); with the fix every committed token is the argmax (gap 0.00).

Notes

  • Verified on M4 Max / Metal, DeepSeek-V4-Flash q2-q4 imatrix base + MTP-Q4K-Q8_0 head. Ran alongside --server --metal-tensor-equivalence --long-context; all pass.
  • The call site is backend-shared, so CUDA is affected and fixed identically, but I have not verified it on a CUDA machine.
  • The other four ds4_gpu_indexer_topk_tensor call sites use the correct argument order; this was the only swapped one.

metal_graph_verify_suffix_tops() asked ds4_gpu_indexer_topk_tensor() for the
top-`top_rows` tokens of a single logits row (n_tokens=1, top_k=top_rows)
instead of the top-1 of each of `top_rows` rows. The signature is
(selected, scores, n_comp, n_tokens, top_k), so row_tops[i>0] came back as
row 0 runner-ups and the verifier accepted or rejected drafts against the
wrong rows, committing tokens the model never produced. Visible only at draft
depth > 2: at depth 2 top_rows == 1 and the code takes the dedicated argmax
path above this branch, so the swapped call is never reached.

Adds tests/ds4_test.c --mtp-verify-depth: runs greedy speculative decode at
draft 4 over a verbatim-copy task, then teacher-forces the committed tokens
back through plain decode and asserts each is a (near-)argmax at its position.
That is the invariant speculative verify must preserve, and unlike comparing
whole token streams it tolerates the benign tie divergences of near-greedy
speculation. Self-skips unless DS4_TEST_MTP points at an MTP GGUF. Without the
fix a committed token sits ~21 logits below the argmax (worst gap 20.96 at
token 145); with the fix every committed token is the argmax (gap 0.00).

The shared test engine loads the MTP head only when DS4_TEST_MTP is set, and
only on the fast engine, so the default suite and the quality engine are
unaffected.

Verified on M4 Max / Metal, DeepSeek-V4-Flash q2-q4 imatrix base + MTP-Q4K-Q8_0
head:
  make ds4_test
  DS4_TEST_MODEL=<base> DS4_TEST_MTP=<mtp> ./ds4_test --mtp-verify-depth
Ran alongside --server --metal-tensor-equivalence --long-context: all pass.
The swapped call site is backend-shared (ds4.c), so CUDA is affected and fixed
identically, but this was not verified on a CUDA machine.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant