fix: speculative verify with --mtp-draft > 2 commits wrong tokens#358
Open
pandysp wants to merge 1 commit into
Open
fix: speculative verify with --mtp-draft > 2 commits wrong tokens#358pandysp wants to merge 1 commit into
pandysp wants to merge 1 commit into
Conversation
metal_graph_verify_suffix_tops() asked ds4_gpu_indexer_topk_tensor() for the top-`top_rows` tokens of a single logits row (n_tokens=1, top_k=top_rows) instead of the top-1 of each of `top_rows` rows. The signature is (selected, scores, n_comp, n_tokens, top_k), so row_tops[i>0] came back as row 0 runner-ups and the verifier accepted or rejected drafts against the wrong rows, committing tokens the model never produced. Visible only at draft depth > 2: at depth 2 top_rows == 1 and the code takes the dedicated argmax path above this branch, so the swapped call is never reached. Adds tests/ds4_test.c --mtp-verify-depth: runs greedy speculative decode at draft 4 over a verbatim-copy task, then teacher-forces the committed tokens back through plain decode and asserts each is a (near-)argmax at its position. That is the invariant speculative verify must preserve, and unlike comparing whole token streams it tolerates the benign tie divergences of near-greedy speculation. Self-skips unless DS4_TEST_MTP points at an MTP GGUF. Without the fix a committed token sits ~21 logits below the argmax (worst gap 20.96 at token 145); with the fix every committed token is the argmax (gap 0.00). The shared test engine loads the MTP head only when DS4_TEST_MTP is set, and only on the fast engine, so the default suite and the quality engine are unaffected. Verified on M4 Max / Metal, DeepSeek-V4-Flash q2-q4 imatrix base + MTP-Q4K-Q8_0 head: make ds4_test DS4_TEST_MODEL=<base> DS4_TEST_MTP=<mtp> ./ds4_test --mtp-verify-depth Ran alongside --server --metal-tensor-equivalence --long-context: all pass. The swapped call site is backend-shared (ds4.c), so CUDA is affected and fixed identically, but this was not verified on a CUDA machine.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
In
metal_graph_verify_suffix_tops, the multi-row branch callsds4_gpu_indexer_topk_tensorwithn_tokensandtop_kswapped: it asks for the top-top_rowstokens of a single row instead of the top-1 of each oftop_rowsrows. The signature is(selected, scores, n_comp, n_tokens, top_k), sorow_tops[i>0]come back as row 0's runner-ups. At draft depth > 2 the verifier then accepts or rejects drafts against the wrong rows and commits tokens the model never produced.It only fires above the shipped depth: at
--mtp-draft 2,top_rows == 1takes the dedicated argmax path just above this branch, so the swapped call is never reached. So this is latent — it does not affect default output — but raising the draft depth gives silently wrong tokens (a spurious EOS or divergence), and nothing in the suite covered the multi-row verify path.Fix
Swap the last two arguments back (signature is
(selected, scores, n_comp, n_tokens, top_k)):Test
Adds
tests/ds4_test.c --mtp-verify-depth. It runs greedy speculative decode at draft 4 over a verbatim-copy task, then teacher-forces the committed tokens back through plain decode and requires each to be a (near-)argmax — the invariant speculative verify must preserve. Unlike comparing whole token streams, that tolerates the benign tie divergences of near-greedy speculation.The MTP head is not in the default test model set, so it self-skips unless
DS4_TEST_MTPpoints at an MTP GGUF:Without the fix a committed token sits ~21 logits below the argmax (worst gap 20.96 at token 145); with the fix every committed token is the argmax (gap 0.00).
Notes
--server --metal-tensor-equivalence --long-context; all pass.ds4_gpu_indexer_topk_tensorcall sites use the correct argument order; this was the only swapped one.