session payload: reject KV checkpoints saved against a different model/quant (v3 fingerprint)#363
Open
jamesburton wants to merge 1 commit into
Open
session payload: reject KV checkpoints saved against a different model/quant (v3 fingerprint)#363jamesburton wants to merge 1 commit into
jamesburton wants to merge 1 commit into
Conversation
The session payload header (ds4_session_save_payload / ds4_session_load_payload) validated ctx/chunk/layout/dims, but two different GGUFs of the same variant + quantization share that layout, so restoring a KV checkpoint saved against a different model silently decoded garbage. Add a cheap, stable 64-bit FNV-1a/64 model/quant fingerprint computed from structural+quant metadata and vocab bytes only (routed-expert quant bits, output/token_embd tensor quant types, per-layer ffn_gate_exps quant mix, and the vocab token byte strings) -- never the ~80 GB weights. Stamp it into payload header words 13/14 on save and reject on load with a clear message before reading any KV bytes. - Bump DS4_SESSION_PAYLOAD_VERSION 2 -> 3 and U32_FIELDS 13 -> 15; older v<3 payloads are rejected cleanly by the existing version check (no misparse). - Expose ds4_engine_model_fingerprint() so both the local payload path (ds4.c, CPU and GPU branches) and the distributed KV checkpoint path (ds4_distributed.c save header + load validate) stamp/validate the same identity symmetrically. Backend-agnostic host C; CUDA/CPU/Metal/ROCm builds unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a model/quant “fingerprint” to session/distributed KV payload headers to prevent restoring checkpoints into a different (but structurally compatible) model/quant that would otherwise decode to garbage.
Changes:
- Bumps session payload header version to v3 and appends a 64-bit fingerprint (two u32 header words).
- Introduces
ds4_engine_model_fingerprint()and stamps it into saved payloads. - Validates the fingerprint on load and rejects mismatched checkpoints before reading KV bytes.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| ds4_distributed.c | Stores fingerprint in distributed KV header and rejects mismatches on load. |
| ds4.h | Exposes fingerprint API and bumps payload header version/field count with documentation. |
| ds4.c | Implements fingerprint hashing and integrates stamp/validate into session save/load. |
Comments suppressed due to low confidence (3)
ds4.c:1
- The fingerprint currently mixes
0uwhenffn_gate_expsis absent. Ifds4_tensor_type(or the underlyingtypefield) can legitimately be0, then ‘missing tensor’ and ‘type==0’ become indistinguishable, weakening mismatch detection. Use an explicit presence bit (mix a boolean first) or a sentinel value that cannot collide with any valid type value.
ds4.c:1 - Fingerprinting hashes the full vocab byte content, making it O(total vocab bytes) per save/load. If payload save/load happens frequently (e.g., periodic checkpointing), consider caching the computed fingerprint in the engine after model load (or lazily on first request) and reusing it, since it should be stable for the lifetime of the engine.
ds4.c:1 - The mismatch error is correct but not very actionable. Consider including identifying details such as the saved fingerprint vs live fingerprint (e.g., as hex
0x%016llx) and/or the engine’s coarse model id/quant info, so operators can quickly determine what differed without additional logging.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+318
to
+319
| #define DS4_SESSION_PAYLOAD_VERSION UINT32_C(3) | ||
| #define DS4_SESSION_PAYLOAD_U32_FIELDS 15u |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Small robustness hardening for the session-payload format.
ds4_session_save_payload/load_payloadvalidate ctx, chunk layout, structural dims, and a coarsemodel_id(=DS4_MODEL_VARIANT) + quant bits. That catches most mismatches, but not two different GGUFs of the same variant + same quant + same layout (e.g. a re-quantized model, a different fine-tune, a swapped checkpoint): such a payload loads and silently decodes to garbage.This adds a cheap 64-bit model/quant fingerprint (FNV-1a/64) to the payload header, written on save and checked on load — rejecting a mismatch with a clear error (
KV checkpoint was saved against a different model or quantization) before any KV bytes are read. The fingerprint folds in routed-expert quant bits, theoutput/token_embdtensor quant types, every per-layerffn_gate_expsquant type, and the vocab token bytes — metadata only, never the weight data, so it's effectively free.DS4_SESSION_PAYLOAD_VERSION2 → 3 (two extra header words); v<3 payloads are rejected cleanly by the existing version gate, not misparsed.ds4.c) and distributed-KV (ds4_distributed.c) save/load paths updated symmetrically; CUDA/CPU/Metal/ROCm builds unaffected (backend-agnostic host C). Build-verified on the gfx1151 Windows HIP build.Honest scope: this is defensive hardening, not a feature — the existing validation already covers most cases, and
ds4_kvstore.chandles the persistence itself. It closes one narrow but real silent-garbage footgun. Happy to drop it if you consider it unnecessary.