Skip to content

session payload: reject KV checkpoints saved against a different model/quant (v3 fingerprint)#363

Open
jamesburton wants to merge 1 commit into
antirez:mainfrom
jamesburton:pr-payload-fingerprint
Open

session payload: reject KV checkpoints saved against a different model/quant (v3 fingerprint)#363
jamesburton wants to merge 1 commit into
antirez:mainfrom
jamesburton:pr-payload-fingerprint

Conversation

@jamesburton

Copy link
Copy Markdown

Small robustness hardening for the session-payload format.

ds4_session_save_payload / load_payload validate ctx, chunk layout, structural dims, and a coarse model_id (= DS4_MODEL_VARIANT) + quant bits. That catches most mismatches, but not two different GGUFs of the same variant + same quant + same layout (e.g. a re-quantized model, a different fine-tune, a swapped checkpoint): such a payload loads and silently decodes to garbage.

This adds a cheap 64-bit model/quant fingerprint (FNV-1a/64) to the payload header, written on save and checked on load — rejecting a mismatch with a clear error (KV checkpoint was saved against a different model or quantization) before any KV bytes are read. The fingerprint folds in routed-expert quant bits, the output/token_embd tensor quant types, every per-layer ffn_gate_exps quant type, and the vocab token bytes — metadata only, never the weight data, so it's effectively free.

  • DS4_SESSION_PAYLOAD_VERSION 2 → 3 (two extra header words); v<3 payloads are rejected cleanly by the existing version gate, not misparsed.
  • Both the local (ds4.c) and distributed-KV (ds4_distributed.c) save/load paths updated symmetrically; CUDA/CPU/Metal/ROCm builds unaffected (backend-agnostic host C). Build-verified on the gfx1151 Windows HIP build.

Honest scope: this is defensive hardening, not a feature — the existing validation already covers most cases, and ds4_kvstore.c handles the persistence itself. It closes one narrow but real silent-garbage footgun. Happy to drop it if you consider it unnecessary.

The session payload header (ds4_session_save_payload / ds4_session_load_payload)
validated ctx/chunk/layout/dims, but two different GGUFs of the same variant +
quantization share that layout, so restoring a KV checkpoint saved against a
different model silently decoded garbage.

Add a cheap, stable 64-bit FNV-1a/64 model/quant fingerprint computed from
structural+quant metadata and vocab bytes only (routed-expert quant bits,
output/token_embd tensor quant types, per-layer ffn_gate_exps quant mix, and the
vocab token byte strings) -- never the ~80 GB weights. Stamp it into payload
header words 13/14 on save and reject on load with a clear message before
reading any KV bytes.

- Bump DS4_SESSION_PAYLOAD_VERSION 2 -> 3 and U32_FIELDS 13 -> 15; older v<3
  payloads are rejected cleanly by the existing version check (no misparse).
- Expose ds4_engine_model_fingerprint() so both the local payload path (ds4.c,
  CPU and GPU branches) and the distributed KV checkpoint path (ds4_distributed.c
  save header + load validate) stamp/validate the same identity symmetrically.

Backend-agnostic host C; CUDA/CPU/Metal/ROCm builds unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 8, 2026 18:49

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a model/quant “fingerprint” to session/distributed KV payload headers to prevent restoring checkpoints into a different (but structurally compatible) model/quant that would otherwise decode to garbage.

Changes:

  • Bumps session payload header version to v3 and appends a 64-bit fingerprint (two u32 header words).
  • Introduces ds4_engine_model_fingerprint() and stamps it into saved payloads.
  • Validates the fingerprint on load and rejects mismatched checkpoints before reading KV bytes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
ds4_distributed.c Stores fingerprint in distributed KV header and rejects mismatches on load.
ds4.h Exposes fingerprint API and bumps payload header version/field count with documentation.
ds4.c Implements fingerprint hashing and integrates stamp/validate into session save/load.
Comments suppressed due to low confidence (3)

ds4.c:1

  • The fingerprint currently mixes 0u when ffn_gate_exps is absent. If ds4_tensor_type (or the underlying type field) can legitimately be 0, then ‘missing tensor’ and ‘type==0’ become indistinguishable, weakening mismatch detection. Use an explicit presence bit (mix a boolean first) or a sentinel value that cannot collide with any valid type value.
    ds4.c:1
  • Fingerprinting hashes the full vocab byte content, making it O(total vocab bytes) per save/load. If payload save/load happens frequently (e.g., periodic checkpointing), consider caching the computed fingerprint in the engine after model load (or lazily on first request) and reusing it, since it should be stable for the lifetime of the engine.
    ds4.c:1
  • The mismatch error is correct but not very actionable. Consider including identifying details such as the saved fingerprint vs live fingerprint (e.g., as hex 0x%016llx) and/or the engine’s coarse model id/quant info, so operators can quickly determine what differed without additional logging.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ds4.h
Comment on lines +318 to +319
#define DS4_SESSION_PAYLOAD_VERSION UINT32_C(3)
#define DS4_SESSION_PAYLOAD_U32_FIELDS 15u
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants