Skip to content

Gemma4 mtp recover#1

Draft
fhnmor21 wants to merge 13 commits into
feature/turboquant-kv-cachefrom
gemma4-mtp-recover
Draft

Gemma4 mtp recover#1
fhnmor21 wants to merge 13 commits into
feature/turboquant-kv-cachefrom
gemma4-mtp-recover

Conversation

@fhnmor21

Copy link
Copy Markdown
Owner

Overview

Additional information

Requirements

fhnmor21 and others added 13 commits May 23, 2026 10:09
Introduces support for the Gemma 4 MTP assistant, allowing for enhanced speculative decoding. This includes new command-line options for specifying the MTP head and draft model, as well as updates to the model architecture and tensor handling. The assistant integrates with the target model, enabling efficient draft generation and improved performance in speculative tasks.

Changes include:
- New command-line options: `--mtp-head` and `--draft-block-size`.
- Updates to the model loading process to accommodate the MTP assistant.
- Enhancements in tensor management for MTP-specific operations.
- Documentation updates for usage examples and guidelines.

This feature aims to improve the overall functionality and efficiency of the model in handling complex tasks.
…oding

This commit introduces an asynchronous MTP draft pipeline, enhancing the speculative decoding process. Key changes include:

- Updated `draft_block_size` to 3, optimizing performance based on empirical results.
- Added new APIs: `llama_decode_mtp_async` and `llama_decode_mtp_wait` for non-blocking draft requests.
- Enhanced documentation to reflect the async pipeline's functionality and usage.
- Implemented tests to ensure parity between synchronous and asynchronous draft generation.

These improvements aim to increase throughput and efficiency in handling complex tasks within the model.
…P handling

This commit introduces significant improvements to the speculative decoding process by implementing a pipeline depth-2 mechanism that allows MTP draft computation to overlap with target verification. Key changes include:

- Added `prepare_next` and `cancel` hooks in the `common_speculative_state` interface for better async draft management.
- Implemented logic to drain any pending MTP requests before new iterations to prevent race conditions.
- Updated documentation to reflect the new pipeline depth-2 functionality and its implications for performance.
- Enhanced the `common_speculative` API with new functions for managing async MTP work.

These enhancements aim to improve throughput and efficiency in speculative decoding tasks, ensuring smoother operation during concurrent processing.
This commit introduces an optional NDJSON tracer for MTP draft and accept events, controlled by the environment variable LLAMA_MTP_ACC_TRACE. Key changes include:

- Implementation of the `mtp_acc_tracer` class for tracing MTP events with configurable output options.
- Integration of tracing logic into the `common_speculative_state_mtp` structure, capturing relevant metrics during draft and acceptance processes.
- Updates to the MTP decoding functions to utilize in-graph argmax for improved performance and reduced data transfer overhead.
- Addition of a new shell script for running the Gemma 4 MTP server with enhanced configuration options.

These enhancements aim to provide better observability and performance in MTP operations, facilitating debugging and optimization of the speculative decoding process.
…cing

This commit introduces an in-graph argmax for MTP draft processing, significantly improving throughput by reducing data transfer overhead. Key changes include:

- Implementation of `ggml_argmax` to publish final logits, allowing the host to read only the necessary token ID.
- Addition of a diagnostic feature for per-draft acceptance tracing, enabling detailed logging of MTP events for better observability.
- Documentation updates to reflect these enhancements and provide usage examples for the new tracing functionality.

These improvements aim to optimize MTP operations and facilitate debugging in the speculative decoding process.
This commit improves the handling of tensors in the MTP process, specifically for the Gemma 4 assistant. Key changes include:

- Updated tensor conversion logic to maintain integer types for specific tensors, ensuring compatibility with centroid routing.
- Introduced handling for `mtp.centroids.weight` and `mtp.token_ordering.weight`, ensuring correct tensor shapes and types during processing.
- Enhanced documentation to clarify the new tensor structures and their implications for MTP operations.
- Added new scripts for quantizing and running the Gemma 4 Edge assistant with improved configuration options.

These enhancements aim to optimize the performance and accuracy of the MTP draft process, particularly when using ordered embeddings.
This commit introduces TurboQuant, a new family of WHT-rotated low-bit quantization formats designed for KV cache and model weight compression. Key changes include:

- Added support for KV cache types (`turbo2`, `turbo3`, `turbo4`) with significant compression ratios.
- Introduced weight quantization formats (`TQ3_1S`, `TQ4_1S`) for efficient model size reduction.
- Enhanced documentation detailing usage, backend support, and practical examples for TurboQuant integration.
- Added new command-line options for enabling TurboQuant features in the server.

These enhancements aim to optimize memory usage and improve performance in bandwidth-bound scenarios, particularly on Apple Silicon and discrete GPUs.
When the Gemma 4 assistant GGUF is loaded via llama_model_load_mtp_from_file,
its block tensors (blk.0-3.*), token_embd, output_norm and rope_freqs share
identical names with the target model's tensors. This makes it impossible to
uniquely target MTP assistant tensors via -ot rules for GPU placement.

Fix: after loading the assistant into aux, rename all tensors not already
prefixed with 'mtp.' to 'mtp.<original_name>'. This is done purely in-memory
on the tensors_by_name vector and the ggml_tensor name field — the GGUF file
and published arch names are unchanged.

After this change, all MTP assistant tensors are addressable as mtp.blk.N.*,
mtp.token_embd.weight, mtp.output_norm.weight etc, and can be pinned with:

  -ot 'mtp\..*=CUDA0'
…g cache reuse

The kv_base->get_size() == kv_swa->get_size() condition in get_can_shift()
was introduced in PR ggml-org#15467 before Gemma 4 existed. Gemma 4 has 10 global
layers vs 50 SWA layers by design, so this check always returns false,
permanently blocking cache reuse for all Gemma 4 users.

The individual get_can_shift() calls on each sub-cache already guard shift
safety independently. Removing the size equality check is safe for all
existing models.

Fix also submitted to llama.cpp mainline:
ggml-org#21831 (comment)
…--cache-reuse enabled

TurboQuant (turbo2/3/4) uses kernel-level WHT rotation which is
position-invariant -- WHT preserves inner products so no RoPE correction
is needed after a KV position shift.

build_graph_shift() assumed standard quantized tensors with upstream
rotation, but TurboQuant sets attn_rot_k=0 and handles rotation at kernel
level. Building the shift graph with turbo-padded tensors causes a null
buffer assert and segfault on the second prompt.

Fix: skip build_graph_shift() layers and get_has_shift() entirely for
turbo KV types. Position tracking via seq_add() still works correctly --
only the broken RoPE re-rotation kernel is skipped.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants