Gemma4 mtp recover#1
Draft
fhnmor21 wants to merge 13 commits into
Draft
Conversation
Introduces support for the Gemma 4 MTP assistant, allowing for enhanced speculative decoding. This includes new command-line options for specifying the MTP head and draft model, as well as updates to the model architecture and tensor handling. The assistant integrates with the target model, enabling efficient draft generation and improved performance in speculative tasks. Changes include: - New command-line options: `--mtp-head` and `--draft-block-size`. - Updates to the model loading process to accommodate the MTP assistant. - Enhancements in tensor management for MTP-specific operations. - Documentation updates for usage examples and guidelines. This feature aims to improve the overall functionality and efficiency of the model in handling complex tasks.
…oding This commit introduces an asynchronous MTP draft pipeline, enhancing the speculative decoding process. Key changes include: - Updated `draft_block_size` to 3, optimizing performance based on empirical results. - Added new APIs: `llama_decode_mtp_async` and `llama_decode_mtp_wait` for non-blocking draft requests. - Enhanced documentation to reflect the async pipeline's functionality and usage. - Implemented tests to ensure parity between synchronous and asynchronous draft generation. These improvements aim to increase throughput and efficiency in handling complex tasks within the model.
…P handling This commit introduces significant improvements to the speculative decoding process by implementing a pipeline depth-2 mechanism that allows MTP draft computation to overlap with target verification. Key changes include: - Added `prepare_next` and `cancel` hooks in the `common_speculative_state` interface for better async draft management. - Implemented logic to drain any pending MTP requests before new iterations to prevent race conditions. - Updated documentation to reflect the new pipeline depth-2 functionality and its implications for performance. - Enhanced the `common_speculative` API with new functions for managing async MTP work. These enhancements aim to improve throughput and efficiency in speculative decoding tasks, ensuring smoother operation during concurrent processing.
This commit introduces an optional NDJSON tracer for MTP draft and accept events, controlled by the environment variable LLAMA_MTP_ACC_TRACE. Key changes include: - Implementation of the `mtp_acc_tracer` class for tracing MTP events with configurable output options. - Integration of tracing logic into the `common_speculative_state_mtp` structure, capturing relevant metrics during draft and acceptance processes. - Updates to the MTP decoding functions to utilize in-graph argmax for improved performance and reduced data transfer overhead. - Addition of a new shell script for running the Gemma 4 MTP server with enhanced configuration options. These enhancements aim to provide better observability and performance in MTP operations, facilitating debugging and optimization of the speculative decoding process.
…cing This commit introduces an in-graph argmax for MTP draft processing, significantly improving throughput by reducing data transfer overhead. Key changes include: - Implementation of `ggml_argmax` to publish final logits, allowing the host to read only the necessary token ID. - Addition of a diagnostic feature for per-draft acceptance tracing, enabling detailed logging of MTP events for better observability. - Documentation updates to reflect these enhancements and provide usage examples for the new tracing functionality. These improvements aim to optimize MTP operations and facilitate debugging in the speculative decoding process.
This commit improves the handling of tensors in the MTP process, specifically for the Gemma 4 assistant. Key changes include: - Updated tensor conversion logic to maintain integer types for specific tensors, ensuring compatibility with centroid routing. - Introduced handling for `mtp.centroids.weight` and `mtp.token_ordering.weight`, ensuring correct tensor shapes and types during processing. - Enhanced documentation to clarify the new tensor structures and their implications for MTP operations. - Added new scripts for quantizing and running the Gemma 4 Edge assistant with improved configuration options. These enhancements aim to optimize the performance and accuracy of the MTP draft process, particularly when using ordered embeddings.
This commit introduces TurboQuant, a new family of WHT-rotated low-bit quantization formats designed for KV cache and model weight compression. Key changes include: - Added support for KV cache types (`turbo2`, `turbo3`, `turbo4`) with significant compression ratios. - Introduced weight quantization formats (`TQ3_1S`, `TQ4_1S`) for efficient model size reduction. - Enhanced documentation detailing usage, backend support, and practical examples for TurboQuant integration. - Added new command-line options for enabling TurboQuant features in the server. These enhancements aim to optimize memory usage and improve performance in bandwidth-bound scenarios, particularly on Apple Silicon and discrete GPUs.
When the Gemma 4 assistant GGUF is loaded via llama_model_load_mtp_from_file, its block tensors (blk.0-3.*), token_embd, output_norm and rope_freqs share identical names with the target model's tensors. This makes it impossible to uniquely target MTP assistant tensors via -ot rules for GPU placement. Fix: after loading the assistant into aux, rename all tensors not already prefixed with 'mtp.' to 'mtp.<original_name>'. This is done purely in-memory on the tensors_by_name vector and the ggml_tensor name field — the GGUF file and published arch names are unchanged. After this change, all MTP assistant tensors are addressable as mtp.blk.N.*, mtp.token_embd.weight, mtp.output_norm.weight etc, and can be pinned with: -ot 'mtp\..*=CUDA0'
…g cache reuse The kv_base->get_size() == kv_swa->get_size() condition in get_can_shift() was introduced in PR ggml-org#15467 before Gemma 4 existed. Gemma 4 has 10 global layers vs 50 SWA layers by design, so this check always returns false, permanently blocking cache reuse for all Gemma 4 users. The individual get_can_shift() calls on each sub-cache already guard shift safety independently. Removing the size equality check is safe for all existing models. Fix also submitted to llama.cpp mainline: ggml-org#21831 (comment)
…--cache-reuse enabled TurboQuant (turbo2/3/4) uses kernel-level WHT rotation which is position-invariant -- WHT preserves inner products so no RoPE correction is needed after a KV position shift. build_graph_shift() assumed standard quantized tensors with upstream rotation, but TurboQuant sets attn_rot_k=0 and handles rotation at kernel level. Building the shift graph with turbo-padded tensors causes a null buffer assert and segfault on the second prompt. Fix: skip build_graph_shift() layers and get_has_shift() entirely for turbo KV types. Position tracking via seq_add() still works correctly -- only the broken RoPE re-rotation kernel is skipped.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Additional information
Requirements