diff --git a/README.md b/README.md index e72113a..a0af2ae 100644 --- a/README.md +++ b/README.md @@ -64,17 +64,17 @@ Defuser currently supports the following `transformers>=5.3.0` `model_type` valu ### 🔄 `convert_model(model)` after load -| Pattern | Supported model types | Defused op performed ⚙️ | -| --- | --- | --- | +| Pattern | Supported model types | Defused op performed ⚙️ | +| --- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| --- | | Standard routed expert tensors 🧱 | `deepseek_v2`, `dots1`, `ernie4_5_moe`, `ernie4_5_vl_moe`, `exaone_moe`, `flex_olmo`, `glm4_moe_lite`, `glm4v_moe`, `hunyuan_v1_moe`, `jamba`, `laguna`, `lfm2_moe`, `minimax`, `minimax_m2`, `olmoe`, `qwen3_vl_moe`, `solar_open` | Splits fused expert tensors or registered expert buffers into numbered expert `nn.Linear` modules with per-expert `gate_proj`, `up_proj`, and `down_proj`. | -| Mixed sparse and shared experts | `deepseek_v3`, `glm_moe_dsa`, `qwen3_5_moe`, `qwen3_5_moe_text` | Runtime expert tensor defusion for routed experts while preserving the model's shared-expert path. | -| Transposed or packed expert tensors | `gpt_oss`, `phimoe` | Splits transposed fused expert `gate_up_proj` tensors into per-expert `gate_proj` + `up_proj`, preserves expert bias when present, and converts expert tensors into numbered expert `nn.Linear` modules. | -| Flattened expert layout | `dbrx` | Rebuilds the flattened DBRX expert FFN weights into numbered expert `gate_proj`, `up_proj`, and `down_proj` `nn.Linear` modules. | -| Batched expert-input execution | `llama4` | Runtime expert tensor defusion plus preservation of the llama4 batched expert-input execution contract. | -| Non-gated expert MLPs | `nemotron_h` | Converts routed expert tensors into numbered `up_proj` and `down_proj` `nn.Linear` modules for non-gated experts. | -| Parallel expert blocks | `granitemoe`, `granitemoehybrid`, `granitemoeshared`, `jetmoe` | Converts packed expert weight tensors into numbered expert `linear` modules while keeping grouped expert execution intact. | -| Routed experts with identity experts | `longcat_flash` | Defuses routed experts into numbered `gate_proj`, `up_proj`, and `down_proj` modules and preserves zero or identity experts. | -| Fused dense `gate_up_proj` MLPs | `dia`, `glm`, `glm4`, `glm_image`, `glm_ocr`, `phi3`, `phi4_multimodal`, `zamba2` | Splits fused dense `gate_up_proj` layers into `gate_proj` + `up_proj` and updates the block `forward()` to preserve the original MLP math. | +| Mixed sparse and shared experts | `deepseek_v3`, deepseek_v4`, `glm_moe_dsa`, `qwen3_5_moe`, `qwen3_5_moe_text` | Runtime expert tensor defusion for routed experts while preserving the model's shared-expert path. | +| Transposed or packed expert tensors | `gpt_oss`, `phimoe` | Splits transposed fused expert `gate_up_proj` tensors into per-expert `gate_proj` + `up_proj`, preserves expert bias when present, and converts expert tensors into numbered expert `nn.Linear` modules. | +| Flattened expert layout | `dbrx` | Rebuilds the flattened DBRX expert FFN weights into numbered expert `gate_proj`, `up_proj`, and `down_proj` `nn.Linear` modules. | +| Batched expert-input execution | `llama4` | Runtime expert tensor defusion plus preservation of the llama4 batched expert-input execution contract. | +| Non-gated expert MLPs | `nemotron_h` | Converts routed expert tensors into numbered `up_proj` and `down_proj` `nn.Linear` modules for non-gated experts. | +| Parallel expert blocks | `granitemoe`, `granitemoehybrid`, `granitemoeshared`, `jetmoe` | Converts packed expert weight tensors into numbered expert `linear` modules while keeping grouped expert execution intact. | +| Routed experts with identity experts | `longcat_flash` | Defuses routed experts into numbered `gate_proj`, `up_proj`, and `down_proj` modules and preserves zero or identity experts. | +| Fused dense `gate_up_proj` MLPs | `dia`, `glm`, `glm4`, `glm_image`, `glm_ocr`, `phi3`, `phi4_multimodal`, `zamba2` | Splits fused dense `gate_up_proj` layers into `gate_proj` + `up_proj` and updates the block `forward()` to preserve the original MLP math. | ## 🔁 Workflow Summary diff --git a/defuser/defuser.py b/defuser/defuser.py index 75c2d67..01c3ebe 100644 --- a/defuser/defuser.py +++ b/defuser/defuser.py @@ -75,7 +75,6 @@ def replace_fused_blocks(model_type: str) -> bool: try: orig_module = importlib.import_module(orig_module_path) custom_module = importlib.import_module(custom_module_path) - print("orig_module", orig_module, orig_class_name) # Validate class existence before patching if not hasattr(orig_module, orig_class_name): raise PatchError(f"Original class[{orig_class_name}] not found: {orig_module}") diff --git a/defuser/model_registry.py b/defuser/model_registry.py index 37d919f..2183aa1 100644 --- a/defuser/model_registry.py +++ b/defuser/model_registry.py @@ -25,6 +25,9 @@ class PATCH(str, Enum): "deepseek_v3": { "min_transformers_version": MIN_SUPPORTED_TRANSFORMERS_VERSION, }, + "deepseek_v4": { + "min_transformers_version": MIN_SUPPORTED_TRANSFORMERS_VERSION, + }, "dia": { "min_transformers_version": MIN_SUPPORTED_TRANSFORMERS_VERSION, }, diff --git a/defuser/modeling/replace_modules.py b/defuser/modeling/replace_modules.py index 97b6d7c..a83715c 100644 --- a/defuser/modeling/replace_modules.py +++ b/defuser/modeling/replace_modules.py @@ -325,7 +325,6 @@ def _apply_custom_replacements( module, model.config, ).to(orig_dtype) - print("replacement", replacement) model.set_submodule(name, replacement) replaced.append((name, replacement_cls)) else: diff --git a/pyproject.toml b/pyproject.toml index 50ae60b..1bfb14c 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -9,7 +9,7 @@ build-backend = "setuptools.build_meta" [project] name = "Defuser" -version = "0.0.21" +version = "0.0.22" description = "Model defuser helper for HF Transformers." readme = "README.md" requires-python = ">=3.9" diff --git a/tests/test_meta_model_defusion.py b/tests/test_meta_model_defusion.py index aa88b87..1acd8b7 100644 --- a/tests/test_meta_model_defusion.py +++ b/tests/test_meta_model_defusion.py @@ -331,6 +331,17 @@ def _validate_defused_module(case: dict, module) -> None: "validator": "experts", "min_targets": 2, }, + { + "model_type": "deepseek_v4", + "mode": "convert", + "model_module": "transformers.models.deepseek_v4.modeling_deepseek_v4", + "model_class": "DeepseekV4ForCausalLM", + "config_module": "transformers.models.deepseek_v4.configuration_deepseek_v4", + "config_class": "DeepseekV4Config", + "target_class_paths": ("transformers.models.deepseek_v4.modeling_deepseek_v4.DeepseekV4Experts",), + "validator": "experts", + "min_targets": 2, + }, { "model_type": "dia", "mode": "convert",