Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,17 +64,17 @@ Defuser currently supports the following `transformers>=5.3.0` `model_type` valu

### 🔄 `convert_model(model)` after load

| Pattern | Supported model types | Defused op performed ⚙️ |
| --- | --- | --- |
| Pattern | Supported model types | Defused op performed ⚙️ |
| --- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| --- |
| Standard routed expert tensors 🧱 | `deepseek_v2`, `dots1`, `ernie4_5_moe`, `ernie4_5_vl_moe`, `exaone_moe`, `flex_olmo`, `glm4_moe_lite`, `glm4v_moe`, `hunyuan_v1_moe`, `jamba`, `laguna`, `lfm2_moe`, `minimax`, `minimax_m2`, `olmoe`, `qwen3_vl_moe`, `solar_open` | Splits fused expert tensors or registered expert buffers into numbered expert `nn.Linear` modules with per-expert `gate_proj`, `up_proj`, and `down_proj`. |
| Mixed sparse and shared experts | `deepseek_v3`, `glm_moe_dsa`, `qwen3_5_moe`, `qwen3_5_moe_text` | Runtime expert tensor defusion for routed experts while preserving the model's shared-expert path. |
| Transposed or packed expert tensors | `gpt_oss`, `phimoe` | Splits transposed fused expert `gate_up_proj` tensors into per-expert `gate_proj` + `up_proj`, preserves expert bias when present, and converts expert tensors into numbered expert `nn.Linear` modules. |
| Flattened expert layout | `dbrx` | Rebuilds the flattened DBRX expert FFN weights into numbered expert `gate_proj`, `up_proj`, and `down_proj` `nn.Linear` modules. |
| Batched expert-input execution | `llama4` | Runtime expert tensor defusion plus preservation of the llama4 batched expert-input execution contract. |
| Non-gated expert MLPs | `nemotron_h` | Converts routed expert tensors into numbered `up_proj` and `down_proj` `nn.Linear` modules for non-gated experts. |
| Parallel expert blocks | `granitemoe`, `granitemoehybrid`, `granitemoeshared`, `jetmoe` | Converts packed expert weight tensors into numbered expert `linear` modules while keeping grouped expert execution intact. |
| Routed experts with identity experts | `longcat_flash` | Defuses routed experts into numbered `gate_proj`, `up_proj`, and `down_proj` modules and preserves zero or identity experts. |
| Fused dense `gate_up_proj` MLPs | `dia`, `glm`, `glm4`, `glm_image`, `glm_ocr`, `phi3`, `phi4_multimodal`, `zamba2` | Splits fused dense `gate_up_proj` layers into `gate_proj` + `up_proj` and updates the block `forward()` to preserve the original MLP math. |
| Mixed sparse and shared experts | `deepseek_v3`, deepseek_v4`, `glm_moe_dsa`, `qwen3_5_moe`, `qwen3_5_moe_text` | Runtime expert tensor defusion for routed experts while preserving the model's shared-expert path. |
| Transposed or packed expert tensors | `gpt_oss`, `phimoe` | Splits transposed fused expert `gate_up_proj` tensors into per-expert `gate_proj` + `up_proj`, preserves expert bias when present, and converts expert tensors into numbered expert `nn.Linear` modules. |
| Flattened expert layout | `dbrx` | Rebuilds the flattened DBRX expert FFN weights into numbered expert `gate_proj`, `up_proj`, and `down_proj` `nn.Linear` modules. |
| Batched expert-input execution | `llama4` | Runtime expert tensor defusion plus preservation of the llama4 batched expert-input execution contract. |
| Non-gated expert MLPs | `nemotron_h` | Converts routed expert tensors into numbered `up_proj` and `down_proj` `nn.Linear` modules for non-gated experts. |
| Parallel expert blocks | `granitemoe`, `granitemoehybrid`, `granitemoeshared`, `jetmoe` | Converts packed expert weight tensors into numbered expert `linear` modules while keeping grouped expert execution intact. |
| Routed experts with identity experts | `longcat_flash` | Defuses routed experts into numbered `gate_proj`, `up_proj`, and `down_proj` modules and preserves zero or identity experts. |
| Fused dense `gate_up_proj` MLPs | `dia`, `glm`, `glm4`, `glm_image`, `glm_ocr`, `phi3`, `phi4_multimodal`, `zamba2` | Splits fused dense `gate_up_proj` layers into `gate_proj` + `up_proj` and updates the block `forward()` to preserve the original MLP math. |

## 🔁 Workflow Summary

Expand Down
1 change: 0 additions & 1 deletion defuser/defuser.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,6 @@ def replace_fused_blocks(model_type: str) -> bool:
try:
orig_module = importlib.import_module(orig_module_path)
custom_module = importlib.import_module(custom_module_path)
print("orig_module", orig_module, orig_class_name)
# Validate class existence before patching
if not hasattr(orig_module, orig_class_name):
raise PatchError(f"Original class[{orig_class_name}] not found: {orig_module}")
Expand Down
3 changes: 3 additions & 0 deletions defuser/model_registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ class PATCH(str, Enum):
"deepseek_v3": {
"min_transformers_version": MIN_SUPPORTED_TRANSFORMERS_VERSION,
},
"deepseek_v4": {
"min_transformers_version": MIN_SUPPORTED_TRANSFORMERS_VERSION,
},
"dia": {
"min_transformers_version": MIN_SUPPORTED_TRANSFORMERS_VERSION,
},
Expand Down
1 change: 0 additions & 1 deletion defuser/modeling/replace_modules.py
Original file line number Diff line number Diff line change
Expand Up @@ -325,7 +325,6 @@ def _apply_custom_replacements(
module,
model.config,
).to(orig_dtype)
print("replacement", replacement)
model.set_submodule(name, replacement)
replaced.append((name, replacement_cls))
else:
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "Defuser"
version = "0.0.21"
version = "0.0.22"
description = "Model defuser helper for HF Transformers."
readme = "README.md"
requires-python = ">=3.9"
Expand Down
11 changes: 11 additions & 0 deletions tests/test_meta_model_defusion.py
Original file line number Diff line number Diff line change
Expand Up @@ -331,6 +331,17 @@ def _validate_defused_module(case: dict, module) -> None:
"validator": "experts",
"min_targets": 2,
},
{
"model_type": "deepseek_v4",
"mode": "convert",
"model_module": "transformers.models.deepseek_v4.modeling_deepseek_v4",
"model_class": "DeepseekV4ForCausalLM",
"config_module": "transformers.models.deepseek_v4.configuration_deepseek_v4",
"config_class": "DeepseekV4Config",
"target_class_paths": ("transformers.models.deepseek_v4.modeling_deepseek_v4.DeepseekV4Experts",),
"validator": "experts",
"min_targets": 2,
},
{
"model_type": "dia",
"mode": "convert",
Expand Down