feat: MoE gate topology + expert clustering + scaffold cross-reference#58
Conversation
…ss-ref extract_gate_topology() — pulls ffn_gate_inp Base17 rows from bgz7, one row per expert. Each row IS the expert's structural identity. cluster_experts() — pairwise L1 between experts within each block, connected-component grouping of structurally interchangeable experts. At threshold=500, Maverick's 123,000× compression predicts >90% redundancy. cross_reference_gate_scaffold() — links attention scaffold blocks (Q+O shifted from Qwen3.5 diff) with gate redundancy per block. Routing-dominated blocks = reasoning changes work through the router, not through the expert weights. Tests: - test_maverick_gate_topology: load all 18 Maverick bgz7 shards - test_cross_reference_gate_scaffold: full pipeline connecting Qwen3.5 attention diff with Maverick gate structure
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 54ed7eedf3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if !t.name.contains("gate_inp") && !t.name.contains("gate.weight") { | ||
| continue; |
There was a problem hiding this comment.
Restrict gate tensor matching to router-only names
Narrowing logic here is too broad: matching "gate.weight" pulls in dense FFN gate tensors (e.g., blk.{i}.ffn_gate.weight is a SiLU MLP gate, not a router gate), so extract_gate_topology will treat thousands of FFN rows as experts and feed them into cluster_experts. That corrupts redundancy conclusions and can make the O(n²) adjacency allocation/computation explode for normal dense blocks, especially when running the Maverick shard pipeline.
Useful? React with 👍 / 👎.
What
Extends
causal_diff.rswith MoE gate topology analysis — the other half of the reasoning reverse-engineering pipeline.New functions
extract_gate_topology(bgz7_path)Finds
ffn_gate_inptensors in a bgz7 file. Each row = one expert's activation fingerprint as Base17. For Maverick: 128 rows per MoE block.cluster_experts(fingerprints, threshold)Pairwise L1 between all experts within each block. Connected-component grouping finds structurally interchangeable expert groups. At 123,000× compression on expert weights, we expect >90% of pairs to be redundant.
cross_reference_gate_scaffold(clusters, scaffold_blocks)The key insight connector:
Tests
test_maverick_gate_topology— loads all 18 Maverick bgz7 shards, extracts gates, clusterstest_cross_reference_gate_scaffold— full pipeline: Qwen3.5 diff → scaffold blocks → Maverick gates → routing dominance checkThe loop that closes