feat: HiDream-I1 diffusion indexing — first cross-domain MoE validation#61
Conversation
Adds 5 test functions for the first diffusion model indexing: - test_stream_index_hidream_transformer (7 shards, 35 GB DiT+MoE) - test_stream_index_hidream_text_encoders (CLIP-L + CLIP-G + Llama-3.1-8B) - test_stream_index_llama31_8b_base (4 shards, ungated via unsloth) - test_hidream_llama_diff (cross-domain: language→vision attention shift) Adds index_safetensors_shards() helper that handles HEAD size detection, HTTP range reading, shard iteration, and skip-if-exists logic. Syntax verified with rustc 1.94.1.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5bf4d45d82
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if std::fs::metadata(&out_path).is_ok() { | ||
| eprintln!("SKIP {} (exists)", out_path); | ||
| continue; |
There was a problem hiding this comment.
Preserve stats when cached shard outputs are reused
When an output shard file already exists, this branch continues without adding any IndexStats, but test_stream_index_hidream_transformer later sums stats and asserts total_tensors > 50. On a rerun where all /tmp/hidream_transformer_shard*.bgz7 files already exist, stats is empty and the assertion fails even though indexing already succeeded, making the test non-idempotent.
Useful? React with 👍 / 👎.
| let pairs = [ | ||
| ("/tmp/llama31_8b_base_shard01.bgz7", "/tmp/hidream_llama_enc_shard01.bgz7", "shard 1"), | ||
| ("/tmp/llama31_8b_base_shard02.bgz7", "/tmp/hidream_llama_enc_shard02.bgz7", "shard 2"), | ||
| ]; |
There was a problem hiding this comment.
Include all base shards in HiDream-vs-base diff
The base index test writes four shard files (model-...-of-00004), but this diff pair list compares only shard 1 and shard 2, so tensors that land in base shards 3 and 4 are never analyzed. That means the reported aggregate shift metrics are computed from a partial model and can materially skew the cross-domain conclusions.
Useful? React with 👍 / 👎.
What
Session prompt + test functions for indexing HiDream-I1-Full, a 17B DiT+MoE diffusion model (MIT license, ungated).
This is the first cross-domain validation: do image generation MoE experts show the same structural redundancy as LLM MoE experts?
Tests added
test_stream_index_hidream_transformertest_stream_index_hidream_text_encoderstest_stream_index_llama31_8b_basetest_hidream_llama_diffCross-domain hypothesis
Bonus: what seeing adds to reading
HiDream uses Llama-3.1-8B as text encoder 3, fine-tuned for image conditioning.
Diffing against base Llama-3.1-8B reveals which attention heads re-routed when
the model learned to condition image generation. Same NARS pipeline as the
Qwen3.5 reasoning diff, different capability injection.
Files
.claude/prompts/SESSION_HIDREAM_DIFFUSION.md— session promptsrc/hpc/safetensors.rs— addedindex_safetensors_shards()helper + 5 test functions