Skip to content

[FlyDSL AOT] Parallelize standalone main() compile drivers#3769

Open
zhiding512 wants to merge 2 commits into
mainfrom
zhimding/flydsl_aot_multithread
Open

[FlyDSL AOT] Parallelize standalone main() compile drivers#3769
zhiding512 wants to merge 2 commits into
mainfrom
zhimding/flydsl_aot_multithread

Conversation

@zhiding512

Copy link
Copy Markdown
Contributor

The per-module python -m aiter.aot.flydsl.<op> drivers compiled kernels in serial for-loops, even though the setup.py build path already fans out across a ProcessPoolExecutor via start_aot/wait_aot. Consolidate the pool logic into a shared run_jobs_parallel(worker, jobs) helper in common.py and route all four main()s (moe, gemm, grouped_moe, chunk_gdn_h) through it.

  • Extract _resolve_max_workers so start_aot and the main() drivers honour the same AITER_FLYDSL_AOT_WORKERS knob and never desync on pool size.
  • moe stage1+stage2 and gemm hgemm+preshuffle share one pool each (the compiles are independent; stage2 does not read stage1's cache artifact).
  • Per-kernel worker crashes are caught and tallied so one bad kernel does not abort the batch; crash label falls back to a job repr for kinds with no kernel_name (chunk_gdn_h).

Must be multi-process, not multi-thread: compile_one_config mutates process-global state (ARCH / FLYDSL_GPU_ARCH env overrides, FakeTensorMode), which threads would corrupt.

The per-module `python -m aiter.aot.flydsl.<op>` drivers compiled kernels
in serial for-loops, even though the setup.py build path already fans out
across a ProcessPoolExecutor via start_aot/wait_aot. Consolidate the pool
logic into a shared `run_jobs_parallel(worker, jobs)` helper in common.py
and route all four main()s (moe, gemm, grouped_moe, chunk_gdn_h) through it.

- Extract `_resolve_max_workers` so start_aot and the main() drivers honour
  the same AITER_FLYDSL_AOT_WORKERS knob and never desync on pool size.
- moe stage1+stage2 and gemm hgemm+preshuffle share one pool each (the
  compiles are independent; stage2 does not read stage1's cache artifact).
- Per-kernel worker crashes are caught and tallied so one bad kernel does
  not abort the batch; crash label falls back to a job repr for kinds with
  no kernel_name (chunk_gdn_h).

Must be multi-process, not multi-thread: compile_one_config mutates
process-global state (ARCH / FLYDSL_GPU_ARCH env overrides, FakeTensorMode),
which threads would corrupt.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@zhiding512 zhiding512 requested review from a team and Copilot June 17, 2026 06:42
@github-actions

Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3769 --add-label <label>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR parallelizes the standalone FlyDSL AOT compilation drivers (run via python -m aiter.aot.flydsl.<op>) by consolidating shared ProcessPoolExecutor logic into aiter/aot/flydsl/common.py, aligning worker-count resolution across both the setup/build AOT path and the per-op CLI drivers.

Changes:

  • Added shared run_jobs_parallel(worker, jobs) and _resolve_max_workers() in common.py, and updated start_aot() to reuse the same worker-count resolution.
  • Updated moe.py, gemm.py, grouped_moe.py, and chunk_gdn_h.py CLIs to compile kernels via the shared multiprocessing helper (with per-job crash handling).
  • Simplified per-op drivers by replacing serial loops with pooled execution and consistent summaries/exit codes.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
aiter/aot/flydsl/common.py Adds shared parallel job runner + shared max-worker resolution; start_aot() now uses the shared knob logic.
aiter/aot/flydsl/moe.py Routes stage1+stage2 compilation through run_jobs_parallel.
aiter/aot/flydsl/gemm.py Routes hgemm+preshuffle compilation through run_jobs_parallel.
aiter/aot/flydsl/grouped_moe.py Routes grouped-MoE compilation through run_jobs_parallel and adds a summary/exit code.
aiter/aot/flydsl/chunk_gdn_h.py Routes chunk-gdn-h compilation through run_jobs_parallel.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread aiter/aot/flydsl/moe.py
Comment on lines +800 to +804
with (
override_env("ARCH", aot_arch),
override_env("FLYDSL_GPU_ARCH", aot_arch),
FakeTensorMode(),
):
Comment thread aiter/aot/flydsl/gemm.py
Comment on lines +376 to +380
with (
override_env("ARCH", aot_arch),
override_env("FLYDSL_GPU_ARCH", aot_arch),
FakeTensorMode(),
):
Comment on lines +329 to +333
with (
override_env("ARCH", aot_arch),
override_env("FLYDSL_GPU_ARCH", aot_arch),
FakeTensorMode(),
):
Comment on lines +203 to +207
with (
compile_only_env(),
override_env("FLYDSL_GPU_ARCH", aot_arch),
FakeTensorMode(),
):
Set HIP_VISIBLE_DEVICES and AITER_AOT_IMPORT only when the module runs
directly as the compile driver, before the aiter imports that read them
at import time. Importing these modules as a library no longer mutates
the environment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants