[FlyDSL AOT] Parallelize standalone main() compile drivers#3769
Open
zhiding512 wants to merge 2 commits into
Open
[FlyDSL AOT] Parallelize standalone main() compile drivers#3769zhiding512 wants to merge 2 commits into
zhiding512 wants to merge 2 commits into
Conversation
The per-module `python -m aiter.aot.flydsl.<op>` drivers compiled kernels in serial for-loops, even though the setup.py build path already fans out across a ProcessPoolExecutor via start_aot/wait_aot. Consolidate the pool logic into a shared `run_jobs_parallel(worker, jobs)` helper in common.py and route all four main()s (moe, gemm, grouped_moe, chunk_gdn_h) through it. - Extract `_resolve_max_workers` so start_aot and the main() drivers honour the same AITER_FLYDSL_AOT_WORKERS knob and never desync on pool size. - moe stage1+stage2 and gemm hgemm+preshuffle share one pool each (the compiles are independent; stage2 does not read stage1's cache artifact). - Per-kernel worker crashes are caught and tallied so one bad kernel does not abort the batch; crash label falls back to a job repr for kinds with no kernel_name (chunk_gdn_h). Must be multi-process, not multi-thread: compile_one_config mutates process-global state (ARCH / FLYDSL_GPU_ARCH env overrides, FakeTensorMode), which threads would corrupt. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
Contributor
There was a problem hiding this comment.
Pull request overview
This PR parallelizes the standalone FlyDSL AOT compilation drivers (run via python -m aiter.aot.flydsl.<op>) by consolidating shared ProcessPoolExecutor logic into aiter/aot/flydsl/common.py, aligning worker-count resolution across both the setup/build AOT path and the per-op CLI drivers.
Changes:
- Added shared
run_jobs_parallel(worker, jobs)and_resolve_max_workers()incommon.py, and updatedstart_aot()to reuse the same worker-count resolution. - Updated
moe.py,gemm.py,grouped_moe.py, andchunk_gdn_h.pyCLIs to compile kernels via the shared multiprocessing helper (with per-job crash handling). - Simplified per-op drivers by replacing serial loops with pooled execution and consistent summaries/exit codes.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| aiter/aot/flydsl/common.py | Adds shared parallel job runner + shared max-worker resolution; start_aot() now uses the shared knob logic. |
| aiter/aot/flydsl/moe.py | Routes stage1+stage2 compilation through run_jobs_parallel. |
| aiter/aot/flydsl/gemm.py | Routes hgemm+preshuffle compilation through run_jobs_parallel. |
| aiter/aot/flydsl/grouped_moe.py | Routes grouped-MoE compilation through run_jobs_parallel and adds a summary/exit code. |
| aiter/aot/flydsl/chunk_gdn_h.py | Routes chunk-gdn-h compilation through run_jobs_parallel. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+800
to
+804
| with ( | ||
| override_env("ARCH", aot_arch), | ||
| override_env("FLYDSL_GPU_ARCH", aot_arch), | ||
| FakeTensorMode(), | ||
| ): |
Comment on lines
+376
to
+380
| with ( | ||
| override_env("ARCH", aot_arch), | ||
| override_env("FLYDSL_GPU_ARCH", aot_arch), | ||
| FakeTensorMode(), | ||
| ): |
Comment on lines
+329
to
+333
| with ( | ||
| override_env("ARCH", aot_arch), | ||
| override_env("FLYDSL_GPU_ARCH", aot_arch), | ||
| FakeTensorMode(), | ||
| ): |
Comment on lines
+203
to
+207
| with ( | ||
| compile_only_env(), | ||
| override_env("FLYDSL_GPU_ARCH", aot_arch), | ||
| FakeTensorMode(), | ||
| ): |
Set HIP_VISIBLE_DEVICES and AITER_AOT_IMPORT only when the module runs directly as the compile driver, before the aiter imports that read them at import time. Importing these modules as a library no longer mutates the environment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The per-module
python -m aiter.aot.flydsl.<op>drivers compiled kernels in serial for-loops, even though the setup.py build path already fans out across a ProcessPoolExecutor via start_aot/wait_aot. Consolidate the pool logic into a sharedrun_jobs_parallel(worker, jobs)helper in common.py and route all four main()s (moe, gemm, grouped_moe, chunk_gdn_h) through it._resolve_max_workersso start_aot and the main() drivers honour the same AITER_FLYDSL_AOT_WORKERS knob and never desync on pool size.Must be multi-process, not multi-thread: compile_one_config mutates process-global state (ARCH / FLYDSL_GPU_ARCH env overrides, FakeTensorMode), which threads would corrupt.