refactor: clean up code organization and refactor workflows to be extensible via image config files.#6250
Open
sirutBuasai wants to merge 111 commits into
Open
refactor: clean up code organization and refactor workflows to be extensible via image config files.#6250sirutBuasai wants to merge 111 commits into
sirutBuasai wants to merge 111 commits into
Conversation
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
…into refactor/workflows
* fix(autocurrency): migrate vLLM/SGLang currency to refactored config schema
The workflow refactor moved + reshaped the image configs but left the
autocurrency scripts pointing at the old layout, silently breaking upstream
release detection and auto-PRs for vLLM and SGLang.
- tracker + agent-context: flat paths (vllm-ec2.yml) -> nested ubuntu variants
(vllm/ec2-ubuntu.yml). Only the ubuntu variants are tracked, matching current
currency scope.
- schema: .common.* -> .metadata.framework_version / .metadata.prod_image /
.metadata.os_version and .build.cuda_version / .build.python_version.
- value format: detect-versions now emits raw major.minor (12.9) and raw python
(3.12) to match the new build.* fields, instead of cu129/py312 short forms.
cuda compares on major.minor so the config's patch segment (13.0.2) is not
clobbered every run.
- agent-fix fallback path updated to the nested config location.
- tests: paths/fields/formats migrated; pr-title assertions updated to the
current "[Docs Update] ..." format and de-duplicated the version literal.
Suite: 41/41 passing.
* fix(autocurrency): correct docs-pr.sh REPO_ROOT after script relocation
Scripts moved scripts/autocurrency/ -> scripts/ci/autocurrency/ (one level
deeper), but docs-pr.sh still computed REPO_ROOT with ../.. — resolving to
<repo>/scripts instead of the repo root. That broke the tracker lookup
(TRACKER), the docs output path (OUTPUT_FILE), and the git add path at release
time. Use ../../.. to match the new depth.
docs-pr.sh consumes the generated release spec (not the raw config), and the
release-spec generator already emits short-form python_version/cuda_version
(py312/cu130), so the tag-building logic needs no schema change.
* fix(autocurrency): update currency-fix agent to renamed PR workflows
The agent-fix workflow gates on a hardcoded TRACKED regex of PR workflow
names. The refactor renamed the vLLM/SGLang PR workflows, so the old names
("PR - vLLM EC2", "PR - SGLang SageMaker", ...) matched nothing — the agent
saw zero tracked runs and silently skipped on every auto-update PR.
- _prcheck.currency-fix.yml: TRACKED -> "PR - vLLM Ubuntu|PR - SGLang Ubuntu",
matching the workflows that an ubuntu-only currency bump actually triggers.
- update-configs.sh: refresh stale docstring examples to the nested config
paths and renamed autorelease workflow filenames (comments only).
* chore(autocurrency): delete stale test suite; fix stale workflow references
- Remove scripts/ci/autocurrency/tests/run-tests.sh — outdated, not wired into
CI, and only covered sourced helper functions (never the main execution
paths), so it gave false confidence (missed the docs-pr.sh REPO_ROOT bug).
- check-upstream-releases.sh: the auto-update PR body linked to the old
workflow filenames (scheduled-check-upstream-releases.yml,
prcheck-detect-versions.yml) which no longer exist after the rename — would
render as dead links in every auto-update PR. Point them at the renamed
_scheduled.check-upstream-releases.yml / _prcheck.detect-versions.yml.
- docs-pr.sh: fix stale usage-comment path (scripts/autocurrency -> scripts/ci/autocurrency).
* chore: remove accidentally committed local settings file
* fix(autocurrency): restore merge-triggered autorelease for vLLM/SGLang
The refactor dropped the push-on-merge trigger and the autocurrency-pr-gate
job from the tracked ubuntu autorelease workflows, leaving them cron +
workflow_dispatch only. As a result, merging an [Auto-Update] PR no longer
kicked off a release — it would wait until the next scheduled cron, breaking
the upstream-release -> PR -> merge -> autorelease -> docs-pr chain.
main has both the `push: branches: [main]` trigger and a gate job that only
proceeds for merged auto-update commits. Restore that pattern on the four
tracked ubuntu autorelease workflows (vllm/sglang x ec2/sagemaker):
- add `push: branches: [main]` to `on:`
- add `autocurrency-pr-gate` job: passes for non-push events, and for push
only when head_commit message contains "[Auto-Update] <fw>" authored by
aws-deep-learning-containers-ci[bot]
- gate the `discover` job on it (needs: [autocurrency-pr-gate]) so the whole
matrix is skipped for unrelated pushes to main
amzn2023 variants stay cron-only (not currency-tracked), matching main.
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CI/CD Workflow Refactor
This PR refactors the entire CI/CD system from 70+ bespoke workflows into a composable, layered architecture.
What changed
Three-layer workflow composition:
pr-*,autorelease-*) — handle triggers, schedules, and change detection. They decide what to build and when, but contain zero test logic.*.pipeline.yml) — one per framework. Orchestrate the build → test → release sequence. They decide which tests to run based on boolean inputs from the caller._reusable.*,*.tests-*) — single-purpose building blocks that run one test suite. They accept a config file + image URI, derive everything else internally.Config-driven builds:
Simplified administration:
Code organization:
scripts/docker/— scripts that run inside containers (COPY'd into images)scripts/ci/— scripts that run on CI runners (build hooks, wheel compilation, config parsing).github/scripts/,scripts/, repo root, and inline in workflows.Dot-namespaced workflow files:
sglang.pr-amzn2023.yml,vllm.autorelease-ec2-amzn2023.yml,pytorch.tests-multi-gpu.yml_.Unix-style actions:
build-imagebuilds,resolve-image-uriresolves,check-image-existsprobes,discover-configsglobs. Simple inputs, simple outputs.build-imageaction accepted 15+ inputs and handled wheel caching, sccache, tag computation, and Docker build all in one. Now those are separate concerns (build hooks, dedicated scripts, action steps).Key benefits
.github/release_schedule.mdfor the full picture. Staggered to avoid GPU fleet contention.What to look at
.github/config/image/— all 21 image configs. One file = one image.*.pipeline.yml— one per framework. This is where the build → test → release graph lives._reusable.sanity-tests.yml— example of how shared tests derive everything from config.pr-*.yml— example of how callers are now just triggers + change detection + pipeline call.