Skip to content

refactor: clean up code organization and refactor workflows to be extensible via image config files.#6250

Open
sirutBuasai wants to merge 111 commits into
mainfrom
refactor/workflows
Open

refactor: clean up code organization and refactor workflows to be extensible via image config files.#6250
sirutBuasai wants to merge 111 commits into
mainfrom
refactor/workflows

Conversation

@sirutBuasai

@sirutBuasai sirutBuasai commented Jun 15, 2026

Copy link
Copy Markdown
Member

CI/CD Workflow Refactor

This PR refactors the entire CI/CD system from 70+ bespoke workflows into a composable, layered architecture.

What changed

Three-layer workflow composition:

  • Caller workflows (pr-*, autorelease-*) — handle triggers, schedules, and change detection. They decide what to build and when, but contain zero test logic.
  • Pipeline workflows (*.pipeline.yml) — one per framework. Orchestrate the build → test → release sequence. They decide which tests to run based on boolean inputs from the caller.
  • Reusable test workflows (_reusable.*, *.tests-*) — single-purpose building blocks that run one test suite. They accept a config file + image URI, derive everything else internally.

Config-driven builds:

  • One YAML config file = one released image variant. All version pins, Dockerfile paths, and metadata live in that file.
  • Adding a new image variant = drop a config file. The discover-configs matrix automatically picks it up — no workflow changes needed.

Simplified administration:

  • Changing a release schedule = edit one cron line in a 30-line caller file.
  • Enabling/disabling a test suite = flip one boolean in the caller's pipeline inputs.
  • Adding a new framework = copy the SGLang pipeline as a template, create configs, done.

Code organization:

  • scripts/docker/ — scripts that run inside containers (COPY'd into images)
  • scripts/ci/ — scripts that run on CI runners (build hooks, wheel compilation, config parsing)
  • Previously these were scattered across .github/scripts/, scripts/, repo root, and inline in workflows.

Dot-namespaced workflow files:

  • sglang.pr-amzn2023.yml, vllm.autorelease-ec2-amzn2023.yml, pytorch.tests-multi-gpu.yml
  • Frameworks are visually grouped in the file list. Cross-framework utilities prefixed with _.

Unix-style actions:

  • Each composite action does one thing: build-image builds, resolve-image-uri resolves, check-image-exists probes, discover-configs globs. Simple inputs, simple outputs.
  • The old build-image action accepted 15+ inputs and handled wheel caching, sccache, tag computation, and Docker build all in one. Now those are separate concerns (build hooks, dedicated scripts, action steps).

Key benefits

  • PR workflow management is trivial. Choosing which images to test, what tests to skip, or what paths trigger a build is all in a short caller file with no test logic to wade through.
  • Release scheduling is centralized. See .github/release_schedule.md for the full picture. Staggered to avoid GPU fleet contention.
  • Framework migration is mechanical. The pattern is proven across all 7 frameworks. Adding PyTorch 2.12 = add configs + update the glob pattern.
  • Test logic is framework-agnostic where possible. Sanity, security, telemetry, and EFA tests work identically across all image types via the config-file interface.

What to look at

  1. .github/config/image/ — all 21 image configs. One file = one image.
  2. *.pipeline.yml — one per framework. This is where the build → test → release graph lives.
  3. _reusable.sanity-tests.yml — example of how shared tests derive everything from config.
  4. Any pr-*.yml — example of how callers are now just triggers + change detection + pipeline call.

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
sirutBuasai and others added 30 commits June 17, 2026 14:12
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
* fix(autocurrency): migrate vLLM/SGLang currency to refactored config schema

The workflow refactor moved + reshaped the image configs but left the
autocurrency scripts pointing at the old layout, silently breaking upstream
release detection and auto-PRs for vLLM and SGLang.

- tracker + agent-context: flat paths (vllm-ec2.yml) -> nested ubuntu variants
  (vllm/ec2-ubuntu.yml). Only the ubuntu variants are tracked, matching current
  currency scope.
- schema: .common.* -> .metadata.framework_version / .metadata.prod_image /
  .metadata.os_version and .build.cuda_version / .build.python_version.
- value format: detect-versions now emits raw major.minor (12.9) and raw python
  (3.12) to match the new build.* fields, instead of cu129/py312 short forms.
  cuda compares on major.minor so the config's patch segment (13.0.2) is not
  clobbered every run.
- agent-fix fallback path updated to the nested config location.
- tests: paths/fields/formats migrated; pr-title assertions updated to the
  current "[Docs Update] ..." format and de-duplicated the version literal.
  Suite: 41/41 passing.

* fix(autocurrency): correct docs-pr.sh REPO_ROOT after script relocation

Scripts moved scripts/autocurrency/ -> scripts/ci/autocurrency/ (one level
deeper), but docs-pr.sh still computed REPO_ROOT with ../.. — resolving to
<repo>/scripts instead of the repo root. That broke the tracker lookup
(TRACKER), the docs output path (OUTPUT_FILE), and the git add path at release
time. Use ../../.. to match the new depth.

docs-pr.sh consumes the generated release spec (not the raw config), and the
release-spec generator already emits short-form python_version/cuda_version
(py312/cu130), so the tag-building logic needs no schema change.

* fix(autocurrency): update currency-fix agent to renamed PR workflows

The agent-fix workflow gates on a hardcoded TRACKED regex of PR workflow
names. The refactor renamed the vLLM/SGLang PR workflows, so the old names
("PR - vLLM EC2", "PR - SGLang SageMaker", ...) matched nothing — the agent
saw zero tracked runs and silently skipped on every auto-update PR.

- _prcheck.currency-fix.yml: TRACKED -> "PR - vLLM Ubuntu|PR - SGLang Ubuntu",
  matching the workflows that an ubuntu-only currency bump actually triggers.
- update-configs.sh: refresh stale docstring examples to the nested config
  paths and renamed autorelease workflow filenames (comments only).

* chore(autocurrency): delete stale test suite; fix stale workflow references

- Remove scripts/ci/autocurrency/tests/run-tests.sh — outdated, not wired into
  CI, and only covered sourced helper functions (never the main execution
  paths), so it gave false confidence (missed the docs-pr.sh REPO_ROOT bug).
- check-upstream-releases.sh: the auto-update PR body linked to the old
  workflow filenames (scheduled-check-upstream-releases.yml,
  prcheck-detect-versions.yml) which no longer exist after the rename — would
  render as dead links in every auto-update PR. Point them at the renamed
  _scheduled.check-upstream-releases.yml / _prcheck.detect-versions.yml.
- docs-pr.sh: fix stale usage-comment path (scripts/autocurrency -> scripts/ci/autocurrency).

* chore: remove accidentally committed local settings file

* fix(autocurrency): restore merge-triggered autorelease for vLLM/SGLang

The refactor dropped the push-on-merge trigger and the autocurrency-pr-gate
job from the tracked ubuntu autorelease workflows, leaving them cron +
workflow_dispatch only. As a result, merging an [Auto-Update] PR no longer
kicked off a release — it would wait until the next scheduled cron, breaking
the upstream-release -> PR -> merge -> autorelease -> docs-pr chain.

main has both the `push: branches: [main]` trigger and a gate job that only
proceeds for merged auto-update commits. Restore that pattern on the four
tracked ubuntu autorelease workflows (vllm/sglang x ec2/sagemaker):

- add `push: branches: [main]` to `on:`
- add `autocurrency-pr-gate` job: passes for non-push events, and for push
  only when head_commit message contains "[Auto-Update] <fw>" authored by
  aws-deep-learning-containers-ci[bot]
- gate the `discover` job on it (needs: [autocurrency-pr-gate]) so the whole
  matrix is skipped for unrelated pushes to main

amzn2023 variants stay cron-only (not currency-tracked), matching main.
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants