From 4cfb82177fe342f312493ea3f41a43fa644f0e1b Mon Sep 17 00:00:00 2001 From: Joshua Temple Date: Fri, 26 Jun 2026 15:08:55 -0400 Subject: [PATCH] docs: document the staged, selective, nightly-gated fleet orchestration Add a release-orchestration page covering the staged fan-out (sequenced lanes holding peak concurrency near two), the repos selector for single-member runs, and the nightly-gated release with its force and dry-run dispatch vector. Update the coverage matrix to eleven fleet repos with the live repository_dispatch rollback row. Part of #373; closes #377. Signed-off-by: Joshua Temple --- docs/astro.config.mjs | 1 + docs/src/content/docs/coverage-matrix.md | 12 +- .../src/content/docs/release-orchestration.md | 129 ++++++++++++++++++ docs/src/content/docs/testing.md | 3 + 4 files changed, 140 insertions(+), 5 deletions(-) create mode 100644 docs/src/content/docs/release-orchestration.md diff --git a/docs/astro.config.mjs b/docs/astro.config.mjs index 0f7c616..6bb6a88 100644 --- a/docs/astro.config.mjs +++ b/docs/astro.config.mjs @@ -86,6 +86,7 @@ export default defineConfig({ { label: 'Architecture', link: '/architecture/' }, { label: 'How it is tested', link: '/testing/' }, { label: 'Feature coverage matrix', link: '/coverage-matrix/' }, + { label: 'Release orchestration', link: '/release-orchestration/' }, { label: 'Security & Hardening', link: '/security/hardening/' }, { label: 'Versioning & Schema', link: '/versioning/' }, ], diff --git a/docs/src/content/docs/coverage-matrix.md b/docs/src/content/docs/coverage-matrix.md index bdcbea2..f401005 100644 --- a/docs/src/content/docs/coverage-matrix.md +++ b/docs/src/content/docs/coverage-matrix.md @@ -17,10 +17,10 @@ covered at that layer by design, not by omission. The "why both layers" section below explains those choices. :::tip[Last validated] -This matrix was last validated against a fully green live-fleet run on `v0.5.0-rc.10`: -all ten example repos (primary, artifact-a, artifact-b, single-env, 2env, 3env, 4env, -release-only, no-env, callbacks) passed every probe, and the shared fail-closed reconcile -gate accounted for every run in each scenario window. +This matrix was last validated against the fully green live-fleet run behind `v0.5.1`: +all eleven example repos (primary, artifact-a, artifact-b, single-env, 2env, 3env, 4env, +release-only, no-env, callbacks, rollback-dispatch) passed every probe, and the shared +fail-closed reconcile gate accounted for every run in each scenario window. ::: ## Why two layers, restated for this matrix @@ -62,13 +62,14 @@ only under real installation tokens on the fleet, never in the token-free harnes | Feature | act plus gitea scenario | Live-fleet probe (repo) | Unit | What the layer proves | |---|---|---|---|---| -| Orchestrate trunk build to release candidate | `01`, `02`, `03`, `04`, `34-extra-orchestrate-triggers` | every repo, orchestrate-on-merge (all 10) | `internal/orchestrate` | A trunk merge mints an RC draft and writes state, across every topology, on real Actions | +| Orchestrate trunk build to release candidate | `01`, `02`, `03`, `04`, `34-extra-orchestrate-triggers` | every repo, orchestrate-on-merge (all 11) | `internal/orchestrate` | A trunk merge mints an RC draft and writes state, across every topology, on real Actions | | Default promotion (env to next env) | `04`, `promote/cascade-deploy-enabled` | `promote-staging` (2env, 3env, primary) | `internal/promote` | One promotion step copies source state into the target on a real release object | | Cascade-mode promotion (atomic multi-step) | `04-cascade-promotion` | `lifecycle` dev to prod (4env) | `internal/promote` | The full ladder advances through intermediates and publishes at the top | | Standalone release lane (draft, prerelease, publish) | `05-publish-callback`, `37`, `38` | dispatch prerelease then release (single-env); `release-only` | `internal/release` | A real release transitions draft to prerelease to published with RC reaping | | Hotfix clean apply | `hotfix/hotfix-clean-apply`, `hotfix-multi-commit-clean`, `hotfix-multi-env-clean`, `hotfix-rejoin` | hotfix plan, apply, PR merge, finalize (3env) | `internal/hotfix` | A pinned-env fix lands, diverges state, and rejoins on real branches and PRs | | Hotfix cherry-pick conflict and halt | `hotfix/hotfix-conflict-resolution`, `hotfix-multi-env-conflict-halt` | `probe_hotfix_conflict` (4env) | `internal/hotfix` | A guaranteed conflict raises the conflict label and halts the downstream lane | | Rollback to prior version or SHA | `rollback/*` (8 scenarios) | `probe_rollback` (4env), `rollback-check` (2env) | `internal/rollback` | An env rewinds, is marked diverged, and the ring snapshot advances | +| External rollback via `repository_dispatch` | | repository_dispatch rollback, state revert asserted (rollback-dispatch) | `internal/rollback` | A real `repository_dispatch` payload drives the automated rollback entry point and the target env's state is read back reverted | | Drift check and comment | `22-verify-drift`, `27-verify-orphan`, `28-drift-check` | `probe_drift` (4env) | `internal/verify`, `internal/generate` | Generated-vs-committed drift is detected and surfaced on a real run | | Validate gate | `14-validate-check`, `17-validate-callback` | `probe_validate` (4env); pre-build validate gate (3env) | `internal/generate` | A validate callback gates the build before it proceeds | | Merge queue | `15-merge-queue` | `probe_merge_queue` (4env) | `internal/generate` | The merge-queue lane is emitted and runs (harness covers the no-configured-queue case) | @@ -136,6 +137,7 @@ from a general case. | Release-only (no deploy environments) | `cascade-example-release-only` | none specific (release lane via `37`, `38`) | | No-environment library shape | covered in harness only by design | `01-no-env-repo` | | Primary plus artifact satellites (cross-repo graph) | `cascade-example-primary`, `cascade-example-artifact-a`, `cascade-example-artifact-b` | `multi-repo/*`, `21-cross-repo-callback` | +| External rollback entry point (`repository_dispatch`) | `cascade-example-rollback-dispatch` | `rollback/*` | The no-environment library shape is covered in the act plus gitea harness; a live `cascade-example-no-env` suite also asserts that orchestrate goes straight from a diff --git a/docs/src/content/docs/release-orchestration.md b/docs/src/content/docs/release-orchestration.md new file mode 100644 index 0000000..b8673d6 --- /dev/null +++ b/docs/src/content/docs/release-orchestration.md @@ -0,0 +1,129 @@ +--- +title: Release orchestration +description: How cascade validates and ships its own releases. The live fleet fans out in sequenced lanes so peak concurrency on one shared token stays low, a repos selector runs a single lane for development, and a nightly gate cuts and promotes a release only when a fully green fleet agrees and main has accumulated release-worthy changes. +--- + +This page describes how cascade releases itself. It is maintainer CI: hand-written +tooling that lives in cascade's own repository, alongside `fleet-e2e.yaml`, +`auto-promote.yaml`, and `nightly-release.yaml`. None of it is part of cascade's +generated output. If you are adopting cascade for your own pipelines, this is +background on how the project proves and ships each version, not a feature you +configure. + +The release chain is four workflows in sequence. Orchestrate cuts a release +candidate and pushes its tag. Release builds and publishes that tag's assets. The +fleet fans out across every example repository to validate the published binary. +Auto-promote publishes the final version, but only when the entire fleet is green. + +## The staged fleet fan-out + +The fleet ([`.github/workflows/fleet-e2e.yaml`](https://github.com/stablekernel/cascade/blob/main/.github/workflows/fleet-e2e.yaml)) +revalidates the downstream `cascade-example-*` fleet on live GitHub. Every example +repository dispatches its own `scenario-suite.yaml` under one shared fleet token. A +green run means this cascade version validated across all eleven example repositories, +each running its own scenario suite in its own repository context. + +Dispatching all eleven repositories at once tripped transient GitHub API failures +(401, 403, and 500 responses) on a rotating repository each run, because they all +draw on the same token. The fan-out is therefore split into sequenced lanes that +hold peak live concurrency near two repositories at a time. A `gh()` transient-retry +wrapper inside each suite remains the per-call backstop; the staging fixes the +structural burst that the wrapper alone could not absorb. + +```mermaid +flowchart LR + plan[plan] --> resolve[resolve] + resolve --> repin[repin] + repin --> primary[primary] + primary --> dependents[dependents x2] + dependents --> heavy[4env alone] + heavy --> remainder[remainder, max 2] + remainder --> aggregate[Fleet gate] +``` + +| Stage | What it does | +|---|---| +| `plan` | Parses the `repos` selector once and emits the lane gates and matrices every fan-out job keys off. This is the single place the fleet roster lives. | +| `resolve` | Gates the run and resolves the cascade version under test, then writes `version-under-test.txt` and a `full-run.txt` completeness marker for auto-promote to read. | +| `repin` | Pins every example repository to the candidate, regenerates its workflows, and pushes the pin to each repository's main. It always covers the full roster regardless of the selector, because pinning is cheap, idempotent, and sequential, so it adds nothing to live fan-out concurrency. Every suite job gates on a green repin so none runs against a stale pin. | +| `primary` | Runs first and must pass before its dependents start. | +| `dependents` | `artifact-a` and `artifact-b` mutate the primary's shared external state, so they run only after the primary is green. The two run together, which is the lane that defines the fleet's peak of about two repositories. | +| `heavy` | `4env` is the heaviest and most fragile repository, so it runs alone in its own job, sequenced after the dependents lane so the two never stack. | +| `remainder` | The light repositories (`3env`, `2env`, `single-env`, `release-only`, `no-env`, `callbacks`, `rollback-dispatch`) run in a matrix capped at two in flight via `max-parallel`, sequenced after the heavy lane. | +| `aggregate` | The Fleet gate. It needs every lane, so a green gate means every selected repository passed. Auto-promote keys off this conclusion. | + +The fleet triggers on completion of the Release workflow (the dependable signal that +a candidate tag's assets actually reached the releases page) and on manual dispatch. + +## Running a single lane with the repos selector + +A full fan-out is the right gate for a release, but it is heavy for developing one +example repository's suite. The `workflow_dispatch` path accepts a `repos` selector +that runs a subset of lanes: + +```sh +gh workflow run fleet-e2e.yaml -f repos=4env +``` + +The selector accepts a single short name, or a comma or space separated list. The +default (no input, which is also the value on the Release-triggered path) is `all`, +which runs the full fleet. The `repin` stage always covers the full roster; only the +suite lanes honor the selector. A lane the selector skips reports `skipped` and the +gate treats it as satisfied, so a subset run still produces a meaningful verdict over +exactly the lanes that ran. + +A selective run never auto-promotes. The `plan` stage sets `full_run=true` only when +the selector resolves to `all`, the `resolve` stage records that marker in the +`full-run.txt` artifact, and auto-promote refuses to publish from anything other than +a full run. Only a complete fleet validation is a safe release signal. + +## The nightly-gated release + +Cascade's orchestrate workflow is dispatch-only, set through `release_trigger: dispatch` +in `.github/manifest.yaml`. A trunk merge no longer cuts a release candidate on its +own, which removes the per-merge candidate churn. The single gate that decides whether +to release is [`nightly-release.yaml`](https://github.com/stablekernel/cascade/blob/main/.github/workflows/nightly-release.yaml). + +It runs on a schedule (07:00 UTC daily, off-peak, after late-day merges settle) and +owns only two jobs, `decide` and `dispatch`. Everything from Release onward is the +existing chain, reused unchanged. + +`decide` measures whether main has accumulated release-worthy changes since the last +published release: + +- The diff base is the latest final release tag, matching `vX.Y.Z` exactly so that a + candidate (`-rc.`) or a leftover dry-run (`-dryrun.`) tag can never become the base. + With no final release yet, or an unresolvable ref, it fails open and proceeds rather + than silently skipping a real release. +- It diffs the base against `origin/main` and classifies each changed path. Code and + the shipped action surface (`cmd/**`, `internal/**`, `go.mod`, `go.sum`, + `.github/actions/**`) count as release-worthy. The manifest counts only when its + non-state subtree changed, so a routine state commit alone is not release-worthy. + Documentation, Markdown, and similar paths never trigger a release on their own. +- If nothing release-worthy changed, the run skips. A missed night just defers: the + diff is always measured against the last release, so accumulated changes still + release on the next run. + +When `decide` says to proceed, `dispatch` dispatches orchestrate using the +`CASCADE_STATE_TOKEN`, so the candidate tag push fires Release and the chain continues. +Orchestrate cuts the candidate, Release publishes its assets, the full fleet fans out, +and auto-promote publishes the final version only on a green full run. + +### On-demand inputs: force and dry_run + +`nightly-release.yaml` also runs on `workflow_dispatch` with two inputs for testing +the path on demand: + +- `force` bypasses the change-since-last-release skip, so an unchanged main still + cuts a candidate. It lives entirely inside `decide` and changes nothing downstream. +- `dry_run` rehearses the whole path without publishing. The candidate is cut as a + `vX.Y.Z-dryrun.N` prerelease instead of an `-rc.` candidate. The fleet's `resolve` + gate accepts `-dryrun.` tags, so a dry run fans out across the full fleet and writes + its artifacts exactly like a real candidate. Auto-promote's publish gate stays + `-rc.`-only, so a dry-run tag can validate end to end yet is frozen out of + publication. The `full_run` guard is a second, independent backstop. + +A `force` plus `dry_run` dispatch therefore exercises every component of the real +path (the change gate bypass, the candidate cut, Release, the full fleet, the artifact +handoff, and the auto-promote wiring) while proving, by tag identity alone, that +nothing publishes. diff --git a/docs/src/content/docs/testing.md b/docs/src/content/docs/testing.md index edf02eb..1e02ae6 100644 --- a/docs/src/content/docs/testing.md +++ b/docs/src/content/docs/testing.md @@ -30,6 +30,8 @@ The fleet (`.github/workflows/fleet-e2e.yaml`) fans out to a set of purpose-buil The fleet proves the things that only real GitHub can prove: a real release object transitioning from draft to prerelease to published, real release-candidate tags being reaped on publish, the Contents API state-write path, cross-repo dispatch between real repositories, and real branch protection being written through a scoped token. +The fleet fans out in sequenced lanes so peak live concurrency on its one shared token stays low, accepts a `repos` selector for running a single lane during development, and is cut and promoted under a nightly gate that releases only on a fully green fleet. The [Release orchestration](/cascade/release-orchestration/) page documents that machinery in full. + ### Unit tests (pure logic) Underneath both layers, the Go packages carry conventional unit tests for the pure logic: version calculation, change detection, changelog assembly, manifest parsing and validation. These run on every build and are the fastest feedback loop. @@ -57,6 +59,7 @@ The example repositories span the supported pipeline shapes, so each topology is | `cascade-example-release-only` | Release-only repository (no deploy environments), changelog and contributor assembly | | `cascade-example-primary` | Primary repository receiving external-update and notify handoffs from satellites | | `cascade-example-artifact-a`, `cascade-example-artifact-b` | Satellite repositories in an artifact-dependency graph that notify the primary | +| `cascade-example-rollback-dispatch` | The automated rollback entry point, where a real `repository_dispatch` payload drives a rollback and the reverted state is read back | The `no-environment` library shape is covered today in the act plus gitea harness; the other topologies above are validated in both layers.