From e1d75da5bc25641640756ce13eaa6ba51541d93b Mon Sep 17 00:00:00 2001 From: JacobPEvans <20714140+JacobPEvans@users.noreply.github.com> Date: Sun, 24 May 2026 12:51:56 -0400 Subject: [PATCH] chore(infra-standards)!: drop self-hosted-runners skill; trim infrastructure-standards The self-hosted-runners content is workflow standard, not AI-agent guidance, and already lives in ai-assistant-instructions/agentsmd/rules/ci-cd-policy.md (companion PR #654). The infrastructure-standards skill is trimmed to the two tables agents actually need at edit time: VMID/IP assignment ranges and the Terraform-to-Ansible inventory contract. Everything else (general principles, deployment pipeline, dev shells, SOPS/Doppler) is canonical on docs.jacobpevans.com and the config-secrets/secrets-policy org rules. BREAKING CHANGE: /self-hosted-runners skill removed. Assisted-by: Claude --- infra-standards/.claude-plugin/plugin.json | 9 +- infra-standards/README.md | 8 +- .../skills/infrastructure-standards/SKILL.md | 85 +----------- .../skills/self-hosted-runners/SKILL.md | 128 ------------------ 4 files changed, 12 insertions(+), 218 deletions(-) delete mode 100644 infra-standards/skills/self-hosted-runners/SKILL.md diff --git a/infra-standards/.claude-plugin/plugin.json b/infra-standards/.claude-plugin/plugin.json index 6a07ba1..4b78ce2 100644 --- a/infra-standards/.claude-plugin/plugin.json +++ b/infra-standards/.claude-plugin/plugin.json @@ -1,15 +1,14 @@ { "name": "infra-standards", - "version": "1.7.0", - "description": "Infrastructure standards for Proxmox, Terraform, Ansible including deployment pipeline, IP addressing, and secrets management", + "version": "1.8.0", + "description": "Infrastructure standards for Proxmox, Terraform, Ansible: VMID/IP assignment ranges and the Terraform-to-Ansible inventory contract", "author": { "name": "JacobPEvans" }, "license": "MIT", "repository": "https://github.com/JacobPEvans/claude-code-plugins", - "keywords": ["infrastructure", "terraform", "ansible", "proxmox", "nix", "devshell", "runs-on", "github-actions"], + "keywords": ["infrastructure", "terraform", "ansible", "proxmox"], "skills": [ - "./skills/infrastructure-standards", - "./skills/self-hosted-runners" + "./skills/infrastructure-standards" ] } diff --git a/infra-standards/README.md b/infra-standards/README.md index 2d74eaa..23fdaf7 100644 --- a/infra-standards/README.md +++ b/infra-standards/README.md @@ -1,13 +1,10 @@ # infra-standards -Infrastructure standards for Proxmox, Terraform, Ansible including deployment pipeline, IP addressing, and secrets management. +Infrastructure standards for Proxmox, Terraform, Ansible: VMID/IP assignment ranges and the Terraform-to-Ansible inventory contract. ## Skills -- **`/infrastructure-standards`** - Deployment pipeline, VMID/IP mapping, dev shells, Doppler/SOPS, Terraform inventory -- **`/self-hosted-runners`** - When to target RunsOn vs github-hosted runners in - `.github/workflows/*.yml`, the v3 label catalog used across the org, the required - `github.run_id` segment, and the GitHub App allowlist prereq +- **`/infrastructure-standards`** - VMID/IP assignment ranges and the Terraform-to-Ansible inventory contract ## Installation @@ -19,7 +16,6 @@ claude plugins add jacobpevans-cc-plugins/infra-standards ```text /infrastructure-standards -/self-hosted-runners ``` ## License diff --git a/infra-standards/skills/infrastructure-standards/SKILL.md b/infra-standards/skills/infrastructure-standards/SKILL.md index 96c36ce..9be3476 100644 --- a/infra-standards/skills/infrastructure-standards/SKILL.md +++ b/infra-standards/skills/infrastructure-standards/SKILL.md @@ -1,27 +1,15 @@ --- name: infrastructure-standards -description: Use when working on infrastructure repos (terraform, ansible, kubernetes, proxmox, nix devShells) +description: Use when editing Proxmox/Terraform/Ansible inventory — VMID/IP assignment ranges and the Terraform-to-Ansible inventory contract. --- # Infrastructure Standards -## General Principles - -- **Idempotency**: All IaC must produce same result on repeated runs. -- **Modularity**: Organize into reusable modules. -- **State management**: Remote state with locking (DynamoDB for AWS). -- **Security**: Least privilege, encrypt at rest and in transit. -- **Cost**: Right-size resources, tag everything, set budget alerts. - -## Deployment Pipeline - -```text -terraform-proxmox -> ansible-proxmox -> ansible-proxmox-apps -> ansible-splunk -(provision VMs) (configure host) (configure apps) (configure Splunk) -``` - -Not every change needs the full pipeline. App config: ansible-proxmox-apps -only. Splunk config: ansible-splunk only. New VM: full pipeline. +For general IaC principles, the deployment pipeline diagram, dev-shell templates, +and the SOPS-vs-Doppler decision tree, see +[docs.jacobpevans.com/infrastructure](https://docs.jacobpevans.com/infrastructure) +and the `config-secrets` / `secrets-policy` org rules. This skill carries the +operational tables an agent needs at edit time without leaving the editor. ## VMID & IP Addressing @@ -38,67 +26,6 @@ IPs use pattern `192.168.0.{vmid}` (for VMIDs under 256). | 200-299 | VMs | splunk-vm (200) | | 9000-9999 | Templates | Not running, no IP | -## Dev Shell Architecture - -Every repo owns its own dev shell. No central registry. - -```text -repo/ -├── flake.nix <- defines devShells.default -├── flake.lock <- pins nixpkgs independently -├── .envrc <- `use flake` (ALWAYS committed) -└── .direnv/ <- ALWAYS in .gitignore -``` - -| Repo | Template | Key Tools | -| --- | --- | --- | -| ansible-proxmox | `github:JacobPEvans/nix-devenv?dir=shells/ansible` | ansible, molecule, sops, age | -| ansible-proxmox-apps | `github:JacobPEvans/nix-devenv?dir=shells/ansible` | + SOPS_AGE_KEY_FILE | -| terraform-proxmox | `github:JacobPEvans/nix-devenv?dir=shells/terraform` | terraform, terragrunt, tfsec, trivy | -| terraform-aws | `github:JacobPEvans/nix-devenv?dir=shells/terraform` | same as terraform-proxmox | -| kubernetes-monitoring | `github:JacobPEvans/nix-devenv?dir=shells/kubernetes` | kubectl, helm, helmfile, k9s, kind | -| splunk | `github:JacobPEvans/nix-devenv?dir=shells/splunk-dev` | uv (Python 3.9 on-demand) | - -## Secrets Management - -### SOPS vs Doppler Decision - -| Scenario | Tool | -| --- | --- | -| Runtime injection (env vars) | Doppler | -| Secrets committed to git (encrypted) | SOPS | -| Terraform state encryption | SOPS | -| Ansible vault replacement | SOPS | -| CI/CD pipeline secrets | Doppler | - -**Rule**: If it must exist in a git-tracked file, use SOPS. If injectable -at runtime, use Doppler. - -### Doppler Usage - -```bash -# Terraform -doppler run --name-transformer tf-var -- terragrunt plan -# With AWS -aws-vault exec terraform -- doppler run --name-transformer tf-var -- terragrunt apply -# Ansible -doppler run -- ansible-playbook -i inventory/hosts.yml playbooks/site.yml -``` - -### SOPS Configuration - -`.sops.yaml` at repo root: - -```yaml -creation_rules: - - path_regex: (\.enc\.ya?ml$|secrets/.*\.ya?ml$) - age: >- - age1your-public-key-here -``` - -Naming: encrypted = `.enc.yml`, plaintext = `.yml` (in `.gitignore`). -Precedence: when both provide same secret, Doppler (runtime) wins. - ## Terraform Inventory Contract Terraform outputs feed Ansible dynamic inventory: diff --git a/infra-standards/skills/self-hosted-runners/SKILL.md b/infra-standards/skills/self-hosted-runners/SKILL.md deleted file mode 100644 index 878aecd..0000000 --- a/infra-standards/skills/self-hosted-runners/SKILL.md +++ /dev/null @@ -1,128 +0,0 @@ ---- -name: self-hosted-runners -description: >- - Use when editing GitHub Actions workflow files (.github/workflows/*.yml) - in JacobPEvans repos. Documents when to target self-hosted RunsOn runners - vs GitHub-hosted runners, the v3 label catalog used across the org, the - required github.run_id segment, and the GitHub App allowlist prereq. ---- - -# Self-Hosted Runners (RunsOn) - -JacobPEvans repos use self-hosted RunsOn runners deployed by -[terraform-runs-on](https://github.com/JacobPEvans/terraform-runs-on) for -Linux GitHub Actions jobs. The control plane has a fixed monthly cost -whether or not jobs run (App Runner + CloudWatch — see -[terraform-runs-on/README.md](https://github.com/JacobPEvans/terraform-runs-on/blob/main/README.md) -for the current estimate). Workflows that stay on `ubuntu-latest` waste -GitHub Actions minutes that don't need to be spent. Migrate any Linux job -in the org that isn't covered by the **GitHub-hosted** rows in the decision -table below. - -## When to target RunsOn - -| Workload | Decision | -| --- | --- | -| Linux job (lint, validate, build, test) | **RunsOn** — almost always | -| Job that needs `nix flake check --all-systems` | **RunsOn** with more RAM (see catalog) | -| Job that runs on `macos-latest` | **GitHub-hosted** — RunsOn EC2 Mac has a 24-hour minimum allocation, costs more than `macos-latest` for short jobs | -| Job that runs on `windows-latest` | **RunsOn** supports Windows; treat case-by-case | -| Job generated by `gh-aw compile` (`*.lock.yml`) | **GitHub-hosted** — lock file is regenerated; runner label must flow through the `.md` companion (gh-aw doesn't expose this yet) | -| Job with disabled `schedule:` (manual dispatch only, rarely runs) | **GitHub-hosted** — migration saves nothing | -| Job in a repo that hasn't been added to the RunsOn GitHub App allowlist | **GitHub-hosted** — install the app first | - -## Runner label catalog - -Use the single-string format. The leading `runs-on=${{ github.run_id }}` -segment is **required** so the RunsOn control plane can correlate the -GitHub Actions `workflow_job` webhook back to the originating run. -Omitting it makes the job hang in `queued` forever. - -| Workload | Label string | -| --- | --- | -| Standard step (lint, validate, small build) | `runs-on=${{ github.run_id }}/runner=2cpu-linux-x64` | -| Nix `flake check` (Linux only) | `runs-on=${{ github.run_id }}/cpu=4/ram=16/family=m7+c7/extras=s3-cache` | -| Build with large dependency cache | `runs-on=${{ github.run_id }}/cpu=4/ram=16/volume=80gb:gp3:500mbs:4000iops/extras=s3-cache` | -| Heavy CPU (terraform plan over many modules) | `runs-on=${{ github.run_id }}/cpu=8/ram=32/family=c7a` | -| GPU (Hugging Face, MLX cross-eval) | `runs-on=${{ github.run_id }}/family=g4dn.xlarge/image=ubuntu22-gpu-x64` | - -## Pattern in YAML - -```yaml -jobs: - validate: - runs-on: "runs-on=${{ github.run_id }}/runner=2cpu-linux-x64" - steps: - - uses: actions/checkout@v6 - - run: ... -``` - -For reusable workflows in `JacobPEvans/.github`, callers pass the label -through the `runner_label` input (default `ubuntu-latest`): - -```yaml -jobs: - markdown-lint: - uses: JacobPEvans/.github/.github/workflows/_markdown-lint.yml@main - with: - runner_label: "runs-on=${{ github.run_id }}/runner=2cpu-linux-x64" -``` - -Each reusable workflow's `runs-on:` line then expands the input: - -```yaml -jobs: - lint: - runs-on: ${{ inputs.runner_label }} -``` - -This keeps `ubuntu-latest` as the safe default for consumers that haven't -opted in. - -## Prerequisites for a new repo - -1. The RunsOn CloudFormation stack must be applied (`terraform-runs-on/main`). -2. The RunsOn GitHub App must be installed on the target repo (either - organization-wide with "All repositories" selected, or the repo added - individually under the App settings). -3. The first migrated workflow should be a low-risk canary (a lint or - validate job, not the whole `Merge Gate`). Watch one run end-to-end - before migrating the rest. - -## Identifying RunsOn vs github-hosted in a run - -In the GitHub Actions UI, expand the `Set up runner` group on any step. -A RunsOn run prints: - -```text -RUNS_ON_VERSION: v3.x.x -RUNS_ON_INSTANCE_ID: i-... -RUNS_ON_INSTANCE_TYPE: m8i.large -RUNS_ON_INSTANCE_LIFECYCLE: spot -``` - -If those variables are missing despite the `runs-on=...` label, the job -didn't land on RunsOn. GitHub does **not** silently fall back to github-hosted -when a custom label is unmatched — the job sits in `queued` state waiting -for a runner that never picks it up. Most common causes: the repo isn't in -the RunsOn GitHub App allowlist, the `${{ github.run_id }}` segment is -missing from the label so the control plane can't correlate the -`workflow_job` webhook, or AWS spot capacity for the requested family is -briefly exhausted (RunsOn v3's spot circuit breaker handles this but a -queue stall can still happen during the fallback). The `_ci-gate.yml` -watchdog in `JacobPEvans/.github` cancels any job stuck in `queued` after -`queue_timeout_minutes` so the merge gate isn't blocked indefinitely. - -## Cost allocation - -Every RunsOn-launched EC2 instance is tagged with `runs-on=...`. AWS Cost -Explorer can be filtered by that tag group to attribute spend per repo, -workflow, and job. No per-workflow setup is needed — the tag is applied -by the RunsOn control plane. - -## Related - -- [Migration guide](https://github.com/JacobPEvans/terraform-runs-on/blob/main/docs/migration-guide.md) — the canonical playbook for migrating a single repo -- [RunsOn v3 job labels](https://runs-on.com/configuration/job-labels/) — upstream label spec -- [terraform-runs-on README](https://github.com/JacobPEvans/terraform-runs-on/blob/main/README.md) — infra deployment -- `ci-cd-policy` rule (auto-loaded org rule) — billing/runner policy