From d9c832ba33ecdae4a4d6766a7a99f9cb25516479 Mon Sep 17 00:00:00 2001
From: Hanwen Cheng <heawen.cheng@gmail.com>
Date: Sat, 23 May 2026 22:55:17 +0800
Subject: [PATCH 01/16] issue #101: path-conditional auto-deploy of test broker
 via SSM
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds two new harness-ci.yml jobs that re-deploy the test broker EC2
when a PR touches broker-affecting paths, so harness-e2e validates the
PR's actual broker code instead of whatever stale binary the EC2
happens to be running.

- detect-changes (dorny/paths-filter@v3) computes broker_changed
- deploy-test-broker assumes a new OIDC role and drives
  setup-broker-host.sh --test --yes on the EC2 via aws ssm send-command
- scripts/provision-ci-deploy-role.sh provisions the IAM role with a
  trust policy scoped to repo:litentry/agentKeys:* and an inline policy
  scoped to one EC2 instance ARN (separation of duties from the
  existing TEST_OIDC_AWS_ROLE_ARN e2e role)
- harness-e2e now runs AFTER deploy-test-broker (deviation from the
  issue's `needs: harness-e2e` spec, documented inline) so broker bugs
  introduced by a PR fail that PR's harness — not the next one's

Auto-deploy is fully opt-in: skipped silently unless both
OIDC_AWS_ROLE_ARN_DEPLOY and TEST_BROKER_INSTANCE_ID secrets are set.
A workflow_dispatch input force_deploy_broker enables dry-run
validation without a broker-path change.

Out of scope for this PR (rollout plan step 7 in issue #101):
auto-deploy of the test Heima EVM contracts. Defers to a follow-up
because it needs the SECRETS_REWRITE_PAT token to update six
TEST_*_ADDRESS_HEIMA secrets after each redeploy.

Prod broker auto-deploy stays explicitly out of scope per CLAUDE.md
"Remote broker host (single entry point)" — manual via
bash scripts/setup-broker-host.sh --upgrade only.

Docs: docs/ci-setup.md gains §7 with the provisioning recipe, secret
list, dry-run procedure, and disarm path.
---
 .github/workflows/harness-ci.yml    | 285 +++++++++++++++++++++++++-
 docs/ci-setup.md                    |  91 +++++++++
 scripts/provision-ci-deploy-role.sh | 304 ++++++++++++++++++++++++++++
 3 files changed, 677 insertions(+), 3 deletions(-)
 create mode 100755 scripts/provision-ci-deploy-role.sh

diff --git a/.github/workflows/harness-ci.yml b/.github/workflows/harness-ci.yml
index 82606ac..891419f 100644
--- a/.github/workflows/harness-ci.yml
+++ b/.github/workflows/harness-ci.yml
@@ -59,9 +59,33 @@ name: harness CI (no LLM)
 #   TEST_P256_VERIFIER_ADDRESS_HEIMA        per test-environment refresh.
 #   TEST_K11_VERIFIER_ADDRESS_HEIMA
 #
+# Additional secrets for the optional path-conditional auto-deploy of the
+# test broker EC2 (issue #101 — see docs/ci-setup.md §7):
+#
+#   OIDC_AWS_ROLE_ARN_DEPLOY  IAM role assumed by deploy-test-broker. Trust
+#                             policy: federated on GitHub Actions OIDC,
+#                             conditioned on repo:litentry/agentKeys:*.
+#                             Inline policy: ssm:SendCommand on
+#                             document/AWS-RunShellScript +
+#                             one EC2 instance ARN (= TEST_BROKER_INSTANCE_ID).
+#                             Provisioned by scripts/provision-ci-deploy-role.sh.
+#                             SEPARATE from TEST_OIDC_AWS_ROLE_ARN by design:
+#                             e2e role exercises the workload (sts:AssumeRole
+#                             on data roles, S3 verify), deploy role drives
+#                             the broker re-deploy on EC2. Separation of
+#                             duties — a compromise of one doesn't grant
+#                             the other's capability.
+#   TEST_BROKER_INSTANCE_ID   EC2 instance ID (i-xxxxxxxxxxxxxxxxx) hosting
+#                             test-broker.${ZONE}. Pinned in the deploy role's
+#                             inline SSM policy so a leaked session cred
+#                             cannot SendCommand on any other EC2.
+#
 # Gating: until TEST_OIDC_AWS_ROLE_ARN is set, the workflow's preflight
 # job surfaces a ::warning:: skip and exits clean — safe to merge before
-# the operator activates the test infra.
+# the operator activates the test infra. The auto-deploy gate is a
+# distinct check (OIDC_AWS_ROLE_ARN_DEPLOY + TEST_BROKER_INSTANCE_ID
+# both present) so harness validation can be activated without
+# auto-deploy, and vice versa.
 #
 # WebAuthn: never invoked. harness/v2-stage1-demo.sh defaults to
 # WEBAUTHN_MODE=0 (line 131), v2-stage2-demo.sh accepts --stub, neither
@@ -90,6 +114,12 @@ on:
         default: "all"
         type: choice
         options: ["1", "2", "3", "all"]
+      force_deploy_broker:
+        description: "Force deploy-test-broker even if no broker paths changed (dry-run validation)"
+        required: false
+        default: "false"
+        type: choice
+        options: ["false", "true"]
 
 concurrency:
   group: harness-ci-${{ github.ref }}
@@ -97,6 +127,7 @@ concurrency:
 
 permissions:
   id-token: write   # GitHub Actions OIDC → assume TEST_OIDC_AWS_ROLE_ARN
+                    # (and OIDC_AWS_ROLE_ARN_DEPLOY for deploy-test-broker)
   contents: read
 
 jobs:
@@ -126,6 +157,44 @@ jobs:
       # map — same convention as the existing @claude review workflow.
       - run: cargo test --workspace -- --test-threads=1
 
+  detect-changes:
+    # Issue #101: path-conditional triggers for auto-deploy of the test broker.
+    # Computes `broker_changed` so deploy-test-broker can skip when a PR only
+    # touches docs/harness/test infra — saves ~3 min cargo rebuild + ssm wait
+    # per CI run, and avoids touching the test EC2 from PRs that don't need to.
+    #
+    # Path-filter false-negative caveats (see issue #101 "Trade-offs"):
+    #   - workspace-shared crates (agentkeys-types, agentkeys-signer-protocol)
+    #     ripple into the broker → listed in the filter conservatively.
+    #   - Cargo.lock changes → also listed (a transitive dep bump can affect
+    #     broker behavior at runtime).
+    name: detect changed paths (broker / contracts)
+    runs-on: ubuntu-latest
+    outputs:
+      broker_changed: ${{ steps.f.outputs.broker }}
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          # paths-filter needs the merge-base to diff against; default fetch
+          # is shallow. fetch-depth=0 ⇒ full history (cheap on a small repo).
+          fetch-depth: 0
+      - uses: dorny/paths-filter@v3
+        id: f
+        with:
+          filters: |
+            broker:
+              - 'crates/agentkeys-broker-server/**'
+              - 'crates/agentkeys-worker-*/**'
+              - 'crates/agentkeys-signer-protocol/**'
+              - 'crates/agentkeys-types/**'
+              - 'crates/agentkeys-core/**'
+              - 'scripts/setup-broker-host.sh'
+              - 'scripts/setup-broker-host.sh.d/**'
+              - 'scripts/broker.env'
+              - 'scripts/broker.test.env'
+              - 'Cargo.toml'
+              - 'Cargo.lock'
+
   preflight:
     # Gate the harness jobs on the test infra credentials being present.
     # Until the operator sets TEST_OIDC_AWS_ROLE_ARN, the harness jobs
@@ -135,6 +204,7 @@ jobs:
     needs: rust-checks
     outputs:
       should_run: ${{ steps.gate.outputs.should_run }}
+      deploy_ready: ${{ steps.gate.outputs.deploy_ready }}
     steps:
       - id: gate
         run: |
@@ -145,11 +215,220 @@ jobs:
             echo "should_run=false" >> "$GITHUB_OUTPUT"
             echo "::warning::TEST_OIDC_AWS_ROLE_ARN unset — harness E2E skipped. See workflow header for operator setup."
           fi
+          # deploy_ready: both deploy-side secrets must be present. Independent
+          # of should_run so an operator can opt INTO harness validation
+          # without enabling auto-deploy (e.g. while still vetting the deploy
+          # role's blast radius).
+          if [ -n "${{ secrets.OIDC_AWS_ROLE_ARN_DEPLOY }}" ] && [ -n "${{ secrets.TEST_BROKER_INSTANCE_ID }}" ]; then
+            echo "deploy_ready=true" >> "$GITHUB_OUTPUT"
+            echo "deploy secrets present; auto-deploy eligible"
+          else
+            echo "deploy_ready=false" >> "$GITHUB_OUTPUT"
+            echo "::notice::OIDC_AWS_ROLE_ARN_DEPLOY or TEST_BROKER_INSTANCE_ID unset — auto-deploy skipped. See docs/ci-setup.md §7."
+          fi
+
+  deploy-test-broker:
+    # Issue #101: drives `setup-broker-host.sh --test --yes` on the test broker
+    # EC2 via AWS SSM whenever a PR/push changes broker-affecting paths.
+    #
+    # Why deploy BEFORE harness-e2e (vs the issue's `needs: harness-e2e`):
+    # the failure mode this fixes is "harness scripts at version B vs broker
+    # binary at version A → spurious pass or confusing failure". Deploying
+    # first means harness-e2e validates the SAME revision the PR proposes —
+    # so a broker bug introduced by the PR is caught in the same PR, not
+    # leaked to whoever pushes next. Trade-off: a broker bug that crashes on
+    # startup will fail the deploy and skip harness-e2e (which is also the
+    # right signal — there's nothing to test).
+    #
+    # Concurrency: cross-PR races on the test EC2 are possible (PR-A deploys
+    # version A, PR-B deploys version B mid-flight, PR-A's harness sees B).
+    # Mitigation deferred to the followup PR — first cut accepts the race
+    # since concurrent broker-touching PRs are rare and the test EC2 is
+    # disposable. To add later: `concurrency: group: test-broker-deploy`
+    # with `cancel-in-progress: false` so deploys queue.
+    name: deploy broker to test EC2 (path-conditional)
+    needs: [preflight, detect-changes]
+    if: |
+      needs.preflight.outputs.should_run == 'true' &&
+      needs.preflight.outputs.deploy_ready == 'true' &&
+      (needs.detect-changes.outputs.broker_changed == 'true' ||
+       (github.event_name == 'workflow_dispatch' && inputs.force_deploy_broker == 'true'))
+    runs-on: ubuntu-latest
+    timeout-minutes: 15
+    permissions:
+      id-token: write
+      contents: read
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Configure AWS credentials via OIDC (deploy role)
+        uses: aws-actions/configure-aws-credentials@v4
+        with:
+          role-to-assume: ${{ secrets.OIDC_AWS_ROLE_ARN_DEPLOY }}
+          aws-region: ${{ secrets.TEST_AWS_REGION || 'us-east-1' }}
+          # Session name shows up in CloudTrail — distinct from the e2e
+          # role's session-name pattern so the deploy invocations are
+          # filterable separately.
+          role-session-name: gh-deploy-${{ github.run_id }}
+
+      - name: Sanity-check the test broker EC2 is SSM-managed
+        # Fail fast with a clear remediation path if the instance isn't
+        # registered with SSM (instance profile missing AmazonSSMManagedInstanceCore,
+        # agent down, or wrong region).
+        env:
+          REGION: ${{ secrets.TEST_AWS_REGION || 'us-east-1' }}
+          INSTANCE_ID: ${{ secrets.TEST_BROKER_INSTANCE_ID }}
+        run: |
+          set -euo pipefail
+          state=$(aws ssm describe-instance-information \
+            --region "$REGION" \
+            --filters "Key=InstanceIds,Values=$INSTANCE_ID" \
+            --query 'InstanceInformationList[0].PingStatus' \
+            --output text 2>/dev/null || echo "None")
+          case "$state" in
+            Online)
+              echo "::notice::SSM agent online on $INSTANCE_ID"
+              ;;
+            None|"")
+              echo "::error::$INSTANCE_ID is not SSM-managed. Run scripts/provision-ci-deploy-role.sh to diagnose."
+              exit 1
+              ;;
+            *)
+              echo "::error::SSM agent state = $state on $INSTANCE_ID (expected Online)"
+              exit 1
+              ;;
+          esac
+
+      - name: Compute deploy ref (PR head or push branch)
+        # GitHub provides GITHUB_HEAD_REF for PRs (source branch) and
+        # GITHUB_REF_NAME for push events. Falling through to "evm" as a
+        # safety net for manual workflow_dispatch on the default branch.
+        # The test EC2 fetches + checks out this ref before re-running
+        # setup-broker-host.sh, so the deployed binary matches the PR.
+        run: |
+          set -euo pipefail
+          ref="${GITHUB_HEAD_REF:-${GITHUB_REF_NAME:-evm}}"
+          if [ -z "$ref" ]; then
+            echo "::error::could not derive a ref to deploy"
+            exit 1
+          fi
+          # Refuse refs that contain shell metacharacters (defense-in-depth
+          # — GitHub already validates branch names, but the value is
+          # interpolated into a remote shell snippet below).
+          if printf '%s' "$ref" | grep -qE '[^A-Za-z0-9._/-]'; then
+            echo "::error::ref '$ref' contains unsupported characters"
+            exit 1
+          fi
+          echo "DEPLOY_REF=$ref" >> "$GITHUB_ENV"
+          echo "::notice::will deploy ref: $ref"
+
+      - name: SendCommand — fetch + checkout + setup-broker-host.sh --test --yes
+        env:
+          REGION: ${{ secrets.TEST_AWS_REGION || 'us-east-1' }}
+          INSTANCE_ID: ${{ secrets.TEST_BROKER_INSTANCE_ID }}
+        run: |
+          set -euo pipefail
+          # Compose the remote shell script. `$DEPLOY_REF` is interpolated by
+          # the runner's shell (GHA env block makes it visible here); the
+          # remote SSM-driven shell sees the literal branch name. The remote
+          # shell runs as root (SSM-default on Ubuntu AMIs); git ops use
+          # `sudo -u ubuntu` so the working tree stays ubuntu-owned.
+          read -r -d '' deploy_script <<EOF || true
+          set -euo pipefail
+          cd /home/ubuntu/agentKeys
+          sudo -u ubuntu git fetch --prune origin
+          sudo -u ubuntu git checkout "$DEPLOY_REF" || sudo -u ubuntu git checkout "origin/$DEPLOY_REF"
+          sudo -u ubuntu git pull --ff-only origin "$DEPLOY_REF" 2>/dev/null || true
+          bash scripts/setup-broker-host.sh --test --yes --non-interactive
+          EOF
+
+          # jq --arg passes the multi-line script outside of shell parameter
+          # expansion (no modifier bugs per CLAUDE.md heredoc-trap rule).
+          params=$(jq -n --arg script "$deploy_script" '{
+            commands: [$script],
+            executionTimeout: ["900"]
+          }')
+
+          cmd_id=$(aws ssm send-command \
+            --region "$REGION" \
+            --instance-ids "$INSTANCE_ID" \
+            --document-name "AWS-RunShellScript" \
+            --comment "gh-ci deploy ${GITHUB_RUN_ID} ref=${DEPLOY_REF}" \
+            --parameters "$params" \
+            --query 'Command.CommandId' \
+            --output text)
+          echo "SSM_COMMAND_ID=$cmd_id" >> "$GITHUB_ENV"
+          echo "::notice::SSM SendCommand queued: $cmd_id"
+
+      - name: Poll SSM command until completion
+        env:
+          REGION: ${{ secrets.TEST_AWS_REGION || 'us-east-1' }}
+          INSTANCE_ID: ${{ secrets.TEST_BROKER_INSTANCE_ID }}
+        run: |
+          set -euo pipefail
+          # Poll every 10s for up to 15 min. The command runs setup-broker-host.sh
+          # which rebuilds + restarts broker/signer/4 workers; cold cargo cache
+          # can be ~10min, warm ~3min.
+          for i in $(seq 1 90); do
+            sleep 10
+            status=$(aws ssm get-command-invocation \
+              --region "$REGION" \
+              --command-id "$SSM_COMMAND_ID" \
+              --instance-id "$INSTANCE_ID" \
+              --query 'Status' \
+              --output text 2>/dev/null || echo "Pending")
+            echo "iter=$i status=$status"
+            case "$status" in
+              Success)
+                aws ssm get-command-invocation \
+                  --region "$REGION" \
+                  --command-id "$SSM_COMMAND_ID" \
+                  --instance-id "$INSTANCE_ID" \
+                  --query 'StandardOutputContent' \
+                  --output text | tail -200
+                echo "::notice::deploy ok (ssm command $SSM_COMMAND_ID)"
+                exit 0
+                ;;
+              Failed|Cancelled|TimedOut)
+                echo "::error::SSM command terminal status: $status"
+                aws ssm get-command-invocation \
+                  --region "$REGION" \
+                  --command-id "$SSM_COMMAND_ID" \
+                  --instance-id "$INSTANCE_ID" \
+                  --query '{stdout:StandardOutputContent,stderr:StandardErrorContent}' \
+                  --output json
+                exit 1
+                ;;
+              Pending|InProgress|Delayed)
+                continue
+                ;;
+              *)
+                echo "::warning::unexpected status: $status"
+                ;;
+            esac
+          done
+          echo "::error::SSM command $SSM_COMMAND_ID did not complete within 15min"
+          exit 1
 
   harness-e2e:
     name: harness/v2-stage*-demo.sh on Heima mainnet (test deployer)
-    needs: preflight
-    if: needs.preflight.outputs.should_run == 'true'
+    needs: [preflight, deploy-test-broker]
+    # Run when:
+    #   - preflight gates green (test infra is set up)
+    #   - AND either:
+    #       (a) deploy-test-broker succeeded (PR re-deployed the broker
+    #           to test EC2, validating fresh broker code), OR
+    #       (b) deploy-test-broker was skipped (no broker paths changed
+    #           OR deploy_ready=false — the EC2's existing binary still
+    #           covers the harness contract).
+    # always() forces evaluation even when the upstream `if:` skips
+    # deploy-test-broker (GHA treats `needs:` deps with skipped jobs as
+    # failing the implicit `success()` filter without always()).
+    if: |
+      always() &&
+      needs.preflight.outputs.should_run == 'true' &&
+      (needs.deploy-test-broker.result == 'success' ||
+       needs.deploy-test-broker.result == 'skipped')
     runs-on: ubuntu-latest
     timeout-minutes: 60
 
diff --git a/docs/ci-setup.md b/docs/ci-setup.md
index 005d77b..569ed73 100644
--- a/docs/ci-setup.md
+++ b/docs/ci-setup.md
@@ -365,6 +365,97 @@ gh workflow run harness-ci.yml --repo litentry/agentKeys --field stage=3
 
 When the workflow passes against the test stack, CI is live. Every subsequent push to a PR triggers it; you're done.
 
+### 7. (Optional) Wire auto-deploy of the test broker (issue [#101](https://github.com/litentry/agentKeys/issues/101))
+
+Without this step, the workflow validates against the **already-deployed** test broker. If a PR changes broker code (`crates/agentkeys-broker-server/**`, `crates/agentkeys-worker-*/**`, `crates/agentkeys-signer-protocol/**`, `scripts/setup-broker-host.sh*`, or any workspace-shared crate the broker links against), the test broker binary silently drifts from the PR's source tree — the harness then exercises *old* broker code against *new* harness scripts, producing either spurious passes or confusing failures.
+
+Step 7 wires a second OIDC role (`github-actions-agentkeys-deploy`) plus two new GitHub secrets. When activated, the workflow's `detect-changes` job sees broker-affecting paths in the diff, the `deploy-test-broker` job assumes that role, and `aws ssm send-command` drives `setup-broker-host.sh --test --yes` on the test EC2 — re-deploying the broker so `harness-e2e` validates the PR's actual code. The deploy job is **gated three ways**:
+
+1. `paths-filter` boolean (no broker code changed → skip).
+2. Both deploy secrets present (`OIDC_AWS_ROLE_ARN_DEPLOY` + `TEST_BROKER_INSTANCE_ID`).
+3. `preflight.outputs.should_run == 'true'` (test infra fully wired).
+
+If any gate fails, the deploy job is **skipped, not failed** — `harness-e2e` still runs against the existing broker binary. So this step is fully opt-in; partial activation is safe.
+
+#### 7.1 Run the provisioning script
+
+```bash
+awsp agentkeys-admin
+# Look up the test broker EC2 instance ID (one-shot — pin it once):
+TEST_BROKER_INSTANCE_ID=$(aws ec2 describe-instances \
+  --region "$REGION" \
+  --filters "Name=ip-address,Values=$(curl -sS "https://dns.google/resolve?name=$BROKER_HOST&type=A" | jq -r '.Answer[0].data')" \
+  --query 'Reservations[0].Instances[0].InstanceId' --output text)
+echo "$TEST_BROKER_INSTANCE_ID"   # → i-xxxxxxxxxxxxxxxxx
+
+# Idempotent provisioning — safe to re-run:
+bash scripts/provision-ci-deploy-role.sh \
+  --test-broker-instance-id "$TEST_BROKER_INSTANCE_ID" \
+  --env-file scripts/operator-workstation.test.env
+```
+
+The script:
+
+- Creates / refreshes the `github-actions-agentkeys-deploy` IAM role with a federated trust policy on the GitHub Actions OIDC provider, scoped to `repo:litentry/agentKeys:*` (any branch in this repo can trigger; the workflow's path filter + preflight gate further restrict when the role is actually used).
+- Attaches an inline policy `agentkeys-ci-deploy-ssm` with:
+  - `ssm:SendCommand` on `document/AWS-RunShellScript` + the one instance ARN (so even if the role's session creds leaked, the worst a third party can do is re-run setup-broker-host.sh on the test EC2 — a destructive op there is `terraform apply`-style: idempotent, recoverable, and contained to the test environment).
+  - `ssm:GetCommandInvocation` / `ssm:ListCommandInvocations` for status polling.
+  - `ec2:DescribeInstances` scoped to the one instance ID, for the workflow's pre-deploy sanity check.
+- Verifies the test EC2 is registered with SSM (`PingStatus = Online`). If not, prints concrete remediation: attach `AmazonSSMManagedInstanceCore` to the instance profile and / or `systemctl restart amazon-ssm-agent`.
+
+#### 7.2 Set the two new repo secrets
+
+```bash
+# Print the deploy role ARN you just provisioned (script also prints this):
+role_arn=$(aws iam get-role --role-name github-actions-agentkeys-deploy \
+  --query 'Role.Arn' --output text)
+
+gh secret set OIDC_AWS_ROLE_ARN_DEPLOY --repo litentry/agentKeys --body "$role_arn"
+gh secret set TEST_BROKER_INSTANCE_ID  --repo litentry/agentKeys --body "$TEST_BROKER_INSTANCE_ID"
+```
+
+| Secret | Purpose |
+|---|---|
+| `OIDC_AWS_ROLE_ARN_DEPLOY` | ARN of `github-actions-agentkeys-deploy` — assumed by the `deploy-test-broker` job via GitHub Actions OIDC. |
+| `TEST_BROKER_INSTANCE_ID` | EC2 instance ID (`i-…`) hosting `test-broker.${ZONE}`. The deploy role's inline policy is scoped to *this single instance*. |
+
+#### 7.3 Dry-run validate
+
+Trigger the workflow manually with `force_deploy_broker=true` so the deploy fires regardless of whether the latest commit touched broker paths:
+
+```bash
+gh workflow run harness-ci.yml --repo litentry/agentKeys \
+  --field stage=1 \
+  --field force_deploy_broker=true
+```
+
+Then in the run logs:
+
+- `deploy-test-broker` should show `SSM agent online on i-…` (sanity check passed).
+- The `SendCommand` step prints the command ID; the next step polls until `Success`.
+- On success: the tail of `StandardOutputContent` shows `setup-broker-host.sh` finishing cleanly (`ok systemd unit … active`, `ok nginx running`, etc.).
+- On failure: stdout + stderr are dumped to the GHA log. The most common cause is `git checkout` failing on the EC2 because the source tree doesn't have the PR branch fetched — fix by ssh-ing into the box and running `sudo -u ubuntu git fetch --prune origin` once.
+
+#### 7.4 Disable / disarm
+
+Remove either secret to disarm — the workflow's `preflight.outputs.deploy_ready` will flip to `false` and the deploy job silently skips:
+
+```bash
+gh secret delete OIDC_AWS_ROLE_ARN_DEPLOY --repo litentry/agentKeys
+# or
+gh secret delete TEST_BROKER_INSTANCE_ID --repo litentry/agentKeys
+```
+
+The IAM role can stay provisioned indefinitely — without the secret it can't be assumed by GHA, and the inline SSM perms are scoped to one instance.
+
+#### Out of scope for issue #101
+
+Per [issue #101](https://github.com/litentry/agentKeys/issues/101) "Out of scope":
+
+- **Prod broker auto-deploy** — never. The prod broker EC2 stays manual via `bash scripts/setup-broker-host.sh --upgrade` from the operator laptop, per CLAUDE.md "Remote broker host (single entry point)".
+- **Auto-deploy of test Heima EVM contracts** — deferred to a follow-up PR (issue #101 rollout plan step 7). Contract redeploys mint new addresses and require the `SECRETS_REWRITE_PAT` token to update six `TEST_*_ADDRESS_HEIMA` secrets — more risk than the broker deploy, so it ships separately.
+- **Mainnet prod contract redeploy** — never automatic. Manual via `bash scripts/setup-heima.sh` only.
+
 ## What the workflow does on every run
 
 1. Restores submodules + Rust toolchain + Foundry + cargo cache.
diff --git a/scripts/provision-ci-deploy-role.sh b/scripts/provision-ci-deploy-role.sh
new file mode 100755
index 0000000..018fc8c
--- /dev/null
+++ b/scripts/provision-ci-deploy-role.sh
@@ -0,0 +1,304 @@
+#!/usr/bin/env bash
+# scripts/provision-ci-deploy-role.sh — idempotent creation of the
+# `github-actions-agentkeys-deploy` IAM role that lets the no-LLM CI
+# workflow drive `setup-broker-host.sh --test --yes` on the test broker
+# EC2 via AWS Systems Manager (SSM).
+#
+# Per arch.md trust posture (issue #101): the role is reachable ONLY
+# via GitHub Actions OIDC from the `litentry/agentKeys` repo, and its
+# inline policy is scoped to:
+#   - `ssm:SendCommand` on document/AWS-RunShellScript + the ONE test
+#     broker instance ARN — so even if the role were stolen, the worst
+#     it can do is queue a shell command on that single EC2.
+#   - `ssm:GetCommandInvocation` + `ssm:ListCommandInvocations` for
+#     status polling (no resource scope, read-only).
+#   - `ec2:DescribeInstances` so the workflow can sanity-check the
+#     instance is reachable before sending the command.
+#
+# Why a separate role from `github-actions-agentkeys-e2e`:
+#   - The e2e role's perms (sts:AssumeRole on test data roles + S3
+#     verify) are read/write into the test environment AS the workload.
+#   - The deploy role's perms (ssm:SendCommand on the broker EC2) are
+#     control-plane: it tells the EC2 to re-deploy the broker binary.
+#   - Separation of duties: a compromise of CI's e2e creds cannot
+#     trigger a broker re-deploy, and vice versa.
+#
+# Out of scope (stays manual per CLAUDE.md "Remote broker host (single
+# entry point)" + "Idempotent remote-setup rule (CLOUD)"):
+#   - The PROD broker EC2 (broker.litentry.org) — no auto-deploy ever.
+#   - The Heima EVM PROD contract redeploy — never automatic.
+#
+# Required env (sourced from $ENV_FILE):
+#   - ACCOUNT_ID
+#   - REGION
+# Required CLI flags:
+#   - --test-broker-instance-id i-xxxxxxxxx (the EC2 hosting the test broker)
+# Optional CLI flags:
+#   - --repo litentry/agentKeys (default; pinned in OIDC sub condition)
+#   - --role-name github-actions-agentkeys-deploy (default)
+#   - --env-file scripts/operator-workstation.test.env (default)
+#   - --dry-run (print planned changes; no AWS calls that mutate state)
+#
+# Required AWS profile: agentkeys-admin (the script checks caller ARN).
+#
+# Outcomes per step (matches the idempotent-remote-setup rule shape):
+#   - `ok proceeding` → mutation applied
+#   - `skip <reason>` → no-op (e.g. role already present + trust matches)
+#   - `fail <reason>` → hard error, exit non-zero
+
+set -euo pipefail
+
+# ─── CLI parse ────────────────────────────────────────────────────────────────
+DRY_RUN=0
+TEST_BROKER_INSTANCE_ID=""
+REPO_SLUG="litentry/agentKeys"
+ROLE_NAME="github-actions-agentkeys-deploy"
+SSM_POLICY_NAME="agentkeys-ci-deploy-ssm"
+REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
+ENV_FILE="${ENV_FILE:-$REPO_ROOT/scripts/operator-workstation.test.env}"
+
+while [ $# -gt 0 ]; do
+  case "$1" in
+    --test-broker-instance-id) TEST_BROKER_INSTANCE_ID="$2"; shift 2 ;;
+    --repo)                    REPO_SLUG="$2"; shift 2 ;;
+    --role-name)               ROLE_NAME="$2"; shift 2 ;;
+    --env-file)                ENV_FILE="$2"; shift 2 ;;
+    --dry-run)                 DRY_RUN=1; shift ;;
+    --help|-h)
+      sed -n '2,/^set -euo/p' "$0" | sed 's/^# \{0,1\}//' | sed '$d'; exit 0 ;;
+    *) echo "unknown flag: $1 (try --help)" >&2; exit 2 ;;
+  esac
+done
+
+# ─── Logging primitives (mirrors provision-vault-role.sh) ─────────────────────
+if [ -t 2 ]; then
+  C_HEAD='\033[1;36m'; C_OK='\033[1;32m'; C_SKIP='\033[1;33m'
+  C_WARN='\033[1;33m'; C_ERR='\033[1;31m'; C_RESET='\033[0m'
+else
+  C_HEAD=''; C_OK=''; C_SKIP=''; C_WARN=''; C_ERR=''; C_RESET=''
+fi
+log()  { printf "${C_HEAD}==>${C_RESET} %s\n" "$*" >&2; }
+ok()   { printf "    ${C_OK}ok${C_RESET}   %s\n" "$*" >&2; }
+skip() { printf "    ${C_SKIP}skip${C_RESET} %s\n" "$*" >&2; }
+warn() { printf "    ${C_WARN}warn${C_RESET} %s\n" "$*" >&2; }
+die()  { printf "    ${C_ERR}fail${C_RESET} %s\n" "$*" >&2; exit 1; }
+
+# ─── Preconditions ────────────────────────────────────────────────────────────
+[ -f "$ENV_FILE" ] || die "missing $ENV_FILE (pass --env-file <path> to override)"
+set -a; . "$ENV_FILE"; set +a
+
+ACCOUNT_ID="${ACCOUNT_ID:?ACCOUNT_ID required in $ENV_FILE}"
+REGION="${REGION:?REGION required in $ENV_FILE}"
+
+[ -n "$TEST_BROKER_INSTANCE_ID" ] \
+  || die "missing --test-broker-instance-id (look up via: aws ec2 describe-instances --region $REGION --filters 'Name=tag:Name,Values=agentkeys-test-broker' --query 'Reservations[0].Instances[0].InstanceId')"
+
+[[ "$TEST_BROKER_INSTANCE_ID" =~ ^i-[0-9a-f]{8,17}$ ]] \
+  || die "instance ID shape invalid: $TEST_BROKER_INSTANCE_ID (expected i-<8-17 hex chars>)"
+
+[[ "$REPO_SLUG" =~ ^[A-Za-z0-9._-]+/[A-Za-z0-9._-]+$ ]] \
+  || die "repo slug shape invalid: $REPO_SLUG (expected owner/repo)"
+
+command -v jq >/dev/null  || die "jq not found in PATH (brew install jq)"
+command -v aws >/dev/null || die "aws CLI not found in PATH"
+
+# Caller identity must be agentkeys-admin (matches the rest of the provision-*
+# scripts; lowercase compare because the live IAM user is `agentKeys-admin`).
+caller_arn=$(aws sts get-caller-identity --query Arn --output text 2>&1) \
+  || die "aws sts get-caller-identity failed: $caller_arn"
+arn_lc=$(printf '%s' "$caller_arn" | tr '[:upper:]' '[:lower:]')
+case "$arn_lc" in
+  *":user/agentkeys-admin"*) ok "caller is admin: $caller_arn" ;;
+  *) die "caller is $caller_arn — needs agentkeys-admin (try: awsp agentkeys-admin)" ;;
+esac
+
+# ─── Step 1: ensure the GitHub Actions OIDC provider exists in the account ───
+log "OIDC provider: token.actions.githubusercontent.com"
+gha_provider_arn="arn:aws:iam::${ACCOUNT_ID}:oidc-provider/token.actions.githubusercontent.com"
+if aws iam get-open-id-connect-provider --open-id-connect-provider-arn "$gha_provider_arn" >/dev/null 2>&1; then
+  skip "GHA OIDC provider already registered"
+else
+  if [ "$DRY_RUN" = "1" ]; then
+    log "DRY RUN — would create-open-id-connect-provider for token.actions.githubusercontent.com"
+  else
+    # Thumbprint per GitHub's published cert (matches docs/ci-setup.md §4 note).
+    # If the cert chain rolls, this needs a refresh; AWS rejects mismatches.
+    aws iam create-open-id-connect-provider \
+      --url https://token.actions.githubusercontent.com \
+      --client-id-list sts.amazonaws.com \
+      --thumbprint-list 6938fd4d98bab03faadb97b34396831e3780aea1 \
+      >/dev/null \
+      || die "create-open-id-connect-provider failed"
+    ok "GHA OIDC provider registered"
+  fi
+fi
+
+# ─── Step 2: trust policy ─────────────────────────────────────────────────────
+# Federated on the GHA OIDC provider, scoped to the litentry/agentKeys repo.
+# `StringLike` on `sub` lets PR branches AND `refs/heads/*` push events
+# trigger; the workflow itself is the second gate (path filter + concurrency).
+#
+# To tighten further later (e.g. main-branch-only deploys), change the StringLike
+# pattern to `repo:litentry/agentKeys:ref:refs/heads/evm` or similar.
+trust_policy=$(jq -n \
+  --arg provider "$gha_provider_arn" \
+  --arg sub_pattern "repo:${REPO_SLUG}:*" \
+  '{
+    Version: "2012-10-17",
+    Statement: [{
+      Effect: "Allow",
+      Principal: { Federated: $provider },
+      Action: "sts:AssumeRoleWithWebIdentity",
+      Condition: {
+        StringEquals: {
+          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
+        },
+        StringLike: {
+          "token.actions.githubusercontent.com:sub": $sub_pattern
+        }
+      }
+    }]
+  }')
+
+# ─── Step 3: role existence ──────────────────────────────────────────────────
+log "Role existence: $ROLE_NAME"
+if aws iam get-role --role-name "$ROLE_NAME" >/dev/null 2>&1; then
+  skip "role already exists"
+  if [ "$DRY_RUN" = "1" ]; then
+    log "DRY RUN — would update-assume-role-policy with: $trust_policy"
+  else
+    log "Refreshing trust policy (idempotent — overwrites with $sub_pattern shape)"
+    aws iam update-assume-role-policy \
+      --role-name "$ROLE_NAME" \
+      --policy-document "$trust_policy" \
+      || die "update-assume-role-policy failed"
+    ok "trust policy refreshed"
+  fi
+else
+  if [ "$DRY_RUN" = "1" ]; then
+    log "DRY RUN — would create-role $ROLE_NAME with trust: $trust_policy"
+  else
+    log "Creating role $ROLE_NAME"
+    aws iam create-role \
+      --role-name "$ROLE_NAME" \
+      --assume-role-policy-document "$trust_policy" \
+      --description "CI deploy role — drives setup-broker-host.sh on the test EC2 via SSM (issue #101)" \
+      >/dev/null \
+      || die "create-role failed"
+    ok "role created"
+  fi
+fi
+
+# ─── Step 4: inline SSM policy ───────────────────────────────────────────────
+# Narrow on purpose: SendCommand limited to the document + the ONE instance
+# ARN. Even a compromised role can only re-run setup-broker-host.sh on the
+# test broker; nothing in prod, nothing on other EC2s.
+instance_arn="arn:aws:ec2:${REGION}:${ACCOUNT_ID}:instance/${TEST_BROKER_INSTANCE_ID}"
+ssm_document_arn="arn:aws:ssm:${REGION}::document/AWS-RunShellScript"
+
+inline_policy=$(jq -n \
+  --arg doc_arn "$ssm_document_arn" \
+  --arg inst_arn "$instance_arn" \
+  --arg inst_id  "$TEST_BROKER_INSTANCE_ID" \
+  '{
+    Version: "2012-10-17",
+    Statement: [
+      {
+        Sid: "SendShellCommandToTestBrokerOnly",
+        Effect: "Allow",
+        Action: "ssm:SendCommand",
+        Resource: [$doc_arn, $inst_arn]
+      },
+      {
+        Sid: "PollCommandStatus",
+        Effect: "Allow",
+        Action: [
+          "ssm:GetCommandInvocation",
+          "ssm:ListCommandInvocations"
+        ],
+        Resource: "*"
+      },
+      {
+        Sid: "DescribeTestBrokerInstanceOnly",
+        Effect: "Allow",
+        Action: "ec2:DescribeInstances",
+        Resource: "*",
+        Condition: {
+          StringEquals: {
+            "ec2:InstanceId": [$inst_id]
+          }
+        }
+      }
+    ]
+  }')
+
+log "Inline policy: $SSM_POLICY_NAME"
+if [ "$DRY_RUN" = "1" ]; then
+  log "DRY RUN — would put-role-policy: $inline_policy"
+else
+  aws iam put-role-policy \
+    --role-name "$ROLE_NAME" \
+    --policy-name "$SSM_POLICY_NAME" \
+    --policy-document "$inline_policy" \
+    || die "put-role-policy failed"
+  ok "inline policy applied ($(echo "$inline_policy" | jq '.Statement | length') statements; SendCommand scoped to $TEST_BROKER_INSTANCE_ID)"
+fi
+
+# ─── Step 5: verify the test broker EC2 is SSM-managed ───────────────────────
+# If the instance lacks AmazonSSMManagedInstanceCore (via its instance profile)
+# OR the SSM Agent isn't running, SendCommand will queue the command and time
+# out without delivering it. Fail fast here with a clear remediation path.
+log "Verify SSM agent reachable: $TEST_BROKER_INSTANCE_ID"
+if [ "$DRY_RUN" = "1" ]; then
+  log "DRY RUN — would query ssm describe-instance-information for $TEST_BROKER_INSTANCE_ID"
+else
+  ssm_state=$(aws ssm describe-instance-information \
+    --region "$REGION" \
+    --filters "Key=InstanceIds,Values=$TEST_BROKER_INSTANCE_ID" \
+    --query 'InstanceInformationList[0].PingStatus' \
+    --output text 2>/dev/null || echo "None")
+
+  case "$ssm_state" in
+    Online)
+      ok "SSM agent online — workflow can SendCommand"
+      ;;
+    ConnectionLost|Inactive)
+      warn "SSM agent state = $ssm_state — workflow SendCommand may stall"
+      warn "Remediation: ssh into EC2, run 'sudo systemctl restart amazon-ssm-agent' and re-check"
+      ;;
+    None|"")
+      die "$TEST_BROKER_INSTANCE_ID is not registered with SSM. Likely causes:
+    1. EC2 instance profile is missing AmazonSSMManagedInstanceCore. Fix:
+         aws ec2 describe-instances --region $REGION --instance-ids $TEST_BROKER_INSTANCE_ID \\
+           --query 'Reservations[0].Instances[0].IamInstanceProfile.Arn'
+       Then attach the policy to the role behind that instance profile:
+         aws iam attach-role-policy --role-name <role-from-above> \\
+           --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
+       Reboot the EC2 (or restart amazon-ssm-agent) to pick up new perms.
+    2. SSM Agent not installed/running. Fix (Ubuntu 22.04+ ships it):
+         ssh test-broker 'sudo systemctl enable --now amazon-ssm-agent'
+    3. Instance is in a private VPC subnet without an SSM VPC endpoint.
+       (Unlikely for a public-IP broker, but worth a glance at the routing.)"
+      ;;
+    *)
+      warn "SSM agent state = $ssm_state (unexpected) — proceed with caution"
+      ;;
+  esac
+fi
+
+# ─── Final: print the ARN so the operator can paste it into the GHA secret ──
+role_arn=$(aws iam get-role --role-name "$ROLE_NAME" --query 'Role.Arn' --output text 2>/dev/null || echo "?")
+ok "deploy role ready: $role_arn"
+cat <<EOF >&2
+
+Next:
+  # 1. Set the two GitHub secrets (idempotent — overwrites existing values):
+  gh secret set OIDC_AWS_ROLE_ARN_DEPLOY --repo $REPO_SLUG --body "$role_arn"
+  gh secret set TEST_BROKER_INSTANCE_ID  --repo $REPO_SLUG --body "$TEST_BROKER_INSTANCE_ID"
+
+  # 2. Trigger a workflow_dispatch with broker_changed=true to dry-run the
+  #    deploy path on the test EC2 (see docs/ci-setup.md §7).
+
+EOF
+
+echo "$role_arn"

From 16eac7a471474d116671dc9df3e5bf0cf736fe0c Mon Sep 17 00:00:00 2001
From: Hanwen Cheng <heawen.cheng@gmail.com>
Date: Sat, 23 May 2026 23:08:08 +0800
Subject: [PATCH 02/16] fix(provision-ci-deploy-role): strip non-ASCII from
 --description
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

IAM CreateRole rejects descriptions outside [\t\n\r\x20-\x7e\xa1-\xff]
with 'Value at description failed to satisfy constraint'. The em-dash
in the original description string tripped this regex at provisioning
time. Replace with an ASCII hyphen and add an inline warning comment
so a future editor doesn't reintroduce Unicode here.

Reported by operator running docs/ci-setup.md §7.1.
---
 scripts/provision-ci-deploy-role.sh | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/scripts/provision-ci-deploy-role.sh b/scripts/provision-ci-deploy-role.sh
index 018fc8c..32ae718 100755
--- a/scripts/provision-ci-deploy-role.sh
+++ b/scripts/provision-ci-deploy-role.sh
@@ -179,10 +179,14 @@ else
     log "DRY RUN — would create-role $ROLE_NAME with trust: $trust_policy"
   else
     log "Creating role $ROLE_NAME"
+    # IAM CreateRole --description allows only printable ASCII + Latin-1
+    # (regex [\t\n\r\x20-\x7e\xa1-\xff]*). Em-dash / en-dash / arrows trip
+    # "Value at 'description' failed to satisfy constraint" at AWS-call time.
+    # Keep this string ASCII-only.
     aws iam create-role \
       --role-name "$ROLE_NAME" \
       --assume-role-policy-document "$trust_policy" \
-      --description "CI deploy role — drives setup-broker-host.sh on the test EC2 via SSM (issue #101)" \
+      --description "CI deploy role - drives setup-broker-host.sh on the test EC2 via SSM (issue #101)" \
       >/dev/null \
       || die "create-role failed"
     ok "role created"

From a724745222cd1a7ff29cd80da50bdf22015c1e55 Mon Sep 17 00:00:00 2001
From: Hanwen Cheng <heawen.cheng@gmail.com>
Date: Sat, 23 May 2026 23:13:37 +0800
Subject: [PATCH 03/16] fix(provision-ci-deploy-role): --fix-ssm auto-attaches
 SSM policy + folds into runbook
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Operator hit the second failure mode in docs/ci-setup.md §7.1: the test
broker EC2 was not registered with SSM (PingStatus=None), so the script
exited before SendCommand could ever work. The fix had to be one round-
trip per CLAUDE.md runbook-fix-fold-back policy: a sanity check upgrade
that catches the same case for the next operator AND a manual override.

Script changes:
- New --fix-ssm flag. When passed AND PingStatus != Online, the script:
    1. Looks up the EC2's IamInstanceProfile via DescribeInstances.
    2. Walks profile -> role via iam:GetInstanceProfile.
    3. Attaches arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
       (idempotent — aws iam attach-role-policy no-ops on re-attach).
    4. Polls describe-instance-information up to 18x (~3 min) waiting
       for the agent to refresh creds.
    5. If still offline after 3 min: prints both manual escape hatches
       (ssh + systemctl restart amazon-ssm-agent, OR aws ec2 reboot-instances).
- Without --fix-ssm: same diagnostic message as before, plus a one-line
  hint pointing at --fix-ssm. No IAM mutation; safe default.
- Handles the edge case of an instance with NO instance profile at all:
  prints associate-iam-instance-profile command, exits 1.

Docs (docs/ci-setup.md §7.1):
- Standard invocation now includes --fix-ssm on the first run.
- New SSM remediation table maps each failure mode to what --fix-ssm
  covers vs what the operator must still do by hand (agent restart,
  reboot, install agent, VPC endpoint).

Reported by operator after re-running the em-dash-fixed script;
PingStatus=None on i-0135a8b2c53d14941.
---
 docs/ci-setup.md                    |  21 +++-
 scripts/provision-ci-deploy-role.sh | 147 ++++++++++++++++++++++++----
 2 files changed, 148 insertions(+), 20 deletions(-)

diff --git a/docs/ci-setup.md b/docs/ci-setup.md
index 569ed73..f2afe87 100644
--- a/docs/ci-setup.md
+++ b/docs/ci-setup.md
@@ -388,10 +388,13 @@ TEST_BROKER_INSTANCE_ID=$(aws ec2 describe-instances \
   --query 'Reservations[0].Instances[0].InstanceId' --output text)
 echo "$TEST_BROKER_INSTANCE_ID"   # → i-xxxxxxxxxxxxxxxxx
 
-# Idempotent provisioning — safe to re-run:
+# Idempotent provisioning — safe to re-run. Use --fix-ssm on the FIRST run
+# so the script auto-attaches AmazonSSMManagedInstanceCore to the broker EC2's
+# instance profile if it's missing (a fresh EC2 commonly lacks this policy).
 bash scripts/provision-ci-deploy-role.sh \
   --test-broker-instance-id "$TEST_BROKER_INSTANCE_ID" \
-  --env-file scripts/operator-workstation.test.env
+  --env-file scripts/operator-workstation.test.env \
+  --fix-ssm
 ```
 
 The script:
@@ -401,7 +404,19 @@ The script:
   - `ssm:SendCommand` on `document/AWS-RunShellScript` + the one instance ARN (so even if the role's session creds leaked, the worst a third party can do is re-run setup-broker-host.sh on the test EC2 — a destructive op there is `terraform apply`-style: idempotent, recoverable, and contained to the test environment).
   - `ssm:GetCommandInvocation` / `ssm:ListCommandInvocations` for status polling.
   - `ec2:DescribeInstances` scoped to the one instance ID, for the workflow's pre-deploy sanity check.
-- Verifies the test EC2 is registered with SSM (`PingStatus = Online`). If not, prints concrete remediation: attach `AmazonSSMManagedInstanceCore` to the instance profile and / or `systemctl restart amazon-ssm-agent`.
+- Verifies the test EC2 is registered with SSM (`PingStatus = Online`). With `--fix-ssm`, auto-remediates the common "instance profile is missing AmazonSSMManagedInstanceCore" case by attaching the policy and polling for up to 3 min for the SSM agent to refresh its creds. Without `--fix-ssm`, just reports the failure with manual fix instructions.
+
+**SSM remediation modes (what `--fix-ssm` covers, what it doesn't):**
+
+| Failure | What `--fix-ssm` does | What it CAN'T fix automatically |
+|---|---|---|
+| Instance profile missing `AmazonSSMManagedInstanceCore` | Attaches the policy, polls for Online | (handled) |
+| Policy already attached, agent process running with stale creds | Polls until agent refreshes (~1-3 min typical) | If poll times out: SSH + `sudo systemctl restart amazon-ssm-agent`, OR `aws ec2 reboot-instances …` |
+| Instance has NO instance profile at all | Exits with `associate-iam-instance-profile` command to run first | Operator runs the printed command, then re-invokes with `--fix-ssm` |
+| SSM Agent not installed | Reports state; can't reach the box to install | SSH + `sudo systemctl enable --now amazon-ssm-agent` (Ubuntu 22.04+ ships it) |
+| Private VPC subnet without an SSM VPC endpoint | Reports state | Operator wires the VPC endpoint (unlikely for a public-IP broker, but possible) |
+
+Re-running the script after any of the operator-side fixes is safe (idempotent — every step is `get-*` pre-checked before any mutation).
 
 #### 7.2 Set the two new repo secrets
 
diff --git a/scripts/provision-ci-deploy-role.sh b/scripts/provision-ci-deploy-role.sh
index 32ae718..a728d6e 100755
--- a/scripts/provision-ci-deploy-role.sh
+++ b/scripts/provision-ci-deploy-role.sh
@@ -37,6 +37,12 @@
 #   - --repo litentry/agentKeys (default; pinned in OIDC sub condition)
 #   - --role-name github-actions-agentkeys-deploy (default)
 #   - --env-file scripts/operator-workstation.test.env (default)
+#   - --fix-ssm  Auto-attach AmazonSSMManagedInstanceCore to the broker EC2's
+#                instance profile role if the SSM agent is offline, then poll
+#                for up to 3 min waiting for the agent to refresh creds.
+#                Safe to pass on every run (idempotent: aws iam attach-role-policy
+#                no-ops on re-attach, and the auto-attach is gated on PingStatus
+#                != Online so a healthy EC2 is untouched).
 #   - --dry-run (print planned changes; no AWS calls that mutate state)
 #
 # Required AWS profile: agentkeys-admin (the script checks caller ARN).
@@ -50,6 +56,7 @@ set -euo pipefail
 
 # ─── CLI parse ────────────────────────────────────────────────────────────────
 DRY_RUN=0
+FIX_SSM=0
 TEST_BROKER_INSTANCE_ID=""
 REPO_SLUG="litentry/agentKeys"
 ROLE_NAME="github-actions-agentkeys-deploy"
@@ -63,6 +70,7 @@ while [ $# -gt 0 ]; do
     --repo)                    REPO_SLUG="$2"; shift 2 ;;
     --role-name)               ROLE_NAME="$2"; shift 2 ;;
     --env-file)                ENV_FILE="$2"; shift 2 ;;
+    --fix-ssm)                 FIX_SSM=1; shift ;;
     --dry-run)                 DRY_RUN=1; shift ;;
     --help|-h)
       sed -n '2,/^set -euo/p' "$0" | sed 's/^# \{0,1\}//' | sed '$d'; exit 0 ;;
@@ -252,6 +260,90 @@ fi
 # If the instance lacks AmazonSSMManagedInstanceCore (via its instance profile)
 # OR the SSM Agent isn't running, SendCommand will queue the command and time
 # out without delivering it. Fail fast here with a clear remediation path.
+#
+# With --fix-ssm, the script attempts auto-remediation:
+#   - Looks up the EC2's instance profile via DescribeInstances
+#   - Extracts the role name behind the profile
+#   - Attaches AmazonSSMManagedInstanceCore (idempotent: AWS no-ops on re-attach)
+#   - Re-polls PingStatus for up to 3 min waiting for the agent to refresh creds
+#   - If still offline after 3 min: tells operator to reboot or restart the agent
+#
+# The auto-attach is safe because the operator is already running as
+# agentkeys-admin (verified above) — they HAVE iam:AttachRolePolicy. Without
+# --fix-ssm the script just reports + exits (no IAM mutation, no surprises).
+attach_ssm_managed_policy_if_missing() {
+  # Returns 0 if policy was attached or already present; non-zero on hard error.
+  local instance_id="$1"
+  local profile_arn role_name policy_arn already_attached
+
+  policy_arn="arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
+
+  profile_arn=$(aws ec2 describe-instances \
+    --region "$REGION" \
+    --instance-ids "$instance_id" \
+    --query 'Reservations[0].Instances[0].IamInstanceProfile.Arn' \
+    --output text 2>/dev/null || echo "None")
+
+  if [ -z "$profile_arn" ] || [ "$profile_arn" = "None" ] || [ "$profile_arn" = "null" ]; then
+    warn "instance $instance_id has NO IAM instance profile attached — auto-remediation is blocked."
+    warn "Fix: associate an instance profile via:"
+    warn "  aws ec2 associate-iam-instance-profile --instance-id $instance_id \\"
+    warn "    --iam-instance-profile Name=<existing-profile-name> --region $REGION"
+    warn "Then re-run this script with --fix-ssm."
+    return 1
+  fi
+
+  # Profile ARN shape: arn:aws:iam::ACCT:instance-profile/<NAME>
+  local profile_name="${profile_arn##*/}"
+  log "instance profile: $profile_name"
+
+  role_name=$(aws iam get-instance-profile \
+    --instance-profile-name "$profile_name" \
+    --query 'InstanceProfile.Roles[0].RoleName' \
+    --output text 2>/dev/null || echo "None")
+
+  if [ -z "$role_name" ] || [ "$role_name" = "None" ]; then
+    warn "instance profile $profile_name has no role attached — auto-remediation is blocked."
+    return 1
+  fi
+  log "role behind profile: $role_name"
+
+  already_attached=$(aws iam list-attached-role-policies \
+    --role-name "$role_name" \
+    --query "AttachedPolicies[?PolicyArn=='$policy_arn'].PolicyArn" \
+    --output text 2>/dev/null || echo "")
+
+  if [ -n "$already_attached" ]; then
+    ok "AmazonSSMManagedInstanceCore already attached to $role_name"
+    return 0
+  fi
+
+  log "Attaching AmazonSSMManagedInstanceCore to $role_name"
+  aws iam attach-role-policy \
+    --role-name "$role_name" \
+    --policy-arn "$policy_arn" \
+    || { warn "attach-role-policy failed"; return 1; }
+  ok "AmazonSSMManagedInstanceCore attached to $role_name"
+  return 0
+}
+
+poll_ssm_online() {
+  local instance_id="$1" max_iters="$2" state
+  for _ in $(seq 1 "$max_iters"); do
+    state=$(aws ssm describe-instance-information \
+      --region "$REGION" \
+      --filters "Key=InstanceIds,Values=$instance_id" \
+      --query 'InstanceInformationList[0].PingStatus' \
+      --output text 2>/dev/null || echo "None")
+    case "$state" in
+      Online) printf '%s' "$state"; return 0 ;;
+    esac
+    sleep 10
+  done
+  printf '%s' "${state:-None}"
+  return 1
+}
+
 log "Verify SSM agent reachable: $TEST_BROKER_INSTANCE_ID"
 if [ "$DRY_RUN" = "1" ]; then
   log "DRY RUN — would query ssm describe-instance-information for $TEST_BROKER_INSTANCE_ID"
@@ -266,23 +358,44 @@ else
     Online)
       ok "SSM agent online — workflow can SendCommand"
       ;;
-    ConnectionLost|Inactive)
-      warn "SSM agent state = $ssm_state — workflow SendCommand may stall"
-      warn "Remediation: ssh into EC2, run 'sudo systemctl restart amazon-ssm-agent' and re-check"
-      ;;
-    None|"")
-      die "$TEST_BROKER_INSTANCE_ID is not registered with SSM. Likely causes:
-    1. EC2 instance profile is missing AmazonSSMManagedInstanceCore. Fix:
-         aws ec2 describe-instances --region $REGION --instance-ids $TEST_BROKER_INSTANCE_ID \\
-           --query 'Reservations[0].Instances[0].IamInstanceProfile.Arn'
-       Then attach the policy to the role behind that instance profile:
-         aws iam attach-role-policy --role-name <role-from-above> \\
-           --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
-       Reboot the EC2 (or restart amazon-ssm-agent) to pick up new perms.
-    2. SSM Agent not installed/running. Fix (Ubuntu 22.04+ ships it):
-         ssh test-broker 'sudo systemctl enable --now amazon-ssm-agent'
-    3. Instance is in a private VPC subnet without an SSM VPC endpoint.
-       (Unlikely for a public-IP broker, but worth a glance at the routing.)"
+    ConnectionLost|Inactive|None|"")
+      if [ "$FIX_SSM" = "1" ]; then
+        log "Auto-remediating (--fix-ssm): attach AmazonSSMManagedInstanceCore + poll"
+        if attach_ssm_managed_policy_if_missing "$TEST_BROKER_INSTANCE_ID"; then
+          log "Polling SSM PingStatus for up to 3 min (agent refresh window)"
+          final_state=$(poll_ssm_online "$TEST_BROKER_INSTANCE_ID" 18) || true
+          if [ "$final_state" = "Online" ]; then
+            ok "SSM agent now online"
+          else
+            warn "SSM agent still $final_state after 3 min — policy attached, but the"
+            warn "agent process hasn't picked up the refreshed creds. Pick ONE:"
+            warn "  a) SSH and bounce the agent:"
+            warn "     ssh test-broker 'sudo systemctl restart amazon-ssm-agent'"
+            warn "  b) Reboot the EC2 (heavier):"
+            warn "     aws ec2 reboot-instances --instance-ids $TEST_BROKER_INSTANCE_ID --region $REGION"
+            warn "Then re-run this script (no flags) to confirm Online."
+            exit 1
+          fi
+        else
+          exit 1
+        fi
+      else
+        die "$TEST_BROKER_INSTANCE_ID is not registered with SSM (state=$ssm_state). Re-run with --fix-ssm
+to attempt auto-remediation (attaches AmazonSSMManagedInstanceCore to the
+EC2's instance profile role, then polls until the SSM agent refreshes).
+Or remediate manually:
+  1. EC2 instance profile is missing AmazonSSMManagedInstanceCore. Fix:
+       aws ec2 describe-instances --region $REGION --instance-ids $TEST_BROKER_INSTANCE_ID \\
+         --query 'Reservations[0].Instances[0].IamInstanceProfile.Arn'
+     Then attach the policy to the role behind that instance profile:
+       aws iam attach-role-policy --role-name <role-from-above> \\
+         --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
+     Reboot the EC2 (or restart amazon-ssm-agent) to pick up new perms.
+  2. SSM Agent not installed/running. Fix (Ubuntu 22.04+ ships it):
+       ssh test-broker 'sudo systemctl enable --now amazon-ssm-agent'
+  3. Instance is in a private VPC subnet without an SSM VPC endpoint.
+     (Unlikely for a public-IP broker, but worth a glance at the routing.)"
+      fi
       ;;
     *)
       warn "SSM agent state = $ssm_state (unexpected) — proceed with caution"

From d6ce74b494720e406c88384ac3bb9fae42313d9e Mon Sep 17 00:00:00 2001
From: Hanwen Cheng <heawen.cheng@gmail.com>
Date: Sat, 23 May 2026 23:21:33 +0800
Subject: [PATCH 04/16] fix(provision-ci-deploy-role): unbound $sub_pattern in
 idempotent log line

set -u tripped on the role-already-exists branch because the log line
referenced $sub_pattern as a shell variable, but it only exists as a
jq --arg inside the trust-policy heredoc. Replace with ${REPO_SLUG}
which is a real shell var.

Latent since the first commit; surfaced now that the previous
em-dash fix let the operator reach this branch on re-run.
---
 scripts/provision-ci-deploy-role.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/provision-ci-deploy-role.sh b/scripts/provision-ci-deploy-role.sh
index a728d6e..8472c24 100755
--- a/scripts/provision-ci-deploy-role.sh
+++ b/scripts/provision-ci-deploy-role.sh
@@ -175,7 +175,7 @@ if aws iam get-role --role-name "$ROLE_NAME" >/dev/null 2>&1; then
   if [ "$DRY_RUN" = "1" ]; then
     log "DRY RUN — would update-assume-role-policy with: $trust_policy"
   else
-    log "Refreshing trust policy (idempotent — overwrites with $sub_pattern shape)"
+    log "Refreshing trust policy (idempotent; sub pattern: repo:${REPO_SLUG}:*)"
     aws iam update-assume-role-policy \
       --role-name "$ROLE_NAME" \
       --policy-document "$trust_policy" \

From b4b38b4d8ceaffa772baf6daa32cd6a655f02ea8 Mon Sep 17 00:00:00 2001
From: Hanwen Cheng <heawen.cheng@gmail.com>
Date: Sat, 23 May 2026 23:27:14 +0800
Subject: [PATCH 05/16] fix(provision-ci-deploy-role): --fix-ssm auto-creates
 instance profile when EC2 has none
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Operator re-ran with --fix-ssm; auto-remediation hit the third failure
mode: the test broker EC2 has NO IAM instance profile attached at all.
A common state on test brokers spun up by setup-cloud.sh --test — the
broker process authenticates to AWS via static creds in
/etc/agentkeys/broker.env, so an instance profile was never wired up.

Script changes:
- New create_and_associate_ssm_profile() called when DescribeInstances
  reports no IamInstanceProfile.Arn. Idempotent end-to-end:
    1. iam get-role agentkeys-test-broker-ssm → create if missing
       (EC2 service trust policy, AmazonSSMManagedInstanceCore attached).
    2. iam get-instance-profile agentkeys-test-broker-ssm → create if
       missing.
    3. iam get-instance-profile (.Roles[0]) → add-role-to-instance-profile
       if empty; refuse to swap if the profile already holds a different
       role (operator must reconcile manually).
    4. 15s sleep for IAM eventual consistency (per AWS docs).
    5. ec2 describe-iam-instance-profile-associations → associate-iam-instance-profile
       if no existing association.
- attach_ssm_managed_policy_if_missing() now dispatches to
  create_and_associate_ssm_profile() when no profile is present, instead
  of exiting 1 with manual instructions.

Why this is safe to add to a running broker:
- The broker app reads AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY from
  broker.env explicitly; static creds always win over IMDS-served creds.
- Adding an IMDS instance profile cannot reduce capability — only the
  SSM agent (and not the broker app) will read from IMDS.

Runbook fold-back (CLAUDE.md policy): docs/ci-setup.md §7.1 SSM
remediation table now reads '(handled)' for the no-profile row,
describing the dedicated role/profile that gets created.
---
 docs/ci-setup.md                    |   2 +-
 scripts/provision-ci-deploy-role.sh | 134 ++++++++++++++++++++++++++--
 2 files changed, 129 insertions(+), 7 deletions(-)

diff --git a/docs/ci-setup.md b/docs/ci-setup.md
index f2afe87..62bc047 100644
--- a/docs/ci-setup.md
+++ b/docs/ci-setup.md
@@ -412,7 +412,7 @@ The script:
 |---|---|---|
 | Instance profile missing `AmazonSSMManagedInstanceCore` | Attaches the policy, polls for Online | (handled) |
 | Policy already attached, agent process running with stale creds | Polls until agent refreshes (~1-3 min typical) | If poll times out: SSH + `sudo systemctl restart amazon-ssm-agent`, OR `aws ec2 reboot-instances …` |
-| Instance has NO instance profile at all | Exits with `associate-iam-instance-profile` command to run first | Operator runs the printed command, then re-invokes with `--fix-ssm` |
+| Instance has NO instance profile at all | Creates a dedicated `agentkeys-test-broker-ssm` role + instance profile (EC2 trust + `AmazonSSMManagedInstanceCore`) and associates it with the EC2. IMDS surfaces the new creds within ~30s. Safe because the broker's app-layer AWS access uses static creds from `broker.env`, not IMDS — adding IMDS-served creds can only ADD capability for the SSM agent, not displace anything. | (handled) |
 | SSM Agent not installed | Reports state; can't reach the box to install | SSH + `sudo systemctl enable --now amazon-ssm-agent` (Ubuntu 22.04+ ships it) |
 | Private VPC subnet without an SSM VPC endpoint | Reports state | Operator wires the VPC endpoint (unlikely for a public-IP broker, but possible) |
 
diff --git a/scripts/provision-ci-deploy-role.sh b/scripts/provision-ci-deploy-role.sh
index 8472c24..5171eed 100755
--- a/scripts/provision-ci-deploy-role.sh
+++ b/scripts/provision-ci-deploy-role.sh
@@ -271,6 +271,131 @@ fi
 # The auto-attach is safe because the operator is already running as
 # agentkeys-admin (verified above) — they HAVE iam:AttachRolePolicy. Without
 # --fix-ssm the script just reports + exits (no IAM mutation, no surprises).
+# Creates the dedicated SSM-only instance profile + role and associates
+# it with the EC2 instance. Used when the EC2 has NO profile attached at
+# all — common on test brokers spun up by setup-cloud.sh --test (the
+# broker process authenticates via static creds in /etc/agentkeys/broker.env,
+# so the EC2 was never given an instance profile).
+#
+# Why this is safe to add to an already-running broker:
+#   - The broker's app-layer AWS calls use AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY
+#     from broker.env explicitly; the static creds take precedence over IMDS.
+#   - Adding an IMDS-served instance profile cannot reduce capability — it only
+#     ADDS a credential source for processes that don't already have static creds
+#     (which on the broker EC2 = the SSM agent and not much else).
+#
+# Names:
+#   - Role:    agentkeys-test-broker-ssm
+#   - Profile: agentkeys-test-broker-ssm (same — conventional)
+#
+# Idempotent: every step is get-* pre-checked. Safe to call repeatedly.
+SSM_INSTANCE_ROLE_NAME="agentkeys-test-broker-ssm"
+SSM_INSTANCE_PROFILE_NAME="agentkeys-test-broker-ssm"
+
+create_and_associate_ssm_profile() {
+  local instance_id="$1"
+  local policy_arn="arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
+
+  # ── Role ──
+  if aws iam get-role --role-name "$SSM_INSTANCE_ROLE_NAME" >/dev/null 2>&1; then
+    skip "role $SSM_INSTANCE_ROLE_NAME already exists"
+  else
+    log "Creating role $SSM_INSTANCE_ROLE_NAME (EC2 trust)"
+    local ec2_trust
+    ec2_trust=$(jq -n '{
+      Version: "2012-10-17",
+      Statement: [{
+        Effect: "Allow",
+        Principal: { Service: "ec2.amazonaws.com" },
+        Action: "sts:AssumeRole"
+      }]
+    }')
+    aws iam create-role \
+      --role-name "$SSM_INSTANCE_ROLE_NAME" \
+      --assume-role-policy-document "$ec2_trust" \
+      --description "Lets the test broker EC2 register with AWS SSM (issue #101)" \
+      >/dev/null \
+      || { warn "create-role failed"; return 1; }
+    ok "role $SSM_INSTANCE_ROLE_NAME created"
+  fi
+
+  # ── Managed policy attach (idempotent — AWS no-ops on re-attach) ──
+  local already_attached
+  already_attached=$(aws iam list-attached-role-policies \
+    --role-name "$SSM_INSTANCE_ROLE_NAME" \
+    --query "AttachedPolicies[?PolicyArn=='$policy_arn'].PolicyArn" \
+    --output text 2>/dev/null || echo "")
+  if [ -n "$already_attached" ]; then
+    skip "AmazonSSMManagedInstanceCore already attached to $SSM_INSTANCE_ROLE_NAME"
+  else
+    aws iam attach-role-policy \
+      --role-name "$SSM_INSTANCE_ROLE_NAME" \
+      --policy-arn "$policy_arn" \
+      || { warn "attach-role-policy failed"; return 1; }
+    ok "AmazonSSMManagedInstanceCore attached to $SSM_INSTANCE_ROLE_NAME"
+  fi
+
+  # ── Instance profile ──
+  if aws iam get-instance-profile --instance-profile-name "$SSM_INSTANCE_PROFILE_NAME" >/dev/null 2>&1; then
+    skip "instance profile $SSM_INSTANCE_PROFILE_NAME already exists"
+  else
+    log "Creating instance profile $SSM_INSTANCE_PROFILE_NAME"
+    aws iam create-instance-profile \
+      --instance-profile-name "$SSM_INSTANCE_PROFILE_NAME" \
+      >/dev/null \
+      || { warn "create-instance-profile failed"; return 1; }
+    ok "instance profile $SSM_INSTANCE_PROFILE_NAME created"
+  fi
+
+  # ── Add role to profile ──
+  local profile_role
+  profile_role=$(aws iam get-instance-profile \
+    --instance-profile-name "$SSM_INSTANCE_PROFILE_NAME" \
+    --query 'InstanceProfile.Roles[0].RoleName' \
+    --output text 2>/dev/null || echo "None")
+  if [ "$profile_role" = "$SSM_INSTANCE_ROLE_NAME" ]; then
+    skip "role already added to instance profile"
+  else
+    if [ "$profile_role" != "None" ] && [ -n "$profile_role" ]; then
+      warn "instance profile $SSM_INSTANCE_PROFILE_NAME currently holds role $profile_role (expected $SSM_INSTANCE_ROLE_NAME)"
+      warn "Refusing to swap — operator should reconcile manually."
+      return 1
+    fi
+    aws iam add-role-to-instance-profile \
+      --instance-profile-name "$SSM_INSTANCE_PROFILE_NAME" \
+      --role-name "$SSM_INSTANCE_ROLE_NAME" \
+      || { warn "add-role-to-instance-profile failed"; return 1; }
+    ok "added $SSM_INSTANCE_ROLE_NAME to instance profile"
+    # IAM is eventually consistent — newly-attached role may not show up in
+    # the EC2 associate API for a few seconds. Brief sleep here is the
+    # documented pattern (AWS docs: "may take up to 30s to propagate").
+    log "Waiting 15s for IAM eventual consistency"
+    sleep 15
+  fi
+
+  # ── Associate profile with EC2 ──
+  local current_profile_arn
+  current_profile_arn=$(aws ec2 describe-iam-instance-profile-associations \
+    --region "$REGION" \
+    --filters "Name=instance-id,Values=$instance_id" \
+    --query 'IamInstanceProfileAssociations[?State==`associated` || State==`associating`].IamInstanceProfile.Arn' \
+    --output text 2>/dev/null || echo "")
+  if [ -n "$current_profile_arn" ] && [ "$current_profile_arn" != "None" ]; then
+    skip "instance already has profile associated: $current_profile_arn"
+  else
+    log "Associating $SSM_INSTANCE_PROFILE_NAME with $instance_id"
+    aws ec2 associate-iam-instance-profile \
+      --region "$REGION" \
+      --instance-id "$instance_id" \
+      --iam-instance-profile "Name=$SSM_INSTANCE_PROFILE_NAME" \
+      >/dev/null \
+      || { warn "associate-iam-instance-profile failed"; return 1; }
+    ok "profile associated; EC2 IMDS will surface new creds within ~30s"
+  fi
+
+  return 0
+}
+
 attach_ssm_managed_policy_if_missing() {
   # Returns 0 if policy was attached or already present; non-zero on hard error.
   local instance_id="$1"
@@ -285,12 +410,9 @@ attach_ssm_managed_policy_if_missing() {
     --output text 2>/dev/null || echo "None")
 
   if [ -z "$profile_arn" ] || [ "$profile_arn" = "None" ] || [ "$profile_arn" = "null" ]; then
-    warn "instance $instance_id has NO IAM instance profile attached — auto-remediation is blocked."
-    warn "Fix: associate an instance profile via:"
-    warn "  aws ec2 associate-iam-instance-profile --instance-id $instance_id \\"
-    warn "    --iam-instance-profile Name=<existing-profile-name> --region $REGION"
-    warn "Then re-run this script with --fix-ssm."
-    return 1
+    log "instance $instance_id has NO IAM instance profile — creating + associating one"
+    create_and_associate_ssm_profile "$instance_id" || return 1
+    return 0
   fi
 
   # Profile ARN shape: arn:aws:iam::ACCT:instance-profile/<NAME>

From e0f92a3803df13276dbc19f510e8581b581ec8db Mon Sep 17 00:00:00 2001
From: Hanwen Cheng <heawen.cheng@gmail.com>
Date: Sat, 23 May 2026 23:48:34 +0800
Subject: [PATCH 06/16] fix(setup-broker-host): install amazon-ssm-agent at
 bootstrap (issue #101 root cause)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Operator hit the SSM-agent-not-installed failure mode after --fix-ssm
created + associated the instance profile: 'Unit amazon-ssm-agent.service
not found.' Some Ubuntu AMIs downstream of the AWS Marketplace base ship
without amazon-ssm-agent. Without the agent, no IAM policy on earth lets
the EC2 register with SSM, and the CI auto-deploy (issue #101) hangs.

Per CLAUDE.md "Runbook-fix-fold-back policy": the cure for an
operator-encountered failure is to upgrade the script that owns the
broken step, not the script that surfaces the symptom. setup-broker-host.sh
is the canonical entry point for the broker EC2 — the SSM agent install
belongs there.

Script changes (scripts/setup-broker-host.sh):
- Idempotent SSM-agent install block right after the ec2-instance-connect
  block (same shape: ssm_unit_active() pre-check, install only on miss).
- Two install paths in priority order:
    1. snap install amazon-ssm-agent --classic
       (AWS-blessed on Ubuntu 22.04+; unit:
        snap.amazon-ssm-agent.amazon-ssm-agent.service)
    2. .deb fallback from
       https://s3.$REGION.amazonaws.com/amazon-ssm-$REGION/latest/
       (older / non-snap images; unit: amazon-ssm-agent.service)
- Both paths converge on ssm_unit_active() returning true; subsequent
  --upgrade re-runs skip after that.

Runbook fold-back (docs/ci-setup.md §7.1):
- 'SSM Agent not installed' row of the remediation table now points
  operators at setup-broker-host.sh --test --yes for the structural fix,
  with a snap one-liner for one-shot manual recovery.

Reported by operator after re-running provision-ci-deploy-role.sh
--fix-ssm: the script created the profile + associated it, but the
3-min poll timed out because no SSM agent was running on the EC2.
---
 docs/ci-setup.md             |  2 +-
 scripts/setup-broker-host.sh | 61 ++++++++++++++++++++++++++++++++++++
 2 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/docs/ci-setup.md b/docs/ci-setup.md
index 62bc047..5caeaca 100644
--- a/docs/ci-setup.md
+++ b/docs/ci-setup.md
@@ -413,7 +413,7 @@ The script:
 | Instance profile missing `AmazonSSMManagedInstanceCore` | Attaches the policy, polls for Online | (handled) |
 | Policy already attached, agent process running with stale creds | Polls until agent refreshes (~1-3 min typical) | If poll times out: SSH + `sudo systemctl restart amazon-ssm-agent`, OR `aws ec2 reboot-instances …` |
 | Instance has NO instance profile at all | Creates a dedicated `agentkeys-test-broker-ssm` role + instance profile (EC2 trust + `AmazonSSMManagedInstanceCore`) and associates it with the EC2. IMDS surfaces the new creds within ~30s. Safe because the broker's app-layer AWS access uses static creds from `broker.env`, not IMDS — adding IMDS-served creds can only ADD capability for the SSM agent, not displace anything. | (handled) |
-| SSM Agent not installed | Reports state; can't reach the box to install | SSH + `sudo systemctl enable --now amazon-ssm-agent` (Ubuntu 22.04+ ships it) |
+| SSM Agent not installed (no `amazon-ssm-agent` unit) | Reports state; can't reach the box to install (operator's laptop has no SSH-into-EC2 capability from the provision script) | Re-run `bash scripts/setup-broker-host.sh --test --yes` on the EC2 — it now installs `amazon-ssm-agent` (snap preferred, .deb fallback) as part of broker bootstrap. One-shot manual recovery if you don't want to re-run the full setup: `ssh test-broker 'sudo snap install amazon-ssm-agent --classic && sudo systemctl enable --now snap.amazon-ssm-agent.amazon-ssm-agent.service'` |
 | Private VPC subnet without an SSM VPC endpoint | Reports state | Operator wires the VPC endpoint (unlikely for a public-IP broker, but possible) |
 
 Re-running the script after any of the operator-side fixes is safe (idempotent — every step is `get-*` pre-checked before any mutation).
diff --git a/scripts/setup-broker-host.sh b/scripts/setup-broker-host.sh
index 44b471e..d82a81b 100755
--- a/scripts/setup-broker-host.sh
+++ b/scripts/setup-broker-host.sh
@@ -790,6 +790,67 @@ EOF
   sudo systemctl reload ssh 2>/dev/null || sudo systemctl reload sshd 2>/dev/null || warn "sshd reload failed — restart manually"
 fi
 
+# ─── AWS SSM Agent (idempotent install) ───────────────────────────────────────
+# Required by harness-ci.yml deploy-test-broker job (issue #101): the GitHub
+# Actions workflow drives `setup-broker-host.sh --test --yes` on the EC2 via
+# `aws ssm send-command`. That path needs amazon-ssm-agent installed AND
+# active here.
+#
+# Some Ubuntu AMIs (including some Canonical / Multipass-derived images
+# downstream of the AWS Marketplace base) ship without amazon-ssm-agent.
+# When that's the case, `systemctl restart amazon-ssm-agent` errors with
+# "Unit amazon-ssm-agent.service not found" — the failure mode the operator
+# hit on 2026-05-23. Fold the install into broker-host bootstrap so every
+# new test broker is SSM-ready out of the box.
+#
+# Two install paths, in priority order:
+#   1) snap (AWS-blessed on Ubuntu 22.04+; service: snap.amazon-ssm-agent.amazon-ssm-agent.service)
+#   2) deb fallback (older / non-snap images; service: amazon-ssm-agent.service)
+#
+# Both produce a unit named `amazon-ssm-agent` in our systemctl alias check
+# below, so subsequent `setup-broker-host.sh --upgrade` re-runs skip.
+ssm_unit_active() {
+  systemctl is-active snap.amazon-ssm-agent.amazon-ssm-agent.service >/dev/null 2>&1 \
+    || systemctl is-active amazon-ssm-agent.service >/dev/null 2>&1
+}
+
+if ssm_unit_active; then
+  log "amazon-ssm-agent already active — skipping install"
+else
+  log "Installing amazon-ssm-agent (required for CI auto-deploy per issue #101)"
+  if command -v snap >/dev/null 2>&1; then
+    # snap install is idempotent — re-running on an already-installed agent
+    # exits 0 with a "snap already installed" message.
+    sudo snap install amazon-ssm-agent --classic >/dev/null \
+      || warn "snap install amazon-ssm-agent failed — falling back to deb"
+    sudo systemctl enable --now snap.amazon-ssm-agent.amazon-ssm-agent.service \
+      >/dev/null 2>&1 || true
+  fi
+
+  if ! ssm_unit_active; then
+    # Snap path didn't take — fall back to the .deb from AWS.
+    REGION_FOR_SSM="${REGION:-us-east-1}"
+    SSM_DEB_URL="https://s3.${REGION_FOR_SSM}.amazonaws.com/amazon-ssm-${REGION_FOR_SSM}/latest/debian_amd64/amazon-ssm-agent.deb"
+    SSM_TMP_DEB=$(mktemp /tmp/amazon-ssm-agent.XXXXXX.deb)
+    if curl -sSfL "$SSM_DEB_URL" -o "$SSM_TMP_DEB"; then
+      sudo dpkg -i "$SSM_TMP_DEB" >/dev/null \
+        || warn "dpkg install amazon-ssm-agent.deb failed"
+      sudo systemctl enable --now amazon-ssm-agent.service \
+        >/dev/null 2>&1 || warn "amazon-ssm-agent enable/start failed"
+    else
+      warn "could not download amazon-ssm-agent.deb from $SSM_DEB_URL"
+    fi
+    rm -f "$SSM_TMP_DEB"
+  fi
+
+  if ssm_unit_active; then
+    log "amazon-ssm-agent installed and active"
+  else
+    warn "amazon-ssm-agent install did not produce an active unit — CI auto-deploy will fail until this is resolved"
+    warn "Manual recovery: sudo snap install amazon-ssm-agent --classic && sudo systemctl enable --now snap.amazon-ssm-agent.amazon-ssm-agent.service"
+  fi
+fi
+
 if [[ "$CRED_MODE" == "profile" ]]; then
   sudo install -d -m 0700 -o agentkeys -g agentkeys /var/lib/agentkeys/.aws
   if [[ ! -f /var/lib/agentkeys/.aws/credentials ]]; then

From afb6e02b381af94c8881a88c0594b8fd0d55625d Mon Sep 17 00:00:00 2001
From: Hanwen Cheng <heawen.cheng@gmail.com>
Date: Sun, 24 May 2026 00:03:49 +0800
Subject: [PATCH 07/16] fix(provision-ci-deploy-role): distinguish AccessDenied
 from instance-not-registered
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The SSM verify block has been masking caller-permission gaps as
'instance not registered with SSM' (state=None) because of the
2>/dev/null || echo None silent fallback. Result: 4 rounds of phantom
remediation against the EC2 (em-dash fix, --fix-ssm flag, auto-create
instance profile, install amazon-ssm-agent on the EC2) — none of which
were addressing the actual cause, which was that the operator's admin
group lacks ssm:DescribeInstanceInformation.

Fix:
- Capture stderr into a tmpfile.
- Grep for 'AccessDenied' specifically; on hit, die() with the exact
  one-liner the operator needs to attach AmazonSSMReadOnlyAccess to
  the AgentKeyAdmin group.
- Empty stdout (no AccessDenied in stderr) = genuinely not registered;
  proceeds to the existing remediation paths.

Diagnosed by running aws ssm describe-instance-information directly
against i-0135a8b2c53d14941 as agentkeys-admin and seeing the
AccessDenied that the script had been swallowing all along.

Lesson (CLAUDE.md fold-back): when a sanity check uses 2>/dev/null,
make sure the discarded stderr can't be the answer to the question
the check is asking.
---
 scripts/provision-ci-deploy-role.sh | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/scripts/provision-ci-deploy-role.sh b/scripts/provision-ci-deploy-role.sh
index 5171eed..7436290 100755
--- a/scripts/provision-ci-deploy-role.sh
+++ b/scripts/provision-ci-deploy-role.sh
@@ -470,11 +470,31 @@ log "Verify SSM agent reachable: $TEST_BROKER_INSTANCE_ID"
 if [ "$DRY_RUN" = "1" ]; then
   log "DRY RUN — would query ssm describe-instance-information for $TEST_BROKER_INSTANCE_ID"
 else
+  # Capture stderr separately so AccessDenied doesn't get silently mapped to
+  # "None" (instance-not-registered). They're distinct failure modes:
+  #   - AccessDenied → caller (agentkeys-admin) lacks ssm:DescribeInstanceInformation.
+  #     Fix the caller's IAM, not the EC2.
+  #   - Empty/None → instance genuinely not registered with SSM. Remediate the EC2.
+  ssm_stderr=$(mktemp /tmp/ssm-describe.XXXXXX.err)
   ssm_state=$(aws ssm describe-instance-information \
     --region "$REGION" \
     --filters "Key=InstanceIds,Values=$TEST_BROKER_INSTANCE_ID" \
     --query 'InstanceInformationList[0].PingStatus' \
-    --output text 2>/dev/null || echo "None")
+    --output text 2>"$ssm_stderr" || echo "")
+  if grep -q "AccessDenied" "$ssm_stderr"; then
+    rm -f "$ssm_stderr"
+    die "caller lacks ssm:DescribeInstanceInformation. This is the upstream
+of every 'PingStatus=None' loop — without read perms, the script cannot tell
+'instance not registered with SSM' from 'I have no permission to look'. Fix
+by attaching AmazonSSMReadOnlyAccess to the admin group ONCE:
+    aws iam attach-group-policy \\
+      --group-name AgentKeyAdmin \\
+      --policy-arn arn:aws:iam::aws:policy/AmazonSSMReadOnlyAccess
+Then re-run this script."
+  fi
+  # Empty state = no record found (genuinely not registered).
+  [ -z "$ssm_state" ] && ssm_state="None"
+  rm -f "$ssm_stderr"
 
   case "$ssm_state" in
     Online)

From 7fae303ef0154843b37dcb951905efd2541604e6 Mon Sep 17 00:00:00 2001
From: Hanwen Cheng <heawen.cheng@gmail.com>
Date: Sun, 24 May 2026 00:12:34 +0800
Subject: [PATCH 08/16] =?UTF-8?q?docs(ci-setup=20=C2=A77.3):=20require=20-?=
 =?UTF-8?q?-ref=20on=20pre-merge=20gh=20workflow=20run=20dispatch?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Operator hit 'HTTP 422: Unexpected inputs provided: [force_deploy_broker]'
on first dry-run dispatch. Root cause is GHA's 'workflows are registered
from the default branch' rule — same trap already documented in §6
('Common first-run failure modes'), but I didn't repeat it in §7.3, so
the operator hit it again.

Fix:
- §7.3 dispatch command now includes --ref <pr-branch>.
- Distinguish pre-merge (--ref required, input lives on PR branch) from
  post-merge (--ref optional, input is on main).
- Show the git rev-parse trick to look up the local branch name.

Per CLAUDE.md runbook-fix-fold-back: every operator-encountered
failure makes the runbook strictly more robust.
---
 docs/ci-setup.md | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/docs/ci-setup.md b/docs/ci-setup.md
index 5caeaca..9d72336 100644
--- a/docs/ci-setup.md
+++ b/docs/ci-setup.md
@@ -436,14 +436,21 @@ gh secret set TEST_BROKER_INSTANCE_ID  --repo litentry/agentKeys --body "$TEST_B
 
 #### 7.3 Dry-run validate
 
-Trigger the workflow manually with `force_deploy_broker=true` so the deploy fires regardless of whether the latest commit touched broker paths:
+Trigger the workflow manually with `force_deploy_broker=true` so the deploy fires regardless of whether the latest commit touched broker paths.
+
+**Pre-merge — `--ref` is required.** `gh workflow run` reads the workflow definition from the *default branch* (`main`) unless you tell it otherwise. Since the `force_deploy_broker` input lives on the PR branch, dispatching without `--ref` fails with `HTTP 422: Unexpected inputs provided: ["force_deploy_broker"]`. Pass `--ref` so GHA reads the workflow YAML (and its inputs) from the PR branch instead:
 
 ```bash
 gh workflow run harness-ci.yml --repo litentry/agentKeys \
+  --ref claude/adoring-bell-1b9ca8 \
   --field stage=1 \
   --field force_deploy_broker=true
 ```
 
+Replace `claude/adoring-bell-1b9ca8` with your actual PR branch name (`git rev-parse --abbrev-ref HEAD` if you're on it locally).
+
+**Post-merge — `--ref` is optional.** Once this PR is on `main`, dispatching without `--ref` will work because the input is part of the default-branch workflow definition. (The `--ref` form still works and lets you target any branch.)
+
 Then in the run logs:
 
 - `deploy-test-broker` should show `SSM agent online on i-…` (sanity check passed).

From f666e756315e2eaf7891944985bc0d8df85adeb2 Mon Sep 17 00:00:00 2001
From: Hanwen Cheng <heawen.cheng@gmail.com>
Date: Sun, 24 May 2026 00:21:41 +0800
Subject: [PATCH 09/16] fix(ci): grant ssm:DescribeInstanceInformation to
 deploy role + distinguish AccessDenied in workflow
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Deploy-test-broker's sanity-check step failed in the first dry-run with
'i-XXX is not SSM-managed'. Root cause: same swallowed-stderr trap as
the local script, now in the workflow. The deploy role's inline policy
granted SendCommand + GetCommandInvocation + ListCommandInvocations,
but NOT DescribeInstanceInformation. AccessDenied was silently mapped
to 'None', which the workflow interpreted as 'not SSM-managed'.

Three fixes:
1. provision-ci-deploy-role.sh: PollCommandStatus statement now includes
   ssm:DescribeInstanceInformation. put-role-policy is idempotent so
   re-running the script refreshes the existing role's inline policy
   in place.
2. harness-ci.yml sanity-check: captures stderr separately, greps for
   AccessDenied, prints actionable remediation. Empty state (no
   AccessDenied) still means genuinely-not-registered.
3. docs/ci-setup.md §7.1: lists DescribeInstanceInformation in the
   inline-policy bullet + notes 'already provisioned? re-run; idempotent'.

Per CLAUDE.md runbook-fix-fold-back: every operator-encountered failure
makes the runbook + scripts strictly more robust. The defensive workflow
step catches this in the future if the policy template ever drifts.
---
 .github/workflows/harness-ci.yml    | 30 +++++++++++++++++++++++------
 docs/ci-setup.md                    |  4 +++-
 scripts/provision-ci-deploy-role.sh |  3 ++-
 3 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/.github/workflows/harness-ci.yml b/.github/workflows/harness-ci.yml
index 891419f..c1a7ae5 100644
--- a/.github/workflows/harness-ci.yml
+++ b/.github/workflows/harness-ci.yml
@@ -272,25 +272,43 @@ jobs:
           role-session-name: gh-deploy-${{ github.run_id }}
 
       - name: Sanity-check the test broker EC2 is SSM-managed
-        # Fail fast with a clear remediation path if the instance isn't
-        # registered with SSM (instance profile missing AmazonSSMManagedInstanceCore,
-        # agent down, or wrong region).
+        # Fail fast with a clear remediation path. Three failure modes are
+        # distinguished:
+        #   - AccessDenied → deploy role lacks ssm:DescribeInstanceInformation.
+        #     Operator re-runs provision-ci-deploy-role.sh on their laptop;
+        #     the inline policy is idempotently refreshed to include it.
+        #   - Empty/None  → instance genuinely not registered (no agent, no
+        #     profile, wrong region). Operator SSH-debugs or re-runs
+        #     setup-broker-host.sh which auto-installs amazon-ssm-agent.
+        #   - Other state → unexpected; fail loud with the value for triage.
         env:
           REGION: ${{ secrets.TEST_AWS_REGION || 'us-east-1' }}
           INSTANCE_ID: ${{ secrets.TEST_BROKER_INSTANCE_ID }}
         run: |
           set -euo pipefail
+          stderr_file=$(mktemp)
           state=$(aws ssm describe-instance-information \
             --region "$REGION" \
             --filters "Key=InstanceIds,Values=$INSTANCE_ID" \
             --query 'InstanceInformationList[0].PingStatus' \
-            --output text 2>/dev/null || echo "None")
+            --output text 2>"$stderr_file" || echo "")
+          if grep -q "AccessDenied" "$stderr_file"; then
+            echo "::error::Deploy role lacks ssm:DescribeInstanceInformation."
+            echo "::error::Fix: re-run scripts/provision-ci-deploy-role.sh on the operator laptop —"
+            echo "::error::the inline policy is now refreshed with the missing perm (idempotent)."
+            rm -f "$stderr_file"
+            exit 1
+          fi
+          rm -f "$stderr_file"
+          [ -z "$state" ] && state="None"
           case "$state" in
             Online)
               echo "::notice::SSM agent online on $INSTANCE_ID"
               ;;
-            None|"")
-              echo "::error::$INSTANCE_ID is not SSM-managed. Run scripts/provision-ci-deploy-role.sh to diagnose."
+            None)
+              echo "::error::$INSTANCE_ID is not SSM-managed (state=$state)."
+              echo "::error::SSH into the broker EC2 and run scripts/setup-broker-host.sh --test --yes —"
+              echo "::error::it auto-installs amazon-ssm-agent. See docs/ci-setup.md §7.1."
               exit 1
               ;;
             *)
diff --git a/docs/ci-setup.md b/docs/ci-setup.md
index 9d72336..b941b40 100644
--- a/docs/ci-setup.md
+++ b/docs/ci-setup.md
@@ -402,8 +402,10 @@ The script:
 - Creates / refreshes the `github-actions-agentkeys-deploy` IAM role with a federated trust policy on the GitHub Actions OIDC provider, scoped to `repo:litentry/agentKeys:*` (any branch in this repo can trigger; the workflow's path filter + preflight gate further restrict when the role is actually used).
 - Attaches an inline policy `agentkeys-ci-deploy-ssm` with:
   - `ssm:SendCommand` on `document/AWS-RunShellScript` + the one instance ARN (so even if the role's session creds leaked, the worst a third party can do is re-run setup-broker-host.sh on the test EC2 — a destructive op there is `terraform apply`-style: idempotent, recoverable, and contained to the test environment).
-  - `ssm:GetCommandInvocation` / `ssm:ListCommandInvocations` for status polling.
+  - `ssm:GetCommandInvocation` / `ssm:ListCommandInvocations` / `ssm:DescribeInstanceInformation` for status polling + the workflow's pre-deploy sanity check.
   - `ec2:DescribeInstances` scoped to the one instance ID, for the workflow's pre-deploy sanity check.
+
+> Already provisioned the role before `ssm:DescribeInstanceInformation` was added to the policy template? Re-run the provisioning script. `put-role-policy` is idempotent — it overwrites the inline policy with the current source-of-truth shape, picking up any added permissions.
 - Verifies the test EC2 is registered with SSM (`PingStatus = Online`). With `--fix-ssm`, auto-remediates the common "instance profile is missing AmazonSSMManagedInstanceCore" case by attaching the policy and polling for up to 3 min for the SSM agent to refresh its creds. Without `--fix-ssm`, just reports the failure with manual fix instructions.
 
 **SSM remediation modes (what `--fix-ssm` covers, what it doesn't):**
diff --git a/scripts/provision-ci-deploy-role.sh b/scripts/provision-ci-deploy-role.sh
index 7436290..66b3475 100755
--- a/scripts/provision-ci-deploy-role.sh
+++ b/scripts/provision-ci-deploy-role.sh
@@ -226,7 +226,8 @@ inline_policy=$(jq -n \
         Effect: "Allow",
         Action: [
           "ssm:GetCommandInvocation",
-          "ssm:ListCommandInvocations"
+          "ssm:ListCommandInvocations",
+          "ssm:DescribeInstanceInformation"
         ],
         Resource: "*"
       },

From e1b4f408d9153d00d1f37eb16780abfd312481d1 Mon Sep 17 00:00:00 2001
From: Hanwen Cheng <heawen.cheng@gmail.com>
Date: Sun, 24 May 2026 00:45:09 +0800
Subject: [PATCH 10/16] fix(deploy-test-broker): auto-discover agentKeys repo
 path on EC2
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Deploy job's SSM script failed with 'cd: can't cd to /home/ubuntu/agentKeys'
on the operator's test broker. The hardcoded path assumed the
ubuntu-user clone layout, but the operator's box has the repo at a
different location (the broker EC2 may have been bootstrapped from a
non-default user or path).

Fix:
- Auto-discover loop tries TEST_BROKER_REPO_DIR override (new optional
  secret), then 7 common candidates (/home/ubuntu/agentKeys, /opt/agentkeys,
  /srv/agentkeys, /root/agentKeys, etc.). First candidate containing
  scripts/setup-broker-host.sh wins.
- stat -c '%U' to discover the actual tree owner instead of hardcoding
  'ubuntu' — covers the agentkeys / root / custom-user cases.
- Fail loud with the override secret name if no candidate matches.

Docs (docs/ci-setup.md §7.2): TEST_BROKER_REPO_DIR added to the secrets
table with a note that it's optional + only needed when auto-discover
prints 'could not locate'.

Diagnosed via SSM command stderr after the upstream AccessDenied + perm
gaps were resolved earlier in this PR.
---
 .github/workflows/harness-ci.yml | 37 +++++++++++++++++++++++++++-----
 docs/ci-setup.md                 |  1 +
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/.github/workflows/harness-ci.yml b/.github/workflows/harness-ci.yml
index c1a7ae5..b8d651b 100644
--- a/.github/workflows/harness-ci.yml
+++ b/.github/workflows/harness-ci.yml
@@ -344,19 +344,46 @@ jobs:
         env:
           REGION: ${{ secrets.TEST_AWS_REGION || 'us-east-1' }}
           INSTANCE_ID: ${{ secrets.TEST_BROKER_INSTANCE_ID }}
+          # Operator-pinnable override; the auto-discover loop below covers the
+          # common candidates when this isn't set.
+          REPO_DIR_OVERRIDE: ${{ secrets.TEST_BROKER_REPO_DIR }}
         run: |
           set -euo pipefail
           # Compose the remote shell script. `$DEPLOY_REF` is interpolated by
           # the runner's shell (GHA env block makes it visible here); the
           # remote SSM-driven shell sees the literal branch name. The remote
           # shell runs as root (SSM-default on Ubuntu AMIs); git ops use
-          # `sudo -u ubuntu` so the working tree stays ubuntu-owned.
+          # `sudo -u <owner>` so the working tree stays owned by whoever
+          # originally cloned it (typically ubuntu, sometimes agentkeys / root).
+          #
+          # Repo location auto-discovery: try TEST_BROKER_REPO_DIR override
+          # first, then common candidates. Fail fast with a clear remediation
+          # path if no candidate has the repo. Avoids the 'cd: can\'t cd to
+          # /home/ubuntu/agentKeys' failure mode when the operator cloned to
+          # a non-default path.
           read -r -d '' deploy_script <<EOF || true
           set -euo pipefail
-          cd /home/ubuntu/agentKeys
-          sudo -u ubuntu git fetch --prune origin
-          sudo -u ubuntu git checkout "$DEPLOY_REF" || sudo -u ubuntu git checkout "origin/$DEPLOY_REF"
-          sudo -u ubuntu git pull --ff-only origin "$DEPLOY_REF" 2>/dev/null || true
+          REPO_DIR=""
+          for candidate in "$REPO_DIR_OVERRIDE" /home/ubuntu/agentKeys /home/ubuntu/agentkeys /home/ubuntu/agentkey /opt/agentkeys /srv/agentkeys /root/agentKeys /root/agentkeys; do
+            [ -n "\$candidate" ] || continue
+            if [ -f "\$candidate/scripts/setup-broker-host.sh" ]; then
+              REPO_DIR=\$candidate
+              break
+            fi
+          done
+          if [ -z "\$REPO_DIR" ]; then
+            echo "could not locate the agentKeys checkout on this EC2" >&2
+            echo "candidates tried: \$REPO_DIR_OVERRIDE /home/ubuntu/agentKeys ..." >&2
+            echo "Fix: pin the path via the TEST_BROKER_REPO_DIR repo secret." >&2
+            exit 2
+          fi
+          echo "using repo at \$REPO_DIR"
+          REPO_OWNER=\$(stat -c '%U' "\$REPO_DIR")
+          echo "tree is owned by \$REPO_OWNER"
+          cd "\$REPO_DIR"
+          sudo -u "\$REPO_OWNER" git fetch --prune origin
+          sudo -u "\$REPO_OWNER" git checkout "$DEPLOY_REF" || sudo -u "\$REPO_OWNER" git checkout "origin/$DEPLOY_REF"
+          sudo -u "\$REPO_OWNER" git pull --ff-only origin "$DEPLOY_REF" 2>/dev/null || true
           bash scripts/setup-broker-host.sh --test --yes --non-interactive
           EOF
 
diff --git a/docs/ci-setup.md b/docs/ci-setup.md
index b941b40..0e270bc 100644
--- a/docs/ci-setup.md
+++ b/docs/ci-setup.md
@@ -435,6 +435,7 @@ gh secret set TEST_BROKER_INSTANCE_ID  --repo litentry/agentKeys --body "$TEST_B
 |---|---|
 | `OIDC_AWS_ROLE_ARN_DEPLOY` | ARN of `github-actions-agentkeys-deploy` — assumed by the `deploy-test-broker` job via GitHub Actions OIDC. |
 | `TEST_BROKER_INSTANCE_ID` | EC2 instance ID (`i-…`) hosting `test-broker.${ZONE}`. The deploy role's inline policy is scoped to *this single instance*. |
+| `TEST_BROKER_REPO_DIR` | **Optional.** Absolute path of the agentKeys git checkout on the EC2 (e.g. `/home/ubuntu/agentKeys`). The deploy workflow auto-discovers across common candidates (`/home/ubuntu/agentKeys`, `/home/ubuntu/agentkeys`, `/opt/agentkeys`, `/srv/agentkeys`, `/root/agentKeys`), so this only needs to be set when the operator cloned to a non-standard path and the workflow's auto-discover step prints `could not locate the agentKeys checkout`. |
 
 #### 7.3 Dry-run validate
 

From 72b43c89411c7638baa4a02e63806e71873f7349 Mon Sep 17 00:00:00 2001
From: Hanwen Cheng <heawen.cheng@gmail.com>
Date: Sun, 24 May 2026 01:04:44 +0800
Subject: [PATCH 11/16] fix(deploy-test-broker): add /home/agentkey paths +
 safe-default REPO_DIR_OVERRIDE under set -u
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two regressions caught by the second dispatch on the operator's box:

1. Auto-discover didn't find the repo. Operator confirmed the checkout
   lives at /home/agentkey/agentKeys — not in my original 8 candidates.
   Added /home/agentkey/agentKeys, /home/agentkey/agentkeys, and
   /home/agentkeys/agentKeys (covering the variations of the broker app
   user name).

2. Diagnostic echo referenced \$REPO_DIR_OVERRIDE under remote-shell
   set -u, which fires 'parameter not set' when the secret is unset.
   Fixed with a one-line default at the top of the remote script:
       REPO_DIR_OVERRIDE="${REPO_DIR_OVERRIDE:-...}"
   That makes subsequent references safe under set -u while still
   honoring an operator-set override.
---
 .github/workflows/harness-ci.yml | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/harness-ci.yml b/.github/workflows/harness-ci.yml
index b8d651b..b3969c8 100644
--- a/.github/workflows/harness-ci.yml
+++ b/.github/workflows/harness-ci.yml
@@ -363,8 +363,9 @@ jobs:
           # a non-default path.
           read -r -d '' deploy_script <<EOF || true
           set -euo pipefail
+          REPO_DIR_OVERRIDE="\${REPO_DIR_OVERRIDE:-$REPO_DIR_OVERRIDE}"
           REPO_DIR=""
-          for candidate in "$REPO_DIR_OVERRIDE" /home/ubuntu/agentKeys /home/ubuntu/agentkeys /home/ubuntu/agentkey /opt/agentkeys /srv/agentkeys /root/agentKeys /root/agentkeys; do
+          for candidate in "\$REPO_DIR_OVERRIDE" /home/ubuntu/agentKeys /home/ubuntu/agentkeys /home/ubuntu/agentkey /home/agentkey/agentKeys /home/agentkey/agentkeys /home/agentkeys/agentKeys /opt/agentkeys /srv/agentkeys /root/agentKeys /root/agentkeys; do
             [ -n "\$candidate" ] || continue
             if [ -f "\$candidate/scripts/setup-broker-host.sh" ]; then
               REPO_DIR=\$candidate
@@ -373,7 +374,7 @@ jobs:
           done
           if [ -z "\$REPO_DIR" ]; then
             echo "could not locate the agentKeys checkout on this EC2" >&2
-            echo "candidates tried: \$REPO_DIR_OVERRIDE /home/ubuntu/agentKeys ..." >&2
+            echo "candidates tried: \$REPO_DIR_OVERRIDE /home/ubuntu/agentKeys /home/agentkey/agentKeys /opt/agentkeys /srv/agentkeys /root/agentKeys etc." >&2
             echo "Fix: pin the path via the TEST_BROKER_REPO_DIR repo secret." >&2
             exit 2
           fi

From 0e8afde1c76cc168787a69b6217ba14771408560 Mon Sep 17 00:00:00 2001
From: Hanwen Cheng <heawen.cheng@gmail.com>
Date: Sun, 24 May 2026 01:09:34 +0800
Subject: [PATCH 12/16] fix(setup-broker-host): default HOME so SSM-driven
 invocations work under set -u
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

After the auto-discover + repo path fix landed, the SSM-driven deploy
got past clone, fetch, summary, apt deps, and rustup install — then hit
'HOME: unbound variable' at the rustup-env source line. SSM-driven
remote shells (AWS-RunShellScript document) don't export HOME for the
default user; setup-broker-host.sh uses 'set -euo pipefail', so the
unset reference aborts.

Fix: 'export HOME=${HOME:-$(getent passwd $(id -u) | cut -d: -f6)}'
right after 'set -euo pipefail'. Resolves the running user's home dir
from /etc/passwd when the env var is missing — portable across interactive
ssh sessions (HOME already set) and SSM SendCommand (HOME unset).

Same root cause family as the earlier IamInstanceProfile + agent-install
fixes: bootstrap paths assume an interactive operator shell, but the CI
auto-deploy path is the structural test for those assumptions.
---
 scripts/setup-broker-host.sh | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/scripts/setup-broker-host.sh b/scripts/setup-broker-host.sh
index d82a81b..166d01c 100755
--- a/scripts/setup-broker-host.sh
+++ b/scripts/setup-broker-host.sh
@@ -21,6 +21,13 @@
 
 set -euo pipefail
 
+# AWS SSM-driven invocations (harness-ci.yml deploy-test-broker, issue #101)
+# don't export HOME on the remote shell. Under set -u that hits 'HOME: unbound
+# variable' at the rustup `source "$HOME/.cargo/env"` line. Resolve HOME from
+# /etc/passwd if missing so the script is callable from both interactive ssh
+# sessions and SSM SendCommand.
+export HOME="${HOME:-$(getent passwd "$(id -u)" | cut -d: -f6)}"
+
 REPO_ROOT="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")/.." && pwd)"
 
 # ─── Defaults ─────────────────────────────────────────────────────────────────

From 143f1df42cc8e32f9dad4cc12b87a1fefc199469 Mon Sep 17 00:00:00 2001
From: Hanwen Cheng <heawen.cheng@gmail.com>
Date: Sun, 24 May 2026 01:28:45 +0800
Subject: [PATCH 13/16] fix(harness): heima-test-deployer nonce contention
 (codex adversarial findings)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Codex adversarial review (PR #102) confirmed the harness-e2e failure
'replacement transaction underpriced' is NOT caused by this PR. The
broker/workers have no chain-write code paths reachable from the shipped
feature set: audit-evm is feature-gated (Phase C, unshipped), the
worker-audit's auto-flush only LOGS 'ready for on-chain appendRoot'
without submitting txs, and setup-broker-host.sh has zero deployer-key
access.

The actual mechanism is harness-side nonce contention: concurrent
harness-e2e runs (PR branch + workflow_dispatch + re-triggers) share
ONE Heima test deployer wallet, and 'cast send' without --nonce defaults
to 'latest' nonce derivation — which collides with pending mempool txs
from a prior run.

Two-layer fix:

1. .github/workflows/harness-ci.yml — second concurrency group on
   harness-e2e, scoped to 'heima-test-deployer-nonce' (not the ref),
   with cancel-in-progress: false so queued runs wait rather than
   cancel. The outer 'harness-ci-${{ github.ref }}' lock only
   serializes per-branch; this one serializes globally for the shared
   deployer wallet.

2. scripts/heima-fund-account.sh + scripts/heima-agent-create.sh — pass
   '--nonce $(cast nonce ADDR --block pending)' so cast computes the
   nonce against the PENDING block, not the latest confirmed. This
   defends against a stuck mempool tx that survives the previous run's
   exit (concurrency lock alone doesn't help — the tx is in the
   mempool, not in another job).

Both layers also add a specific error message when the underpriced
case fires, telling the operator to wait ~1min for the stuck tx to
confirm or drop.

Codex investigation log (1.4M tokens): scanned setup-broker-host.sh,
broker-server, all 4 workers, env files, harness scripts, and workflow
YAML. Found zero chain-write paths reachable from the deployed broker
binary. Specific evidence cited in codex's response (crates/agentkeys-
broker-server/src/handlers/cap.rs uses eth_call reads only; worker-audit
main.rs:71 logs intent but doesn't submit; broker.env has no deployer
key).
---
 .github/workflows/harness-ci.yml | 15 +++++++++++++++
 scripts/heima-agent-create.sh    | 16 +++++++++++++++-
 scripts/heima-fund-account.sh    | 25 ++++++++++++++++++++++++-
 3 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/harness-ci.yml b/.github/workflows/harness-ci.yml
index b3969c8..a5add53 100644
--- a/.github/workflows/harness-ci.yml
+++ b/.github/workflows/harness-ci.yml
@@ -459,6 +459,21 @@ jobs:
   harness-e2e:
     name: harness/v2-stage*-demo.sh on Heima mainnet (test deployer)
     needs: [preflight, deploy-test-broker]
+    # Codex adversarial review (PR #102) confirmed: the harness's chain-mutating
+    # scripts (heima-fund-account.sh + heima-agent-create.sh) share ONE Heima
+    # test deployer wallet. The outer `concurrency: harness-ci-${{ github.ref }}`
+    # only cancels in-flight runs on the SAME ref — concurrent runs on DIFFERENT
+    # refs (PR branch + manual dispatch, two PRs, etc.) share the deployer and
+    # collide on nonce in the Heima mempool, surfacing as
+    # `replacement transaction underpriced`.
+    #
+    # This second concurrency group, scoped to the deployer (not the ref),
+    # serializes harness-e2e runs globally. `cancel-in-progress: false` queues
+    # subsequent runs instead of cancelling them — so a long-running harness
+    # doesn't lose work to a newer push.
+    concurrency:
+      group: heima-test-deployer-nonce
+      cancel-in-progress: false
     # Run when:
     #   - preflight gates green (test infra is set up)
     #   - AND either:
diff --git a/scripts/heima-agent-create.sh b/scripts/heima-agent-create.sh
index b8c1859..4848b60 100755
--- a/scripts/heima-agent-create.sh
+++ b/scripts/heima-agent-create.sh
@@ -200,13 +200,27 @@ if [ "$DRY_RUN" = "1" ]; then
   exit 0
 fi
 
+# Resolve PENDING nonce for the master wallet — same protection as the
+# heima-fund-account.sh fix in PR #102. If the prior run's registerAgentDevice
+# tx is still in the mempool, the default `latest` nonce derivation collides.
+PENDING_NONCE=$(cast nonce "$MASTER_ADDR" --rpc-url "$RPC_HTTP" --block pending 2>/dev/null || echo "")
+if [ -n "$PENDING_NONCE" ]; then
+  log "pending nonce for master = $PENDING_NONCE"
+  CAST_ARGS+=(--nonce "$PENDING_NONCE")
+fi
+
 log "Submitting registerAgentDevice tx via cast send …"
 set +e
 CAST_OUT=$(cast "${CAST_ARGS[@]}" 2>&1)
 CAST_RC=$?
 set -e
 if [ "$CAST_RC" != "0" ]; then
-  echo "    cast send FAILED (exit $CAST_RC). Output:" >&2
+  if printf '%s\n' "$CAST_OUT" | grep -qi "replacement transaction underpriced"; then
+    echo "    cast send FAILED: prior tx with same nonce is pending in Heima mempool." >&2
+    echo "    Wait ~1 minute and re-run. Output:" >&2
+  else
+    echo "    cast send FAILED (exit $CAST_RC). Output:" >&2
+  fi
   echo "$CAST_OUT" >&2
   exit 1
 fi
diff --git a/scripts/heima-fund-account.sh b/scripts/heima-fund-account.sh
index 55fd01a..aaa102d 100755
--- a/scripts/heima-fund-account.sh
+++ b/scripts/heima-fund-account.sh
@@ -125,15 +125,38 @@ if [ "$DRY_RUN" = "1" ]; then
   exit 0
 fi
 
+# Resolve PENDING nonce (defends against the race where a prior run's funding
+# tx is still in the mempool — cast's default `latest` nonce derivation would
+# collide with the stuck pending tx, surfacing as
+# `replacement transaction underpriced`. PR #102 / codex adversarial review.)
+log "Resolving pending nonce for $DEPLOYER_ADDR"
+PENDING_NONCE=$(cast nonce "$DEPLOYER_ADDR" --rpc-url "$RPC_HTTP" --block pending 2>/dev/null || echo "")
+if [ -z "$PENDING_NONCE" ]; then
+  warn "could not resolve pending nonce — proceeding without explicit --nonce (cast will use latest)"
+  NONCE_ARGS=()
+else
+  ok "pending nonce = $PENDING_NONCE"
+  NONCE_ARGS=(--nonce "$PENDING_NONCE")
+fi
+
 log "Submitting transfer via cast send …"
 set +e
 SEND_OUT=$(cast send "$TO_ADDR" --value "$AMOUNT_WEI" \
   --rpc-url "$RPC_HTTP" --chain-id "$LIVE_CHAIN_ID" \
+  "${NONCE_ARGS[@]}" \
   --private-key "$DEPLOYER_KEY" 2>&1)
 SEND_RC=$?
 set -e
 if [ "$SEND_RC" != "0" ]; then
-  echo "    cast send FAILED (exit $SEND_RC). Output:" >&2
+  # Surface the underpriced-replacement case with a specific remediation —
+  # the broader workflow-level concurrency lock SHOULD prevent this from
+  # firing for parallel runs, but a stuck mempool tx still trips it.
+  if printf '%s\n' "$SEND_OUT" | grep -qi "replacement transaction underpriced"; then
+    echo "    cast send FAILED: prior tx with same nonce is pending in Heima mempool." >&2
+    echo "    Wait ~1 minute for it to confirm or drop, then re-run. Output:" >&2
+  else
+    echo "    cast send FAILED (exit $SEND_RC). Output:" >&2
+  fi
   echo "$SEND_OUT" >&2
   exit 1
 fi

From d5b7c4954ac09225088a65fe01ef8f5c542dfe22 Mon Sep 17 00:00:00 2001
From: Hanwen Cheng <heawen.cheng@gmail.com>
Date: Sun, 24 May 2026 01:56:21 +0800
Subject: [PATCH 14/16] fix(ci): add pull-requests:read for dorny/paths-filter
 on PR events
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The detect-changes job fails on pull_request triggers with
'Error: Bad credentials' from dorny/paths-filter@v3. Root cause: the
workflow's explicit 'permissions:' block grants only id-token + contents,
which sets every other scope (including pull-requests) to 'none'.
paths-filter on PR events always queries the REST API
(/repos/.../pulls/N/files) — without pull-requests:read, the token is
rejected.

Earlier workflow_dispatch + push triggers passed because dispatch + push
don't take the PR-API code path (paths-filter does local git diff
against the previous push).
---
 .github/workflows/harness-ci.yml | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/harness-ci.yml b/.github/workflows/harness-ci.yml
index a5add53..5a07c1f 100644
--- a/.github/workflows/harness-ci.yml
+++ b/.github/workflows/harness-ci.yml
@@ -126,9 +126,15 @@ concurrency:
   cancel-in-progress: true
 
 permissions:
-  id-token: write   # GitHub Actions OIDC → assume TEST_OIDC_AWS_ROLE_ARN
-                    # (and OIDC_AWS_ROLE_ARN_DEPLOY for deploy-test-broker)
+  id-token: write       # GitHub Actions OIDC → assume TEST_OIDC_AWS_ROLE_ARN
+                        # (and OIDC_AWS_ROLE_ARN_DEPLOY for deploy-test-broker)
   contents: read
+  pull-requests: read   # dorny/paths-filter@v3 on pull_request events queries
+                        # the GitHub REST API (/repos/.../pulls/N/files) to list
+                        # changed paths. Without this, the API returns
+                        # 'Bad credentials' and the detect-changes job fails.
+                        # Required only on PR triggers; workflow_dispatch +
+                        # push triggers don't need it (no PR to query).
 
 jobs:
   rust-checks:

From 3c3dff6136e5c2e4103a23fe2638c93c701e5a7f Mon Sep 17 00:00:00 2001
From: Hanwen Cheng <heawen.cheng@gmail.com>
Date: Sun, 24 May 2026 10:14:18 +0800
Subject: [PATCH 15/16] docs: add broker + local operator dev guide
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

New docs/spec/broker-and-operator-dev-guide.md focused on the inner
edit-build-test loop:

- The 7-process local stack (mock-server :8090, broker :8091, signer
  :8092, 4 workers :9092-:9095) with the exact ports + crates + env
  vars each one reads.
- First-time keypair generation (one-shot keygen for the broker's
  ES256 OIDC + session keypairs).
- Inner loop A — edit broker code: scripts/broker.dev.env template,
  the --features auth-email-link footgun, three-terminal foreground
  flow, hot-reload pattern.
- Inner loop B — edit operator scripts: scripts/operator-workstation.dev.env
  template, the --from-step/--to-step/--only-step primitive, anvil for
  fully-local chain dev.
- Inner loop C — CI auto-deploy (issue #101 / PR #102): which paths
  trigger the auto-deploy + how to dry-run via workflow_dispatch.
- Config-file map distinguishing broker.env vs operator-workstation.env
  vs broker.test.env so the most common 'I sourced the wrong file' bug
  is debuggable from the guide.
- Debugging cheatsheet — RUST_LOG, port collisions, the 5 most common
  broker-boot-fail shapes with their fixes.
- Chain profile selection (anvil vs heima-paseo vs heima).

Distinct from docs/dev-setup.md (environment bootstrap) and
docs/operator-runbook-stage7.md (deploy-to-real-host) — those are
the 'first machine' / 'first broker' docs. This is the 'I'm iterating
on the broker right now' doc.

Linked from README.md Development section.
---
 README.md                                  |   2 +
 docs/spec/broker-and-operator-dev-guide.md | 336 +++++++++++++++++++++
 2 files changed, 338 insertions(+)
 create mode 100644 docs/spec/broker-and-operator-dev-guide.md

diff --git a/README.md b/README.md
index 8807068..3096c98 100644
--- a/README.md
+++ b/README.md
@@ -54,6 +54,8 @@ cargo test -p agentkeys-provisioner
 
 Staged build plan in [`docs/spec/plans/development-stages.md`](docs/spec/plans/development-stages.md). Each stage has a `harness/stage-N-done.sh` gate that must exit 0 before the stage is marked complete. Contributor workflow: [`CLAUDE.md`](CLAUDE.md).
 
+**Inner-loop guide for broker + operator-side dev:** [`docs/spec/broker-and-operator-dev-guide.md`](docs/spec/broker-and-operator-dev-guide.md) — how to run the broker, signer, and mock-server locally, point the operator scripts at them, and use `harness/v2-stage*-demo.sh` for edit-build-test cycles.
+
 Version control uses [jj (Jujutsu)](https://github.com/jj-vcs/jj), not raw git.
 
 ## License
diff --git a/docs/spec/broker-and-operator-dev-guide.md b/docs/spec/broker-and-operator-dev-guide.md
new file mode 100644
index 0000000..88dcae8
--- /dev/null
+++ b/docs/spec/broker-and-operator-dev-guide.md
@@ -0,0 +1,336 @@
+# Broker + Local Operator Dev Guide
+
+**Audience:** developers iterating on the broker, the workers, or the operator-side scripts (`harness/`, `scripts/heima-*.sh`).
+**Scope:** the inner edit-build-test loop — running the broker stack on your laptop, exercising it with operator scripts, and knowing which knob to turn when something breaks.
+
+This guide is **not** the environment bootstrap doc (see [`docs/dev-setup.md`](../dev-setup.md)) or the deploy-to-real-host runbook (see [`docs/operator-runbook-stage7.md`](../operator-runbook-stage7.md)). Read those first if you have a fresh machine or you're standing up a new broker EC2.
+
+---
+
+## 1. The local stack at a glance
+
+The deployed broker runs five processes on one EC2. For local dev you run the same five processes on `localhost`, on the same ports, with the same env contract. Same code path — only the env values change.
+
+| Process | Default port | Crate | Purpose | Local-dev role |
+|---|---|---|---|---|
+| `agentkeys-mock-server` | `:8090` | `agentkeys-mock-server` | v0 backend; mirrors the Heima parachain extrinsic surface | Stand-in for the chain RPC + the legacy session-validation backend |
+| `agentkeys-broker-server` | `:8091` | `agentkeys-broker-server` | The credential broker — auth, cap-mint, OIDC issuer | The component you're most often editing |
+| `agentkeys-signer` (dev_key_service) | `:8092` | `agentkeys-broker-server` (same binary, different listener) | EVM keypair derivation from `omni_account` via HKDF | Stub for the future TEE signer (see [`signer-protocol.md`](./signer-protocol.md)) |
+| `agentkeys-worker-audit` | `:9092` | `agentkeys-worker-audit` | Merkle-root batching for credential audit | Only matters if you're touching audit code |
+| `agentkeys-worker-email` | `:9093` | `agentkeys-worker-email` | Inbound email handler (SES → cap-mint trigger) | Only matters for email-link auth |
+| `agentkeys-worker-creds` | `:9094` | `agentkeys-worker-creds` | Credential store — STS + S3 PrincipalTag-scoped | The data plane the cap-mint flow leads to |
+| `agentkeys-worker-memory` | `:9095` | `agentkeys-worker-memory` | Memory store — STS + S3 (per-actor isolation) | Symmetric with creds |
+
+In the deployed stack `nginx` fronts the broker + signer + 4 workers on `:443` with public hostnames. Locally you talk to the ports directly — no nginx, no TLS.
+
+---
+
+## 2. First-time local-stack bring-up
+
+After [`docs/dev-setup.md`](../dev-setup.md) §1–§2 (rust, jj, node, `cargo build --workspace --release`), generate the broker's two ES256 keypairs once:
+
+```bash
+mkdir -p ~/.agentkeys/broker
+cargo run -q --release -p agentkeys-broker-server -- keygen --purpose oidc    --out ~/.agentkeys/broker/oidc-keypair.json
+cargo run -q --release -p agentkeys-broker-server -- keygen --purpose session --out ~/.agentkeys/broker/session-keypair.json
+chmod 600 ~/.agentkeys/broker/{oidc,session}-keypair.json
+```
+
+These are the only persistent local state the broker needs. Treat them like any other dev secret — kept under `~/.agentkeys/`, gitignored at the home-directory level, never copied off your laptop. Regenerating them invalidates every previously-derived wallet that depended on the matching session pubkey, so don't `rm` them mid-session.
+
+---
+
+## 3. Inner loop A — edit broker code
+
+The broker reads its config from env vars and the two keypair files. Source a dev env file once per shell, then iterate with `cargo run`.
+
+### 3.1 The dev env
+
+Create `scripts/broker.dev.env` (gitignored — copy + edit from `scripts/broker.env`):
+
+```bash
+# Local-dev broker env — everything points at localhost.
+ACCOUNT_ID=000000000000                                    # placeholder; AWS calls go to mock backend
+BROKER_DATA_ROLE_ARN=arn:aws:iam::000000000000:role/dev    # never assumed in local dev
+BROKER_AWS_REGION=us-east-1                                # any region; not actually hit
+BROKER_OIDC_ISSUER=http://127.0.0.1:8091                   # matches --bind/--port below
+BROKER_OIDC_KEYPAIR_PATH=$HOME/.agentkeys/broker/oidc-keypair.json
+BROKER_SESSION_KEYPAIR_PATH=$HOME/.agentkeys/broker/session-keypair.json
+BROKER_AUTH_METHODS=wallet_sig,email_link
+BROKER_AUDIT_ANCHORS=sqlite                                # sqlite store; never writes to chain
+BROKER_EMAIL_SENDER=stub                                   # in-memory; no SES, no AWS creds needed
+BROKER_EMAIL_FROM_ADDRESS=dev@localhost
+BROKER_BACKEND_URL=http://127.0.0.1:8090                   # points at the local mock-server below
+
+# dev_key_service signer (issue #74 step 1b)
+DEV_KEY_SERVICE_MASTER_SECRET=local-dev-secret-32-bytes-min-length-please
+```
+
+Three lines matter most for local dev:
+
+- `BROKER_EMAIL_SENDER=stub` — skips SES; magic-link tokens land in an in-process `Vec` that you read back via the test harness or a `curl`-driven `/v1/auth/email/list-pending` endpoint (broker test feature).
+- `BROKER_AUDIT_ANCHORS=sqlite` — every audit row lands in a local SQLite file; nothing hits the chain. Set to `evm_testnet` ONLY when you've built with `--features audit-evm` AND you actually want to test the on-chain anchor path (Phase C, not shipped as of PR #102).
+- `BROKER_BACKEND_URL` — the broker calls a "backend" for legacy session validation (the v0 mock-server, or a real chain backend in v0.2+). In local dev this points at `agentkeys-mock-server :8090` started in §3.3 below.
+
+### 3.2 Build the broker with the right features
+
+`cargo run` defaults to debug + workspace default features. The broker MUST be built with `--features auth-email-link` if `BROKER_AUTH_METHODS` includes `email_link` (which the dev env above does) — otherwise the broker boot-fails with `BROKER_AUTH_METHODS="email_link": unknown or feature-gated-out auth method`.
+
+```bash
+# Iteration build (~10s warm, ~3min cold):
+cargo build -p agentkeys-broker-server --features auth-email-link
+
+# Or release for cycle-accurate testing (~30s warm, ~5min cold):
+cargo build --release -p agentkeys-broker-server --features auth-email-link
+```
+
+Cargo footgun (per [`scripts/setup-broker-host.sh:547`](../../scripts/setup-broker-host.sh)): never combine `-p agentkeys-broker-server -p agentkeys-mock-server --features auth-email-link` — cargo silently drops the feature flag. Always build the two binaries in separate `cargo build` invocations.
+
+### 3.3 Run the three foreground processes
+
+Three terminals. Source the dev env in each; pass `--bind 127.0.0.1 --port <p>`:
+
+```bash
+# Terminal 1 — mock-server (v0 backend the broker talks to)
+set -a; source scripts/broker.dev.env; set +a
+cargo run --release -p agentkeys-mock-server -- --bind 127.0.0.1 --port 8090
+
+# Terminal 2 — broker (your usual edit target)
+set -a; source scripts/broker.dev.env; set +a
+RUST_LOG=info,agentkeys_broker_server=debug \
+  cargo run --release -p agentkeys-broker-server --features auth-email-link -- \
+    --bind 127.0.0.1 --port 8091
+
+# Terminal 3 — signer (dev_key_service; serves /dev/derive-address + /dev/sign-*)
+set -a; source scripts/broker.dev.env; set +a
+cargo run --release -p agentkeys-broker-server -- \
+  --bind 127.0.0.1 --port 8092 --signer-only
+```
+
+The signer is the SAME binary as the broker (`agentkeys-broker-server`) with `--signer-only` — it serves only `/dev/*` + `/healthz` and shares the keypair files with the broker process on `:8091`.
+
+Skip workers (`agentkeys-worker-{audit,email,creds,memory}` on `:9092-:9095`) until you're editing them — the broker's hot path doesn't require them for most flows.
+
+### 3.4 Sanity check
+
+```bash
+curl -s http://127.0.0.1:8091/healthz                                # → "ok"
+curl -s http://127.0.0.1:8091/.well-known/openid-configuration | jq . # OIDC discovery doc
+curl -s http://127.0.0.1:8091/.well-known/jwks.json | jq .            # broker's JWKS
+```
+
+If healthz returns `ok` but the JWKS is empty, the keypair files aren't being read — check the paths in your dev env. If the broker boot-fails with `BROKER_AUTH_METHODS=email_link: unknown`, you forgot `--features auth-email-link` on the cargo build.
+
+### 3.5 Hot-reload loop
+
+There's no `cargo watch` in the workspace, but the dev loop is fast enough without it:
+
+1. Edit Rust in `crates/agentkeys-broker-server/src/...`.
+2. `Ctrl-C` Terminal 2's broker.
+3. Re-run the `cargo run -p agentkeys-broker-server ...` command from §3.3 (shell history is your friend).
+4. The first re-run rebuilds the broker (~10s incremental); subsequent runs reuse the artifact.
+
+For a tighter loop while editing a single module, write a unit test next to the module and use `cargo test -p agentkeys-broker-server <test_name>` — typically <2s per iteration.
+
+---
+
+## 4. Inner loop B — edit operator scripts
+
+The operator-side scripts (`harness/v2-stage{1,2,3}-demo.sh`, `scripts/heima-*.sh`, `scripts/agentkeys-*-demo.sh`) are the dev loop for the *operator workflow*: cap-mint, identity bootstrap, scope grants, S3 isolation tests. They run on your laptop and call the broker (local or remote) via plain HTTP + `cast` + `aws`.
+
+### 4.1 Point the operator env at the local broker
+
+Create `scripts/operator-workstation.dev.env` (gitignored — copy + edit from `scripts/operator-workstation.env`):
+
+```bash
+# Local-dev operator env — points the harness scripts at localhost
+ACCOUNT_ID=000000000000
+REGION=us-east-1
+BROKER_HOST=127.0.0.1:8091
+OIDC_ISSUER=http://127.0.0.1:8091
+AGENTKEYS_SIGNER_URL=http://127.0.0.1:8092
+BACKEND_URL=http://127.0.0.1:8090
+
+# Local-stack workers (skip these until you wire them up — broker hot path doesn't need them)
+AGENTKEYS_WORKER_AUDIT_URL=http://127.0.0.1:9092
+AGENTKEYS_WORKER_EMAIL_URL=http://127.0.0.1:9093
+AGENTKEYS_WORKER_CRED_URL=http://127.0.0.1:9094
+AGENTKEYS_WORKER_MEMORY_URL=http://127.0.0.1:9095
+
+# Local chain backbone — pick ONE based on what you're testing:
+#   anvil          — fully local (forge anvil running on 127.0.0.1:8545); fastest
+#   heima-paseo    — Heima testnet; real chain, no real money
+#   heima          — Heima mainnet (production); use with care
+AGENTKEYS_CHAIN=anvil
+```
+
+### 4.2 Run the canonical inner-loop demo
+
+[`harness/v2-stage1-demo.sh`](../../harness/v2-stage1-demo.sh) is the end-to-end exerciser most operator edits land against. It's a 13-step script: install CLI → email-link init → identity bootstrap → S3 envelope smoke test → chain bring-up → device register → agent create → scope grant → K11 enroll → cap-mint roundtrip.
+
+```bash
+set -a; source scripts/operator-workstation.dev.env; set +a
+
+# Full demo against local stack:
+bash harness/v2-stage1-demo.sh --chain anvil
+
+# Re-run just one step you're iterating on:
+bash harness/v2-stage1-demo.sh --only-step 7
+
+# Skip the slow bits (CLI build, chain deploy, S3 provisioning):
+bash harness/v2-stage1-demo.sh --skip-build --skip-deploy --skip-provision
+
+# Stop after a specific step (useful when bisecting a regression):
+bash harness/v2-stage1-demo.sh --to-step 5
+```
+
+The `--from-step N` / `--to-step N` / `--only-step N` triad is the inner-loop primitive — every step prints `[step N/M]` to stderr, every step is idempotent. If step 7 fails after a script edit, fix the script, re-run with `--from-step 7`, you keep the work from steps 1–6.
+
+### 4.3 Anvil for fully-local chain dev
+
+When you don't want to talk to Heima at all, run [foundry](https://book.getfoundry.sh/anvil/) anvil locally:
+
+```bash
+# Terminal 4 — local EVM (anvil) on :8545
+anvil --chain-id 31337 --port 8545
+```
+
+Then `AGENTKEYS_CHAIN=anvil` in your operator env makes every `cast send` hit anvil instead of Heima. The deployer wallet is whichever anvil-prefunded key you point at via `HEIMA_DEPLOYER_KEY` / `HEIMA_DEPLOYER_KEY_FILE`. Anvil's mempool is single-tenant — none of the [PR #102 nonce-contention issues](./plans/issue-101-ci-auto-deploy.md) bite locally.
+
+### 4.4 Editing `setup-broker-host.sh`
+
+`scripts/setup-broker-host.sh` is the canonical "single entry point" for the broker EC2 (per CLAUDE.md "Remote broker host (single entry point)" policy). When you change it, the unit-test is to dry-run it on a throwaway VM, but the practical inner loop is:
+
+1. Edit the script.
+2. `bash -n scripts/setup-broker-host.sh` — syntax check.
+3. SSH into the test broker EC2 (`bash scripts/ssh-broker.sh`), `cd ~/agentKeys`, `git pull`, `bash scripts/setup-broker-host.sh --test --yes` — exercise the full path.
+4. **Or** push to your PR branch and let the [CI auto-deploy](#5-inner-loop-c--ci-auto-deploy-issue-101) (PR #102) drive it on the test EC2.
+
+Step 4 is usually faster — no SSH, you get fresh logs in the GHA run, and the harness validates the deploy end-to-end.
+
+---
+
+## 5. Inner loop C — CI auto-deploy (issue #101)
+
+Per [PR #102](https://github.com/litentry/agentKeys/pull/102), pushing broker-affecting changes to a PR branch auto-deploys to the test EC2 via SSM and runs the full harness against the freshly-deployed broker. You see broker bugs in your own PR, not the next operator's.
+
+What counts as "broker-affecting" — the path-filter list in [`.github/workflows/harness-ci.yml`](../../.github/workflows/harness-ci.yml):
+
+```
+crates/agentkeys-broker-server/**
+crates/agentkeys-worker-*/**
+crates/agentkeys-signer-protocol/**
+crates/agentkeys-types/**
+crates/agentkeys-core/**
+scripts/setup-broker-host.sh
+scripts/setup-broker-host.sh.d/**
+scripts/broker.env
+scripts/broker.test.env
+Cargo.toml
+Cargo.lock
+```
+
+Untouched + auto-deploy is opt-in (gated on `OIDC_AWS_ROLE_ARN_DEPLOY` + `TEST_BROKER_INSTANCE_ID` repo secrets — see [`docs/ci-setup.md`](../ci-setup.md) §7).
+
+To dry-run the deploy without a broker code change, dispatch manually with the override:
+
+```bash
+gh workflow run harness-ci.yml --repo litentry/agentKeys \
+  --ref <your-branch> \
+  --field stage=1 \
+  --field force_deploy_broker=true
+```
+
+---
+
+## 6. Config-file map — which file controls what
+
+Three files, three audiences. The "is the broker reading the right thing" debug usually comes down to which one you sourced.
+
+| File | Where it lives | Who reads it | Local-dev override |
+|---|---|---|---|
+| [`scripts/broker.env`](../../scripts/broker.env) | **Broker host** (EC2 or your laptop's broker process) | `agentkeys-broker-server` (every entry has a matching constant in `crates/agentkeys-broker-server/src/env.rs`) | `scripts/broker.dev.env` (gitignored, copied from `broker.env`, swap hosts to `127.0.0.1`) |
+| [`scripts/operator-workstation.env`](../../scripts/operator-workstation.env) | **Operator laptop** | Every `harness/` + `scripts/heima-*.sh` script | `scripts/operator-workstation.dev.env` (gitignored, swap hosts to `127.0.0.1:809x`) |
+| [`scripts/broker.test.env`](../../scripts/broker.test.env) | **Test broker host** (CI auto-deploy target) | `agentkeys-broker-server` running on the test EC2 | Same shape as `broker.env`; CI workflow materializes per-run values into this on the runner |
+
+Mixing them on the wrong host is the most common config bug. The broker host should NEVER source `operator-workstation.env` — that file has AWS admin tooling vars (BUCKET, OIDC_PROVIDER_ARN) that don't exist as broker-server env vars and would silently shadow what the broker actually reads.
+
+---
+
+## 7. Debugging cheatsheet
+
+### 7.1 Logs
+
+The broker uses `tracing_subscriber` with `EnvFilter` ([`crates/agentkeys-broker-server/src/main.rs:73`](../../crates/agentkeys-broker-server/src/main.rs)). Control via `RUST_LOG`:
+
+```bash
+# Default — only INFO and above
+cargo run -p agentkeys-broker-server -- ...
+
+# Verbose for the broker, quiet for everything else
+RUST_LOG=info,agentkeys_broker_server=debug cargo run -p agentkeys-broker-server -- ...
+
+# Trace-level for one specific module
+RUST_LOG=info,agentkeys_broker_server::handlers::cap=trace cargo run -p agentkeys-broker-server -- ...
+```
+
+On the deployed broker, logs go to systemd journal:
+
+```bash
+ssh broker journalctl -u agentkeys-broker --since '5 min ago' -f
+ssh broker journalctl -u agentkeys-signer --since '5 min ago' -f
+```
+
+### 7.2 Port collisions
+
+If `cargo run` errors with `Address already in use`, find the stuck process:
+
+```bash
+lsof -nP -iTCP:8091 -sTCP:LISTEN     # broker
+lsof -nP -iTCP:8090 -sTCP:LISTEN     # mock-server
+lsof -nP -iTCP:8092 -sTCP:LISTEN     # signer
+```
+
+Kill by PID (the only `kill -9` you should reach for during dev) or by name: `pkill -f agentkeys-broker-server`.
+
+### 7.3 The broker boots, then immediately exits
+
+Common shapes:
+
+| Symptom | Cause | Fix |
+|---|---|---|
+| `BROKER_AUTH_METHODS="email_link": unknown or feature-gated-out auth method` | Built without `--features auth-email-link` | Re-build with the feature; see §3.2 |
+| `failed to read OIDC keypair: No such file` | `BROKER_OIDC_KEYPAIR_PATH` doesn't exist | Re-run the `keygen` from §2 |
+| `BROKER_BACKEND_URL=http://127.0.0.1:8090: connection refused` | Mock-server isn't running on `:8090` | Start it (Terminal 1 in §3.3) |
+| Broker logs are silent | `RUST_LOG` unset and the default filter is too quiet for what you want | Add `RUST_LOG=debug` to your `cargo run` command |
+| `SES GetEmailIdentity: AccessDenied` | `BROKER_EMAIL_SENDER=ses` but no AWS creds in the shell | Set `BROKER_EMAIL_SENDER=stub` for local dev |
+
+### 7.4 The harness fails at a specific step
+
+Re-run with `--from-step N` to keep prior progress, OR `--only-step N` to test one step in isolation. Every step is idempotent — re-running a passed step is a no-op. If `--only-step 7` fails the same way as the full run, the bug is in that step's script; if it passes, the bug is cross-step state that the previous steps mutated.
+
+---
+
+## 8. Chain profile selection
+
+`AGENTKEYS_CHAIN` controls which RPC + which contract addresses every harness script talks to. Default in `v2-stage1-demo.sh` is `heima-paseo`; common alternates:
+
+| Profile | RPC | When to use | Cost |
+|---|---|---|---|
+| `anvil` | `http://127.0.0.1:8545` | Fully local; fastest iteration; no real-world side effects | Free |
+| `heima-paseo` | Heima testnet | Real-chain semantics without real-money cost; default for `v2-stage1-demo.sh` | Testnet HEI (free from faucet) |
+| `heima` | Heima mainnet | The canonical chain; matches what CI's harness-e2e runs against | Real HEI — small per-run cost |
+
+Switch with `--chain` on any harness script. Contract addresses for `heima` and `heima-paseo` live in [`scripts/operator-workstation.env`](../../scripts/operator-workstation.env); add `anvil` ones by running `bash scripts/setup-heima.sh --chain anvil --from-step 4 --to-step 8` after starting your local anvil.
+
+---
+
+## 9. Related docs
+
+- [`docs/arch.md`](../arch.md) — single source of truth for component inventory + trust boundaries.
+- [`docs/dev-setup.md`](../dev-setup.md) — first-time machine bootstrap (rust, jj, node, AWS CLI, browser).
+- [`docs/operator-runbook-stage7.md`](../operator-runbook-stage7.md) — deploy-to-real-EC2 walkthrough (manual; not for local dev).
+- [`docs/ci-setup.md`](../ci-setup.md) — no-LLM CI + auto-deploy of test broker (issue #101 / PR #102).
+- [`docs/spec/signer-protocol.md`](./signer-protocol.md) — wire contract for the signer (TEE swap-in target).
+- [`docs/spec/credential-backend-interface.md`](./credential-backend-interface.md) — the `CredentialBackend` trait; what the broker's storage plug-ins must implement.
+- [`docs/spec/plans/development-stages.md`](./plans/development-stages.md) — the staged build plan + harness gates.

From 99e251943d2ce177d43ed23ba18ab5e398f1b0b2 Mon Sep 17 00:00:00 2001
From: Hanwen Cheng <heawen.cheng@gmail.com>
Date: Sun, 24 May 2026 10:28:11 +0800
Subject: [PATCH 16/16] docs(readme): split into 'For humans' + 'For AI coding
 agents' sections

Top: project name, one-line description, status, arch.md link (shared).

For humans:
- What it does (4 component bullets)
- Workspace layout
- Build & test commands
- First-machine setup (link to dev-setup.md)
- Inner-loop dev (link to broker-and-operator-dev-guide.md)
- License

For AI coding agents:
- Mandatory reading table (CLAUDE.md, arch.md, development-stages, execution-plan, dev guide)
- Hard rules condensed from CLAUDE.md (jj usage, branch push policy, diagnose-before-edit, land-the-fix, runbook-fix-fold-back, no-hardcoded-values, idempotent-remote-setup, plan-completion, terminology-source-of-truth)
- Per-session protocol (4 steps)
- Single entry points (setup-broker-host.sh, setup-heima.sh)

The split makes the README usable as the AI agent's session-start
briefing AND as the human's project intro, without either side wading
through content meant for the other. All 7 link targets verified
present in the repo.
---
 README.md | 66 ++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 56 insertions(+), 10 deletions(-)

diff --git a/README.md b/README.md
index 3096c98..75d499e 100644
--- a/README.md
+++ b/README.md
@@ -4,16 +4,20 @@ Credential broker for AI agents. A master (human) delegates scoped, revocable ac
 
 Status: pre-v0. Stage 5 in progress (see `harness/progress.json`).
 
-## What it does
+Architecture, language choices, trust boundaries: [`docs/arch.md`](docs/arch.md).
+
+---
+
+## 👤 For humans
+
+### What it does
 
 - **Master CLI** (`agentkeys`) — runs on your laptop; owns a session key in the OS keychain; approves pair/recover/scope-change requests.
 - **Sandbox daemon** (`agentkeys-daemon`) — runs inside the agent sandbox; brokers credential reads over MCP + a Unix socket; never exposes raw keys to the agent.
 - **Provisioner** (`agentkeys-provisioner` + `provisioner-scripts`) — Rust orchestrator drives TypeScript/Playwright scrapers to sign up for services and hand the resulting API key back through the trust boundary.
 - **Mock backend** (`agentkeys-mock-server`) — v0-only; mirrors the Heima parachain API so we can build end-to-end before the chain integration lands.
 
-Architecture, language choices, trust boundaries: [`docs/arch.md`](docs/arch.md).
-
-## Workspace layout
+### Workspace layout
 
 ```
 crates/
@@ -31,7 +35,7 @@ harness/                     stage-gated build harness + progress
 
 ~80% Rust, 100% of the security-critical path in Rust. TypeScript is confined to browser automation and (post-MVP) the Web GUI frontend.
 
-## Build & test
+### Build & test
 
 ```
 cargo build
@@ -50,14 +54,56 @@ cargo test -p agentkeys-daemon -p agentkeys-mcp
 cargo test -p agentkeys-provisioner
 ```
 
-## Development
+### First-machine setup
 
-Staged build plan in [`docs/spec/plans/development-stages.md`](docs/spec/plans/development-stages.md). Each stage has a `harness/stage-N-done.sh` gate that must exit 0 before the stage is marked complete. Contributor workflow: [`CLAUDE.md`](CLAUDE.md).
+Fresh laptop? Start with [`docs/dev-setup.md`](docs/dev-setup.md) — it walks you through rustup, jj, Node, AWS CLI, browser, and runs the workspace smoke tests.
 
-**Inner-loop guide for broker + operator-side dev:** [`docs/spec/broker-and-operator-dev-guide.md`](docs/spec/broker-and-operator-dev-guide.md) — how to run the broker, signer, and mock-server locally, point the operator scripts at them, and use `harness/v2-stage*-demo.sh` for edit-build-test cycles.
+### Inner-loop dev
 
-Version control uses [jj (Jujutsu)](https://github.com/jj-vcs/jj), not raw git.
+Iterating on the broker, signer, mock-server, or operator-side scripts? [`docs/spec/broker-and-operator-dev-guide.md`](docs/spec/broker-and-operator-dev-guide.md) covers the local edit-build-test loop: which process to run on which port, how to point harness scripts at `localhost`, how to use `harness/v2-stage*-demo.sh` for resumable step-by-step testing.
 
-## License
+### License
 
 Dual-licensed under **MIT OR Apache-2.0**, at your choice.
+
+---
+
+## 🤖 For AI coding agents
+
+**You must read these before making any change.** They override defaults from your training data and cover the project-specific guardrails.
+
+| Read | Why |
+|---|---|
+| [`CLAUDE.md`](CLAUDE.md) | Project-specific rules: docs layout, /create-pr workflow in worktrees, terminology-source-of-truth, branch push policy, idempotent-remote-setup invariants, runbook-fix-fold-back policy. **Read first, every session.** |
+| [`docs/arch.md`](docs/arch.md) | Single source of truth for component inventory (K1–K11), trust boundaries, HDKD actor tree, per-actor binding ceremonies. When the per-doc detail outgrows arch.md, link outward — never duplicate. |
+| [`docs/spec/plans/development-stages.md`](docs/spec/plans/development-stages.md) | The 8-stage build plan. Each stage has a `harness/stage-N-done.sh` gate; never self-grade — run the gate. |
+| [`docs/spec/plans/execution-plan.md`](docs/spec/plans/execution-plan.md) | Orchestration runbook (ralph, team, ultraqa workflows). |
+| [`docs/spec/broker-and-operator-dev-guide.md`](docs/spec/broker-and-operator-dev-guide.md) | Inner edit-build-test loop for broker + operator-side code. Use this before suggesting changes to the broker's run-time behavior. |
+
+### Hard rules (from CLAUDE.md)
+
+These are non-negotiable. Violating them produces broken PRs / corrupted state.
+
+- **Use `jj` (Jujutsu), never raw `git`.** Common mappings in CLAUDE.md. The one exception: inside a Claude Code `.claude/worktrees/<name>/` worktree, the initial commit must use `git` (jj can't colocate in a git-worktree); then `cd` to the main repo and push via `jj git push`. Never include `Co-Authored-By:` lines in those commits.
+- **Branch `evm` pushes immediately.** On `evm`, push after every `jj describe` — the remote broker host pulls from `origin/evm` to redeploy. "I'll push at the end" silently breaks deploys.
+- **Diagnose before edit.** Reproduce the failure locally first; isolate the layer (shell / client / doc / broker code / network). If the cause is local to the operator's shell, respond with the one-line fix — don't edit the repo.
+- **Land the fix everywhere.** Once a local repro proves a fix is correct, land it the same turn — search the repo for every affected file, commit, push to `origin/evm`. Don't stop at "verified locally" or "fixed one file."
+- **Runbook fix fold-back.** When an operator hits a runbook failure, two things land in the same turn: (1) the targeted fix, (2) a revision to the runbook so the next operator doesn't hit the same trap.
+- **No hardcoded values.** Use env var + default, CLI flag + default, or a config file. If you must hardcode temporarily, log it in [`hardcoded.md`](hardcoded.md) with file:line + reason + what would unblock dynamic.
+- **Idempotent remote setup.** Every script that mutates remote state (AWS / Heima / CI / VM / DNS) must exit 0 on re-run without re-applying. Pre-check with `get-*` before mutating; log `ok | skip <reason> | fail <reason>`.
+- **Plan completion is all-or-nothing.** When implementing a plan, every numbered step must be done — or the PR summary's "What did NOT land" section must explicitly list what was skipped and why.
+- **Terminology source of truth.** Never invent a new name for a concept arch.md already names. If you find divergence, fix it in the same commit or document the alias in arch.md's "Canonical names" section.
+
+### Per-session protocol
+
+1. `jj log --limit 10 && cat harness/progress.json && bash harness/init.sh $(jq -r .current_stage harness/progress.json)`
+2. Read the stage contract for the current stage in `docs/spec/plans/development-stages.md`.
+3. Pick the HIGHEST-PRIORITY incomplete deliverable from `harness/features.json`.
+4. Implement ONE deliverable, run `cargo test -p <crate>`, `jj describe`, update `harness/features.json`, `jj new`.
+
+### Single entry points
+
+Don't reach for ad-hoc `systemctl`, `scp`, or `forge script` — these are wrapped:
+
+- **Remote broker host** (binary upgrades, systemd, nginx, env tweaks): `bash scripts/setup-broker-host.sh`
+- **Heima chain bring-up** (deploy, binding ceremonies, scope grants, K11 enroll, audit-row append, worker smoke): `bash scripts/setup-heima.sh`