Skip to content

issue #101: path-conditional auto-deploy of test broker via SSM#102

Open
hanwencheng wants to merge 14 commits into
mainfrom
claude/adoring-bell-1b9ca8
Open

issue #101: path-conditional auto-deploy of test broker via SSM#102
hanwencheng wants to merge 14 commits into
mainfrom
claude/adoring-bell-1b9ca8

Conversation

@hanwencheng
Copy link
Copy Markdown
Member

Closes #101 (broker-only scope; contract auto-deploy deferred to a follow-up per the issue's rollout plan step 7).

Summary

  • New detect-changes job in .github/workflows/harness-ci.yml uses dorny/paths-filter@v3 to compute broker_changed on every PR/push. The filter conservatively includes all workspace-shared crates the broker links against (agentkeys-types, agentkeys-core, agentkeys-signer-protocol, broker + worker crates), setup-broker-host.sh*, broker.env*, and Cargo.lock — false-negative-resistant per the path-filter caveat in the issue.
  • New deploy-test-broker job assumes a new OIDC role github-actions-agentkeys-deploy and drives setup-broker-host.sh --test --yes on the test broker EC2 via aws ssm send-command. SendCommand is followed by a 10s-poll loop (15min cap) that surfaces stdout / stderr on success or failure.
  • New scripts/provision-ci-deploy-role.sh provisions the role idempotently per CLAUDE.md "Idempotent remote-setup rule (CLOUD)": pre-check role existence, refresh trust policy on re-run, attach agentkeys-ci-deploy-ssm inline policy with SendCommand scoped to one document + one EC2 instance ARN, verify the EC2's SSM agent is PingStatus=Online with concrete remediation hints if not.
  • harness-e2e now needs deploy-test-broker and runs always() && (deploy success || deploy skipped) — so when deploy is skipped (no broker path changes, or auto-deploy not activated) the harness still runs against the existing broker binary.

Design deviation from the issue spec

The issue's example YAML has deploy-test-broker: needs: harness-e2e (deploy AFTER harness validates the current broker). This PR inverts that to deploy FIRST, then harness-e2e, because the failure mode the issue describes — "harness scripts at version B vs broker binary at version A → spurious pass or confusing failure" — is only caught if the harness runs against the freshly-deployed broker. Deploying second means a broker bug introduced by PR-N leaks to PR-(N+1)'s harness, which is what the issue is trying to prevent.

Trade-off: a broker bug that crashes on startup fails the deploy and skips the harness — also the right signal, since there's nothing meaningful to test against a broken broker.

The inline workflow comment documents the divergence so a reviewer can flip it back if they prefer the issue's ordering — both are one-line changes to the needs: graph.

Out of scope (per issue #101)

  • deploy-test-contracts job — deferred to a follow-up PR per the issue's rollout plan step 7. Contract redeploys mint new addresses and require a SECRETS_REWRITE_PAT token to update six TEST_*_ADDRESS_HEIMA secrets after each deploy. More risk than the broker case → separate PR.
  • Prod broker auto-deploy — stays manual via `bash scripts/setup-broker-host.sh --upgrade` from the operator laptop per CLAUDE.md "Remote broker host (single entry point)" policy.
  • Mainnet prod contract redeploy — never automatic; manual via `bash scripts/setup-heima.sh` only (operator approval is the safety gate for production-affecting changes).

What landed

  • .github/workflows/harness-ci.yml — added `detect-changes` + `deploy-test-broker` jobs; updated header docstring with the two new secrets; updated `harness-e2e` `needs:` graph.
  • scripts/provision-ci-deploy-role.sh — idempotent IAM role + SSM policy provisioner. Verifies SSM agent reachability with actionable remediation messages.
  • docs/ci-setup.md — new §7 "Wire auto-deploy of the test broker" with provisioning recipe, secret list, dry-run procedure, and disarm instructions.

What did NOT land

  • `deploy-test-contracts` job (rollout plan step 7) — deferred to follow-up PR per issue scope.

Activation steps (for the operator after this PR merges)

  1. `awsp agentkeys-admin`
  2. Look up the test broker EC2 instance ID:
    ```
    TEST_BROKER_INSTANCE_ID=$(aws ec2 describe-instances --region "$REGION" \
    --filters "Name=tag:Name,Values=agentkeys-test-broker" \
    --query 'Reservations[0].Instances[0].InstanceId' --output text)
    ```
  3. Provision the role idempotently:
    ```
    bash scripts/provision-ci-deploy-role.sh \
    --test-broker-instance-id "$TEST_BROKER_INSTANCE_ID" \
    --env-file scripts/operator-workstation.test.env
    ```
  4. Set the two new repo secrets (the script prints both values):
    ```
    gh secret set OIDC_AWS_ROLE_ARN_DEPLOY --body "$(aws iam get-role --role-name github-actions-agentkeys-deploy --query 'Role.Arn' --output text)"
    gh secret set TEST_BROKER_INSTANCE_ID --body "$TEST_BROKER_INSTANCE_ID"
    ```
  5. Dry-run via `gh workflow run harness-ci.yml --field stage=1 --field force_deploy_broker=true` and confirm the deploy job lands cleanly.

Test plan

  • `bash -n scripts/provision-ci-deploy-role.sh` — syntax OK.
  • `scripts/provision-ci-deploy-role.sh --help` renders the embedded docstring.
  • `scripts/provision-ci-deploy-role.sh --bogus` rejects unknown flags.
  • `python3 -c "import yaml; yaml.safe_load(open('.github/workflows/harness-ci.yml'))"` — workflow YAML parses; all 5 jobs registered (`rust-checks`, `detect-changes`, `preflight`, `deploy-test-broker`, `harness-e2e`).
  • Operator runs the activation steps above + triggers a `force_deploy_broker=true` dry-run dispatch; confirm the SSM command lands on the test EC2 and `setup-broker-host.sh --test --yes` finishes cleanly.
  • Operator merges a no-op broker PR (e.g. a comment-only change in `crates/agentkeys-broker-server/`) and confirms (a) `detect-changes` flips `broker_changed=true`, (b) the deploy fires, (c) `harness-e2e` runs against the freshly-deployed broker.
  • Operator removes one of the two deploy secrets and confirms `deploy-test-broker` silently skips on the next PR, `harness-e2e` still runs.

Adds two new harness-ci.yml jobs that re-deploy the test broker EC2
when a PR touches broker-affecting paths, so harness-e2e validates the
PR's actual broker code instead of whatever stale binary the EC2
happens to be running.

- detect-changes (dorny/paths-filter@v3) computes broker_changed
- deploy-test-broker assumes a new OIDC role and drives
  setup-broker-host.sh --test --yes on the EC2 via aws ssm send-command
- scripts/provision-ci-deploy-role.sh provisions the IAM role with a
  trust policy scoped to repo:litentry/agentKeys:* and an inline policy
  scoped to one EC2 instance ARN (separation of duties from the
  existing TEST_OIDC_AWS_ROLE_ARN e2e role)
- harness-e2e now runs AFTER deploy-test-broker (deviation from the
  issue's `needs: harness-e2e` spec, documented inline) so broker bugs
  introduced by a PR fail that PR's harness — not the next one's

Auto-deploy is fully opt-in: skipped silently unless both
OIDC_AWS_ROLE_ARN_DEPLOY and TEST_BROKER_INSTANCE_ID secrets are set.
A workflow_dispatch input force_deploy_broker enables dry-run
validation without a broker-path change.

Out of scope for this PR (rollout plan step 7 in issue #101):
auto-deploy of the test Heima EVM contracts. Defers to a follow-up
because it needs the SECRETS_REWRITE_PAT token to update six
TEST_*_ADDRESS_HEIMA secrets after each redeploy.

Prod broker auto-deploy stays explicitly out of scope per CLAUDE.md
"Remote broker host (single entry point)" — manual via
bash scripts/setup-broker-host.sh --upgrade only.

Docs: docs/ci-setup.md gains §7 with the provisioning recipe, secret
list, dry-run procedure, and disarm path.
IAM CreateRole rejects descriptions outside [\t\n\r\x20-\x7e\xa1-\xff]
with 'Value at description failed to satisfy constraint'. The em-dash
in the original description string tripped this regex at provisioning
time. Replace with an ASCII hyphen and add an inline warning comment
so a future editor doesn't reintroduce Unicode here.

Reported by operator running docs/ci-setup.md §7.1.
…olds into runbook

Operator hit the second failure mode in docs/ci-setup.md §7.1: the test
broker EC2 was not registered with SSM (PingStatus=None), so the script
exited before SendCommand could ever work. The fix had to be one round-
trip per CLAUDE.md runbook-fix-fold-back policy: a sanity check upgrade
that catches the same case for the next operator AND a manual override.

Script changes:
- New --fix-ssm flag. When passed AND PingStatus != Online, the script:
    1. Looks up the EC2's IamInstanceProfile via DescribeInstances.
    2. Walks profile -> role via iam:GetInstanceProfile.
    3. Attaches arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
       (idempotent — aws iam attach-role-policy no-ops on re-attach).
    4. Polls describe-instance-information up to 18x (~3 min) waiting
       for the agent to refresh creds.
    5. If still offline after 3 min: prints both manual escape hatches
       (ssh + systemctl restart amazon-ssm-agent, OR aws ec2 reboot-instances).
- Without --fix-ssm: same diagnostic message as before, plus a one-line
  hint pointing at --fix-ssm. No IAM mutation; safe default.
- Handles the edge case of an instance with NO instance profile at all:
  prints associate-iam-instance-profile command, exits 1.

Docs (docs/ci-setup.md §7.1):
- Standard invocation now includes --fix-ssm on the first run.
- New SSM remediation table maps each failure mode to what --fix-ssm
  covers vs what the operator must still do by hand (agent restart,
  reboot, install agent, VPC endpoint).

Reported by operator after re-running the em-dash-fixed script;
PingStatus=None on i-0135a8b2c53d14941.
… line

set -u tripped on the role-already-exists branch because the log line
referenced $sub_pattern as a shell variable, but it only exists as a
jq --arg inside the trust-policy heredoc. Replace with ${REPO_SLUG}
which is a real shell var.

Latent since the first commit; surfaced now that the previous
em-dash fix let the operator reach this branch on re-run.
…e when EC2 has none

Operator re-ran with --fix-ssm; auto-remediation hit the third failure
mode: the test broker EC2 has NO IAM instance profile attached at all.
A common state on test brokers spun up by setup-cloud.sh --test — the
broker process authenticates to AWS via static creds in
/etc/agentkeys/broker.env, so an instance profile was never wired up.

Script changes:
- New create_and_associate_ssm_profile() called when DescribeInstances
  reports no IamInstanceProfile.Arn. Idempotent end-to-end:
    1. iam get-role agentkeys-test-broker-ssm → create if missing
       (EC2 service trust policy, AmazonSSMManagedInstanceCore attached).
    2. iam get-instance-profile agentkeys-test-broker-ssm → create if
       missing.
    3. iam get-instance-profile (.Roles[0]) → add-role-to-instance-profile
       if empty; refuse to swap if the profile already holds a different
       role (operator must reconcile manually).
    4. 15s sleep for IAM eventual consistency (per AWS docs).
    5. ec2 describe-iam-instance-profile-associations → associate-iam-instance-profile
       if no existing association.
- attach_ssm_managed_policy_if_missing() now dispatches to
  create_and_associate_ssm_profile() when no profile is present, instead
  of exiting 1 with manual instructions.

Why this is safe to add to a running broker:
- The broker app reads AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY from
  broker.env explicitly; static creds always win over IMDS-served creds.
- Adding an IMDS instance profile cannot reduce capability — only the
  SSM agent (and not the broker app) will read from IMDS.

Runbook fold-back (CLAUDE.md policy): docs/ci-setup.md §7.1 SSM
remediation table now reads '(handled)' for the no-profile row,
describing the dedicated role/profile that gets created.
…101 root cause)

Operator hit the SSM-agent-not-installed failure mode after --fix-ssm
created + associated the instance profile: 'Unit amazon-ssm-agent.service
not found.' Some Ubuntu AMIs downstream of the AWS Marketplace base ship
without amazon-ssm-agent. Without the agent, no IAM policy on earth lets
the EC2 register with SSM, and the CI auto-deploy (issue #101) hangs.

Per CLAUDE.md "Runbook-fix-fold-back policy": the cure for an
operator-encountered failure is to upgrade the script that owns the
broken step, not the script that surfaces the symptom. setup-broker-host.sh
is the canonical entry point for the broker EC2 — the SSM agent install
belongs there.

Script changes (scripts/setup-broker-host.sh):
- Idempotent SSM-agent install block right after the ec2-instance-connect
  block (same shape: ssm_unit_active() pre-check, install only on miss).
- Two install paths in priority order:
    1. snap install amazon-ssm-agent --classic
       (AWS-blessed on Ubuntu 22.04+; unit:
        snap.amazon-ssm-agent.amazon-ssm-agent.service)
    2. .deb fallback from
       https://s3.$REGION.amazonaws.com/amazon-ssm-$REGION/latest/
       (older / non-snap images; unit: amazon-ssm-agent.service)
- Both paths converge on ssm_unit_active() returning true; subsequent
  --upgrade re-runs skip after that.

Runbook fold-back (docs/ci-setup.md §7.1):
- 'SSM Agent not installed' row of the remediation table now points
  operators at setup-broker-host.sh --test --yes for the structural fix,
  with a snap one-liner for one-shot manual recovery.

Reported by operator after re-running provision-ci-deploy-role.sh
--fix-ssm: the script created the profile + associated it, but the
3-min poll timed out because no SSM agent was running on the EC2.
…-not-registered

The SSM verify block has been masking caller-permission gaps as
'instance not registered with SSM' (state=None) because of the
2>/dev/null || echo None silent fallback. Result: 4 rounds of phantom
remediation against the EC2 (em-dash fix, --fix-ssm flag, auto-create
instance profile, install amazon-ssm-agent on the EC2) — none of which
were addressing the actual cause, which was that the operator's admin
group lacks ssm:DescribeInstanceInformation.

Fix:
- Capture stderr into a tmpfile.
- Grep for 'AccessDenied' specifically; on hit, die() with the exact
  one-liner the operator needs to attach AmazonSSMReadOnlyAccess to
  the AgentKeyAdmin group.
- Empty stdout (no AccessDenied in stderr) = genuinely not registered;
  proceeds to the existing remediation paths.

Diagnosed by running aws ssm describe-instance-information directly
against i-0135a8b2c53d14941 as agentkeys-admin and seeing the
AccessDenied that the script had been swallowing all along.

Lesson (CLAUDE.md fold-back): when a sanity check uses 2>/dev/null,
make sure the discarded stderr can't be the answer to the question
the check is asking.
Operator hit 'HTTP 422: Unexpected inputs provided: [force_deploy_broker]'
on first dry-run dispatch. Root cause is GHA's 'workflows are registered
from the default branch' rule — same trap already documented in §6
('Common first-run failure modes'), but I didn't repeat it in §7.3, so
the operator hit it again.

Fix:
- §7.3 dispatch command now includes --ref <pr-branch>.
- Distinguish pre-merge (--ref required, input lives on PR branch) from
  post-merge (--ref optional, input is on main).
- Show the git rev-parse trick to look up the local branch name.

Per CLAUDE.md runbook-fix-fold-back: every operator-encountered
failure makes the runbook strictly more robust.
…nguish AccessDenied in workflow

Deploy-test-broker's sanity-check step failed in the first dry-run with
'i-XXX is not SSM-managed'. Root cause: same swallowed-stderr trap as
the local script, now in the workflow. The deploy role's inline policy
granted SendCommand + GetCommandInvocation + ListCommandInvocations,
but NOT DescribeInstanceInformation. AccessDenied was silently mapped
to 'None', which the workflow interpreted as 'not SSM-managed'.

Three fixes:
1. provision-ci-deploy-role.sh: PollCommandStatus statement now includes
   ssm:DescribeInstanceInformation. put-role-policy is idempotent so
   re-running the script refreshes the existing role's inline policy
   in place.
2. harness-ci.yml sanity-check: captures stderr separately, greps for
   AccessDenied, prints actionable remediation. Empty state (no
   AccessDenied) still means genuinely-not-registered.
3. docs/ci-setup.md §7.1: lists DescribeInstanceInformation in the
   inline-policy bullet + notes 'already provisioned? re-run; idempotent'.

Per CLAUDE.md runbook-fix-fold-back: every operator-encountered failure
makes the runbook + scripts strictly more robust. The defensive workflow
step catches this in the future if the policy template ever drifts.
Deploy job's SSM script failed with 'cd: can't cd to /home/ubuntu/agentKeys'
on the operator's test broker. The hardcoded path assumed the
ubuntu-user clone layout, but the operator's box has the repo at a
different location (the broker EC2 may have been bootstrapped from a
non-default user or path).

Fix:
- Auto-discover loop tries TEST_BROKER_REPO_DIR override (new optional
  secret), then 7 common candidates (/home/ubuntu/agentKeys, /opt/agentkeys,
  /srv/agentkeys, /root/agentKeys, etc.). First candidate containing
  scripts/setup-broker-host.sh wins.
- stat -c '%U' to discover the actual tree owner instead of hardcoding
  'ubuntu' — covers the agentkeys / root / custom-user cases.
- Fail loud with the override secret name if no candidate matches.

Docs (docs/ci-setup.md §7.2): TEST_BROKER_REPO_DIR added to the secrets
table with a note that it's optional + only needed when auto-discover
prints 'could not locate'.

Diagnosed via SSM command stderr after the upstream AccessDenied + perm
gaps were resolved earlier in this PR.
…_DIR_OVERRIDE under set -u

Two regressions caught by the second dispatch on the operator's box:

1. Auto-discover didn't find the repo. Operator confirmed the checkout
   lives at /home/agentkey/agentKeys — not in my original 8 candidates.
   Added /home/agentkey/agentKeys, /home/agentkey/agentkeys, and
   /home/agentkeys/agentKeys (covering the variations of the broker app
   user name).

2. Diagnostic echo referenced \$REPO_DIR_OVERRIDE under remote-shell
   set -u, which fires 'parameter not set' when the secret is unset.
   Fixed with a one-line default at the top of the remote script:
       REPO_DIR_OVERRIDE="${REPO_DIR_OVERRIDE:-...}"
   That makes subsequent references safe under set -u while still
   honoring an operator-set override.
…nder set -u

After the auto-discover + repo path fix landed, the SSM-driven deploy
got past clone, fetch, summary, apt deps, and rustup install — then hit
'HOME: unbound variable' at the rustup-env source line. SSM-driven
remote shells (AWS-RunShellScript document) don't export HOME for the
default user; setup-broker-host.sh uses 'set -euo pipefail', so the
unset reference aborts.

Fix: 'export HOME=${HOME:-$(getent passwd $(id -u) | cut -d: -f6)}'
right after 'set -euo pipefail'. Resolves the running user's home dir
from /etc/passwd when the env var is missing — portable across interactive
ssh sessions (HOME already set) and SSM SendCommand (HOME unset).

Same root cause family as the earlier IamInstanceProfile + agent-install
fixes: bootstrap paths assume an interactive operator shell, but the CI
auto-deploy path is the structural test for those assumptions.
… findings)

Codex adversarial review (PR #102) confirmed the harness-e2e failure
'replacement transaction underpriced' is NOT caused by this PR. The
broker/workers have no chain-write code paths reachable from the shipped
feature set: audit-evm is feature-gated (Phase C, unshipped), the
worker-audit's auto-flush only LOGS 'ready for on-chain appendRoot'
without submitting txs, and setup-broker-host.sh has zero deployer-key
access.

The actual mechanism is harness-side nonce contention: concurrent
harness-e2e runs (PR branch + workflow_dispatch + re-triggers) share
ONE Heima test deployer wallet, and 'cast send' without --nonce defaults
to 'latest' nonce derivation — which collides with pending mempool txs
from a prior run.

Two-layer fix:

1. .github/workflows/harness-ci.yml — second concurrency group on
   harness-e2e, scoped to 'heima-test-deployer-nonce' (not the ref),
   with cancel-in-progress: false so queued runs wait rather than
   cancel. The outer 'harness-ci-${{ github.ref }}' lock only
   serializes per-branch; this one serializes globally for the shared
   deployer wallet.

2. scripts/heima-fund-account.sh + scripts/heima-agent-create.sh — pass
   '--nonce $(cast nonce ADDR --block pending)' so cast computes the
   nonce against the PENDING block, not the latest confirmed. This
   defends against a stuck mempool tx that survives the previous run's
   exit (concurrency lock alone doesn't help — the tx is in the
   mempool, not in another job).

Both layers also add a specific error message when the underpriced
case fires, telling the operator to wait ~1min for the stuck tx to
confirm or drop.

Codex investigation log (1.4M tokens): scanned setup-broker-host.sh,
broker-server, all 4 workers, env files, harness scripts, and workflow
YAML. Found zero chain-write paths reachable from the deployed broker
binary. Specific evidence cited in codex's response (crates/agentkeys-
broker-server/src/handlers/cap.rs uses eth_call reads only; worker-audit
main.rs:71 logs intent but doesn't submit; broker.env has no deployer
key).
The detect-changes job fails on pull_request triggers with
'Error: Bad credentials' from dorny/paths-filter@v3. Root cause: the
workflow's explicit 'permissions:' block grants only id-token + contents,
which sets every other scope (including pull-requests) to 'none'.
paths-filter on PR events always queries the REST API
(/repos/.../pulls/N/files) — without pull-requests:read, the token is
rejected.

Earlier workflow_dispatch + push triggers passed because dispatch + push
don't take the PR-API code path (paths-filter does local git diff
against the previous push).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ci: auto-deploy to test broker + test Heima contracts on path-conditional triggers

1 participant