issue #101: path-conditional auto-deploy of test broker via SSM by hanwencheng · Pull Request #102 · litentry/agentKeys

hanwencheng · 2026-05-23T14:56:47Z

Closes #101 (broker-only scope; contract auto-deploy deferred to a follow-up per the issue's rollout plan step 7).

Summary

New detect-changes job in .github/workflows/harness-ci.yml uses dorny/paths-filter@v3 to compute broker_changed on every PR/push. The filter conservatively includes all workspace-shared crates the broker links against (agentkeys-types, agentkeys-core, agentkeys-signer-protocol, broker + worker crates), setup-broker-host.sh*, broker.env*, and Cargo.lock — false-negative-resistant per the path-filter caveat in the issue.
New deploy-test-broker job assumes a new OIDC role github-actions-agentkeys-deploy and drives setup-broker-host.sh --test --yes on the test broker EC2 via aws ssm send-command. SendCommand is followed by a 10s-poll loop (15min cap) that surfaces stdout / stderr on success or failure.
New scripts/provision-ci-deploy-role.sh provisions the role idempotently per CLAUDE.md "Idempotent remote-setup rule (CLOUD)": pre-check role existence, refresh trust policy on re-run, attach agentkeys-ci-deploy-ssm inline policy with SendCommand scoped to one document + one EC2 instance ARN, verify the EC2's SSM agent is PingStatus=Online with concrete remediation hints if not.
harness-e2e now needs deploy-test-broker and runs always() && (deploy success || deploy skipped) — so when deploy is skipped (no broker path changes, or auto-deploy not activated) the harness still runs against the existing broker binary.

Design deviation from the issue spec

The issue's example YAML has deploy-test-broker: needs: harness-e2e (deploy AFTER harness validates the current broker). This PR inverts that to deploy FIRST, then harness-e2e, because the failure mode the issue describes — "harness scripts at version B vs broker binary at version A → spurious pass or confusing failure" — is only caught if the harness runs against the freshly-deployed broker. Deploying second means a broker bug introduced by PR-N leaks to PR-(N+1)'s harness, which is what the issue is trying to prevent.

Trade-off: a broker bug that crashes on startup fails the deploy and skips the harness — also the right signal, since there's nothing meaningful to test against a broken broker.

The inline workflow comment documents the divergence so a reviewer can flip it back if they prefer the issue's ordering — both are one-line changes to the needs: graph.

Out of scope (per issue #101)

deploy-test-contracts job — deferred to a follow-up PR per the issue's rollout plan step 7. Contract redeploys mint new addresses and require a SECRETS_REWRITE_PAT token to update six TEST_*_ADDRESS_HEIMA secrets after each deploy. More risk than the broker case → separate PR.
Prod broker auto-deploy — stays manual via `bash scripts/setup-broker-host.sh --upgrade` from the operator laptop per CLAUDE.md "Remote broker host (single entry point)" policy.
Mainnet prod contract redeploy — never automatic; manual via `bash scripts/setup-heima.sh` only (operator approval is the safety gate for production-affecting changes).

What landed

.github/workflows/harness-ci.yml — added `detect-changes` + `deploy-test-broker` jobs; updated header docstring with the two new secrets; updated `harness-e2e` `needs:` graph.
scripts/provision-ci-deploy-role.sh — idempotent IAM role + SSM policy provisioner. Verifies SSM agent reachability with actionable remediation messages.
docs/ci-setup.md — new §7 "Wire auto-deploy of the test broker" with provisioning recipe, secret list, dry-run procedure, and disarm instructions.

What did NOT land

`deploy-test-contracts` job (rollout plan step 7) — deferred to follow-up PR per issue scope.

Activation steps (for the operator after this PR merges)

`awsp agentkeys-admin`
Look up the test broker EC2 instance ID:
```
TEST_BROKER_INSTANCE_ID=$(aws ec2 describe-instances --region "$REGION" \
--filters "Name=tag:Name,Values=agentkeys-test-broker" \
--query 'Reservations[0].Instances[0].InstanceId' --output text)
```
Provision the role idempotently:
```
bash scripts/provision-ci-deploy-role.sh \
--test-broker-instance-id "$TEST_BROKER_INSTANCE_ID" \
--env-file scripts/operator-workstation.test.env
```
Set the two new repo secrets (the script prints both values):
```
gh secret set OIDC_AWS_ROLE_ARN_DEPLOY --body "$(aws iam get-role --role-name github-actions-agentkeys-deploy --query 'Role.Arn' --output text)"
gh secret set TEST_BROKER_INSTANCE_ID --body "$TEST_BROKER_INSTANCE_ID"
```
Dry-run via `gh workflow run harness-ci.yml --field stage=1 --field force_deploy_broker=true` and confirm the deploy job lands cleanly.

Test plan

`bash -n scripts/provision-ci-deploy-role.sh` — syntax OK.
`scripts/provision-ci-deploy-role.sh --help` renders the embedded docstring.
`scripts/provision-ci-deploy-role.sh --bogus` rejects unknown flags.
`python3 -c "import yaml; yaml.safe_load(open('.github/workflows/harness-ci.yml'))"` — workflow YAML parses; all 5 jobs registered (`rust-checks`, `detect-changes`, `preflight`, `deploy-test-broker`, `harness-e2e`).
Operator runs the activation steps above + triggers a `force_deploy_broker=true` dry-run dispatch; confirm the SSM command lands on the test EC2 and `setup-broker-host.sh --test --yes` finishes cleanly.
Operator merges a no-op broker PR (e.g. a comment-only change in `crates/agentkeys-broker-server/`) and confirms (a) `detect-changes` flips `broker_changed=true`, (b) the deploy fires, (c) `harness-e2e` runs against the freshly-deployed broker.
Operator removes one of the two deploy secrets and confirms `deploy-test-broker` silently skips on the next PR, `harness-e2e` still runs.

Adds two new harness-ci.yml jobs that re-deploy the test broker EC2 when a PR touches broker-affecting paths, so harness-e2e validates the PR's actual broker code instead of whatever stale binary the EC2 happens to be running. - detect-changes (dorny/paths-filter@v3) computes broker_changed - deploy-test-broker assumes a new OIDC role and drives setup-broker-host.sh --test --yes on the EC2 via aws ssm send-command - scripts/provision-ci-deploy-role.sh provisions the IAM role with a trust policy scoped to repo:litentry/agentKeys:* and an inline policy scoped to one EC2 instance ARN (separation of duties from the existing TEST_OIDC_AWS_ROLE_ARN e2e role) - harness-e2e now runs AFTER deploy-test-broker (deviation from the issue's `needs: harness-e2e` spec, documented inline) so broker bugs introduced by a PR fail that PR's harness — not the next one's Auto-deploy is fully opt-in: skipped silently unless both OIDC_AWS_ROLE_ARN_DEPLOY and TEST_BROKER_INSTANCE_ID secrets are set. A workflow_dispatch input force_deploy_broker enables dry-run validation without a broker-path change. Out of scope for this PR (rollout plan step 7 in issue #101): auto-deploy of the test Heima EVM contracts. Defers to a follow-up because it needs the SECRETS_REWRITE_PAT token to update six TEST_*_ADDRESS_HEIMA secrets after each redeploy. Prod broker auto-deploy stays explicitly out of scope per CLAUDE.md "Remote broker host (single entry point)" — manual via bash scripts/setup-broker-host.sh --upgrade only. Docs: docs/ci-setup.md gains §7 with the provisioning recipe, secret list, dry-run procedure, and disarm path.

IAM CreateRole rejects descriptions outside [\t\n\r\x20-\x7e\xa1-\xff] with 'Value at description failed to satisfy constraint'. The em-dash in the original description string tripped this regex at provisioning time. Replace with an ASCII hyphen and add an inline warning comment so a future editor doesn't reintroduce Unicode here. Reported by operator running docs/ci-setup.md §7.1.

…olds into runbook Operator hit the second failure mode in docs/ci-setup.md §7.1: the test broker EC2 was not registered with SSM (PingStatus=None), so the script exited before SendCommand could ever work. The fix had to be one round- trip per CLAUDE.md runbook-fix-fold-back policy: a sanity check upgrade that catches the same case for the next operator AND a manual override. Script changes: - New --fix-ssm flag. When passed AND PingStatus != Online, the script: 1. Looks up the EC2's IamInstanceProfile via DescribeInstances. 2. Walks profile -> role via iam:GetInstanceProfile. 3. Attaches arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore (idempotent — aws iam attach-role-policy no-ops on re-attach). 4. Polls describe-instance-information up to 18x (~3 min) waiting for the agent to refresh creds. 5. If still offline after 3 min: prints both manual escape hatches (ssh + systemctl restart amazon-ssm-agent, OR aws ec2 reboot-instances). - Without --fix-ssm: same diagnostic message as before, plus a one-line hint pointing at --fix-ssm. No IAM mutation; safe default. - Handles the edge case of an instance with NO instance profile at all: prints associate-iam-instance-profile command, exits 1. Docs (docs/ci-setup.md §7.1): - Standard invocation now includes --fix-ssm on the first run. - New SSM remediation table maps each failure mode to what --fix-ssm covers vs what the operator must still do by hand (agent restart, reboot, install agent, VPC endpoint). Reported by operator after re-running the em-dash-fixed script; PingStatus=None on i-0135a8b2c53d14941.

… line set -u tripped on the role-already-exists branch because the log line referenced $sub_pattern as a shell variable, but it only exists as a jq --arg inside the trust-policy heredoc. Replace with ${REPO_SLUG} which is a real shell var. Latent since the first commit; surfaced now that the previous em-dash fix let the operator reach this branch on re-run.

…e when EC2 has none Operator re-ran with --fix-ssm; auto-remediation hit the third failure mode: the test broker EC2 has NO IAM instance profile attached at all. A common state on test brokers spun up by setup-cloud.sh --test — the broker process authenticates to AWS via static creds in /etc/agentkeys/broker.env, so an instance profile was never wired up. Script changes: - New create_and_associate_ssm_profile() called when DescribeInstances reports no IamInstanceProfile.Arn. Idempotent end-to-end: 1. iam get-role agentkeys-test-broker-ssm → create if missing (EC2 service trust policy, AmazonSSMManagedInstanceCore attached). 2. iam get-instance-profile agentkeys-test-broker-ssm → create if missing. 3. iam get-instance-profile (.Roles[0]) → add-role-to-instance-profile if empty; refuse to swap if the profile already holds a different role (operator must reconcile manually). 4. 15s sleep for IAM eventual consistency (per AWS docs). 5. ec2 describe-iam-instance-profile-associations → associate-iam-instance-profile if no existing association. - attach_ssm_managed_policy_if_missing() now dispatches to create_and_associate_ssm_profile() when no profile is present, instead of exiting 1 with manual instructions. Why this is safe to add to a running broker: - The broker app reads AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY from broker.env explicitly; static creds always win over IMDS-served creds. - Adding an IMDS instance profile cannot reduce capability — only the SSM agent (and not the broker app) will read from IMDS. Runbook fold-back (CLAUDE.md policy): docs/ci-setup.md §7.1 SSM remediation table now reads '(handled)' for the no-profile row, describing the dedicated role/profile that gets created.

…101 root cause) Operator hit the SSM-agent-not-installed failure mode after --fix-ssm created + associated the instance profile: 'Unit amazon-ssm-agent.service not found.' Some Ubuntu AMIs downstream of the AWS Marketplace base ship without amazon-ssm-agent. Without the agent, no IAM policy on earth lets the EC2 register with SSM, and the CI auto-deploy (issue #101) hangs. Per CLAUDE.md "Runbook-fix-fold-back policy": the cure for an operator-encountered failure is to upgrade the script that owns the broken step, not the script that surfaces the symptom. setup-broker-host.sh is the canonical entry point for the broker EC2 — the SSM agent install belongs there. Script changes (scripts/setup-broker-host.sh): - Idempotent SSM-agent install block right after the ec2-instance-connect block (same shape: ssm_unit_active() pre-check, install only on miss). - Two install paths in priority order: 1. snap install amazon-ssm-agent --classic (AWS-blessed on Ubuntu 22.04+; unit: snap.amazon-ssm-agent.amazon-ssm-agent.service) 2. .deb fallback from https://s3.$REGION.amazonaws.com/amazon-ssm-$REGION/latest/ (older / non-snap images; unit: amazon-ssm-agent.service) - Both paths converge on ssm_unit_active() returning true; subsequent --upgrade re-runs skip after that. Runbook fold-back (docs/ci-setup.md §7.1): - 'SSM Agent not installed' row of the remediation table now points operators at setup-broker-host.sh --test --yes for the structural fix, with a snap one-liner for one-shot manual recovery. Reported by operator after re-running provision-ci-deploy-role.sh --fix-ssm: the script created the profile + associated it, but the 3-min poll timed out because no SSM agent was running on the EC2.

…-not-registered The SSM verify block has been masking caller-permission gaps as 'instance not registered with SSM' (state=None) because of the 2>/dev/null || echo None silent fallback. Result: 4 rounds of phantom remediation against the EC2 (em-dash fix, --fix-ssm flag, auto-create instance profile, install amazon-ssm-agent on the EC2) — none of which were addressing the actual cause, which was that the operator's admin group lacks ssm:DescribeInstanceInformation. Fix: - Capture stderr into a tmpfile. - Grep for 'AccessDenied' specifically; on hit, die() with the exact one-liner the operator needs to attach AmazonSSMReadOnlyAccess to the AgentKeyAdmin group. - Empty stdout (no AccessDenied in stderr) = genuinely not registered; proceeds to the existing remediation paths. Diagnosed by running aws ssm describe-instance-information directly against i-0135a8b2c53d14941 as agentkeys-admin and seeing the AccessDenied that the script had been swallowing all along. Lesson (CLAUDE.md fold-back): when a sanity check uses 2>/dev/null, make sure the discarded stderr can't be the answer to the question the check is asking.

Operator hit 'HTTP 422: Unexpected inputs provided: [force_deploy_broker]' on first dry-run dispatch. Root cause is GHA's 'workflows are registered from the default branch' rule — same trap already documented in §6 ('Common first-run failure modes'), but I didn't repeat it in §7.3, so the operator hit it again. Fix: - §7.3 dispatch command now includes --ref <pr-branch>. - Distinguish pre-merge (--ref required, input lives on PR branch) from post-merge (--ref optional, input is on main). - Show the git rev-parse trick to look up the local branch name. Per CLAUDE.md runbook-fix-fold-back: every operator-encountered failure makes the runbook strictly more robust.

…nguish AccessDenied in workflow Deploy-test-broker's sanity-check step failed in the first dry-run with 'i-XXX is not SSM-managed'. Root cause: same swallowed-stderr trap as the local script, now in the workflow. The deploy role's inline policy granted SendCommand + GetCommandInvocation + ListCommandInvocations, but NOT DescribeInstanceInformation. AccessDenied was silently mapped to 'None', which the workflow interpreted as 'not SSM-managed'. Three fixes: 1. provision-ci-deploy-role.sh: PollCommandStatus statement now includes ssm:DescribeInstanceInformation. put-role-policy is idempotent so re-running the script refreshes the existing role's inline policy in place. 2. harness-ci.yml sanity-check: captures stderr separately, greps for AccessDenied, prints actionable remediation. Empty state (no AccessDenied) still means genuinely-not-registered. 3. docs/ci-setup.md §7.1: lists DescribeInstanceInformation in the inline-policy bullet + notes 'already provisioned? re-run; idempotent'. Per CLAUDE.md runbook-fix-fold-back: every operator-encountered failure makes the runbook + scripts strictly more robust. The defensive workflow step catches this in the future if the policy template ever drifts.

Deploy job's SSM script failed with 'cd: can't cd to /home/ubuntu/agentKeys' on the operator's test broker. The hardcoded path assumed the ubuntu-user clone layout, but the operator's box has the repo at a different location (the broker EC2 may have been bootstrapped from a non-default user or path). Fix: - Auto-discover loop tries TEST_BROKER_REPO_DIR override (new optional secret), then 7 common candidates (/home/ubuntu/agentKeys, /opt/agentkeys, /srv/agentkeys, /root/agentKeys, etc.). First candidate containing scripts/setup-broker-host.sh wins. - stat -c '%U' to discover the actual tree owner instead of hardcoding 'ubuntu' — covers the agentkeys / root / custom-user cases. - Fail loud with the override secret name if no candidate matches. Docs (docs/ci-setup.md §7.2): TEST_BROKER_REPO_DIR added to the secrets table with a note that it's optional + only needed when auto-discover prints 'could not locate'. Diagnosed via SSM command stderr after the upstream AccessDenied + perm gaps were resolved earlier in this PR.

…_DIR_OVERRIDE under set -u Two regressions caught by the second dispatch on the operator's box: 1. Auto-discover didn't find the repo. Operator confirmed the checkout lives at /home/agentkey/agentKeys — not in my original 8 candidates. Added /home/agentkey/agentKeys, /home/agentkey/agentkeys, and /home/agentkeys/agentKeys (covering the variations of the broker app user name). 2. Diagnostic echo referenced \$REPO_DIR_OVERRIDE under remote-shell set -u, which fires 'parameter not set' when the secret is unset. Fixed with a one-line default at the top of the remote script: REPO_DIR_OVERRIDE="${REPO_DIR_OVERRIDE:-...}" That makes subsequent references safe under set -u while still honoring an operator-set override.

…nder set -u After the auto-discover + repo path fix landed, the SSM-driven deploy got past clone, fetch, summary, apt deps, and rustup install — then hit 'HOME: unbound variable' at the rustup-env source line. SSM-driven remote shells (AWS-RunShellScript document) don't export HOME for the default user; setup-broker-host.sh uses 'set -euo pipefail', so the unset reference aborts. Fix: 'export HOME=${HOME:-$(getent passwd $(id -u) | cut -d: -f6)}' right after 'set -euo pipefail'. Resolves the running user's home dir from /etc/passwd when the env var is missing — portable across interactive ssh sessions (HOME already set) and SSM SendCommand (HOME unset). Same root cause family as the earlier IamInstanceProfile + agent-install fixes: bootstrap paths assume an interactive operator shell, but the CI auto-deploy path is the structural test for those assumptions.

… findings) Codex adversarial review (PR #102) confirmed the harness-e2e failure 'replacement transaction underpriced' is NOT caused by this PR. The broker/workers have no chain-write code paths reachable from the shipped feature set: audit-evm is feature-gated (Phase C, unshipped), the worker-audit's auto-flush only LOGS 'ready for on-chain appendRoot' without submitting txs, and setup-broker-host.sh has zero deployer-key access. The actual mechanism is harness-side nonce contention: concurrent harness-e2e runs (PR branch + workflow_dispatch + re-triggers) share ONE Heima test deployer wallet, and 'cast send' without --nonce defaults to 'latest' nonce derivation — which collides with pending mempool txs from a prior run. Two-layer fix: 1. .github/workflows/harness-ci.yml — second concurrency group on harness-e2e, scoped to 'heima-test-deployer-nonce' (not the ref), with cancel-in-progress: false so queued runs wait rather than cancel. The outer 'harness-ci-${{ github.ref }}' lock only serializes per-branch; this one serializes globally for the shared deployer wallet. 2. scripts/heima-fund-account.sh + scripts/heima-agent-create.sh — pass '--nonce $(cast nonce ADDR --block pending)' so cast computes the nonce against the PENDING block, not the latest confirmed. This defends against a stuck mempool tx that survives the previous run's exit (concurrency lock alone doesn't help — the tx is in the mempool, not in another job). Both layers also add a specific error message when the underpriced case fires, telling the operator to wait ~1min for the stuck tx to confirm or drop. Codex investigation log (1.4M tokens): scanned setup-broker-host.sh, broker-server, all 4 workers, env files, harness scripts, and workflow YAML. Found zero chain-write paths reachable from the deployed broker binary. Specific evidence cited in codex's response (crates/agentkeys- broker-server/src/handlers/cap.rs uses eth_call reads only; worker-audit main.rs:71 logs intent but doesn't submit; broker.env has no deployer key).

The detect-changes job fails on pull_request triggers with 'Error: Bad credentials' from dorny/paths-filter@v3. Root cause: the workflow's explicit 'permissions:' block grants only id-token + contents, which sets every other scope (including pull-requests) to 'none'. paths-filter on PR events always queries the REST API (/repos/.../pulls/N/files) — without pull-requests:read, the token is rejected. Earlier workflow_dispatch + push triggers passed because dispatch + push don't take the PR-API code path (paths-filter does local git diff against the previous push).

hanwencheng added 14 commits May 23, 2026 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue #101: path-conditional auto-deploy of test broker via SSM#102

issue #101: path-conditional auto-deploy of test broker via SSM#102
hanwencheng wants to merge 14 commits into
mainfrom
claude/adoring-bell-1b9ca8

hanwencheng commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hanwencheng commented May 23, 2026

Summary

Design deviation from the issue spec

Out of scope (per issue #101)

What landed

What did NOT land

Activation steps (for the operator after this PR merges)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant