issue #101: path-conditional auto-deploy of test broker via SSM#102
Open
hanwencheng wants to merge 14 commits into
Open
issue #101: path-conditional auto-deploy of test broker via SSM#102hanwencheng wants to merge 14 commits into
hanwencheng wants to merge 14 commits into
Conversation
Adds two new harness-ci.yml jobs that re-deploy the test broker EC2 when a PR touches broker-affecting paths, so harness-e2e validates the PR's actual broker code instead of whatever stale binary the EC2 happens to be running. - detect-changes (dorny/paths-filter@v3) computes broker_changed - deploy-test-broker assumes a new OIDC role and drives setup-broker-host.sh --test --yes on the EC2 via aws ssm send-command - scripts/provision-ci-deploy-role.sh provisions the IAM role with a trust policy scoped to repo:litentry/agentKeys:* and an inline policy scoped to one EC2 instance ARN (separation of duties from the existing TEST_OIDC_AWS_ROLE_ARN e2e role) - harness-e2e now runs AFTER deploy-test-broker (deviation from the issue's `needs: harness-e2e` spec, documented inline) so broker bugs introduced by a PR fail that PR's harness — not the next one's Auto-deploy is fully opt-in: skipped silently unless both OIDC_AWS_ROLE_ARN_DEPLOY and TEST_BROKER_INSTANCE_ID secrets are set. A workflow_dispatch input force_deploy_broker enables dry-run validation without a broker-path change. Out of scope for this PR (rollout plan step 7 in issue #101): auto-deploy of the test Heima EVM contracts. Defers to a follow-up because it needs the SECRETS_REWRITE_PAT token to update six TEST_*_ADDRESS_HEIMA secrets after each redeploy. Prod broker auto-deploy stays explicitly out of scope per CLAUDE.md "Remote broker host (single entry point)" — manual via bash scripts/setup-broker-host.sh --upgrade only. Docs: docs/ci-setup.md gains §7 with the provisioning recipe, secret list, dry-run procedure, and disarm path.
IAM CreateRole rejects descriptions outside [\t\n\r\x20-\x7e\xa1-\xff] with 'Value at description failed to satisfy constraint'. The em-dash in the original description string tripped this regex at provisioning time. Replace with an ASCII hyphen and add an inline warning comment so a future editor doesn't reintroduce Unicode here. Reported by operator running docs/ci-setup.md §7.1.
…olds into runbook
Operator hit the second failure mode in docs/ci-setup.md §7.1: the test
broker EC2 was not registered with SSM (PingStatus=None), so the script
exited before SendCommand could ever work. The fix had to be one round-
trip per CLAUDE.md runbook-fix-fold-back policy: a sanity check upgrade
that catches the same case for the next operator AND a manual override.
Script changes:
- New --fix-ssm flag. When passed AND PingStatus != Online, the script:
1. Looks up the EC2's IamInstanceProfile via DescribeInstances.
2. Walks profile -> role via iam:GetInstanceProfile.
3. Attaches arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
(idempotent — aws iam attach-role-policy no-ops on re-attach).
4. Polls describe-instance-information up to 18x (~3 min) waiting
for the agent to refresh creds.
5. If still offline after 3 min: prints both manual escape hatches
(ssh + systemctl restart amazon-ssm-agent, OR aws ec2 reboot-instances).
- Without --fix-ssm: same diagnostic message as before, plus a one-line
hint pointing at --fix-ssm. No IAM mutation; safe default.
- Handles the edge case of an instance with NO instance profile at all:
prints associate-iam-instance-profile command, exits 1.
Docs (docs/ci-setup.md §7.1):
- Standard invocation now includes --fix-ssm on the first run.
- New SSM remediation table maps each failure mode to what --fix-ssm
covers vs what the operator must still do by hand (agent restart,
reboot, install agent, VPC endpoint).
Reported by operator after re-running the em-dash-fixed script;
PingStatus=None on i-0135a8b2c53d14941.
… line
set -u tripped on the role-already-exists branch because the log line
referenced $sub_pattern as a shell variable, but it only exists as a
jq --arg inside the trust-policy heredoc. Replace with ${REPO_SLUG}
which is a real shell var.
Latent since the first commit; surfaced now that the previous
em-dash fix let the operator reach this branch on re-run.
…e when EC2 has none
Operator re-ran with --fix-ssm; auto-remediation hit the third failure
mode: the test broker EC2 has NO IAM instance profile attached at all.
A common state on test brokers spun up by setup-cloud.sh --test — the
broker process authenticates to AWS via static creds in
/etc/agentkeys/broker.env, so an instance profile was never wired up.
Script changes:
- New create_and_associate_ssm_profile() called when DescribeInstances
reports no IamInstanceProfile.Arn. Idempotent end-to-end:
1. iam get-role agentkeys-test-broker-ssm → create if missing
(EC2 service trust policy, AmazonSSMManagedInstanceCore attached).
2. iam get-instance-profile agentkeys-test-broker-ssm → create if
missing.
3. iam get-instance-profile (.Roles[0]) → add-role-to-instance-profile
if empty; refuse to swap if the profile already holds a different
role (operator must reconcile manually).
4. 15s sleep for IAM eventual consistency (per AWS docs).
5. ec2 describe-iam-instance-profile-associations → associate-iam-instance-profile
if no existing association.
- attach_ssm_managed_policy_if_missing() now dispatches to
create_and_associate_ssm_profile() when no profile is present, instead
of exiting 1 with manual instructions.
Why this is safe to add to a running broker:
- The broker app reads AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY from
broker.env explicitly; static creds always win over IMDS-served creds.
- Adding an IMDS instance profile cannot reduce capability — only the
SSM agent (and not the broker app) will read from IMDS.
Runbook fold-back (CLAUDE.md policy): docs/ci-setup.md §7.1 SSM
remediation table now reads '(handled)' for the no-profile row,
describing the dedicated role/profile that gets created.
…101 root cause) Operator hit the SSM-agent-not-installed failure mode after --fix-ssm created + associated the instance profile: 'Unit amazon-ssm-agent.service not found.' Some Ubuntu AMIs downstream of the AWS Marketplace base ship without amazon-ssm-agent. Without the agent, no IAM policy on earth lets the EC2 register with SSM, and the CI auto-deploy (issue #101) hangs. Per CLAUDE.md "Runbook-fix-fold-back policy": the cure for an operator-encountered failure is to upgrade the script that owns the broken step, not the script that surfaces the symptom. setup-broker-host.sh is the canonical entry point for the broker EC2 — the SSM agent install belongs there. Script changes (scripts/setup-broker-host.sh): - Idempotent SSM-agent install block right after the ec2-instance-connect block (same shape: ssm_unit_active() pre-check, install only on miss). - Two install paths in priority order: 1. snap install amazon-ssm-agent --classic (AWS-blessed on Ubuntu 22.04+; unit: snap.amazon-ssm-agent.amazon-ssm-agent.service) 2. .deb fallback from https://s3.$REGION.amazonaws.com/amazon-ssm-$REGION/latest/ (older / non-snap images; unit: amazon-ssm-agent.service) - Both paths converge on ssm_unit_active() returning true; subsequent --upgrade re-runs skip after that. Runbook fold-back (docs/ci-setup.md §7.1): - 'SSM Agent not installed' row of the remediation table now points operators at setup-broker-host.sh --test --yes for the structural fix, with a snap one-liner for one-shot manual recovery. Reported by operator after re-running provision-ci-deploy-role.sh --fix-ssm: the script created the profile + associated it, but the 3-min poll timed out because no SSM agent was running on the EC2.
…-not-registered The SSM verify block has been masking caller-permission gaps as 'instance not registered with SSM' (state=None) because of the 2>/dev/null || echo None silent fallback. Result: 4 rounds of phantom remediation against the EC2 (em-dash fix, --fix-ssm flag, auto-create instance profile, install amazon-ssm-agent on the EC2) — none of which were addressing the actual cause, which was that the operator's admin group lacks ssm:DescribeInstanceInformation. Fix: - Capture stderr into a tmpfile. - Grep for 'AccessDenied' specifically; on hit, die() with the exact one-liner the operator needs to attach AmazonSSMReadOnlyAccess to the AgentKeyAdmin group. - Empty stdout (no AccessDenied in stderr) = genuinely not registered; proceeds to the existing remediation paths. Diagnosed by running aws ssm describe-instance-information directly against i-0135a8b2c53d14941 as agentkeys-admin and seeing the AccessDenied that the script had been swallowing all along. Lesson (CLAUDE.md fold-back): when a sanity check uses 2>/dev/null, make sure the discarded stderr can't be the answer to the question the check is asking.
Operator hit 'HTTP 422: Unexpected inputs provided: [force_deploy_broker]'
on first dry-run dispatch. Root cause is GHA's 'workflows are registered
from the default branch' rule — same trap already documented in §6
('Common first-run failure modes'), but I didn't repeat it in §7.3, so
the operator hit it again.
Fix:
- §7.3 dispatch command now includes --ref <pr-branch>.
- Distinguish pre-merge (--ref required, input lives on PR branch) from
post-merge (--ref optional, input is on main).
- Show the git rev-parse trick to look up the local branch name.
Per CLAUDE.md runbook-fix-fold-back: every operator-encountered
failure makes the runbook strictly more robust.
…nguish AccessDenied in workflow Deploy-test-broker's sanity-check step failed in the first dry-run with 'i-XXX is not SSM-managed'. Root cause: same swallowed-stderr trap as the local script, now in the workflow. The deploy role's inline policy granted SendCommand + GetCommandInvocation + ListCommandInvocations, but NOT DescribeInstanceInformation. AccessDenied was silently mapped to 'None', which the workflow interpreted as 'not SSM-managed'. Three fixes: 1. provision-ci-deploy-role.sh: PollCommandStatus statement now includes ssm:DescribeInstanceInformation. put-role-policy is idempotent so re-running the script refreshes the existing role's inline policy in place. 2. harness-ci.yml sanity-check: captures stderr separately, greps for AccessDenied, prints actionable remediation. Empty state (no AccessDenied) still means genuinely-not-registered. 3. docs/ci-setup.md §7.1: lists DescribeInstanceInformation in the inline-policy bullet + notes 'already provisioned? re-run; idempotent'. Per CLAUDE.md runbook-fix-fold-back: every operator-encountered failure makes the runbook + scripts strictly more robust. The defensive workflow step catches this in the future if the policy template ever drifts.
Deploy job's SSM script failed with 'cd: can't cd to /home/ubuntu/agentKeys' on the operator's test broker. The hardcoded path assumed the ubuntu-user clone layout, but the operator's box has the repo at a different location (the broker EC2 may have been bootstrapped from a non-default user or path). Fix: - Auto-discover loop tries TEST_BROKER_REPO_DIR override (new optional secret), then 7 common candidates (/home/ubuntu/agentKeys, /opt/agentkeys, /srv/agentkeys, /root/agentKeys, etc.). First candidate containing scripts/setup-broker-host.sh wins. - stat -c '%U' to discover the actual tree owner instead of hardcoding 'ubuntu' — covers the agentkeys / root / custom-user cases. - Fail loud with the override secret name if no candidate matches. Docs (docs/ci-setup.md §7.2): TEST_BROKER_REPO_DIR added to the secrets table with a note that it's optional + only needed when auto-discover prints 'could not locate'. Diagnosed via SSM command stderr after the upstream AccessDenied + perm gaps were resolved earlier in this PR.
…_DIR_OVERRIDE under set -u
Two regressions caught by the second dispatch on the operator's box:
1. Auto-discover didn't find the repo. Operator confirmed the checkout
lives at /home/agentkey/agentKeys — not in my original 8 candidates.
Added /home/agentkey/agentKeys, /home/agentkey/agentkeys, and
/home/agentkeys/agentKeys (covering the variations of the broker app
user name).
2. Diagnostic echo referenced \$REPO_DIR_OVERRIDE under remote-shell
set -u, which fires 'parameter not set' when the secret is unset.
Fixed with a one-line default at the top of the remote script:
REPO_DIR_OVERRIDE="${REPO_DIR_OVERRIDE:-...}"
That makes subsequent references safe under set -u while still
honoring an operator-set override.
…nder set -u
After the auto-discover + repo path fix landed, the SSM-driven deploy
got past clone, fetch, summary, apt deps, and rustup install — then hit
'HOME: unbound variable' at the rustup-env source line. SSM-driven
remote shells (AWS-RunShellScript document) don't export HOME for the
default user; setup-broker-host.sh uses 'set -euo pipefail', so the
unset reference aborts.
Fix: 'export HOME=${HOME:-$(getent passwd $(id -u) | cut -d: -f6)}'
right after 'set -euo pipefail'. Resolves the running user's home dir
from /etc/passwd when the env var is missing — portable across interactive
ssh sessions (HOME already set) and SSM SendCommand (HOME unset).
Same root cause family as the earlier IamInstanceProfile + agent-install
fixes: bootstrap paths assume an interactive operator shell, but the CI
auto-deploy path is the structural test for those assumptions.
… findings) Codex adversarial review (PR #102) confirmed the harness-e2e failure 'replacement transaction underpriced' is NOT caused by this PR. The broker/workers have no chain-write code paths reachable from the shipped feature set: audit-evm is feature-gated (Phase C, unshipped), the worker-audit's auto-flush only LOGS 'ready for on-chain appendRoot' without submitting txs, and setup-broker-host.sh has zero deployer-key access. The actual mechanism is harness-side nonce contention: concurrent harness-e2e runs (PR branch + workflow_dispatch + re-triggers) share ONE Heima test deployer wallet, and 'cast send' without --nonce defaults to 'latest' nonce derivation — which collides with pending mempool txs from a prior run. Two-layer fix: 1. .github/workflows/harness-ci.yml — second concurrency group on harness-e2e, scoped to 'heima-test-deployer-nonce' (not the ref), with cancel-in-progress: false so queued runs wait rather than cancel. The outer 'harness-ci-${{ github.ref }}' lock only serializes per-branch; this one serializes globally for the shared deployer wallet. 2. scripts/heima-fund-account.sh + scripts/heima-agent-create.sh — pass '--nonce $(cast nonce ADDR --block pending)' so cast computes the nonce against the PENDING block, not the latest confirmed. This defends against a stuck mempool tx that survives the previous run's exit (concurrency lock alone doesn't help — the tx is in the mempool, not in another job). Both layers also add a specific error message when the underpriced case fires, telling the operator to wait ~1min for the stuck tx to confirm or drop. Codex investigation log (1.4M tokens): scanned setup-broker-host.sh, broker-server, all 4 workers, env files, harness scripts, and workflow YAML. Found zero chain-write paths reachable from the deployed broker binary. Specific evidence cited in codex's response (crates/agentkeys- broker-server/src/handlers/cap.rs uses eth_call reads only; worker-audit main.rs:71 logs intent but doesn't submit; broker.env has no deployer key).
The detect-changes job fails on pull_request triggers with 'Error: Bad credentials' from dorny/paths-filter@v3. Root cause: the workflow's explicit 'permissions:' block grants only id-token + contents, which sets every other scope (including pull-requests) to 'none'. paths-filter on PR events always queries the REST API (/repos/.../pulls/N/files) — without pull-requests:read, the token is rejected. Earlier workflow_dispatch + push triggers passed because dispatch + push don't take the PR-API code path (paths-filter does local git diff against the previous push).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #101 (broker-only scope; contract auto-deploy deferred to a follow-up per the issue's rollout plan step 7).
Summary
detect-changesjob in.github/workflows/harness-ci.ymlusesdorny/paths-filter@v3to computebroker_changedon every PR/push. The filter conservatively includes all workspace-shared crates the broker links against (agentkeys-types,agentkeys-core,agentkeys-signer-protocol, broker + worker crates),setup-broker-host.sh*,broker.env*, andCargo.lock— false-negative-resistant per the path-filter caveat in the issue.deploy-test-brokerjob assumes a new OIDC rolegithub-actions-agentkeys-deployand drivessetup-broker-host.sh --test --yeson the test broker EC2 viaaws ssm send-command. SendCommand is followed by a 10s-poll loop (15min cap) that surfaces stdout / stderr on success or failure.scripts/provision-ci-deploy-role.shprovisions the role idempotently per CLAUDE.md "Idempotent remote-setup rule (CLOUD)": pre-check role existence, refresh trust policy on re-run, attachagentkeys-ci-deploy-ssminline policy with SendCommand scoped to one document + one EC2 instance ARN, verify the EC2's SSM agent isPingStatus=Onlinewith concrete remediation hints if not.harness-e2enow needsdeploy-test-brokerand runsalways() && (deploy success || deploy skipped)— so when deploy is skipped (no broker path changes, or auto-deploy not activated) the harness still runs against the existing broker binary.Design deviation from the issue spec
The issue's example YAML has
deploy-test-broker: needs: harness-e2e(deploy AFTER harness validates the current broker). This PR inverts that to deploy FIRST, then harness-e2e, because the failure mode the issue describes — "harness scripts at version B vs broker binary at version A → spurious pass or confusing failure" — is only caught if the harness runs against the freshly-deployed broker. Deploying second means a broker bug introduced by PR-N leaks to PR-(N+1)'s harness, which is what the issue is trying to prevent.Trade-off: a broker bug that crashes on startup fails the deploy and skips the harness — also the right signal, since there's nothing meaningful to test against a broken broker.
The inline workflow comment documents the divergence so a reviewer can flip it back if they prefer the issue's ordering — both are one-line changes to the
needs:graph.Out of scope (per issue #101)
deploy-test-contractsjob — deferred to a follow-up PR per the issue's rollout plan step 7. Contract redeploys mint new addresses and require aSECRETS_REWRITE_PATtoken to update sixTEST_*_ADDRESS_HEIMAsecrets after each deploy. More risk than the broker case → separate PR.What landed
What did NOT land
Activation steps (for the operator after this PR merges)
```
TEST_BROKER_INSTANCE_ID=$(aws ec2 describe-instances --region "$REGION" \
--filters "Name=tag:Name,Values=agentkeys-test-broker" \
--query 'Reservations[0].Instances[0].InstanceId' --output text)
```
```
bash scripts/provision-ci-deploy-role.sh \
--test-broker-instance-id "$TEST_BROKER_INSTANCE_ID" \
--env-file scripts/operator-workstation.test.env
```
```
gh secret set OIDC_AWS_ROLE_ARN_DEPLOY --body "$(aws iam get-role --role-name github-actions-agentkeys-deploy --query 'Role.Arn' --output text)"
gh secret set TEST_BROKER_INSTANCE_ID --body "$TEST_BROKER_INSTANCE_ID"
```
Test plan