Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
d9c832b
issue #101: path-conditional auto-deploy of test broker via SSM
hanwencheng May 23, 2026
16eac7a
fix(provision-ci-deploy-role): strip non-ASCII from --description
hanwencheng May 23, 2026
a724745
fix(provision-ci-deploy-role): --fix-ssm auto-attaches SSM policy + f…
hanwencheng May 23, 2026
d6ce74b
fix(provision-ci-deploy-role): unbound $sub_pattern in idempotent log…
hanwencheng May 23, 2026
b4b38b4
fix(provision-ci-deploy-role): --fix-ssm auto-creates instance profil…
hanwencheng May 23, 2026
e0f92a3
fix(setup-broker-host): install amazon-ssm-agent at bootstrap (issue …
hanwencheng May 23, 2026
afb6e02
fix(provision-ci-deploy-role): distinguish AccessDenied from instance…
hanwencheng May 23, 2026
7fae303
docs(ci-setup §7.3): require --ref on pre-merge gh workflow run dispatch
hanwencheng May 23, 2026
f666e75
fix(ci): grant ssm:DescribeInstanceInformation to deploy role + disti…
hanwencheng May 23, 2026
e1b4f40
fix(deploy-test-broker): auto-discover agentKeys repo path on EC2
hanwencheng May 23, 2026
72b43c8
fix(deploy-test-broker): add /home/agentkey paths + safe-default REPO…
hanwencheng May 23, 2026
0e8afde
fix(setup-broker-host): default HOME so SSM-driven invocations work u…
hanwencheng May 23, 2026
143f1df
fix(harness): heima-test-deployer nonce contention (codex adversarial…
hanwencheng May 23, 2026
d5b7c49
fix(ci): add pull-requests:read for dorny/paths-filter on PR events
hanwencheng May 23, 2026
3c3dff6
docs: add broker + local operator dev guide
hanwencheng May 24, 2026
99e2519
docs(readme): split into 'For humans' + 'For AI coding agents' sections
hanwencheng May 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
354 changes: 350 additions & 4 deletions .github/workflows/harness-ci.yml

Large diffs are not rendered by default.

66 changes: 57 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,20 @@ Credential broker for AI agents. A master (human) delegates scoped, revocable ac

Status: pre-v0. Stage 5 in progress (see `harness/progress.json`).

## What it does
Architecture, language choices, trust boundaries: [`docs/arch.md`](docs/arch.md).

---

## 👤 For humans

### What it does

- **Master CLI** (`agentkeys`) — runs on your laptop; owns a session key in the OS keychain; approves pair/recover/scope-change requests.
- **Sandbox daemon** (`agentkeys-daemon`) — runs inside the agent sandbox; brokers credential reads over MCP + a Unix socket; never exposes raw keys to the agent.
- **Provisioner** (`agentkeys-provisioner` + `provisioner-scripts`) — Rust orchestrator drives TypeScript/Playwright scrapers to sign up for services and hand the resulting API key back through the trust boundary.
- **Mock backend** (`agentkeys-mock-server`) — v0-only; mirrors the Heima parachain API so we can build end-to-end before the chain integration lands.

Architecture, language choices, trust boundaries: [`docs/arch.md`](docs/arch.md).

## Workspace layout
### Workspace layout

```
crates/
Expand All @@ -31,7 +35,7 @@ harness/ stage-gated build harness + progress

~80% Rust, 100% of the security-critical path in Rust. TypeScript is confined to browser automation and (post-MVP) the Web GUI frontend.

## Build & test
### Build & test

```
cargo build
Expand All @@ -50,12 +54,56 @@ cargo test -p agentkeys-daemon -p agentkeys-mcp
cargo test -p agentkeys-provisioner
```

## Development
### First-machine setup

Fresh laptop? Start with [`docs/dev-setup.md`](docs/dev-setup.md) — it walks you through rustup, jj, Node, AWS CLI, browser, and runs the workspace smoke tests.

Staged build plan in [`docs/spec/plans/development-stages.md`](docs/spec/plans/development-stages.md). Each stage has a `harness/stage-N-done.sh` gate that must exit 0 before the stage is marked complete. Contributor workflow: [`CLAUDE.md`](CLAUDE.md).
### Inner-loop dev

Version control uses [jj (Jujutsu)](https://github.com/jj-vcs/jj), not raw git.
Iterating on the broker, signer, mock-server, or operator-side scripts? [`docs/spec/broker-and-operator-dev-guide.md`](docs/spec/broker-and-operator-dev-guide.md) covers the local edit-build-test loop: which process to run on which port, how to point harness scripts at `localhost`, how to use `harness/v2-stage*-demo.sh` for resumable step-by-step testing.

## License
### License

Dual-licensed under **MIT OR Apache-2.0**, at your choice.

---

## 🤖 For AI coding agents

**You must read these before making any change.** They override defaults from your training data and cover the project-specific guardrails.

| Read | Why |
|---|---|
| [`CLAUDE.md`](CLAUDE.md) | Project-specific rules: docs layout, /create-pr workflow in worktrees, terminology-source-of-truth, branch push policy, idempotent-remote-setup invariants, runbook-fix-fold-back policy. **Read first, every session.** |
| [`docs/arch.md`](docs/arch.md) | Single source of truth for component inventory (K1–K11), trust boundaries, HDKD actor tree, per-actor binding ceremonies. When the per-doc detail outgrows arch.md, link outward — never duplicate. |
| [`docs/spec/plans/development-stages.md`](docs/spec/plans/development-stages.md) | The 8-stage build plan. Each stage has a `harness/stage-N-done.sh` gate; never self-grade — run the gate. |
| [`docs/spec/plans/execution-plan.md`](docs/spec/plans/execution-plan.md) | Orchestration runbook (ralph, team, ultraqa workflows). |
| [`docs/spec/broker-and-operator-dev-guide.md`](docs/spec/broker-and-operator-dev-guide.md) | Inner edit-build-test loop for broker + operator-side code. Use this before suggesting changes to the broker's run-time behavior. |

### Hard rules (from CLAUDE.md)

These are non-negotiable. Violating them produces broken PRs / corrupted state.

- **Use `jj` (Jujutsu), never raw `git`.** Common mappings in CLAUDE.md. The one exception: inside a Claude Code `.claude/worktrees/<name>/` worktree, the initial commit must use `git` (jj can't colocate in a git-worktree); then `cd` to the main repo and push via `jj git push`. Never include `Co-Authored-By:` lines in those commits.
- **Branch `evm` pushes immediately.** On `evm`, push after every `jj describe` — the remote broker host pulls from `origin/evm` to redeploy. "I'll push at the end" silently breaks deploys.
- **Diagnose before edit.** Reproduce the failure locally first; isolate the layer (shell / client / doc / broker code / network). If the cause is local to the operator's shell, respond with the one-line fix — don't edit the repo.
- **Land the fix everywhere.** Once a local repro proves a fix is correct, land it the same turn — search the repo for every affected file, commit, push to `origin/evm`. Don't stop at "verified locally" or "fixed one file."
- **Runbook fix fold-back.** When an operator hits a runbook failure, two things land in the same turn: (1) the targeted fix, (2) a revision to the runbook so the next operator doesn't hit the same trap.
- **No hardcoded values.** Use env var + default, CLI flag + default, or a config file. If you must hardcode temporarily, log it in [`hardcoded.md`](hardcoded.md) with file:line + reason + what would unblock dynamic.
- **Idempotent remote setup.** Every script that mutates remote state (AWS / Heima / CI / VM / DNS) must exit 0 on re-run without re-applying. Pre-check with `get-*` before mutating; log `ok | skip <reason> | fail <reason>`.
- **Plan completion is all-or-nothing.** When implementing a plan, every numbered step must be done — or the PR summary's "What did NOT land" section must explicitly list what was skipped and why.
- **Terminology source of truth.** Never invent a new name for a concept arch.md already names. If you find divergence, fix it in the same commit or document the alias in arch.md's "Canonical names" section.

### Per-session protocol

1. `jj log --limit 10 && cat harness/progress.json && bash harness/init.sh $(jq -r .current_stage harness/progress.json)`
2. Read the stage contract for the current stage in `docs/spec/plans/development-stages.md`.
3. Pick the HIGHEST-PRIORITY incomplete deliverable from `harness/features.json`.
4. Implement ONE deliverable, run `cargo test -p <crate>`, `jj describe`, update `harness/features.json`, `jj new`.

### Single entry points

Don't reach for ad-hoc `systemctl`, `scp`, or `forge script` — these are wrapped:

- **Remote broker host** (binary upgrades, systemd, nginx, env tweaks): `bash scripts/setup-broker-host.sh`
- **Heima chain bring-up** (deploy, binding ceremonies, scope grants, K11 enroll, audit-row append, worker smoke): `bash scripts/setup-heima.sh`
116 changes: 116 additions & 0 deletions docs/ci-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -365,6 +365,122 @@ gh workflow run harness-ci.yml --repo litentry/agentKeys --field stage=3

When the workflow passes against the test stack, CI is live. Every subsequent push to a PR triggers it; you're done.

### 7. (Optional) Wire auto-deploy of the test broker (issue [#101](https://github.com/litentry/agentKeys/issues/101))

Without this step, the workflow validates against the **already-deployed** test broker. If a PR changes broker code (`crates/agentkeys-broker-server/**`, `crates/agentkeys-worker-*/**`, `crates/agentkeys-signer-protocol/**`, `scripts/setup-broker-host.sh*`, or any workspace-shared crate the broker links against), the test broker binary silently drifts from the PR's source tree — the harness then exercises *old* broker code against *new* harness scripts, producing either spurious passes or confusing failures.

Step 7 wires a second OIDC role (`github-actions-agentkeys-deploy`) plus two new GitHub secrets. When activated, the workflow's `detect-changes` job sees broker-affecting paths in the diff, the `deploy-test-broker` job assumes that role, and `aws ssm send-command` drives `setup-broker-host.sh --test --yes` on the test EC2 — re-deploying the broker so `harness-e2e` validates the PR's actual code. The deploy job is **gated three ways**:

1. `paths-filter` boolean (no broker code changed → skip).
2. Both deploy secrets present (`OIDC_AWS_ROLE_ARN_DEPLOY` + `TEST_BROKER_INSTANCE_ID`).
3. `preflight.outputs.should_run == 'true'` (test infra fully wired).

If any gate fails, the deploy job is **skipped, not failed** — `harness-e2e` still runs against the existing broker binary. So this step is fully opt-in; partial activation is safe.

#### 7.1 Run the provisioning script

```bash
awsp agentkeys-admin
# Look up the test broker EC2 instance ID (one-shot — pin it once):
TEST_BROKER_INSTANCE_ID=$(aws ec2 describe-instances \
--region "$REGION" \
--filters "Name=ip-address,Values=$(curl -sS "https://dns.google/resolve?name=$BROKER_HOST&type=A" | jq -r '.Answer[0].data')" \
--query 'Reservations[0].Instances[0].InstanceId' --output text)
echo "$TEST_BROKER_INSTANCE_ID" # → i-xxxxxxxxxxxxxxxxx

# Idempotent provisioning — safe to re-run. Use --fix-ssm on the FIRST run
# so the script auto-attaches AmazonSSMManagedInstanceCore to the broker EC2's
# instance profile if it's missing (a fresh EC2 commonly lacks this policy).
bash scripts/provision-ci-deploy-role.sh \
--test-broker-instance-id "$TEST_BROKER_INSTANCE_ID" \
--env-file scripts/operator-workstation.test.env \
--fix-ssm
```

The script:

- Creates / refreshes the `github-actions-agentkeys-deploy` IAM role with a federated trust policy on the GitHub Actions OIDC provider, scoped to `repo:litentry/agentKeys:*` (any branch in this repo can trigger; the workflow's path filter + preflight gate further restrict when the role is actually used).
- Attaches an inline policy `agentkeys-ci-deploy-ssm` with:
- `ssm:SendCommand` on `document/AWS-RunShellScript` + the one instance ARN (so even if the role's session creds leaked, the worst a third party can do is re-run setup-broker-host.sh on the test EC2 — a destructive op there is `terraform apply`-style: idempotent, recoverable, and contained to the test environment).
- `ssm:GetCommandInvocation` / `ssm:ListCommandInvocations` / `ssm:DescribeInstanceInformation` for status polling + the workflow's pre-deploy sanity check.
- `ec2:DescribeInstances` scoped to the one instance ID, for the workflow's pre-deploy sanity check.

> Already provisioned the role before `ssm:DescribeInstanceInformation` was added to the policy template? Re-run the provisioning script. `put-role-policy` is idempotent — it overwrites the inline policy with the current source-of-truth shape, picking up any added permissions.
- Verifies the test EC2 is registered with SSM (`PingStatus = Online`). With `--fix-ssm`, auto-remediates the common "instance profile is missing AmazonSSMManagedInstanceCore" case by attaching the policy and polling for up to 3 min for the SSM agent to refresh its creds. Without `--fix-ssm`, just reports the failure with manual fix instructions.

**SSM remediation modes (what `--fix-ssm` covers, what it doesn't):**

| Failure | What `--fix-ssm` does | What it CAN'T fix automatically |
|---|---|---|
| Instance profile missing `AmazonSSMManagedInstanceCore` | Attaches the policy, polls for Online | (handled) |
| Policy already attached, agent process running with stale creds | Polls until agent refreshes (~1-3 min typical) | If poll times out: SSH + `sudo systemctl restart amazon-ssm-agent`, OR `aws ec2 reboot-instances …` |
| Instance has NO instance profile at all | Creates a dedicated `agentkeys-test-broker-ssm` role + instance profile (EC2 trust + `AmazonSSMManagedInstanceCore`) and associates it with the EC2. IMDS surfaces the new creds within ~30s. Safe because the broker's app-layer AWS access uses static creds from `broker.env`, not IMDS — adding IMDS-served creds can only ADD capability for the SSM agent, not displace anything. | (handled) |
| SSM Agent not installed (no `amazon-ssm-agent` unit) | Reports state; can't reach the box to install (operator's laptop has no SSH-into-EC2 capability from the provision script) | Re-run `bash scripts/setup-broker-host.sh --test --yes` on the EC2 — it now installs `amazon-ssm-agent` (snap preferred, .deb fallback) as part of broker bootstrap. One-shot manual recovery if you don't want to re-run the full setup: `ssh test-broker 'sudo snap install amazon-ssm-agent --classic && sudo systemctl enable --now snap.amazon-ssm-agent.amazon-ssm-agent.service'` |
| Private VPC subnet without an SSM VPC endpoint | Reports state | Operator wires the VPC endpoint (unlikely for a public-IP broker, but possible) |

Re-running the script after any of the operator-side fixes is safe (idempotent — every step is `get-*` pre-checked before any mutation).

#### 7.2 Set the two new repo secrets

```bash
# Print the deploy role ARN you just provisioned (script also prints this):
role_arn=$(aws iam get-role --role-name github-actions-agentkeys-deploy \
--query 'Role.Arn' --output text)

gh secret set OIDC_AWS_ROLE_ARN_DEPLOY --repo litentry/agentKeys --body "$role_arn"
gh secret set TEST_BROKER_INSTANCE_ID --repo litentry/agentKeys --body "$TEST_BROKER_INSTANCE_ID"
```

| Secret | Purpose |
|---|---|
| `OIDC_AWS_ROLE_ARN_DEPLOY` | ARN of `github-actions-agentkeys-deploy` — assumed by the `deploy-test-broker` job via GitHub Actions OIDC. |
| `TEST_BROKER_INSTANCE_ID` | EC2 instance ID (`i-…`) hosting `test-broker.${ZONE}`. The deploy role's inline policy is scoped to *this single instance*. |
| `TEST_BROKER_REPO_DIR` | **Optional.** Absolute path of the agentKeys git checkout on the EC2 (e.g. `/home/ubuntu/agentKeys`). The deploy workflow auto-discovers across common candidates (`/home/ubuntu/agentKeys`, `/home/ubuntu/agentkeys`, `/opt/agentkeys`, `/srv/agentkeys`, `/root/agentKeys`), so this only needs to be set when the operator cloned to a non-standard path and the workflow's auto-discover step prints `could not locate the agentKeys checkout`. |

#### 7.3 Dry-run validate

Trigger the workflow manually with `force_deploy_broker=true` so the deploy fires regardless of whether the latest commit touched broker paths.

**Pre-merge — `--ref` is required.** `gh workflow run` reads the workflow definition from the *default branch* (`main`) unless you tell it otherwise. Since the `force_deploy_broker` input lives on the PR branch, dispatching without `--ref` fails with `HTTP 422: Unexpected inputs provided: ["force_deploy_broker"]`. Pass `--ref` so GHA reads the workflow YAML (and its inputs) from the PR branch instead:

```bash
gh workflow run harness-ci.yml --repo litentry/agentKeys \
--ref claude/adoring-bell-1b9ca8 \
--field stage=1 \
--field force_deploy_broker=true
```

Replace `claude/adoring-bell-1b9ca8` with your actual PR branch name (`git rev-parse --abbrev-ref HEAD` if you're on it locally).

**Post-merge — `--ref` is optional.** Once this PR is on `main`, dispatching without `--ref` will work because the input is part of the default-branch workflow definition. (The `--ref` form still works and lets you target any branch.)

Then in the run logs:

- `deploy-test-broker` should show `SSM agent online on i-…` (sanity check passed).
- The `SendCommand` step prints the command ID; the next step polls until `Success`.
- On success: the tail of `StandardOutputContent` shows `setup-broker-host.sh` finishing cleanly (`ok systemd unit … active`, `ok nginx running`, etc.).
- On failure: stdout + stderr are dumped to the GHA log. The most common cause is `git checkout` failing on the EC2 because the source tree doesn't have the PR branch fetched — fix by ssh-ing into the box and running `sudo -u ubuntu git fetch --prune origin` once.

#### 7.4 Disable / disarm

Remove either secret to disarm — the workflow's `preflight.outputs.deploy_ready` will flip to `false` and the deploy job silently skips:

```bash
gh secret delete OIDC_AWS_ROLE_ARN_DEPLOY --repo litentry/agentKeys
# or
gh secret delete TEST_BROKER_INSTANCE_ID --repo litentry/agentKeys
```

The IAM role can stay provisioned indefinitely — without the secret it can't be assumed by GHA, and the inline SSM perms are scoped to one instance.

#### Out of scope for issue #101

Per [issue #101](https://github.com/litentry/agentKeys/issues/101) "Out of scope":

- **Prod broker auto-deploy** — never. The prod broker EC2 stays manual via `bash scripts/setup-broker-host.sh --upgrade` from the operator laptop, per CLAUDE.md "Remote broker host (single entry point)".
- **Auto-deploy of test Heima EVM contracts** — deferred to a follow-up PR (issue #101 rollout plan step 7). Contract redeploys mint new addresses and require the `SECRETS_REWRITE_PAT` token to update six `TEST_*_ADDRESS_HEIMA` secrets — more risk than the broker deploy, so it ships separately.
- **Mainnet prod contract redeploy** — never automatic. Manual via `bash scripts/setup-heima.sh` only.

## What the workflow does on every run

1. Restores submodules + Rust toolchain + Foundry + cargo cache.
Expand Down
Loading
Loading