Skip to content

feat(broker): egress-auth broker (runner side)#56

Merged
ysyneu merged 5 commits into
mainfrom
feat/egress-broker
Jun 8, 2026
Merged

feat(broker): egress-auth broker (runner side)#56
ysyneu merged 5 commits into
mainfrom
feat/egress-broker

Conversation

@ysyneu

@ysyneu ysyneu commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

Egress-auth broker (runner side)

Turns the runner into a local authenticating broker so the fduty CLI's outbound API calls authenticate without the app_key ever entering the bash environment.

What changes

  • New environment/broker_linux.go (//go:build linux): per-invocation socketpair(AF_UNIX, SOCK_SEQPACKET|SOCK_CLOEXEC) control channel; child end inherited by bash as fd 3 via cmd.ExtraFiles; a control goroutine answers each handshake by minting a dedicated SOCK_STREAM connection and passing it back via SCM_RIGHTS, then serves an httputil.ReverseProxy that overwrites the sentinel ?app_key= with the real per-person key and forwards to the upstream. PR_SET_DUMPABLE=0 hardens the runner's in-memory keys.
  • environment/broker_other.go (//go:build !linux): BrokerSupported=false + stubs. darwin/windows don't carry the broker and fall back to the legacy env-key path.
  • executeBashCommand accepts *protocol.BashCredential; in broker mode it sets FLASHDUTY_CRED_FD=3 + an http placeholder base URL and never puts the key in cmd.Env.
  • protocol.BashCredential{Key, BaseURL} + BashArgs.Credential.
  • ws/client.go: advertises &broker=1 on WS connect iff BrokerSupported.

Robustness (control channel)

  • Validates the control request byte; a malformed datagram (e.g. a confused in-sandbox process that inherited fd 3) is refused with 0xFF and never mints a connection.
  • The fd dispatch wraps and verifies the runner's own end with net.FileConn before handing the peer its fd, so a net.FileConn failure refuses cleanly instead of stranding the peer with a connection that never gets served.
  • Clones http.DefaultTransport (guarded type assertion) with a ResponseHeaderTimeout backstop; BaseContext ties each per-conn server to broker shutdown so teardown promptly cancels in-flight upstream requests.

Why

The app_key in the bash env leaks via printenv and /proc/<sibling>/environ on shared-uid hosts. The broker keeps the key in runner memory only; attribution is by-construction (the runner is the parent of each bash, fd 3 is private to that invocation). Linux-only by construction (SOCK_CLOEXEC/PR_SET_DUMPABLE are undefined on darwin); capability is negotiated so a heterogeneous BYOC fleet stays correct.

Verification

  • Unit: TestServeBrokerControl_RewritesKeyAndProxies (Linux, in container) — control channel + SCM_RIGHTS dispatch + key rewrite + the 0xFF refusal path; passes under -count=3 with no fd/goroutine leak.
  • E2E (TestBrokerE2E_RealFduty, opt-in): real fduty through the broker → real local pgy → authenticated real data; app_key absent from bash env (FLASHDUTY_CRED_FD=3 only); two concurrent fduty both authenticate, no byte interleave.
  • Live: this runner connected to a local Safari, advertised broker=1 (Safari logged broker=true); an AI-SRE session ran fduty channel list through the broker and returned real channels with exit_code:0. Container had no FLASHDUTY_APP_KEY — the only auth path was the broker.
  • Builds clean for linux, darwin, windows (amd64/arm64).

ysyneu added 5 commits June 8, 2026 00:24
…down

- Validate the control-channel request byte (constants ctrlReqDial /
  ctrlRespOK / ctrlRespErr); refuse a malformed datagram with 0xFF so a
  confused in-sandbox process that inherited fd 3 fails fast and never
  mints a connection off-spec.
- Reorder the fd dispatch: wrap and verify the runner's own end with
  net.FileConn BEFORE handing the peer its fd via SCM_RIGHTS, so a
  net.FileConn failure refuses cleanly instead of stranding the peer with
  a connection that never gets served (its request would hang to timeout).
- Clone http.DefaultTransport (guarded type assertion, matching
  safeHTTPTransport) instead of sharing the process-global one, and add a
  ResponseHeaderTimeout backstop.
- BaseContext ties each per-conn server to broker shutdown so cancel()
  promptly cancels in-flight upstream requests.
- Test: fix fd leaks (childFD + received fd) and assert the refusal path.

Note: deliberately do NOT join the control goroutine in the dispatch
WaitGroup — it exits only when its Recvmsg peer fully closes (childEnd is
closed by the caller after stop()), so joining it would deadlock.
This is the broker code's first PR, so the linter surfaces issues across
the original commits too. Fixes:
- errcheck: wrap deferred syscall.Close / f.Close in the test.
- gosec G112: ReadHeaderTimeout on the per-conn http.Server.
- govet shadow: reuse the outer err in the test instead of re-declaring.
- noctx: http.NewRequestWithContext in the test.
- gosec G115 (int->uintptr on fds): exclude by text + path, matching the
  repo's existing G706/G703 handling for rules absent from the pinned CI
  golangci-lint v2.4 (fds from socketpair/ParseUnixRights never overflow).

Verified: golangci-lint run (GOOS=linux) reports 0 issues; config verify
passes; the broker unit test still passes in the linux container.
@ysyneu ysyneu merged commit cc1f21d into main Jun 8, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant