Skip to content

Harden driver IPC and registry client robustness#33

Merged
TeoSlayer merged 1 commit into
mainfrom
fix/common-robustness-bugs
Jun 22, 2026
Merged

Harden driver IPC and registry client robustness#33
TeoSlayer merged 1 commit into
mainfrom
fix/common-robustness-bugs

Conversation

@TeoSlayer

Copy link
Copy Markdown
Contributor

Summary

Robustness/correctness hardening for the driver IPC layer and the registry/client dialer. No wire-protocol or behavior changes for existing callers beyond added safety bounds.

Fixes

1. Stale IPC reply mis-correlation (client-side mitigation) — driver/ipc.go

The IPC wire protocol has no request IDs and the daemon dispatches requests concurrently, so a reply that arrives after its request timed out could be consumed by a later, unrelated request (wrong conn_id/result). Previously all replies funneled into one shared pending buffer that the next caller drained.

Now each sendAndWait registers a private, single-use waiter slot. readLoop delivers a reply only to the currently active waiter, and only when the reply cmd matches what that waiter expects (cmdError is always accepted). On timeout/disconnect the waiter is abandoned, so:

  • a late reply arriving between requests finds no active waiter and is dropped;
  • a stale reply arriving while a different request is in flight is dropped on cmd mismatch (the dangerous case — e.g. a dial reply landing on a health request).

Deferred: full cross-process ordering correctness needs daemon-side request IDs echoed in every reply (a coordinated daemon change + version bump). That is intentionally not done here and is marked TODO(PILOT). This client-side scheme only guarantees a late reply is not mis-delivered; it cannot recover the abandoned request's actual result, and two same-cmd replies remain indistinguishable without request IDs.

Regression test: TestLateReplyAfterTimeoutNotMisdelivered — confirmed to fail without the waiter cmd-match guard ("unexpected reply 0x0E") and pass with it.

2. DialAddr could hang forever — driver/driver.go

DialAddr used sendAndWait (no timeout). It now delegates to DialAddrTimeout with a 30s defaultDialTimeout. The other bounded sendAndWait callers — Listen and Broadcast — get the same bound. jsonRPC-based control ops keep the unbounded path because WaitForTrust legitimately blocks in the daemon.

3. Conn net.Conn conformance — driver/conn.go

  • recvBuf is now guarded by a dedicated readMu so concurrent readers can't corrupt it (single-reader contract documented).
  • SetWriteDeadline was a silent no-op; it is now implemented, backed by a writeDeadline checked before/between Write chunks. SetDeadline sets both read and write deadlines.
  • Write godoc clarified: success means the bytes were enqueued to the local daemon over IPC, not transmitted/acknowledged by the peer.

4. Registry client dial timeouts — registry/client/client.go

Every initial net.Dial/tls.Dial (Dial, DialPool, DialTLS, DialTLSPool, initPool, DialTLSPinned) now uses a 5s dialTimeout (matching the existing reconnect paths) so startup/registry ops can't hang on an unreachable host.

5. Datagram send semantics — driver/driver.go

SendTo godoc made explicit: fire-and-forget; a nil return only means the frame was enqueued to the local daemon, not delivered. Semantics unchanged.

Validation

  • GOWORK=off go build ./... — clean
  • GOWORK=off go vet ./... — clean
  • gofmt — all changed files clean
  • GOWORK=off go test -race -parallel 4 ./... — all 15 packages pass
  • gosec ./badgeverify/... — 0 issues (CI scope)
  • govulncheck ./... — no vulnerabilities
  • gitleaks detect — no leaks
  • All fuzz targets (protocol, badgeverify, registry/wire) — pass (bounded runs)

Deferred work

Full request-ID protocol for IPC (daemon + driver, version-bumped) to make cross-process reply ordering provably correct. Tracked via TODO(PILOT) in driver/ipc.go.

driver/ipc.go: drop late IPC replies after timeout instead of
mis-correlating them with the next request. Each sendAndWait now
registers a private, single-use waiter slot; readLoop delivers only to
the active waiter and only when the reply cmd matches (cmdError always
allowed). On timeout/disconnect the slot is abandoned, so a reply for an
already-timed-out request finds no matching waiter and is dropped rather
than handed to an unrelated caller. Full cross-process correctness still
needs daemon-side request IDs (coordinated change + version bump),
documented as a TODO.

driver/driver.go: DialAddr now applies a 30s default timeout via
DialAddrTimeout so a non-responsive daemon can't block forever; Listen
and Broadcast get the same bound. jsonRPC paths (incl. WaitForTrust,
which blocks in the daemon by design) keep the unbounded path. Document
SendTo as fire-and-forget (nil means local IPC enqueue, not delivery).

driver/conn.go: serialize Read with a dedicated mutex so recvBuf can't
be corrupted by concurrent readers; implement SetWriteDeadline (was a
silent no-op) backed by a writeDeadline checked in Write; SetDeadline
now sets both read and write deadlines; document Write success as local
IPC enqueue, not remote delivery.

registry/client/client.go: bound every initial TCP/TLS dial with a 5s
timeout (matching the reconnect paths) so registry ops can't hang on an
unreachable host.

Tests: add TestLateReplyAfterTimeoutNotMisdelivered (verified to fail
without the waiter cmd-match guard) and a real SetWriteDeadline behavior
test; update fixtures for the removed shared pending buffer.
@TeoSlayer TeoSlayer merged commit 862ff31 into main Jun 22, 2026
10 checks passed
@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.10345% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
driver/conn.go 89.47% 1 Missing and 1 partial ⚠️
driver/driver.go 66.66% 0 Missing and 1 partial ⚠️
driver/ipc.go 96.55% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@matthew-pilot matthew-pilot deleted the fix/common-robustness-bugs branch June 22, 2026 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants