Skip to content

fix(listener): serialize membership orders per stack#205

Draft
Dav-14 wants to merge 1 commit into
mainfrom
fix/serialize-membership-orders-per-stack
Draft

fix(listener): serialize membership orders per stack#205
Dav-14 wants to merge 1 commit into
mainfrom
fix/serialize-membership-orders-per-stack

Conversation

@Dav-14

@Dav-14 Dav-14 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Two membership orders for the same stack could land on different worker-pool goroutines and race on the K8s API, so the slower handler's PATCH silently overwrote the faster one. Repeatedly seen when an ExistingStack arrived in close succession to another and the second update was undone by the first.

This change routes every stack-scoped order (ExistingStack / DeletedStack / EnabledStack / DisabledStack) through a per-stack mailbox: at most one order per stack is in flight at a time, pending orders for the same stack run FIFO, orders for distinct stacks remain concurrent on the existing pond pool.

Cancellation matrix

in-flight ↓ / incoming → ExistingStack DeletedStack Enabled / Disabled
ExistingStack cancel + coalesce (newer snapshot wins) cancel (don't create what we'll tear down) queue
DeletedStack queue queue queue
Enabled / Disabled queue cancel queue

Rules in words:

  • A fresher ExistingStack cancels an in-flight ExistingStack — the older snapshot is stale and replaying it would overwrite the newer state.
  • A DeletedStack cancels any non-DeletedStack in flight — no point creating resources we are about to tear down.
  • An in-flight DeletedStack is never cancelled.
  • Cancellation only fires when a successor is already queued, so the cluster is guaranteed to reconcile to the latest desired state.

syncExistingStack now checks ctx.Err() between its four sub-syncs so a superseding order takes effect promptly instead of waiting for all of them.

Design notes

The dispatcher lives in internal/stack_dispatcher.go with a handler-injection seam (orderHandler func). The queue mechanics are unit-tested under -race with no envtest dependency — stack_dispatcher_test.go covers:

  • per-stack FIFO,
  • cross-stack parallelism,
  • consecutive ExistingStack coalescing,
  • Delete cancels in-flight Existing,
  • Delete is never cancelled,
  • parent ctx propagation,
  • the full supersession matrix,
  • unknown / non-stack orders bypass the mailbox.

Cross-stack throughput is unchanged: the same pond.New(5, 5) pool still bounds concurrent work, and the dispatcher submits one drain goroutine per active stack rather than one goroutine per order.

Test plan

  • go test ./internal/... -run 'TestStackDispatcher|TestShouldCancelInFlight' -race -count=50 — all pass
  • just lint — clean
  • go vet ./... — clean
  • Manual: deploy to staging, replay a back-to-back ExistingStack sequence and confirm the second snapshot's state survives
  • Manual: trigger an ExistingStack then immediately a DeletedStack and confirm the create is interrupted (no orphaned modules)

Notes for reviewers

  • Start() is now a two-line read-from-channel + Dispatch loop; all the per-message handling moved into handleOrder() on the listener.
  • The original closure had a subtle data race (ctx = … reassigning the captured outer ctx variable across worker goroutines); the refactor naturally fixes it because handleOrder is a method where ctx is a parameter.

Two membership orders for the same stack could land on different worker
pool goroutines and race on the K8s API, so the slower handler's PATCH
silently overwrote the faster one — repeatedly observed when an
ExistingStack arrived in close succession to another and the second
update was undone by the first.

Route every stack-scoped order (ExistingStack / DeletedStack /
EnabledStack / DisabledStack) through a per-stack mailbox. At most one
order per stack is in flight at a time; pending orders for the same
stack run FIFO once the previous one finishes. Orders for distinct
stacks remain concurrent, still bounded by the same pond worker pool —
so we keep cross-stack throughput.

Two cancellation cases interrupt the in-flight handler when continuing
it would be wasteful or wrong:

  - a fresher ExistingStack supersedes an in-flight ExistingStack (the
    older snapshot is stale; replaying it overwrites the newer state),
  - a DeletedStack supersedes any non-Deleted in-flight order (no point
    creating resources we are about to tear down).

An in-flight DeletedStack is never cancelled. Cancellation only fires
when a successor is already queued, so the cluster is guaranteed to
reconcile to the latest desired state. syncExistingStack now checks
ctx between sync steps so a superseding order takes effect promptly
instead of waiting for all four sub-syncs to finish.

The dispatcher is in its own file with a handler-injection seam so the
queue mechanics are unit-tested with -race (no envtest dependency).
Coverage includes per-stack FIFO, cross-stack parallelism,
ExistingStack coalescing, Delete-cancels-Existing, Delete-never-
cancelled, parent-ctx propagation, and the full cancellation matrix.
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 910be662-5e9f-4030-9060-efe76018f684

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/serialize-membership-orders-per-stack

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant