fix: ensure relay recovery loop can reclaim stuck messages before expiry#29
Merged
Conversation
DefaultWebhookExpiry was 30s but pendingIdleTimeout was 60s, meaning messages stuck as pending after a server pod restart would always expire before the recovery loop (which runs every 30s) could reclaim them. Fix the invariant: DefaultWebhookExpiry (2m) > pendingIdleTimeout (30s) + recoveryInterval (30s), so stuck messages are reclaimed with ~60s to spare.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Docker Images BuiltImages are available for testing: # gatekeeperd
docker pull ghcr.io/tight-line/gatekeeperd:pr-29-7786b5e
# gatekeeper-relay
docker pull ghcr.io/tight-line/gatekeeper-relay:pr-29-7786b5edocker-compose.ymlGATEKEEPERD_IMAGE=ghcr.io/tight-line/gatekeeperd:pr-29-7786b5e \
RELAY_IMAGE=ghcr.io/tight-line/gatekeeper-relay:pr-29-7786b5e \
docker-compose --profile relay upHelm (values override)image:
repository: ghcr.io/tight-line/gatekeeperd # or gatekeeper-relay
tag: "pr-29-7786b5e"Images expire ~15 days after PR closes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Fixes a timing invariant bug in the relay subsystem where messages stuck as
pendingafter a pod restart would always expire before the recovery loop could reclaim them.DefaultWebhookExpirywas 30s butpendingIdleTimeoutwas 60s, so a stuck message would expire after 30s — before the recovery loop (running every 30s) ever had a chance to pick it up.Changes
DefaultWebhookExpiryfrom 30s to 2 minutespendingIdleTimeoutfrom 60s to 30sDefaultWebhookExpiry > pendingIdleTimeout + recoveryIntervalnow holds with ~60s to spareTesting
Existing tests in
redis_manager_test.goupdated to reflect the new expiry window.make testpasses.Checklist
make testpasses locallymake lintpasses locally