Skip to content

fix: ensure relay recovery loop can reclaim stuck messages before expiry#29

Merged
nickmarden merged 2 commits into
mainfrom
fix/relay-recovery-expiry
Jun 16, 2026
Merged

fix: ensure relay recovery loop can reclaim stuck messages before expiry#29
nickmarden merged 2 commits into
mainfrom
fix/relay-recovery-expiry

Conversation

@nickmarden

Copy link
Copy Markdown
Contributor

Summary

Fixes a timing invariant bug in the relay subsystem where messages stuck as pending after a pod restart would always expire before the recovery loop could reclaim them.

DefaultWebhookExpiry was 30s but pendingIdleTimeout was 60s, so a stuck message would expire after 30s — before the recovery loop (running every 30s) ever had a chance to pick it up.

Changes

  • Raised DefaultWebhookExpiry from 30s to 2 minutes
  • Lowered pendingIdleTimeout from 60s to 30s
  • The invariant DefaultWebhookExpiry > pendingIdleTimeout + recoveryInterval now holds with ~60s to spare
  • Updated test assertions for the new expiry window
  • Added CHANGELOG entry

Testing

Existing tests in redis_manager_test.go updated to reflect the new expiry window. make test passes.

Checklist

  • make test passes locally
  • make lint passes locally
  • Documentation updated (if behavior or configuration changed)
  • CHANGELOG updated (for user-visible changes)

DefaultWebhookExpiry was 30s but pendingIdleTimeout was 60s, meaning
messages stuck as pending after a server pod restart would always expire
before the recovery loop (which runs every 30s) could reclaim them.

Fix the invariant: DefaultWebhookExpiry (2m) > pendingIdleTimeout (30s) +
recoveryInterval (30s), so stuck messages are reclaimed with ~60s to spare.
@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@sonarqubecloud

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

Docker Images Built

Images are available for testing:

# gatekeeperd
docker pull ghcr.io/tight-line/gatekeeperd:pr-29-7786b5e

# gatekeeper-relay
docker pull ghcr.io/tight-line/gatekeeper-relay:pr-29-7786b5e

docker-compose.yml

GATEKEEPERD_IMAGE=ghcr.io/tight-line/gatekeeperd:pr-29-7786b5e \
RELAY_IMAGE=ghcr.io/tight-line/gatekeeper-relay:pr-29-7786b5e \
docker-compose --profile relay up

Helm (values override)

image:
  repository: ghcr.io/tight-line/gatekeeperd  # or gatekeeper-relay
  tag: "pr-29-7786b5e"

Images expire ~15 days after PR closes.

@nickmarden nickmarden merged commit 0d41066 into main Jun 16, 2026
10 checks passed
@nickmarden nickmarden deleted the fix/relay-recovery-expiry branch June 16, 2026 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant