Fix openStream race between timeout and stream closure#436
Fix openStream race between timeout and stream closure#436ChrisSchinnerl wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2ce68fabc9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR adjusts RHPv4 stream timeout handling to avoid a race between net.Conn deadlines and context cancellation, ensuring callers can reliably attribute failures to context-driven aborts vs transport timeouts.
Changes:
- Update
openStreamto only apply aSetDeadlinefallback when the context has no deadline, and otherwise rely on context cancellation to close the stream. - Simplify the context/close goroutine to always close the stream when triggered.
- Update the timeout-related test expectations and add a changeset entry.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
rhp/v4/rpc.go |
Changes how stream deadlines are applied and how streams are closed on context completion. |
rhp/v4/rpc_test.go |
Adjusts timeout error detection in TestRPCTimeout to account for different error strings. |
.changeset/only_set_default_timeout_in_openstream_when_the_context_has_no_deadline_and_handle_context_deadlines_via_goroutine.md |
Documents the behavioral change as a patch changeset. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
While digging through indexd logs due to a constant flow of failed sector integrity checks I noticed that a fair number of these failed sectors happen due to timeouts. After adding some logging it turns out that we are running into the following race:
openStreamusingSetDeadlineopenStreamspins up a goroutine to close the stream ifctxis closedI/o timeoutbutif ctx.Err() != nildoesn't trigger because the deadline triggers before the context is closedTo avoid this race (because we rely on the context to determine whether an RPC failed due to us aborting it versus something else) this PR updates
openStreamto only rely on the goroutine for all timeouts. Only if no timeout is specified on the context we useSetDeadlineon the stream with a sane default. Which should never be the case in indexd.I have deployed this on Zeus for testing and haven't seen any false positives yet.