fix: drain bidi stream response side in publisher to prevent deadlock#75
fix: drain bidi stream response side in publisher to prevent deadlock#75ColinkaMir wants to merge 1 commit into
Conversation
The publisher CLIs (cmd/multi-publish, cmd/single in publish mode) call
client.ListenCommands(ctx) and only invoke stream.Send() in a loop. They
never call stream.Recv(). Because ListenCommands is a bidirectional gRPC
stream and the server emits trace-event Response messages for every
published Request, the per-stream HTTP/2 flow-control window fills up,
the server's response handler blocks, and stream.Send() on the client
eventually blocks too. Result: a publisher that completes 7-8 messages
at payloads >=100KB and then hangs indefinitely.
Fix: spawn a goroutine that drains stream.Recv() until the stream
closes. This unblocks the response side without changing the proto or
the response handling on the client (the trace events are discarded,
which matches existing behaviour at the user-visible level).
Verified on getoptimum/p2pnode:v0.0.1-rc16 with the docker-compose-optimum
stack:
before fix: -count=40 -datasize=102400 hung after 7 publications.
after fix: -count=40 -datasize=102400 completes in ~8s.
-count=20 -datasize=1048576 completes in ~8s.
-count=10 -datasize=2516582 completes in ~6s.
|
Actionable comments posted: 0 |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: getoptimum/coderabbit/.coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR fixes a bidirectional gRPC stream deadlock affecting two publisher CLI commands. The server emits trace-event Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 7 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (7 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Independent benchmark of Optimum mump2p RLNC propagation against plain
GossipSub, run on a local 8-node Docker mesh with MeshDegreeMax=4 so that
messages transit through intermediate nodes (i.e. so RLNC's recoding has
a place to do work).
32 datapoints total: optimum + gossipsub modes × {1KB, 100KB, 1MB, 2.4MB}
× {0, 5, 10, 20}% packet loss.
Headline finding (full details in results-n8/combined/report-n8.md):
optimum delivers 100% in 11 of 16 (size, loss) conditions; gossipsub
delivers 100% in 0 of 16 at the same topology. Clearest single point
is 1 MB / 10% loss: 100% vs 14%.
Methodology:
- REST publish via the Optimum proxy (sidesteps the multi-publish bidi
deadlock, patch in upstream PR getoptimum/optimum-dev-setup-guide#75)
- Subscribers via direct gRPC sidecar on all 8 nodes
- Packet loss scoped to peer-to-peer mesh traffic only (tc netem prio
qdisc with u32 filter matching dst 172.28.0.12/30 and 172.28.0.16/29),
proxy control plane untouched
- Latency per (msg_id, subscriber) from mump2p trace TSV
- Bandwidth from Prometheus process_network_transmit_bytes_total deltas
Closes #74
Summary
The publisher CLIs (
cmd/multi-publishandcmd/singlein publish mode) open theListenCommandsbidi stream and only callstream.Send()in a loop. They neverstream.Recv(). The server emits trace-eventResponsemessages on every publish; without a drain on the client side, the per-stream HTTP/2 flow-control window fills, the server handler blocks, andstream.Sendon the client eventually blocks too. Result: publisher hangs after ~7-8 publications at payloads >=100 KB. Full root-cause analysis in the linked issue.This PR adds a drain goroutine in each of the two affected call sites. Trace events are discarded (they were not used by the publisher anyway).
Changes
Both call sites add the same pattern immediately after
client.ListenCommands(ctx)succeeds:The goroutine exits when the connection closes (the
defer conn.Close()already insendMessagesat the end of the function triggers the exit). No leak.Why this approach
cmd/multi-subscribeandcmd/singlein subscribe mode, which dostream.Recvin their main loop.Publishto a unary RPC, but that requires a proto change and breaks compatibility with the currentRequest/Responsetyping used by both subscribe and publish on the sameListenCommandsstream.Verification
Run on the local Docker stack from this repo, image
getoptimum/p2pnode:v0.0.1-rc16. Before applying the patch the OLD multi-publish hangs:./grpc_p2p_client/p2p-multi-publish -topic=t -ipfile=ips.txt \ -start-index=0 -end-index=1 -count=40 -datasize=102400 -sleep=200ms # 7 publications, then hangs indefinitely.After this patch, with rebuilt binaries:
-count=40 -datasize=102400 -sleep=200ms(100 KB × 40)-count=20 -datasize=1048576 -sleep=400ms(1 MB × 20)-count=10 -datasize=2516582 -sleep=600ms(2.4 MB × 10)-count=50 -datasize=1024 -sleep=100ms(1 KB × 50, the size that worked before)Sample output from the 100 KB run (post-fix):
Notes for reviewers
stream.Recverrors. The errors here are the natural shutdown signal (stream closed because connection closed) and logging them would be noise.cmd/multi-subscribeandcmd/singlein subscribe mode are not touched — they already drain in their main loop.-traceflag to multi-publish symmetrical to multi-subscribe). That seemed out of scope for a bug fix. Happy to do it as a follow-up if useful.Test environment
mainatbd8c0b8.docker-compose-optimum.yml4 p2pnodes + 2 proxies, defaults.Summary by CodeRabbit