fix(broker): allow link stealing when old connection is stopping#1965
Draft
blackb1rd wants to merge 1 commit intoapache:mainfrom
Draft
fix(broker): allow link stealing when old connection is stopping#1965blackb1rd wants to merge 1 commit intoapache:mainfrom
blackb1rd wants to merge 1 commit intoapache:mainfrom
Conversation
Signed-off-by: blackb1rd <blackb1rd.mov@gmail.com>
Author
|
feel free to merge this code if this is really solve the issue. we have production which hitting this issue but on both client and server. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed PR Description
Title: Fix race condition causing
InvalidClientIDExceptionon rapid client reconnectsSummary
This PR resolves a Time-of-Check to Time-of-Use (TOCTOU) race condition in
RegionBroker.addConnection()that occurs when a client rapidly reconnects after a network drop. Previously, this race condition resulted in an unwarrantedInvalidClientIDException.The Problem & Root Cause
When a client connection drops and immediately reconnects, the following sequence occurs:
TransportConnection.stopAsync()on the old connection. This immediately setsstopping = true.processRemoveConnection, which callsbroker.removeConnection()) is scheduled to run asynchronously on a separate thread.RegionBroker.addConnection()finds a stale entry for the client in theclientIdSet.isAllowLinkStealing()defaults tofalsefor TCP/OpenWire, the broker rejects the perfectly valid reconnect attempt and throws anInvalidClientIDException.The Fix
Modified
RegionBroker.addConnection()to inspect the state of the existing connection before throwing the exception.Inside the
synchronized (clientIdSet)block, we now check if the existing connection hasisStopping() == true. If it is already in the process of stopping, we allow the new connection to proceed—effectively treating it the same as link-stealing, since the old connection is already dead and just awaiting garbage collection.Safety & Side-Effect Analysis
This change is clean and introduces no regression risks for the following reasons:
stopAsync()call on the old connection is completely harmless. It usescompareAndSet(false, true), meaning calling it on an already-stopping connection is a safe no-op.removeConnection()cleanup relies on a guard (oldValue == context). When the delayed task finally executes, it will not accidentally remove the newly registered connection context.Fixes #[Insert Issue Number]