Skip to content

fix(broker): allow link stealing when old connection is stopping#1965

Draft
blackb1rd wants to merge 1 commit intoapache:mainfrom
blackb1rd:fix/broker-refuse-connection-for-stopping-connection
Draft

fix(broker): allow link stealing when old connection is stopping#1965
blackb1rd wants to merge 1 commit intoapache:mainfrom
blackb1rd:fix/broker-refuse-connection-for-stopping-connection

Conversation

@blackb1rd
Copy link
Copy Markdown

Proposed PR Description

Title: Fix race condition causing InvalidClientIDException on rapid client reconnects

Summary
This PR resolves a Time-of-Check to Time-of-Use (TOCTOU) race condition in RegionBroker.addConnection() that occurs when a client rapidly reconnects after a network drop. Previously, this race condition resulted in an unwarranted InvalidClientIDException.

The Problem & Root Cause
When a client connection drops and immediately reconnects, the following sequence occurs:

  1. The network drops, triggering TransportConnection.stopAsync() on the old connection. This immediately sets stopping = true.
  2. The actual cleanup task (processRemoveConnection, which calls broker.removeConnection()) is scheduled to run asynchronously on a separate thread.
  3. If the client reconnects before that asynchronous cleanup completes, RegionBroker.addConnection() finds a stale entry for the client in the clientIdSet.
  4. Because isAllowLinkStealing() defaults to false for TCP/OpenWire, the broker rejects the perfectly valid reconnect attempt and throws an InvalidClientIDException.

The Fix
Modified RegionBroker.addConnection() to inspect the state of the existing connection before throwing the exception.

Inside the synchronized (clientIdSet) block, we now check if the existing connection has isStopping() == true. If it is already in the process of stopping, we allow the new connection to proceed—effectively treating it the same as link-stealing, since the old connection is already dead and just awaiting garbage collection.

Safety & Side-Effect Analysis
This change is clean and introduces no regression risks for the following reasons:

  • Idempotency: The subsequent stopAsync() call on the old connection is completely harmless. It uses compareAndSet(false, true), meaning calling it on an already-stopping connection is a safe no-op.
  • Cleanup Thread Safety: The asynchronous removeConnection() cleanup relies on a guard (oldValue == context). When the delayed task finally executes, it will not accidentally remove the newly registered connection context.

Fixes #[Insert Issue Number]

Signed-off-by: blackb1rd <blackb1rd.mov@gmail.com>
@blackb1rd blackb1rd marked this pull request as draft April 25, 2026 16:17
@blackb1rd
Copy link
Copy Markdown
Author

feel free to merge this code if this is really solve the issue.

we have production which hitting this issue but on both client and server.
but our case, we have too many client connect to activemq server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant