Skip to content

[AMQ-7452] Fix "Timer already cancelled" on new connections when shared inactivity timer dies#1963

Open
jbonofre wants to merge 1 commit intoapache:activemq-6.2.xfrom
jbonofre:fix/inactivity-monitor-cancelled-timer
Open

[AMQ-7452] Fix "Timer already cancelled" on new connections when shared inactivity timer dies#1963
jbonofre wants to merge 1 commit intoapache:activemq-6.2.xfrom
jbonofre:fix/inactivity-monitor-cancelled-timer

Conversation

@jbonofre
Copy link
Copy Markdown
Member

The static READ_CHECK_TIMER and WRITE_CHECK_TIMER in AbstractInactivityMonitor are shared across all transport instances. Java's Timer permanently marks itself cancelled when its internal thread exits due to an unhandled Error thrown by a TimerTask. SchedulerTimerTask rethrows Error subclasses, so an OutOfMemoryError in any timer task kills the shared timer thread while the static field remains non-null. All subsequent schedule() calls on new connections then throw IllegalStateException ("Timer already cancelled"), which propagates to TransportConnector.onAcceptError() and is logged as WARN "Could not accept connection: Timer already cancelled."

Here's my fix proposal: wrap the schedule() calls in startConnectCheckTask() and startMonitorThreads() in try/catch(IllegalStateException). On catch, the dead timer is cancelled (to let its thread exit), replaced with a fresh instance, and the task is rescheduled. The guard condition in startConnectCheckTask() is also extended with || READ_CHECK_TIMER == null to cover the case where the field is unexpectedly null when CHECKER_COUNTER > 0.

I added three regression tests to InactivityMonitorTest covering:

  • READ_CHECK_TIMER cancelled with CHECKER_COUNTER == 0 (startConnectCheckTask path)
  • READ_CHECK_TIMER cancelled with CHECKER_COUNTER > 0 (production scenario)
  • WRITE_CHECK_TIMER cancelled (startMonitorThreads path)

Also, tests use Wait.waitFor() instead of Thread.sleep() for reliability on slow CI.

…ed inactivity timer dies

The static READ_CHECK_TIMER and WRITE_CHECK_TIMER in AbstractInactivityMonitor are
shared across all transport instances. Java's Timer permanently marks itself cancelled
when its internal thread exits due to an unhandled Error thrown by a TimerTask.
SchedulerTimerTask rethrows Error subclasses, so an OutOfMemoryError in any timer task
kills the shared timer thread while the static field remains non-null. All subsequent
schedule() calls on new connections then throw IllegalStateException ("Timer already
cancelled"), which propagates to TransportConnector.onAcceptError() and is logged as
WARN "Could not accept connection: Timer already cancelled."

Fix: wrap the schedule() calls in startConnectCheckTask() and startMonitorThreads()
in try/catch(IllegalStateException). On catch, the dead timer is cancelled (to let its
thread exit), replaced with a fresh instance, and the task is rescheduled. The guard
condition in startConnectCheckTask() is also extended with || READ_CHECK_TIMER == null
to cover the case where the field is unexpectedly null when CHECKER_COUNTER > 0.

Add three regression tests to InactivityMonitorTest covering:
- READ_CHECK_TIMER cancelled with CHECKER_COUNTER == 0 (startConnectCheckTask path)
- READ_CHECK_TIMER cancelled with CHECKER_COUNTER > 0 (production scenario)
- WRITE_CHECK_TIMER cancelled (startMonitorThreads path)

Tests use Wait.waitFor() instead of Thread.sleep() for reliability on slow CI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant