RUBY-3822 Stop publishing SDAM events from the Monitor's RTT checks#3069
Open
comandeo-mongo wants to merge 2 commits into
Open
RUBY-3822 Stop publishing SDAM events from the Monitor's RTT checks#3069comandeo-mongo wants to merge 2 commits into
comandeo-mongo wants to merge 2 commits into
Conversation
When the streaming protocol is active the PushMonitor streams hello responses as the authoritative SDAM source, and the polling Monitor only measures RTT on its connection. Per the Server Monitoring spec, an RTT command must not publish events or update the topology, and its errors must not mark the server Unknown. Until now the polling Monitor still ran the cluster SDAM flow and published heartbeat events for these RTT checks. Gate the polling Monitor: when a connection is established and the PushMonitor is running, scan! skips heartbeat publishing and the cluster SDAM flow (on both success and error) while still pacing the next check. The initial handshake and reconnect paths are unaffected, since the gate is false when there is no connection yet, so topology discovery stays prompt. Polling mode is unaffected: with no running PushMonitor the gate is always false. Marking a server Unknown on a streaming-check failure now belongs to the PushMonitor, which previously only stopped and relied on the polling Monitor to notice. The PushMonitor connection also performs a metadata handshake before streaming, mirroring the polling Monitor and the spec; without it the server never learns the connection's appName, so appName-scoped behaviour (including the failCommand fail points used by the SDAM spec tests) never applies to the streaming hello. Refresh the stale interruptInUse-pool-clear fixture from the spec (times: 1 -> times: 4). The RTT hello now consumes fail point budget alongside the streaming check, which is exactly why upstream uses times: 4.
…g mode The RTT-only gate suppressed the polling Monitor's check whenever the PushMonitor was running, including while the server was Unknown. A server marked Unknown by an operation error or a streaming failure then had to wait for the next streaming response to recover, which could be up to heartbeatFrequencyMS away - long enough for server selection to fail with NoServerAvailable and cascade across later operations. Restrict the suppression to known servers: when the server is Unknown the polling Monitor runs a full check again, restoring the prompt recovery it performed before this change. Steady-state RTT suppression (the ticket's goal) still applies while the server is known and the PushMonitor is the authoritative streaming source.
jamis
approved these changes
Jun 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
In the streaming protocol the
PushMonitorstreamshelloresponses as the authoritative SDAM source, while the pollingMonitoronly measures RTT on its connection. Per the Server Monitoring spec (§Measuring RTT / §SDAM Monitoring), an RTT command MUST NOT publish events, MUST NOT update the topology, and its errors MUST NOT mark the server Unknown. The pollingMonitorwas violating this: it ran the cluster SDAM flow and publishedServerHeartbeat*/ServerDescriptionChangedevents for every steady-state RTT check, producing the duplicate non-awaited/awaited sequences seen in RUBY-3456.Changes
Core (
lib/mongo/server/)monitor.rb: newrtt_measurement_only?(connection established and PushMonitor running), computed inscan!beforedo_scan. When true, the scan skips heartbeat publishing and the cluster SDAM flow (success and error) while still scheduling the next check. The handshake/reconnect path (no connection yet) is unaffected, so discovery stays prompt; polling mode is unaffected (no running PushMonitor → gate always false).push_monitor.rb: marking a server Unknown on a streaming-check failure now belongs to thePushMonitor(it previously only stopped and relied on the polling Monitor). The PushMonitor connection also performs a metadata handshake before streaming — without it the server never learns the connection'sappName, soappName-scopedfailCommandfail points (used by the SDAM spec tests) never applied to the streaming hello.Tests
monitor_spec.rb: RTT-only scan runs no SDAM flow and publishes no heartbeat; polling-mode scan still does both (guards against the suppression leaking into polling).push_monitor_spec.rb: streaming-hello error marks the server Unknown; the connection handshakes before streaming.Fixture
interruptInUse-pool-clear.ymlfrom the spec (times: 1→times: 4). The RTT hello now consumes fail point budget alongside the streaming check, which is exactly why upstream usestimes: 4.Test plan
All run against a local replica set (
localhost:27017,27018,27019).monitor_spec+push_monitor_spec+monitor/connection_spec— 44 examples, 0 failuressdam_unified+sdam_monitoring+sdam_spec— 4692 examples, 0 failures, 12 pending (the 4 previously-failinghello-command-error/interruptInUse-pool-cleartests now pass)sdam_events+sdam_prose+awaited_ismaster+heartbeat_events+server_spec— 44 examples, 0 failuresserverMonitoringMode: pollunified fixtures (network-error/timeout/backpressure) passSelf-review
Reviewed by the spec consultant (no MUST/MUST NOT violations; fully spec-compliant) and a Ruby-driver review. Notes:
PushMonitorsuccess path already runsmonitor.run_sdam_flow(awaited: true)→ cluster flow every iteration; the error path uses the identical call and releases@sdam_mutexbeforestop!.start_stop_srv_monitordoes not call back into monitor lifecycle.Unknown → UnknownServerDescriptionChangedis possible if the polling Monitor's reconnect also fails after the PushMonitor marked Unknown — this is expected (matches Go/pymongo) and not a spec violation.Jira: https://jira.mongodb.org/browse/RUBY-3822