RUBY-3822 Stop publishing SDAM events from the Monitor's RTT checks by comandeo-mongo · Pull Request #3069 · mongodb/mongo-ruby-driver

comandeo-mongo · 2026-06-25T14:47:46Z

What

In the streaming protocol the PushMonitor streams hello responses as the authoritative SDAM source, while the polling Monitor only measures RTT on its connection. Per the Server Monitoring spec (§Measuring RTT / §SDAM Monitoring), an RTT command MUST NOT publish events, MUST NOT update the topology, and its errors MUST NOT mark the server Unknown. The polling Monitor was violating this: it ran the cluster SDAM flow and published ServerHeartbeat*/ServerDescriptionChanged events for every steady-state RTT check, producing the duplicate non-awaited/awaited sequences seen in RUBY-3456.

Changes

Core (lib/mongo/server/)

monitor.rb: new rtt_measurement_only? (connection established and PushMonitor running), computed in scan! before do_scan. When true, the scan skips heartbeat publishing and the cluster SDAM flow (success and error) while still scheduling the next check. The handshake/reconnect path (no connection yet) is unaffected, so discovery stays prompt; polling mode is unaffected (no running PushMonitor → gate always false).
push_monitor.rb: marking a server Unknown on a streaming-check failure now belongs to the PushMonitor (it previously only stopped and relied on the polling Monitor). The PushMonitor connection also performs a metadata handshake before streaming — without it the server never learns the connection's appName, so appName-scoped failCommand fail points (used by the SDAM spec tests) never applied to the streaming hello.

Tests

monitor_spec.rb: RTT-only scan runs no SDAM flow and publishes no heartbeat; polling-mode scan still does both (guards against the suppression leaking into polling).
push_monitor_spec.rb: streaming-hello error marks the server Unknown; the connection handshakes before streaming.

Fixture

Refreshed the stale interruptInUse-pool-clear.yml from the spec (times: 1 → times: 4). The RTT hello now consumes fail point budget alongside the streaming check, which is exactly why upstream uses times: 4.

Test plan

All run against a local replica set (localhost:27017,27018,27019).

monitor_spec + push_monitor_spec + monitor/connection_spec — 44 examples, 0 failures
sdam_unified + sdam_monitoring + sdam_spec — 4692 examples, 0 failures, 12 pending (the 4 previously-failing hello-command-error / interruptInUse-pool-clear tests now pass)
sdam_events + sdam_prose + awaited_ismaster + heartbeat_events + server_spec — 44 examples, 0 failures
Polling mode verified: unit guard + the 7 serverMonitoringMode: poll unified fixtures (network-error/timeout/backpressure) pass
RuboCop clean on changed files

Self-review

Reviewed by the spec consultant (no MUST/MUST NOT violations; fully spec-compliant) and a Ruby-driver review. Notes:

No new deadlock: the PushMonitor success path already runs monitor.run_sdam_flow(awaited: true) → cluster flow every iteration; the error path uses the identical call and releases @sdam_mutex before stop!. start_stop_srv_monitor does not call back into monitor lifecycle.
A second Unknown → Unknown ServerDescriptionChanged is possible if the polling Monitor's reconnect also fails after the PushMonitor marked Unknown — this is expected (matches Go/pymongo) and not a spec violation.

Jira: https://jira.mongodb.org/browse/RUBY-3822

When the streaming protocol is active the PushMonitor streams hello responses as the authoritative SDAM source, and the polling Monitor only measures RTT on its connection. Per the Server Monitoring spec, an RTT command must not publish events or update the topology, and its errors must not mark the server Unknown. Until now the polling Monitor still ran the cluster SDAM flow and published heartbeat events for these RTT checks. Gate the polling Monitor: when a connection is established and the PushMonitor is running, scan! skips heartbeat publishing and the cluster SDAM flow (on both success and error) while still pacing the next check. The initial handshake and reconnect paths are unaffected, since the gate is false when there is no connection yet, so topology discovery stays prompt. Polling mode is unaffected: with no running PushMonitor the gate is always false. Marking a server Unknown on a streaming-check failure now belongs to the PushMonitor, which previously only stopped and relied on the polling Monitor to notice. The PushMonitor connection also performs a metadata handshake before streaming, mirroring the polling Monitor and the spec; without it the server never learns the connection's appName, so appName-scoped behaviour (including the failCommand fail points used by the SDAM spec tests) never applies to the streaming hello. Refresh the stale interruptInUse-pool-clear fixture from the spec (times: 1 -> times: 4). The RTT hello now consumes fail point budget alongside the streaming check, which is exactly why upstream uses times: 4.

…g mode The RTT-only gate suppressed the polling Monitor's check whenever the PushMonitor was running, including while the server was Unknown. A server marked Unknown by an operation error or a streaming failure then had to wait for the next streaming response to recover, which could be up to heartbeatFrequencyMS away - long enough for server selection to fail with NoServerAvailable and cascade across later operations. Restrict the suppression to known servers: when the server is Unknown the polling Monitor runs a full check again, restoring the prompt recovery it performed before this change. Steady-state RTT suppression (the ticket's goal) still applies while the server is known and the PushMonitor is the authoritative streaming source.

comandeo-mongo added the bug label Jun 25, 2026

comandeo-mongo marked this pull request as ready for review June 26, 2026 12:44

comandeo-mongo requested a review from a team as a code owner June 26, 2026 12:44

comandeo-mongo requested review from Copilot and jamis and removed request for Copilot June 26, 2026 12:44

jamis approved these changes Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RUBY-3822 Stop publishing SDAM events from the Monitor's RTT checks#3069

RUBY-3822 Stop publishing SDAM events from the Monitor's RTT checks#3069
comandeo-mongo wants to merge 2 commits into
mongodb:masterfrom
comandeo-mongo:ruby-3822-rtt-no-sdam-events

comandeo-mongo commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

comandeo-mongo commented Jun 25, 2026

What

Changes

Test plan

Self-review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants