Skip to content

RUBY-3822 Stop publishing SDAM events from the Monitor's RTT checks#3069

Open
comandeo-mongo wants to merge 2 commits into
mongodb:masterfrom
comandeo-mongo:ruby-3822-rtt-no-sdam-events
Open

RUBY-3822 Stop publishing SDAM events from the Monitor's RTT checks#3069
comandeo-mongo wants to merge 2 commits into
mongodb:masterfrom
comandeo-mongo:ruby-3822-rtt-no-sdam-events

Conversation

@comandeo-mongo

Copy link
Copy Markdown
Contributor

What

In the streaming protocol the PushMonitor streams hello responses as the authoritative SDAM source, while the polling Monitor only measures RTT on its connection. Per the Server Monitoring spec (§Measuring RTT / §SDAM Monitoring), an RTT command MUST NOT publish events, MUST NOT update the topology, and its errors MUST NOT mark the server Unknown. The polling Monitor was violating this: it ran the cluster SDAM flow and published ServerHeartbeat*/ServerDescriptionChanged events for every steady-state RTT check, producing the duplicate non-awaited/awaited sequences seen in RUBY-3456.

Changes

Core (lib/mongo/server/)

  • monitor.rb: new rtt_measurement_only? (connection established and PushMonitor running), computed in scan! before do_scan. When true, the scan skips heartbeat publishing and the cluster SDAM flow (success and error) while still scheduling the next check. The handshake/reconnect path (no connection yet) is unaffected, so discovery stays prompt; polling mode is unaffected (no running PushMonitor → gate always false).
  • push_monitor.rb: marking a server Unknown on a streaming-check failure now belongs to the PushMonitor (it previously only stopped and relied on the polling Monitor). The PushMonitor connection also performs a metadata handshake before streaming — without it the server never learns the connection's appName, so appName-scoped failCommand fail points (used by the SDAM spec tests) never applied to the streaming hello.

Tests

  • monitor_spec.rb: RTT-only scan runs no SDAM flow and publishes no heartbeat; polling-mode scan still does both (guards against the suppression leaking into polling).
  • push_monitor_spec.rb: streaming-hello error marks the server Unknown; the connection handshakes before streaming.

Fixture

  • Refreshed the stale interruptInUse-pool-clear.yml from the spec (times: 1times: 4). The RTT hello now consumes fail point budget alongside the streaming check, which is exactly why upstream uses times: 4.

Test plan

All run against a local replica set (localhost:27017,27018,27019).

  • monitor_spec + push_monitor_spec + monitor/connection_spec — 44 examples, 0 failures
  • sdam_unified + sdam_monitoring + sdam_spec — 4692 examples, 0 failures, 12 pending (the 4 previously-failing hello-command-error / interruptInUse-pool-clear tests now pass)
  • sdam_events + sdam_prose + awaited_ismaster + heartbeat_events + server_spec — 44 examples, 0 failures
  • Polling mode verified: unit guard + the 7 serverMonitoringMode: poll unified fixtures (network-error/timeout/backpressure) pass
  • RuboCop clean on changed files

Self-review

Reviewed by the spec consultant (no MUST/MUST NOT violations; fully spec-compliant) and a Ruby-driver review. Notes:

  • No new deadlock: the PushMonitor success path already runs monitor.run_sdam_flow(awaited: true) → cluster flow every iteration; the error path uses the identical call and releases @sdam_mutex before stop!. start_stop_srv_monitor does not call back into monitor lifecycle.
  • A second Unknown → Unknown ServerDescriptionChanged is possible if the polling Monitor's reconnect also fails after the PushMonitor marked Unknown — this is expected (matches Go/pymongo) and not a spec violation.

Jira: https://jira.mongodb.org/browse/RUBY-3822

When the streaming protocol is active the PushMonitor streams hello
responses as the authoritative SDAM source, and the polling Monitor only
measures RTT on its connection. Per the Server Monitoring spec, an RTT
command must not publish events or update the topology, and its errors
must not mark the server Unknown. Until now the polling Monitor still ran
the cluster SDAM flow and published heartbeat events for these RTT checks.

Gate the polling Monitor: when a connection is established and the
PushMonitor is running, scan! skips heartbeat publishing and the cluster
SDAM flow (on both success and error) while still pacing the next check.
The initial handshake and reconnect paths are unaffected, since the gate
is false when there is no connection yet, so topology discovery stays
prompt. Polling mode is unaffected: with no running PushMonitor the gate
is always false.

Marking a server Unknown on a streaming-check failure now belongs to the
PushMonitor, which previously only stopped and relied on the polling
Monitor to notice. The PushMonitor connection also performs a metadata
handshake before streaming, mirroring the polling Monitor and the spec;
without it the server never learns the connection's appName, so
appName-scoped behaviour (including the failCommand fail points used by
the SDAM spec tests) never applies to the streaming hello.

Refresh the stale interruptInUse-pool-clear fixture from the spec
(times: 1 -> times: 4). The RTT hello now consumes fail point budget
alongside the streaming check, which is exactly why upstream uses times: 4.
…g mode

The RTT-only gate suppressed the polling Monitor's check whenever the
PushMonitor was running, including while the server was Unknown. A server
marked Unknown by an operation error or a streaming failure then had to
wait for the next streaming response to recover, which could be up to
heartbeatFrequencyMS away - long enough for server selection to fail with
NoServerAvailable and cascade across later operations.

Restrict the suppression to known servers: when the server is Unknown the
polling Monitor runs a full check again, restoring the prompt recovery it
performed before this change. Steady-state RTT suppression (the ticket's
goal) still applies while the server is known and the PushMonitor is the
authoritative streaming source.
@comandeo-mongo comandeo-mongo marked this pull request as ready for review June 26, 2026 12:44
@comandeo-mongo comandeo-mongo requested a review from a team as a code owner June 26, 2026 12:44
@comandeo-mongo comandeo-mongo requested review from Copilot and jamis and removed request for Copilot June 26, 2026 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants