StreamInterface: prevent socket/reader-thread leak on handshake failure#918
StreamInterface: prevent socket/reader-thread leak on handshake failure#918nightjoker7 wants to merge 1 commit intomeshtastic:masterfrom
Conversation
…re in __init__ If connect() or waitForConfig() raises during __init__ (handshake timeout, bad stream, config error), the reader thread started by connect() keeps running and the underlying stream/socket stays open — but the caller never receives a reference to the half-initialized instance, so they cannot call close() themselves. The leak compounds on every retry from a caller's reconnect loop. Fix: wrap connect() + waitForConfig() in try/except; call self.close() on any exception before re-raising. Also guard close() against RuntimeError from joining an unstarted reader thread (happens when close() runs from a failed __init__ before connect() could spawn it). Discovered while debugging a real-world Meshtastic firmware crash where a passive logger's retrying TCPInterface() calls against a node with 250-entry NodeDB produced a reconnect storm — every retry triggered a full config+NodeDB dump on the node, compounding heap pressure, which then exposed null-deref bugs in Router::perhapsDecode / MeshService (firmware side fixed in meshtastic/firmware#10226 and #10229). The client-side leak is independent of those firmware bugs and worth fixing on its own.
There was a problem hiding this comment.
Pull request overview
This PR addresses a resource-leak failure mode in StreamInterface where exceptions during initial protocol handshake (connect()/waitForConfig()) could leave a running reader thread and an open underlying transport (serial/TCP), even though the constructor raised before returning an instance to the caller.
Changes:
- Wrap
connect()+waitForConfig()inStreamInterface.__init__with exception handling that attempts toclose()before re-raising. - Make
StreamInterface.close()tolerateRuntimeErrorfrom joining an unstarted reader thread.
| # Nothing to join; safe to ignore. | ||
| pass |
There was a problem hiding this comment.
If connect() fails before the reader thread is started (e.g., serial write raises before _rxThread.start()), this RuntimeError path will skip joining but also won’t close the underlying stream. For SerialInterface (which opens the port before calling StreamInterface.__init__), this means the serial port can remain open even though close() was invoked. Consider explicitly closing self.stream (and setting it to None) when the thread was never started / is not alive, instead of just pass.
| # Nothing to join; safe to ignore. | |
| pass | |
| # In that case the reader thread cannot run _disconnected(), | |
| # so close the stream here to avoid leaking the resource. | |
| if self.stream is not None: | |
| with contextlib.suppress(Exception): | |
| self.stream.close() | |
| self.stream = None |
| if self._rxThread != threading.current_thread(): | ||
| self._rxThread.join() # wait for it to exit | ||
| try: | ||
| self._rxThread.join() # wait for it to exit | ||
| except RuntimeError: |
There was a problem hiding this comment.
join() without a timeout can block indefinitely if the reader thread is stuck in a blocking _readBytes() (e.g., TCPInterface uses blocking socket.recv() with no timeout). With the new __init__ exception handler calling self.close(), a handshake timeout against a silent TCP peer can hang here and never re-raise the original exception. Consider using a bounded join (and letting transport-specific close() logic interrupt the read), or otherwise ensuring blocking reads are interrupted before joining.
| try: | ||
| self.connect() | ||
| if not noProto: | ||
| self.waitForConfig() | ||
| except Exception: | ||
| # Handshake failed (timeout, config error, bad stream). The caller | ||
| # never receives a reference to this half-initialized instance, so | ||
| # they cannot call close() themselves. If we don't clean up here, | ||
| # the reader thread (already started by connect()) keeps running | ||
| # and the underlying stream/socket leaks — the leak compounds on | ||
| # every retry from the caller's reconnect loop. | ||
| with contextlib.suppress(Exception): | ||
| self.close() | ||
| raise |
There was a problem hiding this comment.
New failure-path behavior (calling close() from __init__ when connect() / waitForConfig() raises) is important for leak prevention but currently isn’t covered by unit tests. Since there are existing tests for stream_interface.py, consider adding a test that forces connect() or waitForConfig() to raise and asserts cleanup occurs (e.g., close() is invoked and doesn’t raise when the thread wasn’t started).
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #918 +/- ##
==========================================
- Coverage 61.57% 61.52% -0.06%
==========================================
Files 25 25
Lines 4448 4457 +9
==========================================
+ Hits 2739 2742 +3
- Misses 1709 1715 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Problem
StreamInterface.__init__callsself.connect()which spawns the reader thread and opens the stream, then callsself.waitForConfig()which blocks for the config handshake. If either raises — handshake timeout on a busy/loaded node is the common case — the partially-constructed object propagates the exception up, but:connect()keeps runningsocketonTCPInterface,serial.SerialonSerialInterface, etc.) stays openself, so they cannot callclose()themselvesIn a caller that retries on failure (passive loggers, reconnect loops, monitoring daemons), the thread + stream leak compounds with every retry. On the firmware side, every re-attempted handshake triggers a fresh config-dump + NodeDB-dump on the node, so the client-side leak directly inflates heap pressure on the radio.
Fix
Two small changes in
meshtastic/stream_interface.py:__init__: wrapconnect()+waitForConfig()intry/exceptthat callsself.close()on any exception before re-raising. This ensures the reader thread and the stream are released even when__init__cannot return normally.close(): guardself._rxThread.join()againstRuntimeError— happens whenclose()runs from a failed__init__beforeconnect()spawned the reader thread.How it was found
Hit this leak on a fleet-monitoring daemon retrying
TCPInterface(host)every 15 seconds against a Station-G2 node with NodeDB atMAX_NUM_NODES=250. Each handshake timed out, each retry leaked a reader thread + socket, and the compound reconnect pressure on the node correlated with cascading reboots. After applying this fix client-side (together with firmware null-checks in meshtastic/firmware#10226 and #10229), the same deployment has been stable.Minimal repro: run
TCPInterface('<host that accepts TCP but does not respond to meshtastic protocol>')in a retry loop and watchthreading.active_count()climb.Backwards compatibility
Happy-path behavior is unchanged. On the failure path, callers that previously received an exception from
__init__and did nothing will now see the same exception but without the orphaned thread + stream. Callers that relied on the thread staying alive after a failed__init__(unlikely — they had no reference to it) would notice, but that would be an unusual pattern.Related
Router::perhapsDecodePKI path