Fix: HUNG-dump signals forked descendants + drains pump before tail#765
Merged
ChaoWao merged 1 commit intoMay 13, 2026
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to collect and signal descendant processes during a session timeout, ensuring that faulthandler tracebacks from deadlocked sub-processes are captured. It adds a new utility to walk the Linux /proc tree and updates the timeout handler to drain output buffers before reporting hung jobs. Feedback includes correcting the tree traversal from DFS to the documented BFS, optimizing the process lookup complexity, reducing the join timeout for output pump threads to minimize cumulative delays, and refining the cleanup logic in the new unit tests.
…dlock The session-timeout handler in the root conftest sends SIGUSR1 to each running dispatched pytest pid and relies on the child's installed faulthandler trampoline to dump all-thread tracebacks into the HUNG group. Two latent bugs made this useless for the common deadlock pattern in sim distributed tests (e.g. test_ep_dispatch_combine): 1. L3 ``Worker._start_hierarchical`` (python/simpler/worker.py:833,854,895) ``os.fork``s ChipWorker / SubWorker / next-level children. When the real deadlock lives in one of those forked grandchildren, signaling only the dispatched pytest pid hits a process that is calmly waiting in ``waitpid`` — useful frame, but the actual stuck thread (in the grandchild) sees no signal and faulthandler never fires there. 2. After the 2 s drain sleep, the handler read ``rj.output_lines[-200:]`` directly without joining the pump thread, while every other reader in parallel_scheduler.py (``_reap_one`` line 265, the cancel path line 322) does ``pump_thread.join(timeout=2.0)`` first. Bytes that had landed in the OS pipe but not yet in ``output_lines`` were dropped on the floor when SIGTERM closed the pipe right after. Net result for the symptom that motivated this work — the ``test_ep_dispatch_combine`` hang in CI run 25773352155 — was an empty HUNG group body and no actionable diagnostic. Fix: - Add ``_collect_descendant_pids(pid)`` walking ``/proc/<pid>/task/*/children`` in BFS. Best-effort: empty list on non-Linux (macOS sim/UT) or on pid races. The handler signals the dispatched pid AND every descendant. - Annotate the HUNG header with ``descendants=[...]`` so a reviewer can see at a glance that the dispatched pid had forked children and how many — useful even when the body itself is sparse. - Call ``rj.pump_thread.join(timeout=0.5)`` before reading the tail buffer, mirroring the convention from ``_reap_one`` / cancel. Tests (tests/ut/py/test_session_timeout.py): - ``test_collect_descendant_pids_sees_fork_tree`` — fork a child that forks a grandchild, assert both pids appear. - ``test_collect_descendant_pids_returns_empty_for_dead_pid`` — ``/proc`` walk on a reaped pid returns [], no exception. - ``test_session_timeout_surfaces_forked_child_traceback`` — end-to-end repro: spawn a real dispatcher subprocess running a deadlocking forking pytest target, install the real ``_install_session_timeout`` handler, let SIGALRM fire, and assert the HUNG body contains the grandchild's ``_grandchild_deadlock_sentinel`` faulthandler frame and ``descendants=[...]`` in the header. Skipped on non-Linux. Known gap not addressed here: pytest's default ``--capture=fd`` dup2's the test's stderr onto a capture pipe that is never flushed if the test hangs. The descendants fix gets faulthandler firing in the right process, but its writes still land in pytest's capture pipe unless the dispatched pytest is invoked with ``-s`` (the e2e test uses ``-s`` to isolate the descendants/pump logic). Routing faulthandler to a pre-capture dup of fd 2 is a follow-up.
59402f9 to
1c9ff75
Compare
ChaoWao
approved these changes
May 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The dispatcher's session-timeout handler in the root
conftest.pywasproducing empty
HUNGgroups for the common deadlock pattern in simdistributed tests. Caught in CI run
25773352155
on PR #763:
test_ep_dispatch_combinehung for 585.9 s, the HUNG bodycontained only the pytest startup banner and no faulthandler traceback
— zero diagnostic value.
Root cause (two bugs)
Signal only goes to the dispatched pid, not its descendants.
python/simpler/worker.py::_start_hierarchicalos.forksChipWorker / SubWorker / next-level children at lines 833, 854, 895.
When the actual deadlock is in one of those forked grandchildren,
the dispatched pytest is calmly waiting in
waitpid— useful framefor that pid, but the stuck thread (in the grandchild) sees no
signal and faulthandler never dumps from where the bug lives.
Tail buffer is read without joining the pump thread. After
time.sleep(2.0), the handler didtail = "".join(rj.output_lines[-200:])directly. Every other pathin
parallel_scheduler.py(_reap_oneline 265, the cancel pathline 322) calls
rj.pump_thread.join(timeout=2.0)first. Bytes thatthe faulthandler signal trampoline had written to the pipe but the
daemon pump hadn't yet spliced into
output_lineswere lost whenSIGTERMclosed the pipe right after.Fix
_collect_descendant_pids(pid)inconftest.pythat walks/proc/<pid>/task/*/childrenBFS. Best-effort: empty list onnon-Linux and on pid races; no exception.
every descendant, so the forked grandchild's faulthandler fires
where the deadlock actually is.
descendants=[<pids>]— useful evenwhen the body itself is sparse.
rj.pump_thread.join(timeout=0.5)before reading the tail buffer,mirroring the convention from
_reap_one/ cancel.Tests
tests/ut/py/test_session_timeout.py:test_collect_descendant_pids_sees_fork_treetest_collect_descendant_pids_returns_empty_for_dead_pid/procwalk on a reaped pid returns[](no exception)test_session_timeout_surfaces_forked_child_traceback_install_session_timeouthandler + a forking deadlocking pytest target; HUNG body contains_grandchild_deadlock_sentinelframe and header hasdescendants=[...]All three skip on non-Linux (no
/proc).Reverted the fix locally to confirm the regression tests really exercise
it: without the change, all 3 fail (the importlib load fails on the
helper before reaching the assertions).
Known gap (not fixed here)
pytest's default
--capture=fddup2s the test's stderr onto acapture pipe that is never flushed when the test hangs. The
descendants fix correctly gets faulthandler to fire in the right
process, but its writes still land in pytest's capture and never
reach the parent's pump unless the dispatched pytest is invoked with
-s. The e2e test uses-sto isolate the descendants/pump logic.A follow-up could route faulthandler to a pre-capture dup of fd 2
(e.g.,
os.dup(2)at conftest module load, passed asfile=fdtofaulthandler.register). I prototyped this but the timing of pytest'scapture initialization vs. conftest module load is subtle enough that
I want to verify it carefully in a separate PR rather than bundle.
Test plan
python -m pytest tests/ut/py/test_session_timeout.py -vpasseslocally (3/3).
the tests really exercise the fix).
check-headers) pass on both touched files; no
SKIP=.