Skip to content

Fix: HUNG-dump signals forked descendants + drains pump before tail#765

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:fix/hung-dump-descendants-and-pump-drain
May 13, 2026
Merged

Fix: HUNG-dump signals forked descendants + drains pump before tail#765
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:fix/hung-dump-descendants-and-pump-drain

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

Summary

The dispatcher's session-timeout handler in the root conftest.py was
producing empty HUNG groups for the common deadlock pattern in sim
distributed tests. Caught in CI run
25773352155
on PR #763: test_ep_dispatch_combine hung for 585.9 s, the HUNG body
contained only the pytest startup banner and no faulthandler traceback
— zero diagnostic value.

Root cause (two bugs)

  1. Signal only goes to the dispatched pid, not its descendants.
    python/simpler/worker.py::_start_hierarchical os.forks
    ChipWorker / SubWorker / next-level children at lines 833, 854, 895.
    When the actual deadlock is in one of those forked grandchildren,
    the dispatched pytest is calmly waiting in waitpid — useful frame
    for that pid, but the stuck thread (in the grandchild) sees no
    signal and faulthandler never dumps from where the bug lives.

  2. Tail buffer is read without joining the pump thread. After
    time.sleep(2.0), the handler did
    tail = "".join(rj.output_lines[-200:]) directly. Every other path
    in parallel_scheduler.py (_reap_one line 265, the cancel path
    line 322) calls rj.pump_thread.join(timeout=2.0) first. Bytes that
    the faulthandler signal trampoline had written to the pipe but the
    daemon pump hadn't yet spliced into output_lines were lost when
    SIGTERM closed the pipe right after.

Fix

  • New _collect_descendant_pids(pid) in conftest.py that walks
    /proc/<pid>/task/*/children BFS. Best-effort: empty list on
    non-Linux and on pid races; no exception.
  • The session-timeout handler now signals the dispatched pid and
    every descendant, so the forked grandchild's faulthandler fires
    where the deadlock actually is.
  • HUNG group header now includes descendants=[<pids>] — useful even
    when the body itself is sparse.
  • rj.pump_thread.join(timeout=0.5) before reading the tail buffer,
    mirroring the convention from _reap_one / cancel.

Tests

tests/ut/py/test_session_timeout.py:

Test What it verifies
test_collect_descendant_pids_sees_fork_tree Fork a child that forks a grandchild; both pids appear
test_collect_descendant_pids_returns_empty_for_dead_pid /proc walk on a reaped pid returns [] (no exception)
test_session_timeout_surfaces_forked_child_traceback End-to-end: real _install_session_timeout handler + a forking deadlocking pytest target; HUNG body contains _grandchild_deadlock_sentinel frame and header has descendants=[...]

All three skip on non-Linux (no /proc).

Reverted the fix locally to confirm the regression tests really exercise
it: without the change, all 3 fail (the importlib load fails on the
helper before reaching the assertions).

Known gap (not fixed here)

pytest's default --capture=fd dup2s the test's stderr onto a
capture pipe that is never flushed when the test hangs. The
descendants fix correctly gets faulthandler to fire in the right
process, but its writes still land in pytest's capture and never
reach the parent's pump unless the dispatched pytest is invoked with
-s. The e2e test uses -s to isolate the descendants/pump logic.

A follow-up could route faulthandler to a pre-capture dup of fd 2
(e.g., os.dup(2) at conftest module load, passed as file=fd to
faulthandler.register). I prototyped this but the timing of pytest's
capture initialization vs. conftest module load is subtle enough that
I want to verify it carefully in a separate PR rather than bundle.

Test plan

  • python -m pytest tests/ut/py/test_session_timeout.py -v passes
    locally (3/3).
  • Same suite fails 3/3 with the conftest changes reverted (proves
    the tests really exercise the fix).
  • Pre-commit hooks (ruff check, ruff format, pyright,
    check-headers) pass on both touched files; no SKIP=.
  • Full CI run on this branch.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to collect and signal descendant processes during a session timeout, ensuring that faulthandler tracebacks from deadlocked sub-processes are captured. It adds a new utility to walk the Linux /proc tree and updates the timeout handler to drain output buffers before reporting hung jobs. Feedback includes correcting the tree traversal from DFS to the documented BFS, optimizing the process lookup complexity, reducing the join timeout for output pump threads to minimize cumulative delays, and refining the cleanup logic in the new unit tests.

Comment thread conftest.py
Comment thread conftest.py Outdated
Comment thread tests/ut/py/test_session_timeout.py Outdated
…dlock

The session-timeout handler in the root conftest sends SIGUSR1 to each
running dispatched pytest pid and relies on the child's installed
faulthandler trampoline to dump all-thread tracebacks into the HUNG
group. Two latent bugs made this useless for the common deadlock
pattern in sim distributed tests (e.g. test_ep_dispatch_combine):

1. L3 ``Worker._start_hierarchical`` (python/simpler/worker.py:833,854,895)
   ``os.fork``s ChipWorker / SubWorker / next-level children. When the
   real deadlock lives in one of those forked grandchildren, signaling
   only the dispatched pytest pid hits a process that is calmly waiting
   in ``waitpid`` — useful frame, but the actual stuck thread (in the
   grandchild) sees no signal and faulthandler never fires there.

2. After the 2 s drain sleep, the handler read ``rj.output_lines[-200:]``
   directly without joining the pump thread, while every other reader
   in parallel_scheduler.py (``_reap_one`` line 265, the cancel path
   line 322) does ``pump_thread.join(timeout=2.0)`` first. Bytes that
   had landed in the OS pipe but not yet in ``output_lines`` were
   dropped on the floor when SIGTERM closed the pipe right after.

Net result for the symptom that motivated this work — the
``test_ep_dispatch_combine`` hang in CI run 25773352155 — was an empty
HUNG group body and no actionable diagnostic.

Fix:

- Add ``_collect_descendant_pids(pid)`` walking ``/proc/<pid>/task/*/children``
  in BFS. Best-effort: empty list on non-Linux (macOS sim/UT) or on
  pid races. The handler signals the dispatched pid AND every
  descendant.
- Annotate the HUNG header with ``descendants=[...]`` so a reviewer can
  see at a glance that the dispatched pid had forked children and how
  many — useful even when the body itself is sparse.
- Call ``rj.pump_thread.join(timeout=0.5)`` before reading the tail
  buffer, mirroring the convention from ``_reap_one`` / cancel.

Tests (tests/ut/py/test_session_timeout.py):

- ``test_collect_descendant_pids_sees_fork_tree`` — fork a child that
  forks a grandchild, assert both pids appear.
- ``test_collect_descendant_pids_returns_empty_for_dead_pid`` —
  ``/proc`` walk on a reaped pid returns [], no exception.
- ``test_session_timeout_surfaces_forked_child_traceback`` — end-to-end
  repro: spawn a real dispatcher subprocess running a deadlocking
  forking pytest target, install the real ``_install_session_timeout``
  handler, let SIGALRM fire, and assert the HUNG body contains the
  grandchild's ``_grandchild_deadlock_sentinel`` faulthandler frame
  and ``descendants=[...]`` in the header. Skipped on non-Linux.

Known gap not addressed here: pytest's default ``--capture=fd`` dup2's
the test's stderr onto a capture pipe that is never flushed if the
test hangs. The descendants fix gets faulthandler firing in the right
process, but its writes still land in pytest's capture pipe unless
the dispatched pytest is invoked with ``-s`` (the e2e test uses
``-s`` to isolate the descendants/pump logic). Routing faulthandler
to a pre-capture dup of fd 2 is a follow-up.
@hw-native-sys-bot hw-native-sys-bot force-pushed the fix/hung-dump-descendants-and-pump-drain branch from 59402f9 to 1c9ff75 Compare May 13, 2026 02:50
@ChaoWao ChaoWao merged commit 540d924 into hw-native-sys:main May 13, 2026
14 checks passed
@ChaoWao ChaoWao deleted the fix/hung-dump-descendants-and-pump-drain branch May 13, 2026 03:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants