Skip to content

Fix parallel-coordinator hang when summary pipe saturates#248

Merged
GuyAv46 merged 7 commits intomasterfrom
guyav-fix_cluster_hang
Apr 26, 2026
Merged

Fix parallel-coordinator hang when summary pipe saturates#248
GuyAv46 merged 7 commits intomasterfrom
guyav-fix_cluster_hang

Conversation

@GuyAv46
Copy link
Copy Markdown
Contributor

@GuyAv46 GuyAv46 commented Apr 23, 2026

Problem

When running RLTest with --parallelism > 1, the coordinator could hang indefinitely on p.join() after worker processes produced enough data to saturate the inter-process pipe buffer (~64 KiB on Linux).

The parallel coordinator used two queues: results for per-test output (drained live by the progressbar loop) and summary for per-worker aggregates (done counts and the worker's full testsFailed dict, drained only after the loop). With enough queued data — or a single large testsFailed dict — workers blocked in pipe_write on the summary pipe (either inside on_timeout's summary.join_thread() or during Python's end-of-process queue finalization) while the coordinator sat in p.join() waiting for them to exit. Classic pipe-buffer deadlock.

Observed in practice on a RediSearch run that hung for 3h53m with 16 workers all stuck in pipe_write and the coordinator parked in do_wait.

Fix

The PR is structured as five bisect-friendly commits, building from a minimal patch to a structural refactor. The final design (commits 4 and 5) is the recommended end state.

4a8a452 — Concurrent summary drain (minimal patch)

A _join_workers_with_summary_drain helper spawns a background thread to drain summary while the main thread joins workers. Breaks the deadlock by ensuring writers can always make progress. Also corrects a stale comment in on_timeout.

c8df5bcsummarySimpleQueue

SimpleQueue.put() is synchronous (no feeder thread, no internal buffer), so on_timeout no longer needs close() + join_thread() to flush before the watcher's os._exit(1). The drainer thread also simplifies to a blocking get() driven by a sentinel. results stays a Queue because the progressbar loop relies on get(timeout=…).

c0cb696 — Address review feedback

Reorder puts in on_timeout so the coordinator-visible result is delivered before the (potentially blocking) summary put.

dfc9114 — Collapse results + summary into a single queue

The two queues existed for historical/semantic reasons, not technical ones. Collapse to a single results queue carrying one self-contained message per test: { test_name, output, done, failures }. The worker resets self.testsFailed = {} before each test so addFailure() writes into a fresh dict that ships verbatim; the worker keeps no cumulative state. The coordinator owns the canonical testsFailed via update() per message.

This eliminates the deadlock by construction: the only queue is drained continuously by the coordinator's progressbar loop for the entire lifetime of the workers, so worker put()s can never block on a full pipe. Removes SimpleQueue, the _SUMMARY_DRAIN_STOP sentinel, the _join_workers_with_summary_drain helper, and the threading import. on_timeout shrinks to a single put + flush.

3a50d5a — Add per-worker shutdown message + bounded-count drain

Failures raised during the worker's final takeEnvDown(fullShutDown=True) (e.g. "redis did not exit cleanly" when env_reuse=True) need to reach the coordinator too. Each worker now ships exactly one extra '<worker shutdown>' message after takeEnvDown, tagged 'shutdown': True.

The coordinator reads a known total of n_jobs + parallelism messages in a single bounded loop, ticking the progressbar only on per-test messages. Order-independent: the single shared queue does not preserve per-worker ordering, so a fast worker's shutdown may arrive before a slow worker's last test — the conditional bar tick handles both cases. The has_live_processor liveness guard turns a worker crash before it ships its shutdown message into a clean error instead of an indefinite hang.

Final architecture

  • One queue, one message per test, plus one shutdown message per worker.
  • Coordinator drains continuously throughout the workers' lifetime → no pipe-saturation deadlock possible.
  • Workers are stateless event sources (no cumulative testsFailed); the coordinator is the sole owner of aggregated state.
  • No background drainer thread, no SimpleQueue, no helper functions, no time-based polling outside results.get(timeout=1).

Testing

  • New tests/unit/test_parallel_drain.py reproduces the saturation scenario: 8 worker processes each push many ~32 KiB messages (well over the 64 KiB pipe buffer) and we assert the coordinator drains them and joins all workers within a bounded time.
  • Full unit suite: 145 passed, 5 skipped.
  • Verified end-to-end against the RediSearch test suite (RLTEST=$(realpath ../RLTest) make pytest): 1744 passed, 0 failed.

Pull Request opened by Augment Code with guidance from the PR author

GuyAv46 added 2 commits April 23, 2026 15:54
The coordinator drained the 'summary' queue only after joining all worker
processes. With enough queued data (or a single large testsFailed dict),
the summary-pipe buffer (~64 KiB on Linux) saturates and worker feeder
threads block in pipe_write, both inside on_timeout's join_thread() and
during Python's end-of-process queue finalization. This in turn hangs the
coordinator's p.join() indefinitely.

Introduce a module-level helper _join_workers_with_summary_drain that
joins workers while continuously draining 'summary' from a background
thread, and use it in execute(). Also correct the stale comment in the
on_timeout closure to describe the actual watcher-thread os._exit(1)
flow.
SimpleQueue.put is synchronous (no feeder thread, no internal buffer),
so a successful put() implies the bytes are already in the kernel pipe.
That removes the need for the summary.close() + summary.join_thread()
dance in on_timeout before the watcher's os._exit(1), and the comment
that explained it.

The coordinator-side drain thread is updated to a blocking get() driven
by a sentinel on shutdown, eliminating its busy-loop timeout too.

results stays a Queue because the progressbar liveness loop relies on
get(timeout=...), which SimpleQueue does not expose publicly.
@GuyAv46 GuyAv46 requested review from JoanFM and alonre24 April 23, 2026 13:10
@GuyAv46 GuyAv46 requested a review from Copilot April 23, 2026 14:29
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@GuyAv46 GuyAv46 marked this pull request as ready for review April 23, 2026 14:34
@GuyAv46 GuyAv46 requested a review from Copilot April 23, 2026 14:34
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread RLTest/__main__.py Outdated
Comment thread RLTest/__main__.py Outdated
Comment thread RLTest/__main__.py Outdated
In the on_timeout closure, put 'results' before 'summary'. The summary
queue is a SimpleQueue with a synchronous put(), and the coordinator
only starts draining it after every result is in. Putting summary first
risked blocking on a full summary pipe while the coordinator was still
waiting on this worker's result, which would have stalled the whole
results-collection loop. Putting results first guarantees the worker's
output reaches the coordinator unconditionally; the subsequent summary
put may briefly block but always unblocks once the coordinator moves to
the drain phase.

Also drop the stale 'feeder threads' wording near the call site: the
summary queue no longer has a feeder thread.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread RLTest/__main__.py Outdated
Comment thread tests/unit/test_parallel_drain.py Outdated
The parallel coordinator used two queues: 'results' for per-test output
(read by the progressbar loop) and 'summary' for per-worker aggregates
('done' count and the worker's full testsFailed dict, read after the
loop). Workers' summary.put could block on a full pipe because the
coordinator only drained summary in a second phase, after every results
message had been received.

Collapse to a single 'results' queue carrying one self-contained message
per test: { test_name, output, done, failures }. The worker resets
self.testsFailed = {} before each test so addFailure() writes into a
fresh dict that ships verbatim; the worker keeps no cumulative state.
The coordinator owns the canonical testsFailed via update() per message.

This eliminates the deadlock by construction: the only queue is drained
continuously by the coordinator's progressbar loop for the entire
lifetime of the workers, so worker put()s can never block on a full
pipe. Removes the SimpleQueue import, the _SUMMARY_DRAIN_STOP sentinel,
and the _join_workers_with_summary_drain helper. on_timeout shrinks to
a single put + close + join_thread.

The unit test is updated to exercise the new pattern: workers push many
large per-test messages while the main thread drains them live.
meiravgri
meiravgri previously approved these changes Apr 26, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread RLTest/__main__.py
Comment thread RLTest/__main__.py Outdated
Comment thread RLTest/__main__.py Outdated
- on_timeout now ships both the per-test result (with shutdown=False)
  and the shutdown sentinel before the watcher thread calls os._exit(1),
  keeping the coordinator's bounded count of n_jobs + parallelism
  accurate when a worker dies on timeout. Also fixes a KeyError on the
  missing 'shutdown' key in the timeout payload.
- Grammar: 'no more processors is alive' -> 'are alive'.
- test_parallel_drain: use time.monotonic() for elapsed measurement.
@GuyAv46 GuyAv46 requested a review from meiravgri April 26, 2026 09:34
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread RLTest/__main__.py
if res['output']:
print('%s' % res['output'], end="")
done += res['done']
self.testsFailed.update(res['failures'])
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.testsFailed.update(res['failures']) can overwrite existing failure lists when the incoming dict contains a key that already exists (e.g., the shutdown message can carry failures keyed by self.currEnv.testName, which may match a test that already failed earlier). This risks losing earlier failure details in the final summary. Consider merging per-key lists (extend) instead of dict.update(), or ensure shutdown-phase failures are reported under a dedicated key that cannot collide with real test names.

Suggested change
self.testsFailed.update(res['failures'])
for test_name, failures in res['failures'].items():
if test_name not in self.testsFailed:
self.testsFailed[test_name] = list(failures)
else:
self.testsFailed[test_name].extend(failures)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only relevant for reused envs, will be handled in the future if needed

Comment thread RLTest/__main__.py
@GuyAv46 GuyAv46 merged commit 9571043 into master Apr 26, 2026
10 checks passed
@GuyAv46 GuyAv46 deleted the guyav-fix_cluster_hang branch April 26, 2026 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants