Fix parallel-coordinator hang when summary pipe saturates by GuyAv46 · Pull Request #248 · RedisLabsModules/RLTest

GuyAv46 · 2026-04-23T13:07:14Z

Problem

When running RLTest with --parallelism > 1, the coordinator could hang indefinitely on p.join() after worker processes produced enough data to saturate the inter-process pipe buffer (~64 KiB on Linux).

The parallel coordinator used two queues: results for per-test output (drained live by the progressbar loop) and summary for per-worker aggregates (done counts and the worker's full testsFailed dict, drained only after the loop). With enough queued data — or a single large testsFailed dict — workers blocked in pipe_write on the summary pipe (either inside on_timeout's summary.join_thread() or during Python's end-of-process queue finalization) while the coordinator sat in p.join() waiting for them to exit. Classic pipe-buffer deadlock.

Observed in practice on a RediSearch run that hung for 3h53m with 16 workers all stuck in pipe_write and the coordinator parked in do_wait.

Fix

The PR is structured as five bisect-friendly commits, building from a minimal patch to a structural refactor. The final design (commits 4 and 5) is the recommended end state.

`4a8a452` — Concurrent summary drain (minimal patch)

A _join_workers_with_summary_drain helper spawns a background thread to drain summary while the main thread joins workers. Breaks the deadlock by ensuring writers can always make progress. Also corrects a stale comment in on_timeout.

`c8df5bc` — `summary` → `SimpleQueue`

SimpleQueue.put() is synchronous (no feeder thread, no internal buffer), so on_timeout no longer needs close() + join_thread() to flush before the watcher's os._exit(1). The drainer thread also simplifies to a blocking get() driven by a sentinel. results stays a Queue because the progressbar loop relies on get(timeout=…).

`c0cb696` — Address review feedback

Reorder puts in on_timeout so the coordinator-visible result is delivered before the (potentially blocking) summary put.

`dfc9114` — Collapse `results` + `summary` into a single queue

The two queues existed for historical/semantic reasons, not technical ones. Collapse to a single results queue carrying one self-contained message per test: { test_name, output, done, failures }. The worker resets self.testsFailed = {} before each test so addFailure() writes into a fresh dict that ships verbatim; the worker keeps no cumulative state. The coordinator owns the canonical testsFailed via update() per message.

This eliminates the deadlock by construction: the only queue is drained continuously by the coordinator's progressbar loop for the entire lifetime of the workers, so worker put()s can never block on a full pipe. Removes SimpleQueue, the _SUMMARY_DRAIN_STOP sentinel, the _join_workers_with_summary_drain helper, and the threading import. on_timeout shrinks to a single put + flush.

`3a50d5a` — Add per-worker shutdown message + bounded-count drain

Failures raised during the worker's final takeEnvDown(fullShutDown=True) (e.g. "redis did not exit cleanly" when env_reuse=True) need to reach the coordinator too. Each worker now ships exactly one extra '<worker shutdown>' message after takeEnvDown, tagged 'shutdown': True.

The coordinator reads a known total of n_jobs + parallelism messages in a single bounded loop, ticking the progressbar only on per-test messages. Order-independent: the single shared queue does not preserve per-worker ordering, so a fast worker's shutdown may arrive before a slow worker's last test — the conditional bar tick handles both cases. The has_live_processor liveness guard turns a worker crash before it ships its shutdown message into a clean error instead of an indefinite hang.

Final architecture

One queue, one message per test, plus one shutdown message per worker.
Coordinator drains continuously throughout the workers' lifetime → no pipe-saturation deadlock possible.
Workers are stateless event sources (no cumulative testsFailed); the coordinator is the sole owner of aggregated state.
No background drainer thread, no SimpleQueue, no helper functions, no time-based polling outside results.get(timeout=1).

Testing

New tests/unit/test_parallel_drain.py reproduces the saturation scenario: 8 worker processes each push many ~32 KiB messages (well over the 64 KiB pipe buffer) and we assert the coordinator drains them and joins all workers within a bounded time.
Full unit suite: 145 passed, 5 skipped.
Verified end-to-end against the RediSearch test suite (RLTEST=$(realpath ../RLTest) make pytest): 1744 passed, 0 failed.

Pull Request opened by Augment Code with guidance from the PR author

The coordinator drained the 'summary' queue only after joining all worker processes. With enough queued data (or a single large testsFailed dict), the summary-pipe buffer (~64 KiB on Linux) saturates and worker feeder threads block in pipe_write, both inside on_timeout's join_thread() and during Python's end-of-process queue finalization. This in turn hangs the coordinator's p.join() indefinitely. Introduce a module-level helper _join_workers_with_summary_drain that joins workers while continuously draining 'summary' from a background thread, and use it in execute(). Also correct the stale comment in the on_timeout closure to describe the actual watcher-thread os._exit(1) flow.

SimpleQueue.put is synchronous (no feeder thread, no internal buffer), so a successful put() implies the bytes are already in the kernel pipe. That removes the need for the summary.close() + summary.join_thread() dance in on_timeout before the watcher's os._exit(1), and the comment that explained it. The coordinator-side drain thread is updated to a blocking get() driven by a sentinel on shutdown, eliminating its busy-loop timeout too. results stays a Queue because the progressbar liveness loop relies on get(timeout=...), which SimpleQueue does not expose publicly.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

In the on_timeout closure, put 'results' before 'summary'. The summary queue is a SimpleQueue with a synchronous put(), and the coordinator only starts draining it after every result is in. Putting summary first risked blocking on a full summary pipe while the coordinator was still waiting on this worker's result, which would have stalled the whole results-collection loop. Putting results first guarantees the worker's output reaches the coordinator unconditionally; the subsequent summary put may briefly block but always unblocks once the coordinator moves to the drain phase. Also drop the stale 'feeder threads' wording near the call site: the summary queue no longer has a feeder thread.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The parallel coordinator used two queues: 'results' for per-test output (read by the progressbar loop) and 'summary' for per-worker aggregates ('done' count and the worker's full testsFailed dict, read after the loop). Workers' summary.put could block on a full pipe because the coordinator only drained summary in a second phase, after every results message had been received. Collapse to a single 'results' queue carrying one self-contained message per test: { test_name, output, done, failures }. The worker resets self.testsFailed = {} before each test so addFailure() writes into a fresh dict that ships verbatim; the worker keeps no cumulative state. The coordinator owns the canonical testsFailed via update() per message. This eliminates the deadlock by construction: the only queue is drained continuously by the coordinator's progressbar loop for the entire lifetime of the workers, so worker put()s can never block on a full pipe. Removes the SimpleQueue import, the _SUMMARY_DRAIN_STOP sentinel, and the _join_workers_with_summary_drain helper. on_timeout shrinks to a single put + close + join_thread. The unit test is updated to exercise the new pattern: workers push many large per-test messages while the main thread drains them live.

…sult handling

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- on_timeout now ships both the per-test result (with shutdown=False) and the shutdown sentinel before the watcher thread calls os._exit(1), keeping the coordinator's bounded count of n_jobs + parallelism accurate when a worker dies on timeout. Also fixes a KeyError on the missing 'shutdown' key in the timeout payload. - Grammar: 'no more processors is alive' -> 'are alive'. - test_parallel_drain: use time.monotonic() for elapsed measurement.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-26T10:00:04Z

+                if res['output']:
+                    print('%s' % res['output'], end="")
+                done += res['done']
+                self.testsFailed.update(res['failures'])


self.testsFailed.update(res['failures']) can overwrite existing failure lists when the incoming dict contains a key that already exists (e.g., the shutdown message can carry failures keyed by self.currEnv.testName, which may match a test that already failed earlier). This risks losing earlier failure details in the final summary. Consider merging per-key lists (extend) instead of dict.update(), or ensure shutdown-phase failures are reported under a dedicated key that cannot collide with real test names.

Suggested change

self.testsFailed.update(res['failures'])

for test_name, failures in res['failures'].items():

if test_name not in self.testsFailed:

self.testsFailed[test_name] = list(failures)

else:

self.testsFailed[test_name].extend(failures)

Only relevant for reused envs, will be handled in the future if needed

GuyAv46 added 2 commits April 23, 2026 15:54

GuyAv46 requested review from JoanFM and alonre24 April 23, 2026 13:10

JoanFM approved these changes Apr 23, 2026

View reviewed changes

GuyAv46 requested a review from Copilot April 23, 2026 14:29

Copilot AI reviewed Apr 23, 2026

View reviewed changes

GuyAv46 marked this pull request as ready for review April 23, 2026 14:34

GuyAv46 requested a review from Copilot April 23, 2026 14:34

Copilot started reviewing on behalf of GuyAv46 April 23, 2026 14:40 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

GuyAv46 requested review from Copilot, meiravgri, oshadmi and raz-mon April 26, 2026 06:16

Copilot started reviewing on behalf of GuyAv46 April 26, 2026 06:17 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Comment thread RLTest/__main__.py Outdated

Comment thread RLTest/__main__.py Outdated

Comment thread RLTest/__main__.py Outdated

GuyAv46 requested a review from Copilot April 26, 2026 06:32

Copilot started reviewing on behalf of GuyAv46 April 26, 2026 06:32 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Comment thread RLTest/__main__.py Outdated

Comment thread tests/unit/test_parallel_drain.py Outdated

meiravgri previously approved these changes Apr 26, 2026

View reviewed changes

Enhance parallel test execution: add shutdown messages and improve re…

3a50d5a

…sult handling

GuyAv46 dismissed meiravgri’s stale review via 3a50d5a April 26, 2026 08:32

GuyAv46 requested a review from Copilot April 26, 2026 08:32

Copilot started reviewing on behalf of GuyAv46 April 26, 2026 08:33 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Comment thread RLTest/__main__.py

Comment thread RLTest/__main__.py Outdated

Comment thread RLTest/__main__.py Outdated

GuyAv46 requested a review from meiravgri April 26, 2026 09:34

GuyAv46 requested a review from Copilot April 26, 2026 09:55

Copilot started reviewing on behalf of GuyAv46 April 26, 2026 09:56 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

meiravgri approved these changes Apr 26, 2026

View reviewed changes

Merge branch 'master' into guyav-fix_cluster_hang

5f512fa

GuyAv46 merged commit 9571043 into master Apr 26, 2026
10 checks passed

GuyAv46 deleted the guyav-fix_cluster_hang branch April 26, 2026 10:51

-                self.testsFailed.update(res['failures'])
+                for test_name, failures in res['failures'].items():
+                    if test_name not in self.testsFailed:
+                        self.testsFailed[test_name] = list(failures)
+                    else:
+                        self.testsFailed[test_name].extend(failures)

Conversation

GuyAv46 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

4a8a452 — Concurrent summary drain (minimal patch)

c8df5bc — summary → SimpleQueue

c0cb696 — Address review feedback

dfc9114 — Collapse results + summary into a single queue

3a50d5a — Add per-worker shutdown message + bounded-count drain

Final architecture

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

GuyAv46 Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GuyAv46 commented Apr 23, 2026 •

edited

Loading

`4a8a452` — Concurrent summary drain (minimal patch)

`c8df5bc` — `summary` → `SimpleQueue`

`c0cb696` — Address review feedback

`dfc9114` — Collapse `results` + `summary` into a single queue

`3a50d5a` — Add per-worker shutdown message + bounded-count drain