[Fix] Fix race condition with memory consumption by using ray_init_fixture#1787
[Fix] Fix race condition with memory consumption by using ray_init_fixture#1787SumanthRH wants to merge 4 commits into
ray_init_fixture#1787Conversation
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the weight synchronization tests by moving the Mixture of Experts (MoE) model weight sync tests from test_weight_sync.py into a dedicated test file test_weight_sync_moe.py. Feedback on the new test file highlights that the vllm.AsyncLLMEngine and torch.distributed process groups are initialized but never properly shut down or destroyed, which can lead to GPU memory leaks and background process pollution. It is recommended to wrap these test routines in try...finally blocks to ensure proper resource cleanup.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Triggered a GPU CI run here |
…e_resume Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
|
I encountered some more errors while running GPU CI. I had an agent categorize them: | # | Test | Type | Root error |
| 1 | `integrations/test_pd_routing.py::test_pd_routing_verification` | FAILED | `RuntimeError: NIXL is not available` |
| 2 | `test_engine_generation.py::test_pd_generation` | FAILED | `RuntimeError: NIXL is not available` |
| 3 | `test_engine_generation.py::test_pd_generation_non_colocated[1P1D_non_colocated]` | FAILED | `RuntimeError: NIXL is not available` |
| 4 | `test_weight_sync.py::TestWeightUpdateFlow::test_update_weights_flow[pd_1P1D_non_colocated]` | ERROR (setup) | `RuntimeError: NIXL is not available` |
| 5 | `test_skyrl_gym_generator.py::test_generator_formatting_no_use_conversation_multi_turn[unsloth/Llama-3.2-1B-Instruct]` | FAILED | `assert 0 == 1` (no EOS token in loss-masked-in response) |
| 6 | `test_inference_server_group.py::TestServerGroupAndRouter::test_pause_resume` | ERROR (teardown) | `RuntimeError: There is no current event loop in thread 'MainThread'` |There were two fixes needed:
Fixes for 1. and 2. have been added in 164064c I re-ran the failing tests after these fixes: and now only the Moe forward test is failng: -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/backends/skyrl_train/gpu/gpu_ci/inference_servers/test_weight_sync_moe.py::test_worker_wrap_load_weights_preserves_moe_forward - RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
====== 1 failed, 33 passed, 6 skipped, 108 warnings in 1251.71s (0:20:51) ======
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute |
What does this PR do?
Quick follow-up to #1737 . For context, please see: #1737 (comment)
GPU CI fails with changes in #1737
The reason is that the new tests don't have proper init/ teardown. The previous test still occupies gpu memory while the next moe test runs. temporary buffers or temporary memory usage from the previous test is released while the second test runs profiling. vLLM doesn't permit free memory to change on the worker while it's running profiling and it errors out.
The fix is to use
ray_init_fixturefor proper init and teardown. I also separated out the tests into a standalone module because it's a specific regression test.I don't have push access to the contributor's repo and they don't they have access either. So I'm adding the fixes in this follow-up PR