DAOS-17427 control: Restart excluded rank after suicide#16279
DAOS-17427 control: Restart excluded rank after suicide#16279tanabarr wants to merge 46 commits into
Conversation
|
Ticket title is 'Handle engine suicides by automatically restarting the engines' |
|
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-16279/1/execution/node/1466/log |
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
f1a124f to
e57b0d3
Compare
Features: control Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
e57b0d3 to
af7f056
Compare
| D_ERROR("failed to handle %u/%u event: " DF_RC "\n", src, type, | ||
| DP_RC(rc)); | ||
|
|
||
| rc = kill(getpid(), SIGKILL); |
There was a problem hiding this comment.
[Discussion] Looking at the code changes I think I'm back to the previously discussed point: Whether the engine kills itself, or only informs the server, who will decide what to do. If the engine kills itself, why do we need the RAS event... The server can simply decide whether to restart a killed engine, even without the RAS event, cannot it...
[Question] If the ds_notify_rank_suicide call fails, then the engine will still terminate, but will the server restart the engine?
There was a problem hiding this comment.
[Discussion] I implemented kill+ras mainly because it is guaranteed that the engine process will be killed, the RAS event signifies that it is a suicide as opposed to other termination, do we really want to always restart a terminated rank regardless of reason?
[Question] Yes, if the notify call fails then the engine will be terminated but not restarted. if instead we don't terminate the engine on group map change and just send ras and put the engine into a blocking state then if a notify call fails then the engine will effectively hang.
There was a problem hiding this comment.
do we really want to always restart a terminated rank regardless of reason?
Hmm, it's indeed hard to say for sure. I'd understand if we'd like to begin conservatively, only restarting for certain cases.
There was a problem hiding this comment.
I agree we should start conservatively. For engines that crash, or exit for some unknown reason, we don't know if the engine would be OK if we restarted it. Better to let the admin investigate and manually restart the ranks in that case.
|
@tanabarr, I had the same issue with the CI regarding the "Unit test beds with memcheck". |
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Features: control pool Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16279/5/execution/node/1303/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16279/5/testReport/ |
Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
|
conflicts resolved, doc-only change, CI run no. 5 is still relevant and should be used for PR review |
|
Test stage Unit Test completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16279/23/testReport/ |
Functional tests for the automatic engine restart feature introduced in the control plane. These tests verify that engines automatically restart after self-termination when excluded from the system, with cases to verify disabling, rate-limiting and configuration support. Test-tag: hw,medium,dmg,control,engine_auto_restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com> * try to fix test issues Test-tag: hw,medium,dmg,control,engine_auto_restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
|
Test stage Unit Test completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16279/25/testReport/ |
kjacque
left a comment
There was a problem hiding this comment.
First of all: nice work, and excellent job with all the test coverage. My comments are mostly minor cleanup stuff, but the stop/start behavior is something we may want to make behave better, or at least note that we only get to stop it once.
| func setupTestLogger(t *testing.T) (logging.Logger, *logging.LogBuffer) { | ||
| t.Helper() | ||
| log, buf := logging.NewTestLogger(t.Name()) | ||
| t.Cleanup(func() { | ||
| test.ShowBufferOnFailure(t, buf) | ||
| }) | ||
|
|
||
| return log, buf | ||
| } |
There was a problem hiding this comment.
Don't want to use test.MustLogContext/logging.FromContext to create the logger?
There was a problem hiding this comment.
the problem is that this Pattern doesn't enable me to expose the LogBuffer to be used in the tests for comparison purposes. maybe I'm missing a trick here? @mjmac
There was a problem hiding this comment.
Ah, I see! Never mind, I'm not sure how to extract the buffer in that case either. Might be nice to add this function to the test package. I suspect this will be useful in other tests (maybe call it NewTestLoggerWithBuffer or something like that?). Not requesting a change on this PR, just a suggestion.
| if timer != nil { | ||
| timer.Stop() | ||
| } |
There was a problem hiding this comment.
For timers that need to stop at the end, you could use defer after you get them.
There was a problem hiding this comment.
in this case we don't know whether the timer still exists at end of func, could do the nil check in defer func but I don't think that buys us much
There was a problem hiding this comment.
Looks like the test asserts if the timer wasn't set in the map... So it would have to be non-nil at this point, wouldn't it? Doesn't look like we're ever resetting that variable.
There was a problem hiding this comment.
have changed to improve clarity, done
Features: control Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16279/25/execution/node/983/log |
…gine-suicide-restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
This reverts commit bdfde05.
Features: control Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
This reverts commit 049509c.
Features: control full_regression Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
kjacque
left a comment
There was a problem hiding this comment.
Nothing left that's blocking, just the minor typos and the Notice level logging should be fixed. I'm fine with that happening in a follow-on.
| svc.restartMgr.clearRankRestartHistory(ranks) | ||
| } | ||
| } | ||
| // clearly state history for stopped ranks, instances have already been filtered by |
| svc.restartMgr.clearRankRestartHistory(ranks) | ||
| } | ||
| } | ||
| // clearly state history for started ranks, instances have already been filtered by |
| func setupTestLogger(t *testing.T) (logging.Logger, *logging.LogBuffer) { | ||
| t.Helper() | ||
| log, buf := logging.NewTestLogger(t.Name()) | ||
| t.Cleanup(func() { | ||
| test.ShowBufferOnFailure(t, buf) | ||
| }) | ||
|
|
||
| return log, buf | ||
| } |
There was a problem hiding this comment.
Ah, I see! Never mind, I'm not sure how to extract the buffer in that case either. Might be nice to add this function to the test package. I suspect this will be useful in other tests (maybe call it NewTestLoggerWithBuffer or something like that?). Not requesting a change on this PR, just a suggestion.
| if timer != nil { | ||
| timer.Stop() | ||
| } |
There was a problem hiding this comment.
Looks like the test asserts if the timer wasn't set in the map... So it would have to be non-nil at this point, wouldn't it? Doesn't look like we're ever resetting that variable.
| ei.ready.SetTrue() | ||
| } | ||
| srv.runner = engine.NewTestRunner(trc, engine.MockConfig()) | ||
| srv.setIndex(idx) |
There was a problem hiding this comment.
I guess no tests were using the index?
There was a problem hiding this comment.
nope, so I removed it
| } | ||
| mgr.pendingRestart = make(map[ranklist.Rank]*time.Timer) | ||
|
|
||
| close(mgr.stopChan) |
There was a problem hiding this comment.
As discussed, I agree it's fine as-is. I do think it'd be nice to leave a note to ourselves as a comment though.
|
|
||
| // Record restart time and clear pending state on exit (deferred) | ||
| mgr.recordRestartTime(rank) | ||
| mgr.log.Noticef("recording rank %d", rank) |
There was a problem hiding this comment.
Oops, looks like this fix didn't get pushed.
|
Test stage Functional on EL 9 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16279/26/testReport/ |
…gine-suicide-restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Test-tag: pr control full_regression Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
daltonbohning
left a comment
There was a problem hiding this comment.
It seems the new functional tests aren't even running in CI?
There are many lint errors in ftest that need to be resolved or possibly ignored
| def tearDown(self): | ||
| """Clean up after each test method.""" | ||
| # Reset restart state for next test method | ||
| # This ensures clean state between sequential tests | ||
| try: | ||
| self.reset_engine_restart_state() | ||
| except Exception as error: | ||
| self.log.error("Failed to reset engine restart state: %s", error) | ||
| self.fail("tearDown failed to reset engine restart state: {}".format(error)) | ||
| finally: | ||
| super().tearDown() |
There was a problem hiding this comment.
We have a way to handle this in the framework by calling
self.register_cleanup(reset_engine_restart_state)
After whatever operation puts the system into the invalid state. So maybe instead of defining this tearDown, we can do that after calling exclude_rank_and_wait_restart the first time
| 2. Wait for rank to self-terminate | ||
| 3. Verify rank automatically restarts and rejoins the system | ||
|
|
||
| :avocado: tags=all,pr,daily_regression |
There was a problem hiding this comment.
Just confirming that we really do want to add this to PR testing?
There was a problem hiding this comment.
So yes I think so it's a significant feature
| """ | ||
| all_ranks = self.get_all_ranks() | ||
| if len(all_ranks) < 2: | ||
| self.skipTest("Test requires at least 2 ranks") |
There was a problem hiding this comment.
It is better to fail because skipping will be silent and easily ignored
| self.skipTest("Test requires at least 2 ranks") | |
| self.fail("Test requires at least 2 ranks") |
| restarted, final_state = self.exclude_rank_and_wait_restart(test_rank) | ||
|
|
||
| if not restarted: |
There was a problem hiding this comment.
nit - intentional new line?
| restarted, final_state = self.exclude_rank_and_wait_restart(test_rank) | |
| if not restarted: | |
| restarted, final_state = self.exclude_rank_and_wait_restart(test_rank) | |
| if not restarted: |
| final_incarnation = self.get_rank_incarnation(test_rank) | ||
| if final_incarnation is None: | ||
| self.fail(f"failed to get final incarnation for rank {test_rank}") |
There was a problem hiding this comment.
It would be better if get_rank_incarnation raised an exception instead of silently returning None
| scm_mount: /mnt/daos1 | ||
| pool: | ||
| size: 2G | ||
| timeout: 300 |
There was a problem hiding this comment.
| timeout: 300 |
| if data["status"] != 0: | ||
| self.fail("Cmd dmg system query failed") |
There was a problem hiding this comment.
It's better for a helper function to raise an exception and let the test itself decide if it is a failure
| if data["status"] != 0: | |
| self.fail("Cmd dmg system query failed") | |
| if data["status"] != 0: | |
| raise CommandFailure("dmg system query failed") |
| if data.get("status") != 0: | ||
| self.log.error("dmg system query failed for rank %s", rank) | ||
| return None |
There was a problem hiding this comment.
IMO it's better to raise exception than silently skip
| if data.get("status") != 0: | |
| self.log.error("dmg system query failed for rank %s", rank) | |
| return None | |
| if data.get("status") != 0: | |
| raise CommandFailure(f"dmg system query failed for rank {rank}") |
| except Exception as error: # pylint: disable=broad-exception-caught | ||
| # Catch all exceptions to prevent test framework crashes during rank queries | ||
| self.log.error("Exception getting incarnation for rank %s: %s", rank, error) | ||
| return None |
There was a problem hiding this comment.
This comment is indicative of a larger problem. An unhandled exception should, IMO, cause the test to ultimately fail. We should not silently skip something that is broken
| self.server_managers[0].system_stop() | ||
| time.sleep(2) | ||
| self.server_managers[0].system_start() |
There was a problem hiding this comment.
Where does the arbitrary 2s sleep come from? This will eventually be a problem and we will have to revisit
There was a problem hiding this comment.
will remove if you don't think it necessary for a system restart
Test-tag: pr control full_regression Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16279/28/execution/node/942/log |
Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com> Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com>
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Test-tag: hw,medium,dmg,control,engine_auto_restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
|
|
||
| Returns: | ||
| tuple: (restarted, final_state) - whether rank restarted and its final state | ||
| """ |
There was a problem hiding this comment.
moving register_cleanup() to beginning of each test because not all of them call this helper
…ndFailure exception for helpers and register cleanup in setUp for each test class Test-tag: hw,medium,dmg,control,engine_auto_restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…gine-suicide-restart Test-tag: hw,medium,dmg,control,engine_auto_restart pr Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
When an engine detects that it has been removed from the system group
map by receiving a CART event, it will now notify its local control plane
with a RAS engine_self_terminated event before terminating its own
process. After receiving this self-termination event, the local
control plane will restart the engine so it can rejoin the system.
The goal of this change is to improve overall system resilience by
automatically recovering engines that are excluded because of
temporary issues such as network instability. Once the engines rejoin,
the rank will still need to be reintegrated into pools as a separate
follow‑up step.
Rate-limiting prevents restart storms: a configurable minimum delay
(default 300 seconds) between restarts per rank ensures system
stability. Two new server config file parameters control behavior:
disable_engine_auto_restart (boolean, default false) completely
disables automatic restarts, while engine_auto_restart_min_delay
(integer seconds) sets the minimum time between consecutive restart
attempts.
Functional tests for the automatic engine restart feature included
with cases to verify disabling, rate-limiting and configuration support.
Features: pool control
Steps for the author:
After all prior steps are complete: