DAOS-17427 control: Restart excluded rank after suicide by tanabarr · Pull Request #16279 · daos-stack/daos

tanabarr · 2025-04-17T09:25:01Z

When an engine detects that it has been removed from the system group
map by receiving a CART event, it will now notify its local control plane
with a RAS engine_self_terminated event before terminating its own
process. After receiving this self-termination event, the local
control plane will restart the engine so it can rejoin the system.

The goal of this change is to improve overall system resilience by
automatically recovering engines that are excluded because of
temporary issues such as network instability. Once the engines rejoin,
the rank will still need to be reintegrated into pools as a separate
follow‑up step.

Rate-limiting prevents restart storms: a configurable minimum delay
(default 300 seconds) between restarts per rank ensures system
stability. Two new server config file parameters control behavior:
disable_engine_auto_restart (boolean, default false) completely
disables automatic restarts, while engine_auto_restart_min_delay
(integer seconds) sets the minimum time between consecutive restart
attempts.

Functional tests for the automatic engine restart feature included
with cases to verify disabling, rate-limiting and configuration support.

Features: pool control

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

github-actions · 2025-04-17T09:25:22Z

Ticket title is 'Handle engine suicides by automatically restarting the engines'
Status is 'In Review'
Labels: 'request_for_2.8'
https://daosio.atlassian.net/browse/DAOS-17427

daosbuild1 · 2025-04-18T12:29:45Z

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-16279/1/execution/node/1466/log

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Features: control Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

liw

Not requesting changes.

liw · 2026-03-19T01:46:16Z

+				D_ERROR("failed to handle %u/%u event: " DF_RC "\n", src, type,
+					DP_RC(rc));
+
 			rc = kill(getpid(), SIGKILL);


[Discussion] Looking at the code changes I think I'm back to the previously discussed point: Whether the engine kills itself, or only informs the server, who will decide what to do. If the engine kills itself, why do we need the RAS event... The server can simply decide whether to restart a killed engine, even without the RAS event, cannot it...

[Question] If the ds_notify_rank_suicide call fails, then the engine will still terminate, but will the server restart the engine?

[Discussion] I implemented kill+ras mainly because it is guaranteed that the engine process will be killed, the RAS event signifies that it is a suicide as opposed to other termination, do we really want to always restart a terminated rank regardless of reason?

[Question] Yes, if the notify call fails then the engine will be terminated but not restarted. if instead we don't terminate the engine on group map change and just send ras and put the engine into a blocking state then if a notify call fails then the engine will effectively hang.

do we really want to always restart a terminated rank regardless of reason?

Hmm, it's indeed hard to say for sure. I'd understand if we'd like to begin conservatively, only restarting for certain cases.

I agree we should start conservatively. For engines that crash, or exit for some unknown reason, we don't know if the engine would be OK if we restarted it. Better to let the admin investigate and manually restart the ranks in that case.

knard38 · 2026-03-19T08:50:23Z

@tanabarr, I had the same issue with the CI regarding the "Unit test beds with memcheck".
At the end, I have merged with master and restarted the CI.
Now, this stage is successfully run.

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Features: control pool Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

daosbuild3 · 2026-03-28T01:54:36Z

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16279/5/execution/node/1303/log

daosbuild3 · 2026-03-30T12:51:56Z

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16279/5/testReport/

tanabarr · 2026-03-31T10:03:01Z

PR CI run with Features: pool control to give an extra coverage in case restarting engine automatically causes any tests to fail. BoundaryTest and ListVerboseTest failures are unrelated.

@kjacque @knard38 @mjmac @liw can I get reviews please?

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

tanabarr · 2026-03-31T10:09:33Z

conflicts resolved, doc-only change, CI run no. 5 is still relevant and should be used for PR review

daosbuild3 · 2026-04-30T11:27:27Z

Test stage Unit Test completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16279/23/testReport/

Functional tests for the automatic engine restart feature introduced in the control plane. These tests verify that engines automatically restart after self-termination when excluded from the system, with cases to verify disabling, rate-limiting and configuration support. Test-tag: hw,medium,dmg,control,engine_auto_restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com> * try to fix test issues Test-tag: hw,medium,dmg,control,engine_auto_restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

daosbuild3 · 2026-05-06T22:58:12Z

Test stage Unit Test completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16279/25/testReport/

kjacque

First of all: nice work, and excellent job with all the test coverage. My comments are mostly minor cleanup stuff, but the stop/start behavior is something we may want to make behave better, or at least note that we only get to stop it once.

kjacque · 2026-05-06T23:19:24Z

+func setupTestLogger(t *testing.T) (logging.Logger, *logging.LogBuffer) {
+	t.Helper()
+	log, buf := logging.NewTestLogger(t.Name())
+	t.Cleanup(func() {
+		test.ShowBufferOnFailure(t, buf)
+	})
+
+	return log, buf
+}


Don't want to use test.MustLogContext/logging.FromContext to create the logger?

the problem is that this Pattern doesn't enable me to expose the LogBuffer to be used in the tests for comparison purposes. maybe I'm missing a trick here? @mjmac

Ah, I see! Never mind, I'm not sure how to extract the buffer in that case either. Might be nice to add this function to the test package. I suspect this will be useful in other tests (maybe call it NewTestLoggerWithBuffer or something like that?). Not requesting a change on this PR, just a suggestion.

kjacque · 2026-05-06T23:27:44Z

+	if timer != nil {
+		timer.Stop()
+	}


For timers that need to stop at the end, you could use defer after you get them.

in this case we don't know whether the timer still exists at end of func, could do the nil check in defer func but I don't think that buys us much

Looks like the test asserts if the timer wasn't set in the map... So it would have to be non-nil at this point, wouldn't it? Doesn't look like we're ever resetting that variable.

have changed to improve clarity, done

Features: control Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

daosbuild3 · 2026-05-07T12:44:08Z

Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16279/25/execution/node/983/log

…gine-suicide-restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

This reverts commit bdfde05.

Features: control Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

This reverts commit 049509c.

Features: control full_regression Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

kjacque

Nothing left that's blocking, just the minor typos and the Notice level logging should be fixed. I'm fine with that happening in a follow-on.

kjacque · 2026-05-12T01:55:07Z

-			svc.restartMgr.clearRankRestartHistory(ranks)
-		}
-	}
+	// clearly state history for stopped ranks, instances have already been filtered by


typo? clearly -> clear

kjacque · 2026-05-12T01:55:15Z

-			svc.restartMgr.clearRankRestartHistory(ranks)
-		}
-	}
+	// clearly state history for started ranks, instances have already been filtered by


typo? clearly -> clear

kjacque · 2026-05-12T01:58:10Z

+func setupTestLogger(t *testing.T) (logging.Logger, *logging.LogBuffer) {
+	t.Helper()
+	log, buf := logging.NewTestLogger(t.Name())
+	t.Cleanup(func() {
+		test.ShowBufferOnFailure(t, buf)
+	})
+
+	return log, buf
+}


Ah, I see! Never mind, I'm not sure how to extract the buffer in that case either. Might be nice to add this function to the test package. I suspect this will be useful in other tests (maybe call it NewTestLoggerWithBuffer or something like that?). Not requesting a change on this PR, just a suggestion.

kjacque · 2026-05-12T02:04:13Z

+	if timer != nil {
+		timer.Stop()
+	}


Looks like the test asserts if the timer wasn't set in the map... So it would have to be non-nil at this point, wouldn't it? Doesn't look like we're ever resetting that variable.

kjacque · 2026-05-12T02:07:57Z

+		ei.ready.SetTrue()
 	}
-	srv.runner = engine.NewTestRunner(trc, engine.MockConfig())
-	srv.setIndex(idx)


I guess no tests were using the index?

nope, so I removed it

kjacque · 2026-05-12T02:10:50Z

+	}
+	mgr.pendingRestart = make(map[ranklist.Rank]*time.Timer)
+
+	close(mgr.stopChan)


As discussed, I agree it's fine as-is. I do think it'd be nice to leave a note to ourselves as a comment though.

kjacque · 2026-05-12T02:13:41Z

+
+	// Record restart time and clear pending state on exit (deferred)
+	mgr.recordRestartTime(rank)
+	mgr.log.Noticef("recording rank %d", rank)


Oops, looks like this fix didn't get pushed.

daosbuild3 · 2026-05-12T03:43:32Z

Test stage Functional on EL 9 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16279/26/testReport/

…gine-suicide-restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Test-tag: pr control full_regression Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

daltonbohning

It seems the new functional tests aren't even running in CI?
There are many lint errors in ftest that need to be resolved or possibly ignored

daltonbohning · 2026-05-12T20:27:01Z

+    def tearDown(self):
+        """Clean up after each test method."""
+        # Reset restart state for next test method
+        # This ensures clean state between sequential tests
+        try:
+            self.reset_engine_restart_state()
+        except Exception as error:
+            self.log.error("Failed to reset engine restart state: %s", error)
+            self.fail("tearDown failed to reset engine restart state: {}".format(error))
+        finally:
+            super().tearDown()


We have a way to handle this in the framework by calling

self.register_cleanup(reset_engine_restart_state)

After whatever operation puts the system into the invalid state. So maybe instead of defining this tearDown, we can do that after calling exclude_rank_and_wait_restart the first time

daltonbohning · 2026-05-12T20:28:24Z

+            2. Wait for rank to self-terminate
+            3. Verify rank automatically restarts and rejoins the system
+
+        :avocado: tags=all,pr,daily_regression


Just confirming that we really do want to add this to PR testing?

So yes I think so it's a significant feature

daltonbohning · 2026-05-12T20:29:19Z

+        """
+        all_ranks = self.get_all_ranks()
+        if len(all_ranks) < 2:
+            self.skipTest("Test requires at least 2 ranks")


It is better to fail because skipping will be silent and easily ignored

Suggested change

self.skipTest("Test requires at least 2 ranks")

self.fail("Test requires at least 2 ranks")

daltonbohning · 2026-05-12T20:29:58Z

+        restarted, final_state = self.exclude_rank_and_wait_restart(test_rank)
+
+        if not restarted:


nit - intentional new line?

Suggested change

restarted, final_state = self.exclude_rank_and_wait_restart(test_rank)

if not restarted:

restarted, final_state = self.exclude_rank_and_wait_restart(test_rank)

if not restarted:

daltonbohning · 2026-05-12T20:31:05Z

+        final_incarnation = self.get_rank_incarnation(test_rank)
+        if final_incarnation is None:
+            self.fail(f"failed to get final incarnation for rank {test_rank}")


It would be better if get_rank_incarnation raised an exception instead of silently returning None

daltonbohning · 2026-05-12T20:41:46Z

+          scm_mount: /mnt/daos1
+pool:
+  size: 2G
+timeout: 300


Suggested change

timeout: 300

daltonbohning · 2026-05-12T20:43:46Z

+        if data["status"] != 0:
+            self.fail("Cmd dmg system query failed")


It's better for a helper function to raise an exception and let the test itself decide if it is a failure

Suggested change

if data["status"] != 0:

self.fail("Cmd dmg system query failed")

if data["status"] != 0:

raise CommandFailure("dmg system query failed")

daltonbohning · 2026-05-12T20:45:27Z

+            if data.get("status") != 0:
+                self.log.error("dmg system query failed for rank %s", rank)
+                return None


IMO it's better to raise exception than silently skip

Suggested change

if data.get("status") != 0:

self.log.error("dmg system query failed for rank %s", rank)

return None

if data.get("status") != 0:

raise CommandFailure(f"dmg system query failed for rank {rank}")

daltonbohning · 2026-05-12T20:47:23Z

+        except Exception as error:  # pylint: disable=broad-exception-caught
+            # Catch all exceptions to prevent test framework crashes during rank queries
+            self.log.error("Exception getting incarnation for rank %s: %s", rank, error)
+            return None


This comment is indicative of a larger problem. An unhandled exception should, IMO, cause the test to ultimately fail. We should not silently skip something that is broken

daltonbohning · 2026-05-12T20:47:47Z

+        self.server_managers[0].system_stop()
+        time.sleep(2)
+        self.server_managers[0].system_start()


Where does the arbitrary 2s sleep come from? This will eventually be a problem and we will have to revisit

will remove if you don't think it necessary for a system restart

Test-tag: pr control full_regression Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

daosbuild3 · 2026-05-12T20:54:16Z

Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16279/28/execution/node/942/log

Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com> Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com>

Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Test-tag: hw,medium,dmg,control,engine_auto_restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr · 2026-05-13T10:31:07Z

+
+        Returns:
+            tuple: (restarted, final_state) - whether rank restarted and its final state
+        """


moving register_cleanup() to beginning of each test because not all of them call this helper

…ndFailure exception for helpers and register cleanup in setUp for each test class Test-tag: hw,medium,dmg,control,engine_auto_restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

…gine-suicide-restart Test-tag: hw,medium,dmg,control,engine_auto_restart pr Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr self-assigned this Apr 17, 2025

tanabarr added the control-plane work on the management infrastructure of the DAOS Control Plane label Apr 17, 2025

tanabarr changed the title ~~DAOS-17427 control: Restart evicted rank after suicide~~ DAOS-17427 control: Restart excluded rank after suicide Apr 19, 2025

DAOS-17427 control: Restart evicted rank after suicide

b94c921

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr force-pushed the tanabarr/control-engine-suicide-restart branch 2 times, most recently from f1a124f to e57b0d3 Compare March 17, 2026 22:57

tanabarr requested review from kjacque, knard38, liw and mjmac March 17, 2026 22:58

implement suicide event handlers

af7f056

Features: control Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr force-pushed the tanabarr/control-engine-suicide-restart branch from e57b0d3 to af7f056 Compare March 19, 2026 01:17

liw reviewed Mar 19, 2026

View reviewed changes

tanabarr added 4 commits March 19, 2026 16:36

add unit testing and documentation

4ce711f

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

fix docs and unit tests

550ef12

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

revise unit test for suicide handler

1f61b98

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

fixup tests

8a79efb

Features: control pool Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr marked this pull request as ready for review March 27, 2026 13:07

tanabarr requested review from a team as code owners March 27, 2026 13:07

tanabarr requested a review from liw March 27, 2026 13:07

Merge branch 'master' into tanabarr/control-engine-suicide-restart

f643187

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

tanabarr requested review from a team as code owners May 6, 2026 21:46

kjacque requested changes May 6, 2026

View reviewed changes

fix server package unit test helpers

bdfde05

Features: control Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr added 7 commits May 7, 2026 14:06

Merge remote-tracking branch 'origin/master' into tanabarr/control-en…

417b33b

…gine-suicide-restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Revert "fix server package unit test helpers"

c76507f

This reverts commit bdfde05.

fix server package unit test helpers

8d5a1da

Features: control Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

addressed review comments from kjacque pt1

4a2938f

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

allow restart manager to close and open again

049509c

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Revert "allow restart manager to close and open again"

bd15522

This reverts commit 049509c.

address some review comments from kjacque

06f3f6e

Features: control full_regression Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

kjacque previously approved these changes May 12, 2026

View reviewed changes

tanabarr added 2 commits May 12, 2026 11:32

Merge remote-tracking branch 'origin/master' into tanabarr/control-en…

8f82660

…gine-suicide-restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

comment one start/stop per process lifetime

5feadfe

Test-tag: pr control full_regression Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr dismissed kjacque’s stale review via 5feadfe May 12, 2026 11:32

address more review comments from kjacque

969c837

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

daltonbohning reviewed May 12, 2026

View reviewed changes

pylint fixes

835156b

Test-tag: pr control full_regression Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr and others added 3 commits May 13, 2026 10:17

using self.register_cleanup (#18240)

95c061a

Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com> Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com>

Apply suggestion from @daltonbohning

9ab86a9

Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

more ftest related review comment updates

1913fd9

Test-tag: hw,medium,dmg,control,engine_auto_restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr commented May 13, 2026

View reviewed changes

tanabarr added 2 commits May 13, 2026 12:51

f-string updates and remove step comments in log_step calls use Comma…

9dc146c

…ndFailure exception for helpers and register cleanup in setUp for each test class Test-tag: hw,medium,dmg,control,engine_auto_restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Merge remote-tracking branch 'origin/master' into tanabarr/control-en…

48ae917

…gine-suicide-restart Test-tag: hw,medium,dmg,control,engine_auto_restart pr Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

	self.skipTest("Test requires at least 2 ranks")
	self.fail("Test requires at least 2 ranks")

		restarted, final_state = self.exclude_rank_and_wait_restart(test_rank)

		if not restarted:

		if data["status"] != 0:
		self.fail("Cmd dmg system query failed")

Conversation

tanabarr commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Steps for the author:

After all prior steps are complete:

Uh oh!

github-actions Bot commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daosbuild1 commented Apr 18, 2025

Uh oh!

liw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liw Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knard38 commented Mar 19, 2026

Uh oh!

daosbuild3 commented Mar 28, 2026

Uh oh!

daosbuild3 commented Mar 30, 2026

Uh oh!

tanabarr commented Mar 31, 2026

Uh oh!

tanabarr commented Mar 31, 2026

Uh oh!

daosbuild3 commented Apr 30, 2026

Uh oh!

daosbuild3 commented May 6, 2026

Uh oh!

kjacque left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanabarr May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

daosbuild3 commented May 7, 2026

Uh oh!

kjacque left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanabarr commented Apr 17, 2025 •

edited

Loading

github-actions Bot commented Apr 17, 2025 •

edited

Loading

liw Mar 19, 2026 •

edited

Loading

tanabarr May 9, 2026 •

edited

Loading