Skip to content

DAOS-17427 control: Restart excluded rank after suicide#16279

Open
tanabarr wants to merge 46 commits into
masterfrom
tanabarr/control-engine-suicide-restart
Open

DAOS-17427 control: Restart excluded rank after suicide#16279
tanabarr wants to merge 46 commits into
masterfrom
tanabarr/control-engine-suicide-restart

Conversation

@tanabarr
Copy link
Copy Markdown
Contributor

@tanabarr tanabarr commented Apr 17, 2025

When an engine detects that it has been removed from the system group
map by receiving a CART event, it will now notify its local control plane
with a RAS engine_self_terminated event before terminating its own
process. After receiving this self-termination event, the local
control plane will restart the engine so it can rejoin the system.

The goal of this change is to improve overall system resilience by
automatically recovering engines that are excluded because of
temporary issues such as network instability. Once the engines rejoin,
the rank will still need to be reintegrated into pools as a separate
follow‑up step.

Rate-limiting prevents restart storms: a configurable minimum delay
(default 300 seconds) between restarts per rank ensures system
stability. Two new server config file parameters control behavior:
disable_engine_auto_restart (boolean, default false) completely
disables automatic restarts, while engine_auto_restart_min_delay
(integer seconds) sets the minimum time between consecutive restart
attempts.

Functional tests for the automatic engine restart feature included
with cases to verify disabling, rate-limiting and configuration support.

Features: pool control

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@tanabarr tanabarr self-assigned this Apr 17, 2025
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 17, 2025

Ticket title is 'Handle engine suicides by automatically restarting the engines'
Status is 'In Review'
Labels: 'request_for_2.8'
https://daosio.atlassian.net/browse/DAOS-17427

@tanabarr tanabarr added the control-plane work on the management infrastructure of the DAOS Control Plane label Apr 17, 2025
@daosbuild1
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-16279/1/execution/node/1466/log

@tanabarr tanabarr changed the title DAOS-17427 control: Restart evicted rank after suicide DAOS-17427 control: Restart excluded rank after suicide Apr 19, 2025
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr force-pushed the tanabarr/control-engine-suicide-restart branch 2 times, most recently from f1a124f to e57b0d3 Compare March 17, 2026 22:57
@tanabarr tanabarr requested review from kjacque, knard38, liw and mjmac March 17, 2026 22:58
Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr force-pushed the tanabarr/control-engine-suicide-restart branch from e57b0d3 to af7f056 Compare March 19, 2026 01:17
Copy link
Copy Markdown
Contributor

@liw liw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not requesting changes.

Comment thread src/engine/init.c
D_ERROR("failed to handle %u/%u event: " DF_RC "\n", src, type,
DP_RC(rc));

rc = kill(getpid(), SIGKILL);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Discussion] Looking at the code changes I think I'm back to the previously discussed point: Whether the engine kills itself, or only informs the server, who will decide what to do. If the engine kills itself, why do we need the RAS event... The server can simply decide whether to restart a killed engine, even without the RAS event, cannot it...

[Question] If the ds_notify_rank_suicide call fails, then the engine will still terminate, but will the server restart the engine?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Discussion] I implemented kill+ras mainly because it is guaranteed that the engine process will be killed, the RAS event signifies that it is a suicide as opposed to other termination, do we really want to always restart a terminated rank regardless of reason?

[Question] Yes, if the notify call fails then the engine will be terminated but not restarted. if instead we don't terminate the engine on group map change and just send ras and put the engine into a blocking state then if a notify call fails then the engine will effectively hang.

Copy link
Copy Markdown
Contributor

@liw liw Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want to always restart a terminated rank regardless of reason?

Hmm, it's indeed hard to say for sure. I'd understand if we'd like to begin conservatively, only restarting for certain cases.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should start conservatively. For engines that crash, or exit for some unknown reason, we don't know if the engine would be OK if we restarted it. Better to let the admin investigate and manually restart the ranks in that case.

@knard38
Copy link
Copy Markdown
Contributor

knard38 commented Mar 19, 2026

@tanabarr, I had the same issue with the CI regarding the "Unit test beds with memcheck".
At the end, I have merged with master and restarted the CI.
Now, this stage is successfully run.

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Features: control pool

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr marked this pull request as ready for review March 27, 2026 13:07
@tanabarr tanabarr requested review from a team as code owners March 27, 2026 13:07
@tanabarr tanabarr requested a review from liw March 27, 2026 13:07
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16279/5/execution/node/1303/log

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16279/5/testReport/

@tanabarr
Copy link
Copy Markdown
Contributor Author

PR CI run with Features: pool control to give an extra coverage in case restarting engine automatically causes any tests to fail. BoundaryTest and ListVerboseTest failures are unrelated.

@kjacque @knard38 @mjmac @liw can I get reviews please?

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
@tanabarr
Copy link
Copy Markdown
Contributor Author

conflicts resolved, doc-only change, CI run no. 5 is still relevant and should be used for PR review

@daosbuild3
Copy link
Copy Markdown
Collaborator

Functional tests for the automatic engine restart feature introduced
in the control plane. These tests verify that engines automatically
restart after self-termination when excluded from the system, with
cases to verify disabling, rate-limiting and configuration support.

Test-tag: hw,medium,dmg,control,engine_auto_restart
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

* try to fix test issues

Test-tag: hw,medium,dmg,control,engine_auto_restart
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr requested review from a team as code owners May 6, 2026 21:46
@daosbuild3
Copy link
Copy Markdown
Collaborator

Copy link
Copy Markdown
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all: nice work, and excellent job with all the test coverage. My comments are mostly minor cleanup stuff, but the stop/start behavior is something we may want to make behave better, or at least note that we only get to stop it once.

Comment thread src/control/server/ctl_ranks_rpc.go Outdated
Comment thread src/control/server/server.go Outdated
Comment thread src/control/server/server.go Outdated
Comment thread src/control/server/server_utils.go Outdated
Comment thread src/control/server/server_utils.go Outdated
Comment on lines +23 to +31
func setupTestLogger(t *testing.T) (logging.Logger, *logging.LogBuffer) {
t.Helper()
log, buf := logging.NewTestLogger(t.Name())
t.Cleanup(func() {
test.ShowBufferOnFailure(t, buf)
})

return log, buf
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't want to use test.MustLogContext/logging.FromContext to create the logger?

Copy link
Copy Markdown
Contributor Author

@tanabarr tanabarr May 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is that this Pattern doesn't enable me to expose the LogBuffer to be used in the tests for comparison purposes. maybe I'm missing a trick here? @mjmac

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see! Never mind, I'm not sure how to extract the buffer in that case either. Might be nice to add this function to the test package. I suspect this will be useful in other tests (maybe call it NewTestLoggerWithBuffer or something like that?). Not requesting a change on this PR, just a suggestion.

Comment thread src/control/server/instance_restart_test.go Outdated
Comment on lines +499 to +501
if timer != nil {
timer.Stop()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For timers that need to stop at the end, you could use defer after you get them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case we don't know whether the timer still exists at end of func, could do the nil check in defer func but I don't think that buys us much

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the test asserts if the timer wasn't set in the map... So it would have to be non-nil at this point, wouldn't it? Doesn't look like we're ever resetting that variable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have changed to improve clarity, done

Comment thread src/control/server/instance_restart_test.go Outdated
Comment thread src/control/server/server_utils_test.go
Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@daosbuild3
Copy link
Copy Markdown
Collaborator

tanabarr added 7 commits May 7, 2026 14:06
…gine-suicide-restart

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Features: control full_regression
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
kjacque
kjacque previously approved these changes May 12, 2026
Copy link
Copy Markdown
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing left that's blocking, just the minor typos and the Notice level logging should be fixed. I'm fine with that happening in a follow-on.

Comment thread src/control/server/ctl_ranks_rpc.go Outdated
svc.restartMgr.clearRankRestartHistory(ranks)
}
}
// clearly state history for stopped ranks, instances have already been filtered by
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? clearly -> clear

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread src/control/server/ctl_ranks_rpc.go Outdated
svc.restartMgr.clearRankRestartHistory(ranks)
}
}
// clearly state history for started ranks, instances have already been filtered by
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? clearly -> clear

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +23 to +31
func setupTestLogger(t *testing.T) (logging.Logger, *logging.LogBuffer) {
t.Helper()
log, buf := logging.NewTestLogger(t.Name())
t.Cleanup(func() {
test.ShowBufferOnFailure(t, buf)
})

return log, buf
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see! Never mind, I'm not sure how to extract the buffer in that case either. Might be nice to add this function to the test package. I suspect this will be useful in other tests (maybe call it NewTestLoggerWithBuffer or something like that?). Not requesting a change on this PR, just a suggestion.

Comment on lines +499 to +501
if timer != nil {
timer.Stop()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the test asserts if the timer wasn't set in the map... So it would have to be non-nil at this point, wouldn't it? Doesn't look like we're ever resetting that variable.

Comment thread src/control/server/server_utils_test.go
ei.ready.SetTrue()
}
srv.runner = engine.NewTestRunner(trc, engine.MockConfig())
srv.setIndex(idx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess no tests were using the index?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, so I removed it

}
mgr.pendingRestart = make(map[ranklist.Rank]*time.Timer)

close(mgr.stopChan)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I agree it's fine as-is. I do think it'd be nice to leave a note to ourselves as a comment though.

Comment thread src/control/server/instance_restart.go Outdated

// Record restart time and clear pending state on exit (deferred)
mgr.recordRestartTime(rank)
mgr.log.Noticef("recording rank %d", rank)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, looks like this fix didn't get pushed.

@daosbuild3
Copy link
Copy Markdown
Collaborator

tanabarr added 2 commits May 12, 2026 11:32
…gine-suicide-restart

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Test-tag: pr control full_regression
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Copy link
Copy Markdown
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the new functional tests aren't even running in CI?
There are many lint errors in ftest that need to be resolved or possibly ignored

Comment on lines +21 to +31
def tearDown(self):
"""Clean up after each test method."""
# Reset restart state for next test method
# This ensures clean state between sequential tests
try:
self.reset_engine_restart_state()
except Exception as error:
self.log.error("Failed to reset engine restart state: %s", error)
self.fail("tearDown failed to reset engine restart state: {}".format(error))
finally:
super().tearDown()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a way to handle this in the framework by calling

self.register_cleanup(reset_engine_restart_state)

After whatever operation puts the system into the invalid state. So maybe instead of defining this tearDown, we can do that after calling exclude_rank_and_wait_restart the first time

2. Wait for rank to self-terminate
3. Verify rank automatically restarts and rejoins the system

:avocado: tags=all,pr,daily_regression
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming that we really do want to add this to PR testing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yes I think so it's a significant feature

"""
all_ranks = self.get_all_ranks()
if len(all_ranks) < 2:
self.skipTest("Test requires at least 2 ranks")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to fail because skipping will be silent and easily ignored

Suggested change
self.skipTest("Test requires at least 2 ranks")
self.fail("Test requires at least 2 ranks")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +61 to +63
restarted, final_state = self.exclude_rank_and_wait_restart(test_rank)

if not restarted:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - intentional new line?

Suggested change
restarted, final_state = self.exclude_rank_and_wait_restart(test_rank)
if not restarted:
restarted, final_state = self.exclude_rank_and_wait_restart(test_rank)
if not restarted:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +68 to +70
final_incarnation = self.get_rank_incarnation(test_rank)
if final_incarnation is None:
self.fail(f"failed to get final incarnation for rank {test_rank}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better if get_rank_incarnation raised an exception instead of silently returning None

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

scm_mount: /mnt/daos1
pool:
size: 2G
timeout: 300
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
timeout: 300

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +70 to +71
if data["status"] != 0:
self.fail("Cmd dmg system query failed")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better for a helper function to raise an exception and let the test itself decide if it is a failure

Suggested change
if data["status"] != 0:
self.fail("Cmd dmg system query failed")
if data["status"] != 0:
raise CommandFailure("dmg system query failed")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +155 to +157
if data.get("status") != 0:
self.log.error("dmg system query failed for rank %s", rank)
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's better to raise exception than silently skip

Suggested change
if data.get("status") != 0:
self.log.error("dmg system query failed for rank %s", rank)
return None
if data.get("status") != 0:
raise CommandFailure(f"dmg system query failed for rank {rank}")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +180 to +183
except Exception as error: # pylint: disable=broad-exception-caught
# Catch all exceptions to prevent test framework crashes during rank queries
self.log.error("Exception getting incarnation for rank %s: %s", rank, error)
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is indicative of a larger problem. An unhandled exception should, IMO, cause the test to ultimately fail. We should not silently skip something that is broken

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +214 to +216
self.server_managers[0].system_stop()
time.sleep(2)
self.server_managers[0].system_start()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the arbitrary 2s sleep come from? This will eventually be a problem and we will have to revisit

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove if you don't think it necessary for a system restart

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Test-tag: pr control full_regression
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@daosbuild3
Copy link
Copy Markdown
Collaborator

tanabarr and others added 3 commits May 13, 2026 10:17
Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com>
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Test-tag: hw,medium,dmg,control,engine_auto_restart
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Returns:
tuple: (restarted, final_state) - whether rank restarted and its final state
"""
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving register_cleanup() to beginning of each test because not all of them call this helper

tanabarr added 2 commits May 13, 2026 12:51
…ndFailure exception for helpers and register cleanup in setUp for each test class

Test-tag: hw,medium,dmg,control,engine_auto_restart
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…gine-suicide-restart

Test-tag: hw,medium,dmg,control,engine_auto_restart pr
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

control-plane work on the management infrastructure of the DAOS Control Plane forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.

Development

Successfully merging this pull request may close these issues.

8 participants