DAOS-16575 rebuild: set rgt DRS_COMPLETED for unretryable failure by liuxuezhao · Pull Request #18204 · daos-stack/daos

liuxuezhao · 2026-05-08T07:42:11Z

If rebuild failed with unretryable failure, set it as DRS_COMPLETED to avoid following pool query always report rebuild busy such as - Rebuild failing (state=busy, status=-2001)
after this patch, that kind of failed rebuild will report - Rebuild failed (state=done, status=-2001)
This can fix test_rebuild_31/REBUILD31 problem.

For test_rebuild_32/REBUILD32 add DAOS_POOL_TGT_UPDATE_SKIP_RF_CHECK to bypass DAOS_POOL_RF checking when changing pool map.

Fixes: DAOS-16575 DAOS-16576
Test-tag: pr test_rebuild_31 test_rebuild_32

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

github-actions · 2026-05-08T08:26:34Z

Ticket title is 'daos_test/rebuild.py:DaosCoreTestRebuild.test_rebuild_31 and test_rebuild_32 failures'
Status is 'In Review'
Labels: 'ci_master_daily,daily_test'
https://daosio.atlassian.net/browse/DAOS-16575

daosbuild3 · 2026-05-08T17:08:31Z

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18204/1/execution/node/1343/log

daosbuild3 · 2026-05-11T13:11:31Z

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18204/2/execution/node/1302/log

daltonbohning

ftest changes LGTM

daosbuild3 · 2026-05-11T17:47:22Z

Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18204/2/testReport/

kccain

Definitely recommend running patches with this significant a change with Features: rebuild. Also see a couple of comments/questions about when to switch to DRS_COMPLETED.

kccain · 2026-05-12T20:19:48Z

 	 */
 	DL_CDEBUG(error, DLOG_ERR, DLOG_INFO, error, DF_UUID " opc %u/%u, retry.",
 		  DP_UUID(task->dst_pool_uuid), task->dst_rebuild_op, task->dst_map_ver);
 	*opc = RB_OP_REBUILD;


Does the code have to retry rebuilds for targets in the DOWN state, when the original rebuild failed with a non-retryable error? Setting opc = RB_OP_REBUILD here makes things really complicated once the RB_OP_FAIL_RECLAIM finishes, because it schedules another RB_OP_REBUILD that may quickly cancel itself (for drain/extend/reintegrate). Or for exclude it will actually run. This makes it difficult to use pool query rebuild status, because the results will vary depending on when the query occurs (after original RB_OP_REBUILD: busy/nonzero rs_errno ; when RB_OP_FAIL_RECLAIM is run: DRS_IN_PROGRESS/rs_errno=0 ; when RB_OP_REBUILD is launched again DRS_IN_PROGRESS/rs_errno=0 or DRS_COMPLETED/rs_errno=0).

If the code assigns *opc to RB_OP_NONE here, there could be a chance with some code changes to arrange so that when RB_OP_FAIL_RECLAIM finishes, it could set the state to DRS_COMPLETED, and set the rs_errno to the original failed rebuild error value. This might make it easier to query for a failed rebuild (pool query would show DRS_IN_PROGRESS/nonzero rs_errno after first RB_OP_REBUILD failure, then DRS_IN_PROGRESS/rs_errno=0 during RB_OP_FAIL_RECLAIM, then finally DRS_COMPLETED/nonzero rs_errno).

Also, if the code were to set RB_OP_NONE here after an exclude rebuild fails with a non-retryable error, then at a later time the administrator could manually retry to rebuild for the DOWN targets with dmg pool rebuild start command.

"This makes it difficult to use pool query rebuild status, because the results will vary depending on when the query occurs"
for whatever time it queried, it looks can return expected/explainable rebuild status, considering ds_rebuild_query().

as for whether or not to set opc to RB_OP_NONE here, i think one point is if the original rebuild is mixed with reint/drain/down, then after revert pool map, the DOWN's rebuild possibly can success. Whether or not to retry DOWN only's rebuild, I am not 100% sure now.
But this PR is not intended to change this part anyway.

kccain · 2026-05-12T20:22:16Z

-		rgt->rgt_status.rs_state = DRS_IN_PROGRESS;
+		if ((rgt->rgt_opc == RB_OP_REBUILD) && (rgt->rgt_status.rs_errno != 0) &&
+		    !rebuild_retryable_err(rgt->rgt_status.rs_errno))
+			rgt->rgt_status.rs_state = DRS_COMPLETED;


I would rather have the state remain DRS_IN_PROGRESS while the RB_OP_FAIL_RECLAIM runs, and after that finishes, set state=DRS_COMPLETED (and rs_errno be assigned to the original RB_OP_REBUILD error number). This way, a test using pool query to poll for rebuild complete can wait until the reclaim/fail_reclaim steps are done.

When FAIL_RECLAIM task runs or in queue, ds_rebuild_query() will set "status->rs_state = DRS_IN_PROGRESS", so current code looks same with what you said.

liuxuezhao · 2026-05-13T03:25:41Z

Definitely recommend running patches with this significant a change with Features: rebuild. Also see a couple of comments/questions about when to switch to DRS_COMPLETED.

it looks actually not a really significant change. but looks not bad to rerun with Features: rebuild.

If rebuild failed with unretryable failure, set it as DRS_COMPLETED to avoid following pool query always report rebuild busy such as - Rebuild failing (state=busy, status=-2001) after this patch, that kind of failed rebuild will report - Rebuild failed (state=done, status=-2001) This can fix test_rebuild_31/REBUILD31 problem. For test_rebuild_32/REBUILD32 add DAOS_POOL_TGT_UPDATE_SKIP_RF_CHECK to bypass DAOS_POOL_RF checking when changing pool map. Fixes: DAOS-16575 DAOS-16576 Test-tag: pr test_rebuild_31 test_rebuild_32 Features: rebuild Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>

liuxuezhao · 2026-05-13T03:36:08Z

just a rebase and add "Features: rebuild" to retrun.

liuxuezhao requested review from a team as code owners May 8, 2026 07:42

liuxuezhao requested review from gnailzenh, kccain and wangshilong May 8, 2026 07:47

liuxuezhao force-pushed the lxz/rb_test_refine branch from 495d1f5 to f2dc2f6 Compare May 11, 2026 02:39

daltonbohning reviewed May 11, 2026

View reviewed changes

kccain reviewed May 12, 2026

View reviewed changes

liuxuezhao force-pushed the lxz/rb_test_refine branch from f2dc2f6 to 9c7d55d Compare May 13, 2026 03:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-16575 rebuild: set rgt DRS_COMPLETED for unretryable failure#18204

DAOS-16575 rebuild: set rgt DRS_COMPLETED for unretryable failure#18204
liuxuezhao wants to merge 1 commit into
masterfrom
lxz/rb_test_refine

liuxuezhao commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

daosbuild3 commented May 8, 2026

Uh oh!

daosbuild3 commented May 11, 2026

Uh oh!

daltonbohning left a comment

Uh oh!

daosbuild3 commented May 11, 2026

Uh oh!

kccain left a comment

Uh oh!

kccain May 12, 2026

Uh oh!

liuxuezhao May 13, 2026

Uh oh!

kccain May 12, 2026

Uh oh!

liuxuezhao May 13, 2026 •

edited

Loading

Uh oh!

liuxuezhao commented May 13, 2026

Uh oh!

liuxuezhao commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

Conversation

liuxuezhao commented May 8, 2026

Steps for the author:

After all prior steps are complete:

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

daosbuild3 commented May 8, 2026

Uh oh!

daosbuild3 commented May 11, 2026

Uh oh!

daltonbohning left a comment

Choose a reason for hiding this comment

Uh oh!

daosbuild3 commented May 11, 2026

Uh oh!

kccain left a comment

Choose a reason for hiding this comment

Uh oh!

kccain May 12, 2026

Choose a reason for hiding this comment

Uh oh!

liuxuezhao May 13, 2026

Choose a reason for hiding this comment

Uh oh!

kccain May 12, 2026

Choose a reason for hiding this comment

Uh oh!

liuxuezhao May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liuxuezhao commented May 13, 2026

Uh oh!

liuxuezhao commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

liuxuezhao May 13, 2026 •

edited

Loading