Skip to content

DAOS-16575 rebuild: set rgt DRS_COMPLETED for unretryable failure#18204

Open
liuxuezhao wants to merge 1 commit into
masterfrom
lxz/rb_test_refine
Open

DAOS-16575 rebuild: set rgt DRS_COMPLETED for unretryable failure#18204
liuxuezhao wants to merge 1 commit into
masterfrom
lxz/rb_test_refine

Conversation

@liuxuezhao
Copy link
Copy Markdown
Contributor

If rebuild failed with unretryable failure, set it as DRS_COMPLETED to avoid following pool query always report rebuild busy such as - Rebuild failing (state=busy, status=-2001)
after this patch, that kind of failed rebuild will report - Rebuild failed (state=done, status=-2001)
This can fix test_rebuild_31/REBUILD31 problem.

For test_rebuild_32/REBUILD32 add DAOS_POOL_TGT_UPDATE_SKIP_RF_CHECK to bypass DAOS_POOL_RF checking when changing pool map.

Fixes: DAOS-16575 DAOS-16576
Test-tag: pr test_rebuild_31 test_rebuild_32

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@liuxuezhao liuxuezhao requested review from a team as code owners May 8, 2026 07:42
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

Ticket title is 'daos_test/rebuild.py:DaosCoreTestRebuild.test_rebuild_31 and test_rebuild_32 failures'
Status is 'In Review'
Labels: 'ci_master_daily,daily_test'
https://daosio.atlassian.net/browse/DAOS-16575

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18204/1/execution/node/1343/log

@liuxuezhao liuxuezhao force-pushed the lxz/rb_test_refine branch from 495d1f5 to f2dc2f6 Compare May 11, 2026 02:39
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18204/2/execution/node/1302/log

Copy link
Copy Markdown
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftest changes LGTM

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18204/2/testReport/

Copy link
Copy Markdown
Contributor

@kccain kccain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely recommend running patches with this significant a change with Features: rebuild. Also see a couple of comments/questions about when to switch to DRS_COMPLETED.

Comment thread src/rebuild/srv.c
*/
DL_CDEBUG(error, DLOG_ERR, DLOG_INFO, error, DF_UUID " opc %u/%u, retry.",
DP_UUID(task->dst_pool_uuid), task->dst_rebuild_op, task->dst_map_ver);
*opc = RB_OP_REBUILD;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the code have to retry rebuilds for targets in the DOWN state, when the original rebuild failed with a non-retryable error? Setting opc = RB_OP_REBUILD here makes things really complicated once the RB_OP_FAIL_RECLAIM finishes, because it schedules another RB_OP_REBUILD that may quickly cancel itself (for drain/extend/reintegrate). Or for exclude it will actually run. This makes it difficult to use pool query rebuild status, because the results will vary depending on when the query occurs (after original RB_OP_REBUILD: busy/nonzero rs_errno ; when RB_OP_FAIL_RECLAIM is run: DRS_IN_PROGRESS/rs_errno=0 ; when RB_OP_REBUILD is launched again DRS_IN_PROGRESS/rs_errno=0 or DRS_COMPLETED/rs_errno=0).

If the code assigns *opc to RB_OP_NONE here, there could be a chance with some code changes to arrange so that when RB_OP_FAIL_RECLAIM finishes, it could set the state to DRS_COMPLETED, and set the rs_errno to the original failed rebuild error value. This might make it easier to query for a failed rebuild (pool query would show DRS_IN_PROGRESS/nonzero rs_errno after first RB_OP_REBUILD failure, then DRS_IN_PROGRESS/rs_errno=0 during RB_OP_FAIL_RECLAIM, then finally DRS_COMPLETED/nonzero rs_errno).

Also, if the code were to set RB_OP_NONE here after an exclude rebuild fails with a non-retryable error, then at a later time the administrator could manually retry to rebuild for the DOWN targets with dmg pool rebuild start command.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This makes it difficult to use pool query rebuild status, because the results will vary depending on when the query occurs"
for whatever time it queried, it looks can return expected/explainable rebuild status, considering ds_rebuild_query().

as for whether or not to set opc to RB_OP_NONE here, i think one point is if the original rebuild is mixed with reint/drain/down, then after revert pool map, the DOWN's rebuild possibly can success. Whether or not to retry DOWN only's rebuild, I am not 100% sure now.
But this PR is not intended to change this part anyway.

Comment thread src/rebuild/srv.c
rgt->rgt_status.rs_state = DRS_IN_PROGRESS;
if ((rgt->rgt_opc == RB_OP_REBUILD) && (rgt->rgt_status.rs_errno != 0) &&
!rebuild_retryable_err(rgt->rgt_status.rs_errno))
rgt->rgt_status.rs_state = DRS_COMPLETED;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather have the state remain DRS_IN_PROGRESS while the RB_OP_FAIL_RECLAIM runs, and after that finishes, set state=DRS_COMPLETED (and rs_errno be assigned to the original RB_OP_REBUILD error number). This way, a test using pool query to poll for rebuild complete can wait until the reclaim/fail_reclaim steps are done.

Copy link
Copy Markdown
Contributor Author

@liuxuezhao liuxuezhao May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When FAIL_RECLAIM task runs or in queue, ds_rebuild_query() will set "status->rs_state = DRS_IN_PROGRESS", so current code looks same with what you said.

@liuxuezhao
Copy link
Copy Markdown
Contributor Author

Definitely recommend running patches with this significant a change with Features: rebuild. Also see a couple of comments/questions about when to switch to DRS_COMPLETED.

it looks actually not a really significant change. but looks not bad to rerun with Features: rebuild.

If rebuild failed with unretryable failure, set it as DRS_COMPLETED
to avoid following pool query always report rebuild busy such as -
Rebuild failing (state=busy, status=-2001)
after this patch, that kind of failed rebuild will report -
Rebuild failed (state=done, status=-2001)
This can fix test_rebuild_31/REBUILD31 problem.

For test_rebuild_32/REBUILD32 add DAOS_POOL_TGT_UPDATE_SKIP_RF_CHECK to
bypass DAOS_POOL_RF checking when changing pool map.

Fixes: DAOS-16575 DAOS-16576
Test-tag: pr test_rebuild_31 test_rebuild_32
Features: rebuild

Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>
@liuxuezhao liuxuezhao force-pushed the lxz/rb_test_refine branch from f2dc2f6 to 9c7d55d Compare May 13, 2026 03:35
@liuxuezhao
Copy link
Copy Markdown
Contributor Author

just a rebase and add "Features: rebuild" to retrun.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants