DAOS-16575 rebuild: set rgt DRS_COMPLETED for unretryable failure#18204
DAOS-16575 rebuild: set rgt DRS_COMPLETED for unretryable failure#18204liuxuezhao wants to merge 1 commit into
Conversation
|
Ticket title is 'daos_test/rebuild.py:DaosCoreTestRebuild.test_rebuild_31 and test_rebuild_32 failures' |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18204/1/execution/node/1343/log |
495d1f5 to
f2dc2f6
Compare
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18204/2/execution/node/1302/log |
daltonbohning
left a comment
There was a problem hiding this comment.
ftest changes LGTM
|
Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18204/2/testReport/ |
kccain
left a comment
There was a problem hiding this comment.
Definitely recommend running patches with this significant a change with Features: rebuild. Also see a couple of comments/questions about when to switch to DRS_COMPLETED.
| */ | ||
| DL_CDEBUG(error, DLOG_ERR, DLOG_INFO, error, DF_UUID " opc %u/%u, retry.", | ||
| DP_UUID(task->dst_pool_uuid), task->dst_rebuild_op, task->dst_map_ver); | ||
| *opc = RB_OP_REBUILD; |
There was a problem hiding this comment.
Does the code have to retry rebuilds for targets in the DOWN state, when the original rebuild failed with a non-retryable error? Setting opc = RB_OP_REBUILD here makes things really complicated once the RB_OP_FAIL_RECLAIM finishes, because it schedules another RB_OP_REBUILD that may quickly cancel itself (for drain/extend/reintegrate). Or for exclude it will actually run. This makes it difficult to use pool query rebuild status, because the results will vary depending on when the query occurs (after original RB_OP_REBUILD: busy/nonzero rs_errno ; when RB_OP_FAIL_RECLAIM is run: DRS_IN_PROGRESS/rs_errno=0 ; when RB_OP_REBUILD is launched again DRS_IN_PROGRESS/rs_errno=0 or DRS_COMPLETED/rs_errno=0).
If the code assigns *opc to RB_OP_NONE here, there could be a chance with some code changes to arrange so that when RB_OP_FAIL_RECLAIM finishes, it could set the state to DRS_COMPLETED, and set the rs_errno to the original failed rebuild error value. This might make it easier to query for a failed rebuild (pool query would show DRS_IN_PROGRESS/nonzero rs_errno after first RB_OP_REBUILD failure, then DRS_IN_PROGRESS/rs_errno=0 during RB_OP_FAIL_RECLAIM, then finally DRS_COMPLETED/nonzero rs_errno).
Also, if the code were to set RB_OP_NONE here after an exclude rebuild fails with a non-retryable error, then at a later time the administrator could manually retry to rebuild for the DOWN targets with dmg pool rebuild start command.
There was a problem hiding this comment.
"This makes it difficult to use pool query rebuild status, because the results will vary depending on when the query occurs"
for whatever time it queried, it looks can return expected/explainable rebuild status, considering ds_rebuild_query().
as for whether or not to set opc to RB_OP_NONE here, i think one point is if the original rebuild is mixed with reint/drain/down, then after revert pool map, the DOWN's rebuild possibly can success. Whether or not to retry DOWN only's rebuild, I am not 100% sure now.
But this PR is not intended to change this part anyway.
| rgt->rgt_status.rs_state = DRS_IN_PROGRESS; | ||
| if ((rgt->rgt_opc == RB_OP_REBUILD) && (rgt->rgt_status.rs_errno != 0) && | ||
| !rebuild_retryable_err(rgt->rgt_status.rs_errno)) | ||
| rgt->rgt_status.rs_state = DRS_COMPLETED; |
There was a problem hiding this comment.
I would rather have the state remain DRS_IN_PROGRESS while the RB_OP_FAIL_RECLAIM runs, and after that finishes, set state=DRS_COMPLETED (and rs_errno be assigned to the original RB_OP_REBUILD error number). This way, a test using pool query to poll for rebuild complete can wait until the reclaim/fail_reclaim steps are done.
There was a problem hiding this comment.
When FAIL_RECLAIM task runs or in queue, ds_rebuild_query() will set "status->rs_state = DRS_IN_PROGRESS", so current code looks same with what you said.
it looks actually not a really significant change. but looks not bad to rerun with Features: rebuild. |
If rebuild failed with unretryable failure, set it as DRS_COMPLETED to avoid following pool query always report rebuild busy such as - Rebuild failing (state=busy, status=-2001) after this patch, that kind of failed rebuild will report - Rebuild failed (state=done, status=-2001) This can fix test_rebuild_31/REBUILD31 problem. For test_rebuild_32/REBUILD32 add DAOS_POOL_TGT_UPDATE_SKIP_RF_CHECK to bypass DAOS_POOL_RF checking when changing pool map. Fixes: DAOS-16575 DAOS-16576 Test-tag: pr test_rebuild_31 test_rebuild_32 Features: rebuild Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>
f2dc2f6 to
9c7d55d
Compare
|
just a rebase and add "Features: rebuild" to retrun. |
If rebuild failed with unretryable failure, set it as DRS_COMPLETED to avoid following pool query always report rebuild busy such as - Rebuild failing (state=busy, status=-2001)
after this patch, that kind of failed rebuild will report - Rebuild failed (state=done, status=-2001)
This can fix test_rebuild_31/REBUILD31 problem.
For test_rebuild_32/REBUILD32 add DAOS_POOL_TGT_UPDATE_SKIP_RF_CHECK to bypass DAOS_POOL_RF checking when changing pool map.
Fixes: DAOS-16575 DAOS-16576
Test-tag: pr test_rebuild_31 test_rebuild_32
Steps for the author:
After all prior steps are complete: