Skip to content

DAOS-18950 object: migrate_cont_iter_cb() error, drop ds_pool#18224

Draft
kccain wants to merge 1 commit into
release/2.6from
kccain/daos_18950_rel2p6_ds_pool_ref_fix
Draft

DAOS-18950 object: migrate_cont_iter_cb() error, drop ds_pool#18224
kccain wants to merge 1 commit into
release/2.6from
kccain/daos_18950_rel2p6_ds_pool_ref_fix

Conversation

@kccain
Copy link
Copy Markdown
Contributor

@kccain kccain commented May 12, 2026

Before this change, if migrate_cont_iter_cb() gets an error (e.g., -DER_CONT_NONEXIST) while launching cont_fetch_start_ult(), a reference taken on struct ds_pool (in fetch_arg->pool) still exists, and is not dropped in the error handling path. Instead, migrate_cont_iter_cb() returns the error directly (and the cleanup in cont_fetch_end_ult is not run). Later, a pool destroy was observed to hang in ds_pool_stop() because the loop waiting for all references to be dropped did not finish.

With this change, the error handling will redirect to the free: label where cont_fetch_end_ult is invoked, that drops the reference to the ds_pool. This is expected to avoid a lingering reference on the ds_pool and a resulting pool destroy hang.

Features: rebuild ec_online_rebuild ec_offline_rebuild

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

Before this change, if migrate_cont_iter_cb() gets an error
(e.g., -DER_CONT_NONEXIST) while launching cont_fetch_start_ult(),
a reference taken on struct ds_pool (in fetch_arg->pool) still
exists, and is not dropped in the error handling path. Instead,
migrate_cont_iter_cb() returns the error directly (and the
cleanup in cont_fetch_end_ult is not run). Later, a pool destroy
was observed to hang in ds_pool_stop() because the loop waiting
for all references to be dropped did not finish.

With this change, the error handling will redirect to the
free: label where cont_fetch_end_ult is invoked, that drops
the reference to the ds_pool. This is expected to avoid
a lingering reference on the ds_pool and a resulting pool
destroy hang.

Features: rebuild ec_online_rebuild ec_offline_rebuild

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
@github-actions
Copy link
Copy Markdown

Ticket title is 'erasurecode/online_rebuild_mdtest.py:EcodOnlineRebuildMdtest.test_ec_online_rebuild_mdtest - tearDown pool destroy time out'
Status is 'Open'
Labels: '2.6.5rc2,daily_test'
https://daosio.atlassian.net/browse/DAOS-18950

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant