Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-13812 container: fix destroy vs lookup #12757

Merged
merged 4 commits into from
Dec 21, 2023
Merged

Conversation

wangshilong
Copy link
Contributor

If container is going to be dead, set @sc_stopping firstly, and later lookup will fail, then wait existed container services exit.

Required-githooks: true

@github-actions
Copy link

github-actions bot commented Aug 2, 2023

Bug-tracker data:
Ticket title is 'container destroy fails with DER_IO after IOR Hard with EC'
Status is 'In Review'
Labels: 'acceptance,tds'
https://daosio.atlassian.net/browse/DAOS-13812

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@wangshilong wangshilong force-pushed the shilongw/DAOS-13812 branch from 0dd3879 to 5179937 Compare August 2, 2023 07:37
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12757/3/execution/node/1303/log

If container is going to be dead, set @sc_stopping firstly,
and later lookup will fail, then wait existed container services exit.

Required-githooks: true
Signed-off-by: Wang Shilong <[email protected]>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@wangshilong wangshilong marked this pull request as ready for review December 13, 2023 00:59
ds_cont = cont_child_obj(llink);
if (ds_cont->sc_stopping) {
cont_child_put(cache, ds_cont);
return -DER_SHUTDOWN;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking "sc_stopping" in this internal lookup function could unintentional errors (for example, in cont_child_start() or cont_child_destroy_one()) , should we only check "sc_stopping" in the ds_cont_child_lookup()?

@@ -839,35 +840,45 @@ rebuild_container_scan_cb(daos_handle_t ih, vos_iter_entry_t *entry,
}

rc = vos_cont_open(iter_param->ip_hdl, entry->ie_couuid, &coh);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not from this patch, but I don't see why we need to call vos_cont_open() here, won't following ds_cont_child_lookup() open vos container? @wangdi1 ?

Required-githooks: true

Signed-off-by: Wang Shilong <[email protected]>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@wangshilong wangshilong requested a review from NiuYawei December 14, 2023 07:36
@NiuYawei NiuYawei requested a review from Nasf-Fan December 14, 2023 08:03
src/container/srv_target.c Outdated Show resolved Hide resolved
D_ERROR(DF_CONT": container is in stopping "DF_RC"\n",
DP_CONT(cont->sc_pool->spc_uuid, cont->sc_uuid), DP_RC(rc));
cont_child_put(tls->dt_cont_cache, cont);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks incorrect to me, why should container destroy return error when it's stopping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this should be removed.

@@ -1178,7 +1198,7 @@ cont_child_destroy_one(void *vin)
ABT_mutex_unlock(cont->sc_mutex);

/* Give chance to DTX reindex ULT for exit. */
if (unlikely(cont->sc_dtx_reindex))
while (unlikely(cont->sc_dtx_reindex))
ABT_thread_yield();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nasf-Fan , reindex is supposed to be very quick? Is this busy yield loop ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reindex may take long time. But when close the container, reindex ULT will exit soon when detects the "stopping" flag.

@wangshilong wangshilong requested a review from NiuYawei December 18, 2023 01:58
@@ -1178,7 +1198,7 @@ cont_child_destroy_one(void *vin)
ABT_mutex_unlock(cont->sc_mutex);

/* Give chance to DTX reindex ULT for exit. */
if (unlikely(cont->sc_dtx_reindex))
while (unlikely(cont->sc_dtx_reindex))
ABT_thread_yield();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reindex may take long time. But when close the container, reindex ULT will exit soon when detects the "stopping" flag.

@wangshilong wangshilong requested a review from a team December 21, 2023 02:29
@NiuYawei NiuYawei merged commit ea01b50 into master Dec 21, 2023
46 of 47 checks passed
@NiuYawei NiuYawei deleted the shilongw/DAOS-13812 branch December 21, 2023 02:32
wangshilong added a commit that referenced this pull request Jan 30, 2024
If container is going to be dead, set @sc_stopping firstly,
and later lookup will fail, then wait existed container services exit.

Signed-off-by: Wang Shilong <[email protected]>
jolivier23 pushed a commit that referenced this pull request Jul 2, 2024
DAOS-16039 object: fix EC aggregation wrong peer address (#14593)
DAOS-16009 rebuild: fix O_TRUNC file size related handling
DAOS-15056 rebuild: add rpt to the rgt list properly (#13862)
DAOS-15517 rebuild: refine lock handling for rpt list (#14064)
DAOS-13812 container: fix destroy vs lookup (#12757)
DAOS-15627 dtx: redunce stack usage for DTX resync to avoid overflow (#14189)
DAOS-14845 rebuild: do not wait for EC agg for reclaim (#13610)

Signed-off-by: Xuezhao Liu <[email protected]>
Signed-off-by: Mohamad Chaarawi <[email protected]>
Signed-off-by: Jeff Olivier <[email protected]>
Signed-off-by: Wang, Di <[email protected]>
Signed-off-by: Di Wang <[email protected]>
Signed-off-by: Wang Shilong <[email protected]>
Signed-off-by: Fan Yong <[email protected]>
@jolivier23 jolivier23 mentioned this pull request Jul 2, 2024
18 tasks
jolivier23 added a commit that referenced this pull request Jul 3, 2024
DAOS-16039 object: fix EC aggregation wrong peer address (#14593)
DAOS-16009 rebuild: fix O_TRUNC file size related handling
DAOS-15056 rebuild: add rpt to the rgt list properly (#13862)
DAOS-15517 rebuild: refine lock handling for rpt list (#14064)
DAOS-13812 container: fix destroy vs lookup (#12757)
DAOS-15627 dtx: redunce stack usage for DTX resync to avoid overflow (#14189)
DAOS-14845 rebuild: do not wait for EC agg for reclaim (#13610)

Signed-off-by: Xuezhao Liu <[email protected]>
Signed-off-by: Mohamad Chaarawi <[email protected]>
Signed-off-by: Jeff Olivier <[email protected]>
Signed-off-by: Wang, Di <[email protected]>
Signed-off-by: Di Wang <[email protected]>
Signed-off-by: Wang Shilong <[email protected]>
Signed-off-by: Fan Yong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants