-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-13812 container: fix destroy vs lookup #12757
Conversation
Bug-tracker data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
0dd3879
to
5179937
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
5179937
to
5821a5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12757/3/execution/node/1303/log |
If container is going to be dead, set @sc_stopping firstly, and later lookup will fail, then wait existed container services exit. Required-githooks: true Signed-off-by: Wang Shilong <[email protected]>
5821a5e
to
54159bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
src/container/srv_target.c
Outdated
ds_cont = cont_child_obj(llink); | ||
if (ds_cont->sc_stopping) { | ||
cont_child_put(cache, ds_cont); | ||
return -DER_SHUTDOWN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking "sc_stopping" in this internal lookup function could unintentional errors (for example, in cont_child_start() or cont_child_destroy_one()) , should we only check "sc_stopping" in the ds_cont_child_lookup()?
@@ -839,35 +840,45 @@ rebuild_container_scan_cb(daos_handle_t ih, vos_iter_entry_t *entry, | |||
} | |||
|
|||
rc = vos_cont_open(iter_param->ip_hdl, entry->ie_couuid, &coh); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not from this patch, but I don't see why we need to call vos_cont_open() here, won't following ds_cont_child_lookup() open vos container? @wangdi1 ?
Required-githooks: true Signed-off-by: Wang Shilong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
src/container/srv_target.c
Outdated
D_ERROR(DF_CONT": container is in stopping "DF_RC"\n", | ||
DP_CONT(cont->sc_pool->spc_uuid, cont->sc_uuid), DP_RC(rc)); | ||
cont_child_put(tls->dt_cont_cache, cont); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks incorrect to me, why should container destroy return error when it's stopping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, this should be removed.
@@ -1178,7 +1198,7 @@ cont_child_destroy_one(void *vin) | |||
ABT_mutex_unlock(cont->sc_mutex); | |||
|
|||
/* Give chance to DTX reindex ULT for exit. */ | |||
if (unlikely(cont->sc_dtx_reindex)) | |||
while (unlikely(cont->sc_dtx_reindex)) | |||
ABT_thread_yield(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Nasf-Fan , reindex is supposed to be very quick? Is this busy yield loop ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reindex may take long time. But when close the container, reindex ULT will exit soon when detects the "stopping" flag.
Required-githooks: true Signed-off-by: Wang Shilong <[email protected]>
@@ -1178,7 +1198,7 @@ cont_child_destroy_one(void *vin) | |||
ABT_mutex_unlock(cont->sc_mutex); | |||
|
|||
/* Give chance to DTX reindex ULT for exit. */ | |||
if (unlikely(cont->sc_dtx_reindex)) | |||
while (unlikely(cont->sc_dtx_reindex)) | |||
ABT_thread_yield(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reindex may take long time. But when close the container, reindex ULT will exit soon when detects the "stopping" flag.
If container is going to be dead, set @sc_stopping firstly, and later lookup will fail, then wait existed container services exit. Signed-off-by: Wang Shilong <[email protected]>
DAOS-16039 object: fix EC aggregation wrong peer address (#14593) DAOS-16009 rebuild: fix O_TRUNC file size related handling DAOS-15056 rebuild: add rpt to the rgt list properly (#13862) DAOS-15517 rebuild: refine lock handling for rpt list (#14064) DAOS-13812 container: fix destroy vs lookup (#12757) DAOS-15627 dtx: redunce stack usage for DTX resync to avoid overflow (#14189) DAOS-14845 rebuild: do not wait for EC agg for reclaim (#13610) Signed-off-by: Xuezhao Liu <[email protected]> Signed-off-by: Mohamad Chaarawi <[email protected]> Signed-off-by: Jeff Olivier <[email protected]> Signed-off-by: Wang, Di <[email protected]> Signed-off-by: Di Wang <[email protected]> Signed-off-by: Wang Shilong <[email protected]> Signed-off-by: Fan Yong <[email protected]>
DAOS-16039 object: fix EC aggregation wrong peer address (#14593) DAOS-16009 rebuild: fix O_TRUNC file size related handling DAOS-15056 rebuild: add rpt to the rgt list properly (#13862) DAOS-15517 rebuild: refine lock handling for rpt list (#14064) DAOS-13812 container: fix destroy vs lookup (#12757) DAOS-15627 dtx: redunce stack usage for DTX resync to avoid overflow (#14189) DAOS-14845 rebuild: do not wait for EC agg for reclaim (#13610) Signed-off-by: Xuezhao Liu <[email protected]> Signed-off-by: Mohamad Chaarawi <[email protected]> Signed-off-by: Jeff Olivier <[email protected]> Signed-off-by: Wang, Di <[email protected]> Signed-off-by: Di Wang <[email protected]> Signed-off-by: Wang Shilong <[email protected]> Signed-off-by: Fan Yong <[email protected]>
If container is going to be dead, set @sc_stopping firstly, and later lookup will fail, then wait existed container services exit.
Required-githooks: true