-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16876 vos: set cont parameter when deregister modification from DTX #15657
Conversation
Ticket title is 'LRZ: m02r01s07dao engine coredumps with vos EMRG src/vos/ilog.c:411 ilog_open() Assertion' |
734b47a
to
5170ae1
Compare
Test stage Unit Test bdev on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15657/2/testReport/ |
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15657/2/testReport/ |
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15657/2/testReport/ |
Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15657/2/testReport/ |
* Then destroy DTX table firstly to avoid dangling DXT records during drain | ||
* the container (that may yield). | ||
*/ | ||
rc = vos_dtx_table_destroy(&pool->vp_umm, cont); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why it's moved here, then it'll called on draining every object? The original place looks more proper to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because in gc_drain_cont(), there is assertion:
D_ASSERT(daos_handle_is_inval(coh)); return gc_drain_btr(gc, pool, coh, item, &cont->cd_obj_root, credits, empty);
Means that the coh
parameter for dtx_deregister will always be zero for draining the container. That will cause related DTX entry to be left as dangling. To avoid such case, since it will be drained, the DTX table is useless any longer, we can destroyed the DTX table firstly. On the other hand, even if the coh
is not zero, deregistering DTX records one by one is also inefficient for drain case, destroying the table will be more efficient.
src/vos/vos_obj_index.c
Outdated
@@ -945,7 +945,7 @@ oi_iter_aggregate(daos_handle_t ih, bool range_discard) | |||
if (rc != 0) | |||
D_ERROR("Could not evict object "DF_UOID" "DF_RC"\n", | |||
DP_UOID(oid), DP_RC(rc)); | |||
rc = dbtree_iter_delete(oiter->oit_hdl, NULL); | |||
rc = dbtree_iter_delete(oiter->oit_hdl, oiter->oit_cont); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the reasons I suggested reverting the "special OI deletion for layout upgrading" (see the oi_delete_arg), this was for EC rotation layout upgrade, it's not useful for 2.8 anymore. There are lots of potential issues in this PR, and it'll likely cause more issues in the future. (your current PR is an example), please revert that change before your fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch will be back-ported to release/2.6, not sure whether it will affect. I am thinking that if Di's patch is really useless any longer, we can directly revert it? Or via another independent patch?
return 0; | ||
|
||
found = lrua_lookupx(cont->vc_dtx_array, entry - DTX_LID_RESERVED, | ||
found = lrua_lookupx(vos_hdl2cont(coh)->vc_dtx_array, entry - DTX_LID_RESERVED, | ||
epoch, &dae); | ||
if (!found) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe risky, for some case, such as the DTX entry has been removed for error cleanup but related DTX sponsor may still cache some stable information. Then deregister the DTX record may not find the entry. On the other hand, non-exist DTX record when deregister is not fatal. Consider current patch, we destroy the DTX table firstly during drain the container, it may also cause non-exist case.
58e73b4
to
4261423
Compare
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15657/4/testReport/ |
Test stage Unit Test bdev on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15657/4/testReport/ |
Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15657/4/testReport/ |
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15657/4/testReport/ |
e7d08d6
to
28e6ca6
Compare
src/vos/vos_gc.c
Outdated
@@ -791,9 +796,6 @@ gc_get_container(struct vos_pool *pool) | |||
*/ | |||
cont = d_list_pop_entry(&pool->vp_gc_cont, struct vos_container, | |||
vc_gc_link); | |||
if (DAOS_FAIL_CHECK(DAOS_VOS_GC_CONT_NULL)) | |||
D_ASSERT(cont == NULL); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this test code is removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need it?
As long as the container is not destroyed, then anytime want to deregister a modification from related active DTX entry (that is usually triggered for vos discard or aggregation), the caller needs to offer container handle to vos_dtx_deregister_record() for locating the DTX entry in active DTX table. Otherwise, if the caller offers empty container handle, then it will cause dangling reference in related DTX entry as to data corruption in subsequent DTX commit or abort. On the other hand, if the container will be destroyed, then all related DTX entries for such container will be useless any more. We need to destroy DTX table firstly to avoid generating dangling DTX references during destroying the container. Signed-off-by: Fan Yong <[email protected]>
28e6ca6
to
da3b90a
Compare
Please help to review the patch with higher priority, that is for a P1 ticket, thanks! |
@@ -296,6 +296,15 @@ gc_drain_cont(struct vos_gc *gc, struct vos_pool *pool, daos_handle_t coh, | |||
int i; | |||
int rc; | |||
|
|||
/* | |||
* When we prepaer to drain the container, we do not need DTX entry any long. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo s/prepaer/prepare
@@ -2599,7 +2599,7 @@ vos_obj_iter_aggregate(daos_handle_t ih, bool range_discard) | |||
* be aborted. Then it will be added and handled via GC when ktr_rec_free(). | |||
*/ | |||
|
|||
rc = dbtree_iter_delete(oiter->it_hdl, NULL); | |||
rc = dbtree_iter_delete(oiter->it_hdl, obj->obj_cont); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like the main bug here
As long as the container is not destroyed, then anytime want to deregister a modification from related active DTX entry (that is usually triggered for vos discard or aggregation), the caller needs to offer container handle to vos_dtx_deregister_record() for locating the DTX entry in active DTX table. Otherwise, if the caller offers empty container handle, then it will cause dangling reference in related DTX entry as to data corruption in subsequent DTX commit or abort.
On the other hand, if the container will be destroyed, then all related DTX entries for such container will be useless any more. We need to destroy DTX table firstly to avoid generating dangling DTX references during destroying the container.
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: