Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16876 vos: set cont parameter when deregister modification from DTX #15657

Merged
merged 1 commit into from
Dec 27, 2024

Conversation

Nasf-Fan
Copy link
Contributor

@Nasf-Fan Nasf-Fan commented Dec 21, 2024

As long as the container is not destroyed, then anytime want to deregister a modification from related active DTX entry (that is usually triggered for vos discard or aggregation), the caller needs to offer container handle to vos_dtx_deregister_record() for locating the DTX entry in active DTX table. Otherwise, if the caller offers empty container handle, then it will cause dangling reference in related DTX entry as to data corruption in subsequent DTX commit or abort.

On the other hand, if the container will be destroyed, then all related DTX entries for such container will be useless any more. We need to destroy DTX table firstly to avoid generating dangling DTX references during destroying the container.

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link

Ticket title is 'LRZ: m02r01s07dao engine coredumps with vos EMRG src/vos/ilog.c:411 ilog_open() Assertion'
Status is 'In Progress'
Labels: 'LRZ'
https://daosio.atlassian.net/browse/DAOS-16876

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-16876 branch from 734b47a to 5170ae1 Compare December 21, 2024 13:54
@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15657/2/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15657/2/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15657/2/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15657/2/testReport/

* Then destroy DTX table firstly to avoid dangling DXT records during drain
* the container (that may yield).
*/
rc = vos_dtx_table_destroy(&pool->vp_umm, cont);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why it's moved here, then it'll called on draining every object? The original place looks more proper to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in gc_drain_cont(), there is assertion:
D_ASSERT(daos_handle_is_inval(coh)); return gc_drain_btr(gc, pool, coh, item, &cont->cd_obj_root, credits, empty);
Means that the coh parameter for dtx_deregister will always be zero for draining the container. That will cause related DTX entry to be left as dangling. To avoid such case, since it will be drained, the DTX table is useless any longer, we can destroyed the DTX table firstly. On the other hand, even if the coh is not zero, deregistering DTX records one by one is also inefficient for drain case, destroying the table will be more efficient.

@@ -945,7 +945,7 @@ oi_iter_aggregate(daos_handle_t ih, bool range_discard)
if (rc != 0)
D_ERROR("Could not evict object "DF_UOID" "DF_RC"\n",
DP_UOID(oid), DP_RC(rc));
rc = dbtree_iter_delete(oiter->oit_hdl, NULL);
rc = dbtree_iter_delete(oiter->oit_hdl, oiter->oit_cont);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the reasons I suggested reverting the "special OI deletion for layout upgrading" (see the oi_delete_arg), this was for EC rotation layout upgrade, it's not useful for 2.8 anymore. There are lots of potential issues in this PR, and it'll likely cause more issues in the future. (your current PR is an example), please revert that change before your fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch will be back-ported to release/2.6, not sure whether it will affect. I am thinking that if Di's patch is really useless any longer, we can directly revert it? Or via another independent patch?

return 0;

found = lrua_lookupx(cont->vc_dtx_array, entry - DTX_LID_RESERVED,
found = lrua_lookupx(vos_hdl2cont(coh)->vc_dtx_array, entry - DTX_LID_RESERVED,
epoch, &dae);
if (!found) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe risky, for some case, such as the DTX entry has been removed for error cleanup but related DTX sponsor may still cache some stable information. Then deregister the DTX record may not find the entry. On the other hand, non-exist DTX record when deregister is not fatal. Consider current patch, we destroy the DTX table firstly during drain the container, it may also cause non-exist case.

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-16876 branch 2 times, most recently from 58e73b4 to 4261423 Compare December 25, 2024 02:20
@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15657/4/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15657/4/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15657/4/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15657/4/testReport/

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-16876 branch 2 times, most recently from e7d08d6 to 28e6ca6 Compare December 25, 2024 06:54
src/vos/vos_gc.c Outdated
@@ -791,9 +796,6 @@ gc_get_container(struct vos_pool *pool)
*/
cont = d_list_pop_entry(&pool->vp_gc_cont, struct vos_container,
vc_gc_link);
if (DAOS_FAIL_CHECK(DAOS_VOS_GC_CONT_NULL))
D_ASSERT(cont == NULL);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this test code is removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need it?

@Nasf-Fan Nasf-Fan marked this pull request as ready for review December 26, 2024 02:14
@Nasf-Fan Nasf-Fan requested review from a team as code owners December 26, 2024 02:14
As long as the container is not destroyed, then anytime want to deregister
a modification from related active DTX entry (that is usually triggered for
vos discard or aggregation), the caller needs to offer container handle to
vos_dtx_deregister_record() for locating the DTX entry in active DTX table.
Otherwise, if the caller offers empty container handle, then it will cause
dangling reference in related DTX entry as to data corruption in subsequent
DTX commit or abort.

On the other hand, if the container will be destroyed, then all related DTX
entries for such container will be useless any more. We need to destroy DTX
table firstly to avoid generating dangling DTX references during destroying
the container.

Signed-off-by: Fan Yong <[email protected]>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-16876 branch from 28e6ca6 to da3b90a Compare December 27, 2024 02:38
@Nasf-Fan
Copy link
Contributor Author

Please help to review the patch with higher priority, that is for a P1 ticket, thanks!

@@ -296,6 +296,15 @@ gc_drain_cont(struct vos_gc *gc, struct vos_pool *pool, daos_handle_t coh,
int i;
int rc;

/*
* When we prepaer to drain the container, we do not need DTX entry any long.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo s/prepaer/prepare

@@ -2599,7 +2599,7 @@ vos_obj_iter_aggregate(daos_handle_t ih, bool range_discard)
* be aborted. Then it will be added and handled via GC when ktr_rec_free().
*/

rc = dbtree_iter_delete(oiter->it_hdl, NULL);
rc = dbtree_iter_delete(oiter->it_hdl, obj->obj_cont);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like the main bug here

@jolivier23 jolivier23 merged commit 09903e8 into master Dec 27, 2024
56 of 57 checks passed
@jolivier23 jolivier23 deleted the Nasf-Fan/DAOS-16876 branch December 27, 2024 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants