Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14908 vos: Reduce aggregation conflicts #14143

Merged
merged 24 commits into from
May 20, 2024
Merged

Conversation

jolivier23
Copy link
Contributor

Rather than blocking vos_obj_discard entirely when discard or aggregation are running, let's block it only when there is an actual conflict on the object being discarded.

Features: rebuild

Required-githooks: true

Change-Id: I110dd2e37e299df25c002bba63776559d689b1cf

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link

github-actions bot commented Apr 9, 2024

Ticket title is 'soak - online harassers with 2.4.1-2 - rebuild took over 30 minutes to complete'
Status is 'In Review'
Labels: '2.4.2rc1,2.6.0,2.6.0tb1,scrubbed,soak,triaged'
https://daosio.atlassian.net/browse/DAOS-14908

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14143/1/execution/node/1198/log

Rather than blocking vos_obj_discard entirely when
discard or aggregation are running, let's block it
only when there is an actual conflict on the object
being discarded.

Features: rebuild
Allow-unstable-test: true

Required-githooks: true

Change-Id: I110dd2e37e299df25c002bba63776559d689b1cf
Signed-off-by: Jeff Olivier <[email protected]>
@jolivier23 jolivier23 force-pushed the jvolivie/agg_discard branch from 31ba3d9 to 50b4d41 Compare April 10, 2024 14:33
@jolivier23 jolivier23 marked this pull request as ready for review April 10, 2024 15:33
@jolivier23 jolivier23 requested review from a team as code owners April 10, 2024 15:33
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14143/2/execution/node/1198/log

Copy link
Contributor

@NiuYawei NiuYawei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me that "obj discard" is only used in two cases today:

  1. Discard old data on the target to be reintegrated before full reintegration. I think it's not quite correct, and Di has a pending PR tries to use vos container destroy instead of "obj discard".
  2. Once the target is reintegrated, delete unused data on source target to reclaim space. I don't quite see why 'obj discard' is used in this case, can't we just delete the whole obj tree to reclaim space?

I think we review the rebuild/reintegration design first to see if "obj discard" is still relevant, what do you think?

Change-Id: I6b80380fc5cbac91390368fba60c1cb09d93f1de
Features: rebuild
Allow-unstable-test: true

Required-githooks: true

Change-Id: I13b0ee0b776ccf7c124492e5788f00c84fd9b986
Signed-off-by: Jeff Olivier <[email protected]>
@jolivier23
Copy link
Contributor Author

It looks to me that "obj discard" is only used in two cases today:

  1. Discard old data on the target to be reintegrated before full reintegration. I think it's not quite correct, and Di has a pending PR tries to use vos container destroy instead of "obj discard".
  2. Once the target is reintegrated, delete unused data on source target to reclaim space. I don't quite see why 'obj discard' is used in this case, can't we just delete the whole obj tree to reclaim space?

I think we review the rebuild/reintegration design first to see if "obj discard" is still relevant, what do you think?

Discussed in our last call but just to summarize the discussion.

For reintegration, container destroy/recreate precludes the need for vos_obj_discard.

For reclaim, I think it's still needed but can potentially be simplified to do as you say and delete the whole object tree. We can forego visibility checks. We still need to avoid conflict with aggregation...

Random thought, would it make any sense to get rid of vos_obj_discard and instead add an upcall in aggregation (should_remove(oid)) ?

@jolivier23 jolivier23 requested a review from NiuYawei April 11, 2024 14:33
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14143/3/execution/node/1174/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14143/3/execution/node/1415/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14143/3/execution/node/1508/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14143/3/execution/node/1601/log

@NiuYawei
Copy link
Contributor

It looks to me that "obj discard" is only used in two cases today:

  1. Discard old data on the target to be reintegrated before full reintegration. I think it's not quite correct, and Di has a pending PR tries to use vos container destroy instead of "obj discard".
  2. Once the target is reintegrated, delete unused data on source target to reclaim space. I don't quite see why 'obj discard' is used in this case, can't we just delete the whole obj tree to reclaim space?

I think we review the rebuild/reintegration design first to see if "obj discard" is still relevant, what do you think?

Discussed in our last call but just to summarize the discussion.

For reintegration, container destroy/recreate precludes the need for vos_obj_discard.

For reclaim, I think it's still needed but can potentially be simplified to do as you say and delete the whole object tree. We can forego visibility checks. We still need to avoid conflict with aggregation...

Random thought, would it make any sense to get rid of vos_obj_discard and instead add an upcall in aggregation (should_remove(oid)) ?

I asked Di about "obj discard" use cases, looks so far it's used in three cases:

  1. Discard old data on the target being integrated before reintegration, this could be removed since Di has pushed a PR tries to discard old data by VOS container destroy.
  2. Reclaim space by discarding the old data on source targets after reintegration, this obj discard requires only discard epoch [A, MAX), I think could be replaced by some more efficient new API to delete the whole oid tree.
  3. When rebuild fail then retry on second time, discard the already rebuilt data from the first round rebuild, this obj discard requires discarding epoch [A, B), the purpose of this discard is to avoid overwrite on second rebuild. Given that VOS now can support overwrite, we might be able to skip obj discard in this case now.

The conclusion is that obj discard is till required in current rebuild/reintegration, but we may consider to drop it in the future.

Copy link
Contributor

@NiuYawei NiuYawei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering should we move further on this direction?

Looks such fix won't solve the long rebuild reclaim time issue when the system has only few giant objects. If the "obj discard" is only used for reclaiming space by deleting object after reintegration, why don't we just mark the object as "obsoleted" (which ignores snapshots) in ilog and rely on aggregation to put it for GC?

Required-githooks: true

Change-Id: I5006c60caf5d1e6a1bc1851d3b14981c8cd6e72d
@jolivier23
Copy link
Contributor Author

I'm wondering should we move further on this direction?

Looks such fix won't solve the long rebuild reclaim time issue when the system has only few giant objects. If the "obj discard" is only used for reclaiming space by deleting object after reintegration, why don't we just mark the object as "obsoleted" (which ignores snapshots) in ilog and rely on aggregation to put it for GC?

Not a bad idea. Let me look at what that would entail.

Required-githooks: true

Signed-off-by: Jeff Olivier <[email protected]>
Required-githooks: true

Signed-off-by: Jeff Olivier <[email protected]>
Required-githooks: true

Signed-off-by: Jeff Olivier <[email protected]>
Required-githooks: true

Signed-off-by: Jeff Olivier <[email protected]>
@jolivier23
Copy link
Contributor Author

Note to @daos-stack/daos-gatekeeper this one is on top of #14382 so that should land first

Base automatically changed from jvolivie/timeout to master May 17, 2024 18:16
@jolivier23 jolivier23 dismissed stale reviews from NiuYawei and liuxuezhao May 17, 2024 18:16

The base branch was changed.

@jolivier23 jolivier23 requested review from a team as code owners May 17, 2024 18:16
Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftest LGTM

@jolivier23
Copy link
Contributor Author

sorry for the re-review. I tried an experiment and it failed. I fixed another issue in another PR and based this one on it. That preserved the reviews but after landing that one, it removed them. This should be identical to before.

@jolivier23 jolivier23 merged commit c0d9109 into master May 20, 2024
52 checks passed
@jolivier23 jolivier23 deleted the jvolivie/agg_discard branch May 20, 2024 06:13
jolivier23 pushed a commit that referenced this pull request Jun 5, 2024
…ild epoch

Rebuild code change:
1. __migrate_fetch_update_parity(), fix a bug when set partial replica
   rebuild epoch for parity shard rebuild.
2. __migrate_fetch_update_bulk() should carry DIOF_FOR_MIGRATION flag,
3. migrate_fetch_update_parity() parameter fix when calling
   __migrate_fetch_update_parity().

EC aggregation change:
1. ds_obj_ec_rep_handler() and ds_obj_ec_agg_handler(), the vos_update_begin()
   should carry VOS_OF_REBUILD to avoid -DER_VOS_PARTIAL_UPDATE failure.
2. give more chance to abort EC agg when rebuild started, to save
   conflict window.

includes backports of
DAOS-15007 object: fix EC aggregation's ap_min_unagg_eph set (#13875)
DAOS-15262 vos: Fix probe issue in vos iterator (#13918)
DAOS-14908 vos: Reduce aggregation conflicts (#14143)

Signed-off-by: Jeff Olivier <[email protected]>
Signed-off-by: Xuezhao Liu <[email protected]>
Signed-off-by: Niu Yawei <[email protected]>
jolivier23 added a commit that referenced this pull request Jun 6, 2024
…ild epoch (#14519)

Rebuild code change:
1. __migrate_fetch_update_parity(), fix a bug when set partial replica
   rebuild epoch for parity shard rebuild.
2. __migrate_fetch_update_bulk() should carry DIOF_FOR_MIGRATION flag,
3. migrate_fetch_update_parity() parameter fix when calling
   __migrate_fetch_update_parity().

EC aggregation change:
1. ds_obj_ec_rep_handler() and ds_obj_ec_agg_handler(), the vos_update_begin()
   should carry VOS_OF_REBUILD to avoid -DER_VOS_PARTIAL_UPDATE failure.
2. give more chance to abort EC agg when rebuild started, to save
   conflict window.

includes backports of
DAOS-15007 object: fix EC aggregation's ap_min_unagg_eph set (#13875)
DAOS-15262 vos: Fix probe issue in vos iterator (#13918)
DAOS-14908 vos: Reduce aggregation conflicts (#14143)

Signed-off-by: Jeff Olivier <[email protected]>
Signed-off-by: Xuezhao Liu <[email protected]>
Signed-off-by: Niu Yawei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants