-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-14908 vos: Reduce aggregation conflicts #14143
Conversation
Ticket title is 'soak - online harassers with 2.4.1-2 - rebuild took over 30 minutes to complete' |
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14143/1/execution/node/1198/log |
Rather than blocking vos_obj_discard entirely when discard or aggregation are running, let's block it only when there is an actual conflict on the object being discarded. Features: rebuild Allow-unstable-test: true Required-githooks: true Change-Id: I110dd2e37e299df25c002bba63776559d689b1cf Signed-off-by: Jeff Olivier <[email protected]>
31ba3d9
to
50b4d41
Compare
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14143/2/execution/node/1198/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me that "obj discard" is only used in two cases today:
- Discard old data on the target to be reintegrated before full reintegration. I think it's not quite correct, and Di has a pending PR tries to use vos container destroy instead of "obj discard".
- Once the target is reintegrated, delete unused data on source target to reclaim space. I don't quite see why 'obj discard' is used in this case, can't we just delete the whole obj tree to reclaim space?
I think we review the rebuild/reintegration design first to see if "obj discard" is still relevant, what do you think?
Change-Id: I6b80380fc5cbac91390368fba60c1cb09d93f1de
Features: rebuild Allow-unstable-test: true Required-githooks: true Change-Id: I13b0ee0b776ccf7c124492e5788f00c84fd9b986 Signed-off-by: Jeff Olivier <[email protected]>
Discussed in our last call but just to summarize the discussion. For reintegration, container destroy/recreate precludes the need for vos_obj_discard. For reclaim, I think it's still needed but can potentially be simplified to do as you say and delete the whole object tree. We can forego visibility checks. We still need to avoid conflict with aggregation... Random thought, would it make any sense to get rid of vos_obj_discard and instead add an upcall in aggregation (should_remove(oid)) ? |
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14143/3/execution/node/1174/log |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14143/3/execution/node/1415/log |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14143/3/execution/node/1508/log |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14143/3/execution/node/1601/log |
I asked Di about "obj discard" use cases, looks so far it's used in three cases:
The conclusion is that obj discard is till required in current rebuild/reintegration, but we may consider to drop it in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering should we move further on this direction?
Looks such fix won't solve the long rebuild reclaim time issue when the system has only few giant objects. If the "obj discard" is only used for reclaiming space by deleting object after reintegration, why don't we just mark the object as "obsoleted" (which ignores snapshots) in ilog and rely on aggregation to put it for GC?
Required-githooks: true Change-Id: I5006c60caf5d1e6a1bc1851d3b14981c8cd6e72d
Not a bad idea. Let me look at what that would entail. |
Required-githooks: true Signed-off-by: Jeff Olivier <[email protected]>
Required-githooks: true Signed-off-by: Jeff Olivier <[email protected]>
Required-githooks: true Signed-off-by: Jeff Olivier <[email protected]>
Required-githooks: true Signed-off-by: Jeff Olivier <[email protected]>
Note to @daos-stack/daos-gatekeeper this one is on top of #14382 so that should land first |
The base branch was changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ftest LGTM
Required-githooks: true Signed-off-by: Jeff Olivier <[email protected]>
sorry for the re-review. I tried an experiment and it failed. I fixed another issue in another PR and based this one on it. That preserved the reviews but after landing that one, it removed them. This should be identical to before. |
…ild epoch Rebuild code change: 1. __migrate_fetch_update_parity(), fix a bug when set partial replica rebuild epoch for parity shard rebuild. 2. __migrate_fetch_update_bulk() should carry DIOF_FOR_MIGRATION flag, 3. migrate_fetch_update_parity() parameter fix when calling __migrate_fetch_update_parity(). EC aggregation change: 1. ds_obj_ec_rep_handler() and ds_obj_ec_agg_handler(), the vos_update_begin() should carry VOS_OF_REBUILD to avoid -DER_VOS_PARTIAL_UPDATE failure. 2. give more chance to abort EC agg when rebuild started, to save conflict window. includes backports of DAOS-15007 object: fix EC aggregation's ap_min_unagg_eph set (#13875) DAOS-15262 vos: Fix probe issue in vos iterator (#13918) DAOS-14908 vos: Reduce aggregation conflicts (#14143) Signed-off-by: Jeff Olivier <[email protected]> Signed-off-by: Xuezhao Liu <[email protected]> Signed-off-by: Niu Yawei <[email protected]>
…ild epoch (#14519) Rebuild code change: 1. __migrate_fetch_update_parity(), fix a bug when set partial replica rebuild epoch for parity shard rebuild. 2. __migrate_fetch_update_bulk() should carry DIOF_FOR_MIGRATION flag, 3. migrate_fetch_update_parity() parameter fix when calling __migrate_fetch_update_parity(). EC aggregation change: 1. ds_obj_ec_rep_handler() and ds_obj_ec_agg_handler(), the vos_update_begin() should carry VOS_OF_REBUILD to avoid -DER_VOS_PARTIAL_UPDATE failure. 2. give more chance to abort EC agg when rebuild started, to save conflict window. includes backports of DAOS-15007 object: fix EC aggregation's ap_min_unagg_eph set (#13875) DAOS-15262 vos: Fix probe issue in vos iterator (#13918) DAOS-14908 vos: Reduce aggregation conflicts (#14143) Signed-off-by: Jeff Olivier <[email protected]> Signed-off-by: Xuezhao Liu <[email protected]> Signed-off-by: Niu Yawei <[email protected]>
Rather than blocking vos_obj_discard entirely when discard or aggregation are running, let's block it only when there is an actual conflict on the object being discarded.
Features: rebuild
Required-githooks: true
Change-Id: I110dd2e37e299df25c002bba63776559d689b1cf
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: