-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-10138 pool: Improve PS reconfigurations #10121
Conversation
Bug-tracker data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style warning(s) for job https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-10121/1/
Please review https://wiki.hpdd.intel.com/display/DC/Coding+Rules
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10121/1/execution/node/138/log |
96bec7b
to
0cd6d58
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Functional on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-10121/2/testReport/(root)/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10121/3/execution/node/872/log |
Test stage Functional Hardware Small completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10121/3/execution/node/1013/log |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10121/3/execution/node/1102/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10121/5/execution/node/320/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10121/5/execution/node/335/log |
Test stage Build RPM on Leap 15 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10121/5/execution/node/323/log |
5ef83e5
to
a1ea680
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10121/6/execution/node/871/log |
a1ea680
to
33ff5b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10121/7/execution/node/873/log |
Test stage Functional Hardware Small completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10121/7/execution/node/1014/log |
33ff5b0
to
521779e
Compare
A PS currently performs reconfigurations (i.e., membership changes) only upon pool map changes. If the PS leader crashes in the middle of a series of reconfigurations, the new PS leader will not plan any reconfiguration (or notify the MS of the latest list of replicas) until the pool map changes for some other reason. This patch lets a PS leader check if reconfigurations are required when it steps up. To avoid blocking the step up and pool map change processes, this patch performs each series of reconfigurations asynchronously in a ULT. The change allows the reconfiguration process to wait for pending events, retry upon certain errors (in the future), and wait for RPC timeouts without directly impacting the normal PS operations. Hence, this patch reverts the workaround (209ba92) that skips destroying PS replicas. Moreover, this patch adds a safety net that prevent an older rsvc leader from removing an rsvc replica created by a newer rsvc leader. Although it cannot resolve all problems in the area, the natural, term-based approach requires no RDB layout change and is simple to implement. The patch has to change rdb_test a bit to allow more than one test rsvc instance, so that a quick rsvc test can be added. Since select_svc_ranks avoids rank 0 but ds_pool_plan_svc_reconfs does not, this patch modifies the former to remove the avoidance, so that during a PoolCreate operation the MS observes a notification from the new PS with the same ranks as the PoolCreate dRPC response. We have to change a few tests as well as the MS to make this work: - Work around a race between mgmtSvc.PoolCreate and Database.handlePoolRepsUpdate. See the comment for the details. - Update svc.yaml to reflect the new PS replacement policy. - Fix the daos_obj.c assertion that PS leaders can be rank 0. Signed-off-by: Li Wei <[email protected]> Required-githooks: true
Signed-off-by: Li Wei <[email protected]> Required-githooks: true
Signed-off-by: Li Wei <[email protected]> Required-githooks: true
Test stage NLT on EL 8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-10121/14/display/redirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Rebased to resolve conflicts caused by the automatic base change (i.e., the previous base branch was merged to master). No changes are made otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
// | ||
// The pool remains in Creating state after PoolCreate completes, | ||
// leading to DER_AGAINs during PoolDestroy. | ||
if p.State == system.PoolServiceStateReady && ps.State == system.PoolServiceStateCreating { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks. This is probably not even really a workaround now, but just the correct behavior. If we are in this situation, all of the information in the "update" is probably stale anyhow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% satisfied with this change for 1) the update may contain a valid svc rank refresh as in the example (where the svc ranks could differ with those reported by ds_pool_svc_dist_create in theory), and 2) I return nil in this case (because otherwise callers would need some specific error that they could recognize). The control plane deserves a better solution eventually. :)
Required-githooks: true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Required-githooks: true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly questions, looks very good
src/rsvc/srv.c
Outdated
rc = rdb_ping(svc->s_db, caller_term); | ||
if (rc != 0) { | ||
if (rc != -DER_STALE) | ||
D_ERROR("%s: failed to ping local replica\n", svc->s_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(very minor) Only for developer insight when debugging problems, is it worth a D_DEBUG in the event rc == -DER_STALE to know that this replica was asked to be stopped by another "stale term" replica?
with ds_rsvc_start() it looks like a D_ERROR will be emitted there in case of -DER_STALE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let me add a DEBUG.
* reconfigurations or the last MS notification. | ||
*/ | ||
svc->ps_reconf.psc_force_notify = true; | ||
pool_svc_schedule_reconf(svc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon any subsequent rc != 0 errors below this point, is it worth issuing pool_svc_cancel_and_wait_reconf()?
Or, consider moving this to be one of the last steps after everything else has succeeded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, this is definitely a defect. :( Thanks a lot. Fixed.
|
||
if (rdb_get_ranks(svc->ps_rsvc.s_db, &new) == 0) { | ||
d_rank_list_sort(current); | ||
d_rank_list_sort(new); | ||
|
||
if (!d_rank_list_identical(new, current)) { | ||
if (reconf->psc_force_notify || !d_rank_list_identical(new, current)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be getting a little lost looking at all of the parts of this change, but a question: we don't expect many instances where the psc_force_notify will cause a RAS notification even when the lists are identical (no membership changes occur), is that right? Since most of the time if we are here, there is some membership change occurring (e.g., you would be stepping up as a leader because something happened to the previous leader)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I hope we won't see any besides those triggered by pool_svc_step_up_cb, who doesn't know if we need to notify the MS or not. The ds_notify_pool_svc_update call seemed to always succeed (expect for ENOMEM or bugs), as it only passes the notification to the local daos_server regardless of whether the latter is able to pass the notification to the MS (which is an issue itself, but I felt this PR is too big to accommodate any changes for this issue). Does it sound like I get your point, or maybe not? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about a scenario where a new leader steps up but the membership has not changed - so I guess now looking closer maybe this could be the consequence of the previous leader experiencing some failure (e.g., in a step of updating the pool map) and calling rdb_resign() due to that error.
But I see your line of thinking here, that if for some very unexpected reason ds_notify_pool_svc_update() fails then reconf->psc_force_notify will remain true.
Probably no issue here, as we will want to see any leader transition reported.
Signed-off-by: Li Wei <[email protected]> Required-githooks: true
Required-githooks: true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Ken.
* reconfigurations or the last MS notification. | ||
*/ | ||
svc->ps_reconf.psc_force_notify = true; | ||
pool_svc_schedule_reconf(svc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, this is definitely a defect. :( Thanks a lot. Fixed.
|
||
if (rdb_get_ranks(svc->ps_rsvc.s_db, &new) == 0) { | ||
d_rank_list_sort(current); | ||
d_rank_list_sort(new); | ||
|
||
if (!d_rank_list_identical(new, current)) { | ||
if (reconf->psc_force_notify || !d_rank_list_identical(new, current)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I hope we won't see any besides those triggered by pool_svc_step_up_cb, who doesn't know if we need to notify the MS or not. The ds_notify_pool_svc_update call seemed to always succeed (expect for ENOMEM or bugs), as it only passes the notification to the local daos_server regardless of whether the latter is able to pass the notification to the MS (which is an issue itself, but I felt this PR is too big to accommodate any changes for this issue). Does it sound like I get your point, or maybe not? :)
src/rsvc/srv.c
Outdated
rc = rdb_ping(svc->s_db, caller_term); | ||
if (rc != 0) { | ||
if (rc != -DER_STALE) | ||
D_ERROR("%s: failed to ping local replica\n", svc->s_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let me add a DEBUG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
|
||
if (rdb_get_ranks(svc->ps_rsvc.s_db, &new) == 0) { | ||
d_rank_list_sort(current); | ||
d_rank_list_sort(new); | ||
|
||
if (!d_rank_list_identical(new, current)) { | ||
if (reconf->psc_force_notify || !d_rank_list_identical(new, current)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about a scenario where a new leader steps up but the membership has not changed - so I guess now looking closer maybe this could be the consequence of the previous leader experiencing some failure (e.g., in a step of updating the pool map) and calling rdb_resign() due to that error.
But I see your line of thinking here, that if for some very unexpected reason ds_notify_pool_svc_update() fails then reconf->psc_force_notify will remain true.
Probably no issue here, as we will want to see any leader transition reported.
A PS currently performs reconfigurations (i.e., membership changes) only upon pool map changes. If the PS leader crashes in the middle of a series of reconfigurations, the new PS leader will not plan any reconfiguration (or notify the MS of the latest list of replicas) until the pool map changes for some other reason. This patch lets a PS leader check if reconfigurations are required when it steps up. To avoid blocking the step up and pool map change processes, this patch performs each series of reconfigurations asynchronously in a ULT. The change allows the reconfiguration process to wait for pending events, retry upon certain errors (in the future), and wait for RPC timeouts without directly impacting the normal PS operations. Hence, this patch reverts the workaround (209ba92) that skips destroying PS replicas. Moreover, this patch adds a safety net that prevent an older rsvc leader from removing an rsvc replica created by a newer rsvc leader. Although it cannot resolve all problems in the area, the natural, term-based approach requires no RDB layout change and is simple to implement. The patch has to change rdb_test a bit to allow more than one test rsvc instance, so that a quick rsvc test can be added. Since select_svc_ranks avoids rank 0 but ds_pool_plan_svc_reconfs does not, this patch modifies the former to remove the avoidance, so that during a PoolCreate operation the MS observes a notification from the new PS with the same ranks as the PoolCreate dRPC response. We have to change a few tests as well as the MS to make this work: - Work around a race between mgmtSvc.PoolCreate and Database.handlePoolRepsUpdate. See the comment for the details. - Update svc.yaml to reflect the new PS replacement policy. - Fix the daos_obj.c assertion that PS leaders can be rank 0. Signed-off-by: Li Wei <[email protected]> Required-githooks: true
A PS currently performs reconfigurations (i.e., membership changes) only upon pool map changes. If the PS leader crashes in the middle of a series of reconfigurations, the new PS leader will not plan any reconfiguration (or notify the MS of the latest list of replicas) until the pool map changes for some other reason. This patch lets a PS leader check if reconfigurations are required when it steps up. To avoid blocking the step up and pool map change processes, this patch performs each series of reconfigurations asynchronously in a ULT. The change allows the reconfiguration process to wait for pending events, retry upon certain errors (in the future), and wait for RPC timeouts without directly impacting the normal PS operations. Hence, this patch reverts the workaround (209ba92) that skips destroying PS replicas. Moreover, this patch adds a safety net that prevent an older rsvc leader from removing an rsvc replica created by a newer rsvc leader. Although it cannot resolve all problems in the area, the natural, term-based approach requires no RDB layout change and is simple to implement. The patch has to change rdb_test a bit to allow more than one test rsvc instance, so that a quick rsvc test can be added. Since select_svc_ranks avoids rank 0 but ds_pool_plan_svc_reconfs does not, this patch modifies the former to remove the avoidance, so that during a PoolCreate operation the MS observes a notification from the new PS with the same ranks as the PoolCreate dRPC response. We have to change a few tests as well as the MS to make this work: - Work around a race between mgmtSvc.PoolCreate and Database.handlePoolRepsUpdate. See the comment for the details. - Update svc.yaml to reflect the new PS replacement policy. - Fix the daos_obj.c assertion that PS leaders can be rank 0. Signed-off-by: Li Wei <[email protected]>
A PS currently performs reconfigurations (i.e., membership changes) only
upon pool map changes. If the PS leader crashes in the middle of a
series of reconfigurations, the new PS leader will not plan any
reconfiguration (or notify the MS of the latest list of replicas) until
the pool map changes for some other reason. This patch lets a PS leader
check if reconfigurations are required when it steps up.
To avoid blocking the step up and pool map change processes, this patch
performs each series of reconfigurations asynchronously in a ULT. The
change allows the reconfiguration process to wait for pending events,
retry upon certain errors (in the future), and wait for RPC timeouts
without directly impacting the normal PS operations. Hence, this patch
reverts the workaround (209ba92) that
skips destroying PS replicas.
Moreover, this patch adds a safety net that prevent an older rsvc leader
from removing an rsvc replica created by a newer rsvc leader. Although
it cannot resolve all problems in the area, the natural, term-based
approach requires no RDB layout change and is simple to implement. The
patch has to change rdb_test a bit to allow more than one test rsvc
instance, so that a quick rsvc test can be added.
Since select_svc_ranks avoids rank 0 but ds_pool_plan_svc_reconfs does
not, this patch modifies the former to remove the avoidance, so that
during a PoolCreate operation the MS observes a notification from the
new PS with the same ranks as the PoolCreate dRPC response. We have to
change a few tests as well as the MS to make this work:
Work around a race between mgmtSvc.PoolCreate and
Database.handlePoolRepsUpdate. See the comment for the details.
Update svc.yaml to reflect the new PS replacement policy.
Fix the daos_obj.c assertion that PS leaders can be rank 0.
Signed-off-by: Li Wei [email protected]
Required-githooks: true