Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-10138 pool: Improve PS reconfigurations #10121

Merged
merged 7 commits into from
Sep 27, 2022
Merged

DAOS-10138 pool: Improve PS reconfigurations #10121

merged 7 commits into from
Sep 27, 2022

Conversation

liw
Copy link
Contributor

@liw liw commented Aug 26, 2022

A PS currently performs reconfigurations (i.e., membership changes) only
upon pool map changes. If the PS leader crashes in the middle of a
series of reconfigurations, the new PS leader will not plan any
reconfiguration (or notify the MS of the latest list of replicas) until
the pool map changes for some other reason. This patch lets a PS leader
check if reconfigurations are required when it steps up.

To avoid blocking the step up and pool map change processes, this patch
performs each series of reconfigurations asynchronously in a ULT. The
change allows the reconfiguration process to wait for pending events,
retry upon certain errors (in the future), and wait for RPC timeouts
without directly impacting the normal PS operations. Hence, this patch
reverts the workaround (209ba92) that
skips destroying PS replicas.

Moreover, this patch adds a safety net that prevent an older rsvc leader
from removing an rsvc replica created by a newer rsvc leader. Although
it cannot resolve all problems in the area, the natural, term-based
approach requires no RDB layout change and is simple to implement. The
patch has to change rdb_test a bit to allow more than one test rsvc
instance, so that a quick rsvc test can be added.

Since select_svc_ranks avoids rank 0 but ds_pool_plan_svc_reconfs does
not, this patch modifies the former to remove the avoidance, so that
during a PoolCreate operation the MS observes a notification from the
new PS with the same ranks as the PoolCreate dRPC response. We have to
change a few tests as well as the MS to make this work:

  • Work around a race between mgmtSvc.PoolCreate and
    Database.handlePoolRepsUpdate. See the comment for the details.

  • Update svc.yaml to reflect the new PS replacement policy.

  • Fix the daos_obj.c assertion that PS leaders can be rank 0.

Signed-off-by: Li Wei [email protected]
Required-githooks: true

@github-actions
Copy link

github-actions bot commented Aug 26, 2022

Bug-tracker data:
Ticket title is 'Restore Pool Service redundancy when enough engines are available'
Status is 'In Review'
Labels: 'Metadata'
https://daosio.atlassian.net/browse/DAOS-10138

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/pool/srv_pool.c Outdated Show resolved Hide resolved
src/pool/srv_pool.c Outdated Show resolved Hide resolved
@daosbuild1
Copy link
Collaborator

@liw liw force-pushed the liw/ps-reconf branch 2 times, most recently from 96bec7b to 0cd6d58 Compare August 30, 2022 01:20
@daosbuild1 daosbuild1 dismissed their stale review August 30, 2022 01:22

Updated patch

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-10121/2/testReport/(root)/

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10121/3/execution/node/872/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Small completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10121/3/execution/node/1013/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10121/3/execution/node/1102/log

@liw liw changed the base branch from master to liw/pool-destroy-svc August 31, 2022 08:36
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10121/5/execution/node/320/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10121/5/execution/node/335/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10121/5/execution/node/323/log

@liw liw force-pushed the liw/pool-destroy-svc branch from 5ef83e5 to a1ea680 Compare August 31, 2022 08:46
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10121/6/execution/node/871/log

@liw liw force-pushed the liw/pool-destroy-svc branch from a1ea680 to 33ff5b0 Compare September 1, 2022 03:48
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10121/7/execution/node/873/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Small completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10121/7/execution/node/1014/log

@liw liw force-pushed the liw/pool-destroy-svc branch from 33ff5b0 to 521779e Compare September 2, 2022 07:28
liw added 3 commits September 9, 2022 07:16
A PS currently performs reconfigurations (i.e., membership changes) only
upon pool map changes. If the PS leader crashes in the middle of a
series of reconfigurations, the new PS leader will not plan any
reconfiguration (or notify the MS of the latest list of replicas) until
the pool map changes for some other reason. This patch lets a PS leader
check if reconfigurations are required when it steps up.

To avoid blocking the step up and pool map change processes, this patch
performs each series of reconfigurations asynchronously in a ULT. The
change allows the reconfiguration process to wait for pending events,
retry upon certain errors (in the future), and wait for RPC timeouts
without directly impacting the normal PS operations. Hence, this patch
reverts the workaround (209ba92) that
skips destroying PS replicas.

Moreover, this patch adds a safety net that prevent an older rsvc leader
from removing an rsvc replica created by a newer rsvc leader. Although
it cannot resolve all problems in the area, the natural, term-based
approach requires no RDB layout change and is simple to implement. The
patch has to change rdb_test a bit to allow more than one test rsvc
instance, so that a quick rsvc test can be added.

Since select_svc_ranks avoids rank 0 but ds_pool_plan_svc_reconfs does
not, this patch modifies the former to remove the avoidance, so that
during a PoolCreate operation the MS observes a notification from the
new PS with the same ranks as the PoolCreate dRPC response. We have to
change a few tests as well as the MS to make this work:

  - Work around a race between mgmtSvc.PoolCreate and
    Database.handlePoolRepsUpdate. See the comment for the details.

  - Update svc.yaml to reflect the new PS replacement policy.

  - Fix the daos_obj.c assertion that PS leaders can be rank 0.

Signed-off-by: Li Wei <[email protected]>
Required-githooks: true
Signed-off-by: Li Wei <[email protected]>
Required-githooks: true
Signed-off-by: Li Wei <[email protected]>
Required-githooks: true
@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-10121/14/display/redirect

@liw liw removed the request for review from a team September 8, 2022 23:19
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@liw
Copy link
Contributor Author

liw commented Sep 8, 2022

Rebased to resolve conflicts caused by the automatic base change (i.e., the previous base branch was merged to master). No changes are made otherwise.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

//
// The pool remains in Creating state after PoolCreate completes,
// leading to DER_AGAINs during PoolDestroy.
if p.State == system.PoolServiceStateReady && ps.State == system.PoolServiceStateCreating {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks. This is probably not even really a workaround now, but just the correct behavior. If we are in this situation, all of the information in the "update" is probably stale anyhow.

Copy link
Contributor Author

@liw liw Sep 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% satisfied with this change for 1) the update may contain a valid svc rank refresh as in the example (where the svc ranks could differ with those reported by ds_pool_svc_dist_create in theory), and 2) I return nil in this case (because otherwise callers would need some specific error that they could recognize). The control plane deserves a better solution eventually. :)

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

kccain
kccain previously approved these changes Sep 22, 2022
Copy link
Contributor

@kccain kccain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly questions, looks very good

src/rsvc/srv.c Outdated
rc = rdb_ping(svc->s_db, caller_term);
if (rc != 0) {
if (rc != -DER_STALE)
D_ERROR("%s: failed to ping local replica\n", svc->s_name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(very minor) Only for developer insight when debugging problems, is it worth a D_DEBUG in the event rc == -DER_STALE to know that this replica was asked to be stopped by another "stale term" replica?

with ds_rsvc_start() it looks like a D_ERROR will be emitted there in case of -DER_STALE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me add a DEBUG.

* reconfigurations or the last MS notification.
*/
svc->ps_reconf.psc_force_notify = true;
pool_svc_schedule_reconf(svc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon any subsequent rc != 0 errors below this point, is it worth issuing pool_svc_cancel_and_wait_reconf()?
Or, consider moving this to be one of the last steps after everything else has succeeded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this is definitely a defect. :( Thanks a lot. Fixed.


if (rdb_get_ranks(svc->ps_rsvc.s_db, &new) == 0) {
d_rank_list_sort(current);
d_rank_list_sort(new);

if (!d_rank_list_identical(new, current)) {
if (reconf->psc_force_notify || !d_rank_list_identical(new, current)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be getting a little lost looking at all of the parts of this change, but a question: we don't expect many instances where the psc_force_notify will cause a RAS notification even when the lists are identical (no membership changes occur), is that right? Since most of the time if we are here, there is some membership change occurring (e.g., you would be stepping up as a leader because something happened to the previous leader)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I hope we won't see any besides those triggered by pool_svc_step_up_cb, who doesn't know if we need to notify the MS or not. The ds_notify_pool_svc_update call seemed to always succeed (expect for ENOMEM or bugs), as it only passes the notification to the local daos_server regardless of whether the latter is able to pass the notification to the MS (which is an issue itself, but I felt this PR is too big to accommodate any changes for this issue). Does it sound like I get your point, or maybe not? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about a scenario where a new leader steps up but the membership has not changed - so I guess now looking closer maybe this could be the consequence of the previous leader experiencing some failure (e.g., in a step of updating the pool map) and calling rdb_resign() due to that error.

But I see your line of thinking here, that if for some very unexpected reason ds_notify_pool_svc_update() fails then reconf->psc_force_notify will remain true.

Probably no issue here, as we will want to see any leader transition reported.

liw added 2 commits September 23, 2022 16:02
Signed-off-by: Li Wei <[email protected]>
Required-githooks: true
Copy link
Contributor Author

@liw liw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Ken.

* reconfigurations or the last MS notification.
*/
svc->ps_reconf.psc_force_notify = true;
pool_svc_schedule_reconf(svc);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this is definitely a defect. :( Thanks a lot. Fixed.


if (rdb_get_ranks(svc->ps_rsvc.s_db, &new) == 0) {
d_rank_list_sort(current);
d_rank_list_sort(new);

if (!d_rank_list_identical(new, current)) {
if (reconf->psc_force_notify || !d_rank_list_identical(new, current)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I hope we won't see any besides those triggered by pool_svc_step_up_cb, who doesn't know if we need to notify the MS or not. The ds_notify_pool_svc_update call seemed to always succeed (expect for ENOMEM or bugs), as it only passes the notification to the local daos_server regardless of whether the latter is able to pass the notification to the MS (which is an issue itself, but I felt this PR is too big to accommodate any changes for this issue). Does it sound like I get your point, or maybe not? :)

src/rsvc/srv.c Outdated
rc = rdb_ping(svc->s_db, caller_term);
if (rc != 0) {
if (rc != -DER_STALE)
D_ERROR("%s: failed to ping local replica\n", svc->s_name);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me add a DEBUG.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@liw liw requested review from kccain and liuxuezhao September 23, 2022 13:06

if (rdb_get_ranks(svc->ps_rsvc.s_db, &new) == 0) {
d_rank_list_sort(current);
d_rank_list_sort(new);

if (!d_rank_list_identical(new, current)) {
if (reconf->psc_force_notify || !d_rank_list_identical(new, current)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about a scenario where a new leader steps up but the membership has not changed - so I guess now looking closer maybe this could be the consequence of the previous leader experiencing some failure (e.g., in a step of updating the pool map) and calling rdb_resign() due to that error.

But I see your line of thinking here, that if for some very unexpected reason ds_notify_pool_svc_update() fails then reconf->psc_force_notify will remain true.

Probably no issue here, as we will want to see any leader transition reported.

@liw liw requested a review from a team September 26, 2022 23:53
@mjmac mjmac merged commit 67604a8 into master Sep 27, 2022
@mjmac mjmac deleted the liw/ps-reconf branch September 27, 2022 12:44
liw added a commit that referenced this pull request Sep 28, 2022
A PS currently performs reconfigurations (i.e., membership changes) only
upon pool map changes. If the PS leader crashes in the middle of a
series of reconfigurations, the new PS leader will not plan any
reconfiguration (or notify the MS of the latest list of replicas) until
the pool map changes for some other reason. This patch lets a PS leader
check if reconfigurations are required when it steps up.

To avoid blocking the step up and pool map change processes, this patch
performs each series of reconfigurations asynchronously in a ULT. The
change allows the reconfiguration process to wait for pending events,
retry upon certain errors (in the future), and wait for RPC timeouts
without directly impacting the normal PS operations. Hence, this patch
reverts the workaround (209ba92) that
skips destroying PS replicas.

Moreover, this patch adds a safety net that prevent an older rsvc leader
from removing an rsvc replica created by a newer rsvc leader. Although
it cannot resolve all problems in the area, the natural, term-based
approach requires no RDB layout change and is simple to implement. The
patch has to change rdb_test a bit to allow more than one test rsvc
instance, so that a quick rsvc test can be added.

Since select_svc_ranks avoids rank 0 but ds_pool_plan_svc_reconfs does
not, this patch modifies the former to remove the avoidance, so that
during a PoolCreate operation the MS observes a notification from the
new PS with the same ranks as the PoolCreate dRPC response. We have to
change a few tests as well as the MS to make this work:

  - Work around a race between mgmtSvc.PoolCreate and
    Database.handlePoolRepsUpdate. See the comment for the details.

  - Update svc.yaml to reflect the new PS replacement policy.

  - Fix the daos_obj.c assertion that PS leaders can be rank 0.

Signed-off-by: Li Wei <[email protected]>
Required-githooks: true
mjmac pushed a commit that referenced this pull request Oct 3, 2022
A PS currently performs reconfigurations (i.e., membership changes) only
upon pool map changes. If the PS leader crashes in the middle of a
series of reconfigurations, the new PS leader will not plan any
reconfiguration (or notify the MS of the latest list of replicas) until
the pool map changes for some other reason. This patch lets a PS leader
check if reconfigurations are required when it steps up.

To avoid blocking the step up and pool map change processes, this patch
performs each series of reconfigurations asynchronously in a ULT. The
change allows the reconfiguration process to wait for pending events,
retry upon certain errors (in the future), and wait for RPC timeouts
without directly impacting the normal PS operations. Hence, this patch
reverts the workaround (209ba92) that
skips destroying PS replicas.

Moreover, this patch adds a safety net that prevent an older rsvc leader
from removing an rsvc replica created by a newer rsvc leader. Although
it cannot resolve all problems in the area, the natural, term-based
approach requires no RDB layout change and is simple to implement. The
patch has to change rdb_test a bit to allow more than one test rsvc
instance, so that a quick rsvc test can be added.

Since select_svc_ranks avoids rank 0 but ds_pool_plan_svc_reconfs does
not, this patch modifies the former to remove the avoidance, so that
during a PoolCreate operation the MS observes a notification from the
new PS with the same ranks as the PoolCreate dRPC response. We have to
change a few tests as well as the MS to make this work:

  - Work around a race between mgmtSvc.PoolCreate and
    Database.handlePoolRepsUpdate. See the comment for the details.

  - Update svc.yaml to reflect the new PS replacement policy.

  - Fix the daos_obj.c assertion that PS leaders can be rank 0.

Signed-off-by: Li Wei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants