Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid parallel reroutes in DiskThresholdMonitor #43381

Conversation

DaveCTurner
Copy link
Contributor

Today the DiskThresholdMonitor limits the frequency with which it submits
reroute tasks, but it might still submit these tasks faster than the master can
process them if, for instance, each reroute takes over 60 seconds. This causes
a problem since the reroute task runs with priority IMMEDIATE and is always
scheduled when there is a node over the high watermark, so this can starve any
other pending tasks on the master.

This change avoids further updates from the monitor while its last task(s) are
still in progress, and it measures the time of each update from the completion
time of the reroute task rather than its start time, to allow a larger window
for other tasks to run.

Fixes #40174

Today the `DiskThresholdMonitor` limits the frequency with which it submits
reroute tasks, but it might still submit these tasks faster than the master can
process them if, for instance, each reroute takes over 60 seconds. This causes
a problem since the reroute task runs with priority `IMMEDIATE` and is always
scheduled when there is a node over the high watermark, so this can starve any
other pending tasks on the master.

This change avoids further updates from the monitor while its last task(s) are
still in progress, and it measures the time of each update from the completion
time of the reroute task rather than its start time, to allow a larger window
for other tasks to run.

Fixes elastic#40174
@DaveCTurner DaveCTurner added >bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.3.0 labels Jun 19, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

}

final ImmutableOpenMap<String, DiskUsage> usages = info.getNodeLeastAvailableDiskUsages();
if (usages == null) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to look at this bit ignoring whitespace changes - I removed a level of indentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good tip

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not super happy that we are sending a task to be executed at priority IMMEDIATE. I would rather have this call RoutingService. In that case, we could also avoid this whole business of tracking whether there is already a call in progress (that's taken care of by RoutingService). WDYT?

}

final ImmutableOpenMap<String, DiskUsage> usages = info.getNodeLeastAvailableDiskUsages();
if (usages == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good tip

@DaveCTurner
Copy link
Contributor Author

I agree on the priority thing, but the RoutingService still uses HIGH priority and doesn't offer a notification on completion to keep the frequency low. I could add such a thing if you'd like?

@ywelsch
Copy link
Contributor

ywelsch commented Jun 25, 2019

I agree on the priority thing, but the RoutingService still uses HIGH priority and doesn't offer a notification on completion to keep the frequency low. I could add such a thing if you'd like?

I think HIGH priority is ok for now. I wonder why we need the notification on completion. What does it keep the frequency low of? If we're batching calls, it's fine to have multiple pending attempts?

@DaveCTurner DaveCTurner requested a review from ywelsch June 26, 2019 12:07
@DaveCTurner
Copy link
Contributor Author

Ok I have added to the RoutingService the ability to listen for completion, and adjusted the DiskThresholdMonitor to make use of this. @ywelsch would you take another look?

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left two small asks. Looking good o.w.

@@ -379,7 +379,7 @@ public void clusterStatePublished(ClusterChangedEvent clusterChangedEvent) {
if (logger.isTraceEnabled()) {
logger.trace("{}, scheduling a reroute", reason);
}
routingService.reroute(reason);
routingService.reroute(reason, ActionListener.wrap(() -> logger.trace("{}, reroute completed", reason)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also logs the same line on an exception :/
I would prefer two different log lines, and the failure one with the exception (same for other places in this PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Fixed in ce5946b

if (nodes.contains(node) == false) {
nodeHasPassedWatermark.remove(node);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert that rerouteAction is set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c1d6ee0

@DaveCTurner
Copy link
Contributor Author

@elasticmachine please run elasticsearch-ci/2

@ywelsch ywelsch self-requested a review June 27, 2019 20:26
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner merged commit 448acea into elastic:master Jun 30, 2019
@DaveCTurner DaveCTurner deleted the 2019-06-19-avoid-parallel-rerouting-in-disk-threshold-monitor branch June 30, 2019 15:45
DaveCTurner added a commit that referenced this pull request Jun 30, 2019
Today the `DiskThresholdMonitor` limits the frequency with which it submits
reroute tasks, but it might still submit these tasks faster than the master can
process them if, for instance, each reroute takes over 60 seconds. This causes
a problem since the reroute task runs with priority `IMMEDIATE` and is always
scheduled when there is a node over the high watermark, so this can starve any
other pending tasks on the master.

This change avoids further updates from the monitor while its last task(s) are
still in progress, and it measures the time of each update from the completion
time of the reroute task rather than its start time, to allow a larger window
for other tasks to run.

It also now makes use of the `RoutingService` to submit the reroute task, in
order to batch this task with any other pending reroutes. It enhances the
`RoutingService` to notify its listeners on completion.

Fixes #40174
Relates #42559
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v7.3.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MockDiskUsagesIT.testRerouteOccursOnDiskPassingHighWatermark fails in CI
4 participants