Avoid parallel reroutes in DiskThresholdMonitor #43381

DaveCTurner · 2019-06-19T16:15:14Z

Today the DiskThresholdMonitor limits the frequency with which it submits
reroute tasks, but it might still submit these tasks faster than the master can
process them if, for instance, each reroute takes over 60 seconds. This causes
a problem since the reroute task runs with priority IMMEDIATE and is always
scheduled when there is a node over the high watermark, so this can starve any
other pending tasks on the master.

This change avoids further updates from the monitor while its last task(s) are
still in progress, and it measures the time of each update from the completion
time of the reroute task rather than its start time, to allow a larger window
for other tasks to run.

Fixes #40174

Today the `DiskThresholdMonitor` limits the frequency with which it submits reroute tasks, but it might still submit these tasks faster than the master can process them if, for instance, each reroute takes over 60 seconds. This causes a problem since the reroute task runs with priority `IMMEDIATE` and is always scheduled when there is a node over the high watermark, so this can starve any other pending tasks on the master. This change avoids further updates from the monitor while its last task(s) are still in progress, and it measures the time of each update from the completion time of the reroute task rather than its start time, to allow a larger window for other tasks to run. Fixes elastic#40174

elasticmachine · 2019-06-19T16:15:16Z

Pinging @elastic/es-distributed

DaveCTurner · 2019-06-19T16:15:58Z

server/src/main/java/org/elasticsearch/cluster/routing/allocation/DiskThresholdMonitor.java

+        }
+
+        final ImmutableOpenMap<String, DiskUsage> usages = info.getNodeLeastAvailableDiskUsages();
+        if (usages == null) {


Probably best to look at this bit ignoring whitespace changes - I removed a level of indentation.

…he watermark

ywelsch

I'm still not super happy that we are sending a task to be executed at priority IMMEDIATE. I would rather have this call RoutingService. In that case, we could also avoid this whole business of tracking whether there is already a call in progress (that's taken care of by RoutingService). WDYT?

ywelsch · 2019-06-21T11:44:49Z

server/src/main/java/org/elasticsearch/cluster/routing/allocation/DiskThresholdMonitor.java

+        }
+
+        final ImmutableOpenMap<String, DiskUsage> usages = info.getNodeLeastAvailableDiskUsages();
+        if (usages == null) {


DaveCTurner · 2019-06-24T14:41:12Z

I agree on the priority thing, but the RoutingService still uses HIGH priority and doesn't offer a notification on completion to keep the frequency low. I could add such a thing if you'd like?

ywelsch · 2019-06-25T07:56:23Z

I agree on the priority thing, but the RoutingService still uses HIGH priority and doesn't offer a notification on completion to keep the frequency low. I could add such a thing if you'd like?

I think HIGH priority is ok for now. I wonder why we need the notification on completion. What does it keep the frequency low of? If we're batching calls, it's fine to have multiple pending attempts?

…k-threshold-monitor

DaveCTurner · 2019-06-26T12:09:17Z

Ok I have added to the RoutingService the ability to listen for completion, and adjusted the DiskThresholdMonitor to make use of this. @ywelsch would you take another look?

ywelsch

I've left two small asks. Looking good o.w.

ywelsch · 2019-06-26T13:40:41Z

server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java

@@ -379,7 +379,7 @@ public void clusterStatePublished(ClusterChangedEvent clusterChangedEvent) {
                if (logger.isTraceEnabled()) {
                    logger.trace("{}, scheduling a reroute", reason);
                }
-                routingService.reroute(reason);
+                routingService.reroute(reason, ActionListener.wrap(() -> logger.trace("{}, reroute completed", reason)));


this also logs the same line on an exception :/
I would prefer two different log lines, and the failure one with the exception (same for other places in this PR)

Fixed in ce5946b

ywelsch · 2019-06-26T13:45:27Z

server/src/main/java/org/elasticsearch/cluster/routing/allocation/DiskThresholdMonitor.java

-                if (nodes.contains(node) == false) {
-                    nodeHasPassedWatermark.remove(node);
-                }
+


assert that rerouteAction is set?

Fixed in c1d6ee0

DaveCTurner · 2019-06-26T16:20:43Z

@elasticmachine please run elasticsearch-ci/2

ywelsch

LGTM

…k-threshold-monitor

Today the `DiskThresholdMonitor` limits the frequency with which it submits reroute tasks, but it might still submit these tasks faster than the master can process them if, for instance, each reroute takes over 60 seconds. This causes a problem since the reroute task runs with priority `IMMEDIATE` and is always scheduled when there is a node over the high watermark, so this can starve any other pending tasks on the master. This change avoids further updates from the monitor while its last task(s) are still in progress, and it measures the time of each update from the completion time of the reroute task rather than its start time, to allow a larger window for other tasks to run. It also now makes use of the `RoutingService` to submit the reroute task, in order to batch this task with any other pending reroutes. It enhances the `RoutingService` to notify its listeners on completion. Fixes #40174 Relates #42559

DaveCTurner added >bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.3.0 labels Jun 19, 2019

DaveCTurner requested review from dakrone, ywelsch and original-brownbear June 19, 2019 16:15

DaveCTurner commented Jun 19, 2019

View reviewed changes

DaveCTurner added 4 commits June 19, 2019 17:20

AtomicLong

5ea085c

Off-by-one

a74bde6

Extract var

81004c3

Also assert that reroutes happen again if there's still a disk over t…

7cb5915

…he watermark

ywelsch reviewed Jun 21, 2019

View reviewed changes

DaveCTurner added 3 commits June 26, 2019 08:09

Merge branch 'master' into 2019-06-19-avoid-parallel-rerouting-in-dis…

f385bac

…k-threshold-monitor

Add completion notificartion to RerouteService

3d261d4

Inject reroute action

4de4ab6

DaveCTurner requested a review from ywelsch June 26, 2019 12:07

ywelsch suggested changes Jun 26, 2019

View reviewed changes

DaveCTurner added 3 commits June 26, 2019 15:58

Assert not-null

c1d6ee0

Rename

7969284

Log exceptions on error

ce5946b

ywelsch self-requested a review June 27, 2019 20:26

ywelsch approved these changes Jun 28, 2019

View reviewed changes

DaveCTurner mentioned this pull request Jun 28, 2019

Auto-release of read-only-allow-delete block when disk utilization fa… #42559

Merged

Merge branch 'master' into 2019-06-19-avoid-parallel-rerouting-in-dis…

9387277

…k-threshold-monitor

DaveCTurner merged commit 448acea into elastic:master Jun 30, 2019

DaveCTurner deleted the 2019-06-19-avoid-parallel-rerouting-in-disk-threshold-monitor branch June 30, 2019 15:45

DaveCTurner added the backport pending label Jun 30, 2019

DaveCTurner removed the backport pending label Jul 1, 2019

DaveCTurner mentioned this pull request May 20, 2020

ClusterApplierService stuck for mins while establishing connections to other node due to mismatch ephemeralId #56979

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid parallel reroutes in DiskThresholdMonitor #43381

Avoid parallel reroutes in DiskThresholdMonitor #43381

DaveCTurner commented Jun 19, 2019

elasticmachine commented Jun 19, 2019

DaveCTurner Jun 19, 2019

ywelsch Jun 21, 2019

ywelsch left a comment

ywelsch Jun 21, 2019

DaveCTurner commented Jun 24, 2019

ywelsch commented Jun 25, 2019

DaveCTurner commented Jun 26, 2019

ywelsch left a comment

ywelsch Jun 26, 2019

DaveCTurner Jun 26, 2019

ywelsch Jun 26, 2019

DaveCTurner Jun 26, 2019

DaveCTurner commented Jun 26, 2019

ywelsch left a comment

Avoid parallel reroutes in DiskThresholdMonitor #43381

Avoid parallel reroutes in DiskThresholdMonitor #43381

Conversation

DaveCTurner commented Jun 19, 2019

elasticmachine commented Jun 19, 2019

DaveCTurner Jun 19, 2019

Choose a reason for hiding this comment

ywelsch Jun 21, 2019

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Jun 21, 2019

Choose a reason for hiding this comment

DaveCTurner commented Jun 24, 2019

ywelsch commented Jun 25, 2019

DaveCTurner commented Jun 26, 2019

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Jun 26, 2019

Choose a reason for hiding this comment

DaveCTurner Jun 26, 2019

Choose a reason for hiding this comment

ywelsch Jun 26, 2019

Choose a reason for hiding this comment

DaveCTurner Jun 26, 2019

Choose a reason for hiding this comment

DaveCTurner commented Jun 26, 2019

ywelsch left a comment

Choose a reason for hiding this comment