Introduce max headroom for disk watermark stages

Introduce max headroom settings for the low, high, and flood disk watermark stages, similar to the existing max headroom setting for the flood stage of the frozen tier. Also, convert the disk watermarks to RelativeByteSizeValue, similar to the existing setting for the flood stage of the frozen tier. Introduce new max headrooms in HealthMetadata and in ReactiveStorageDeciderService. Add multiple tests in DiskThresholdDeciderUnitTests, DiskThresholdDeciderTests and DiskThresholdMonitorTests. Fixes elastic#81406
kingherc · Jul 20, 2022 · 61d4f32 · 61d4f32
1 parent 377ad77
commit 61d4f32
Show file tree

Hide file tree

Showing 25 changed files with 2,081 additions and 958 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -604,7 +604,7 @@ threshold has been breached:
 
     logger.warn(
         "flood stage disk watermark [{}] exceeded on {}, all indices on this node will be marked read-only",
-        diskThresholdSettings.describeFloodStageThreshold(),
+        diskThresholdSettings.describeFloodStageThreshold(total, false),
         usage
     );
 

diff --git a/docs/reference/how-to/fix-common-cluster-issues.asciidoc b/docs/reference/how-to/fix-common-cluster-issues.asciidoc
@@ -51,8 +51,13 @@ PUT _cluster/settings
 {
   "persistent": {
     "cluster.routing.allocation.disk.watermark.low": "90%",
+    "cluster.routing.allocation.disk.watermark.low.max_headroom": "100gb",
     "cluster.routing.allocation.disk.watermark.high": "95%",
-    "cluster.routing.allocation.disk.watermark.flood_stage": "97%"
+    "cluster.routing.allocation.disk.watermark.high.max_headroom": "20gb",
+    "cluster.routing.allocation.disk.watermark.flood_stage": "97%",
+    "cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": "5gb",
+    "cluster.routing.allocation.disk.watermark.flood_stage.frozen": "97%",
+    "cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": "5gb"
   }
 }
 
@@ -82,8 +87,13 @@ PUT _cluster/settings
 {
   "persistent": {
     "cluster.routing.allocation.disk.watermark.low": null,
+    "cluster.routing.allocation.disk.watermark.low.max_headroom": null,
     "cluster.routing.allocation.disk.watermark.high": null,
-    "cluster.routing.allocation.disk.watermark.flood_stage": null
+    "cluster.routing.allocation.disk.watermark.high.max_headroom": null,
+    "cluster.routing.allocation.disk.watermark.flood_stage": null,
+    "cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": null,
+    "cluster.routing.allocation.disk.watermark.flood_stage.frozen": null,
+    "cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": null
   }
 }
 ----
@@ -674,8 +684,8 @@ for tips on diagnosing and preventing them.
 [[task-queue-backlog]]
 === Task queue backlog
 
-A backlogged task queue can prevent tasks from completing and 
-put the cluster into an unhealthy state. 
+A backlogged task queue can prevent tasks from completing and
+put the cluster into an unhealthy state.
 Resource constraints, a large number of tasks being triggered at once,
 and long running tasks can all contribute to a backlogged task queue.
 
@@ -685,11 +695,11 @@ and long running tasks can all contribute to a backlogged task queue.
 
 **Check the thread pool status**
 
-A <<high-cpu-usage,depleted thread pool>> can result in <<rejected-requests,rejected requests>>. 
+A <<high-cpu-usage,depleted thread pool>> can result in <<rejected-requests,rejected requests>>.
 
-You can use the <<cat-thread-pool,cat thread pool API>> to 
+You can use the <<cat-thread-pool,cat thread pool API>> to
 see the number of active threads in each thread pool and
-how many tasks are queued, how many have been rejected, and how many have completed. 
+how many tasks are queued, how many have been rejected, and how many have completed.
 
 [source,console]
 ----
@@ -698,9 +708,9 @@ GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,comple
 
 **Inspect the hot threads on each node**
 
-If a particular thread pool queue is backed up, 
-you can periodically poll the <<cluster-nodes-hot-threads,Nodes hot threads>> API 
-to determine if the thread has sufficient 
+If a particular thread pool queue is backed up,
+you can periodically poll the <<cluster-nodes-hot-threads,Nodes hot threads>> API
+to determine if the thread has sufficient
 resources to progress and gauge how quickly it is progressing.
 
 [source,console]
@@ -710,9 +720,9 @@ GET /_nodes/hot_threads
 
 **Look for long running tasks**
 
-Long-running tasks can also cause a backlog. 
-You can use the <<tasks,task management>> API to get information about the tasks that are running. 
-Check the `running_time_in_nanos` to identify tasks that are taking an excessive amount of time to complete. 
+Long-running tasks can also cause a backlog.
+You can use the <<tasks,task management>> API to get information about the tasks that are running.
+Check the `running_time_in_nanos` to identify tasks that are taking an excessive amount of time to complete.
 
 [source,console]
 ----
@@ -723,16 +733,16 @@ GET /_tasks?filter_path=nodes.*.tasks
 [[resolve-task-queue-backlog]]
 ==== Resolve a task queue backlog
 
-**Increase available resources** 
+**Increase available resources**
 
-If tasks are progressing slowly and the queue is backing up, 
-you might need to take steps to <<reduce-cpu-usage>>. 
+If tasks are progressing slowly and the queue is backing up,
+you might need to take steps to <<reduce-cpu-usage>>.
 
 In some cases, increasing the thread pool size might help.
 For example, the `force_merge` thread pool defaults to a single thread.
 Increasing the size to 2 might help reduce a backlog of force merge requests.
 
 **Cancel stuck tasks**
 
-If you find the active task's hot thread isn't progressing and there's a backlog, 
-consider canceling the task. 
+If you find the active task's hot thread isn't progressing and there's a backlog,
+consider canceling the task.
diff --git a/docs/reference/index-modules/blocks.asciidoc b/docs/reference/index-modules/blocks.asciidoc
@@ -37,7 +37,8 @@ block and makes resources available almost immediately.
 +
 IMPORTANT: {es} adds and removes the read-only index block automatically when
 the disk utilization falls below the high watermark, controlled by
-<<cluster-routing-flood-stage,cluster.routing.allocation.disk.watermark.flood_stage>>.
+<<cluster-routing-flood-stage,cluster.routing.allocation.disk.watermark.flood_stage>>
+and <<cluster-routing-flood-stage,cluster.routing.allocation.disk.watermark.flood_stage.max_headroom>>.
 
 `index.blocks.read`::
 

diff --git a/docs/reference/modules/cluster/disk_allocator.asciidoc b/docs/reference/modules/cluster/disk_allocator.asciidoc
@@ -72,16 +72,26 @@ Defaults to `true`. Set to `false` to disable the disk allocation decider.
 // tag::cluster-routing-watermark-low-tag[]
 `cluster.routing.allocation.disk.watermark.low` {ess-icon}::
 (<<dynamic-cluster-setting,Dynamic>>)
-Controls the low watermark for disk usage. It defaults to `85%`, meaning that {es} will not allocate shards to nodes that have more than 85% disk used. It can also be set to an absolute byte value (like `500mb`) to prevent {es} from allocating shards if less than the specified amount of space is available. This setting has no effect on the primary shards of newly-created indices but will prevent their replicas from being allocated.
+Controls the low watermark for disk usage. It defaults to `85%`, meaning that {es} will not allocate shards to nodes that have more than 85% disk used. It can alternatively be set to a ratio value, e.g., `0.85`. It can also be set to an absolute byte value (like `500mb`) to prevent {es} from allocating shards if less than the specified amount of space is available. This setting has no effect on the primary shards of newly-created indices but will prevent their replicas from being allocated.
 // end::cluster-routing-watermark-low-tag[]
 
+`cluster.routing.allocation.disk.watermark.low.max_headroom` {ess-icon}::
+(<<dynamic-cluster-setting,Dynamic>>) Controls the max headroom for the low stage watermark (in case of a percentage/ratio value).
+Defaults to 150gb when `cluster.routing.allocation.disk.watermark.low` is not explicitly set.
+This caps the amount of free space required.
+
 [[cluster-routing-watermark-high]]
 // tag::cluster-routing-watermark-high-tag[]
 `cluster.routing.allocation.disk.watermark.high` {ess-icon}::
 (<<dynamic-cluster-setting,Dynamic>>)
-Controls the high watermark. It defaults to `90%`, meaning that {es} will attempt to relocate shards away from a node whose disk usage is above 90%. It can also be set to an absolute byte value (similarly to the low watermark) to relocate shards away from a node if it has less than the specified amount of free space. This setting affects the allocation of all shards, whether previously allocated or not.
+Controls the high watermark. It defaults to `90%`, meaning that {es} will attempt to relocate shards away from a node whose disk usage is above 90%. It can alternatively be set to a ratio value, e.g., `0.9`. It can also be set to an absolute byte value (similarly to the low watermark) to relocate shards away from a node if it has less than the specified amount of free space. This setting affects the allocation of all shards, whether previously allocated or not.
 // end::cluster-routing-watermark-high-tag[]
 
+`cluster.routing.allocation.disk.watermark.high.max_headroom` {ess-icon}::
+(<<dynamic-cluster-setting,Dynamic>>) Controls the max headroom for the high stage watermark (in case of a percentage/ratio value).
+Defaults to 100gb when `cluster.routing.allocation.disk.watermark.high` is not explicitly set.
+This caps the amount of free space required.
+
 `cluster.routing.allocation.disk.watermark.enable_for_single_data_node`::
     (<<static-cluster-setting,Static>>)
 In earlier releases, the default behaviour was to disregard disk watermarks for a single
@@ -95,10 +105,16 @@ is now `true`. The setting will be removed in a future release.
 +
 --
 (<<dynamic-cluster-setting,Dynamic>>)
-Controls the flood stage watermark, which defaults to 95%. {es} enforces a read-only index block (`index.blocks.read_only_allow_delete`) on every index that has one or more shards allocated on the node, and that has at least one disk exceeding the flood stage. This setting is a last resort to prevent nodes from running out of disk space. The index block is automatically released when the disk utilization falls below the high watermark.
+Controls the flood stage watermark, which defaults to 95%. {es} enforces a read-only index block (`index.blocks.read_only_allow_delete`) on every index that has one or more shards allocated on the node, and that has at least one disk exceeding the flood stage. This setting is a last resort to prevent nodes from running out of disk space. The index block is automatically released when the disk utilization falls below the high watermark. Similarly to the low and high watermark values, it can alternatively be set to a ratio value, e.g., `0.95`, or an absolute byte value.
+
+`cluster.routing.allocation.disk.watermark.flood_stage.max_headroom` {ess-icon}::
+(<<dynamic-cluster-setting,Dynamic>>) Controls the max headroom for the flood stage watermark (in case of a percentage/ratio value).
+Defaults to 20gb when
+`cluster.routing.allocation.disk.watermark.flood_stage` is not explicitly set.
+This caps the amount of free space required.
 
-NOTE: You cannot mix the usage of percentage values and byte values within
-these settings. Either all values are set to percentage values, or all are set to byte values. This enforcement is so that {es} can validate that the settings are internally consistent, ensuring that the low disk threshold is less than the high disk threshold, and the high disk threshold is less than the flood stage threshold.
+NOTE: You cannot mix the usage of percentage/ratio values and byte values within
+the watermark settings. Either all values are set to percentage/ratio values, or all are set to byte values. This enforcement is so that {es} can validate that the settings are internally consistent, ensuring that the low disk threshold is less than the high disk threshold, and the high disk threshold is less than the flood stage threshold. A similar check is done for the max headroom values.
 
 An example of resetting the read-only index block on the `my-index-000001` index:
 
@@ -123,7 +139,7 @@ Controls the flood stage watermark for dedicated frozen nodes, which defaults to
 `cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom` {ess-icon}::
 (<<dynamic-cluster-setting,Dynamic>>)
 Controls the max headroom for the flood stage watermark for dedicated frozen
-nodes. Defaults to 20GB when
+nodes. Defaults to 20gb when
 `cluster.routing.allocation.disk.watermark.flood_stage.frozen` is not explicitly
 set. This caps the amount of free space required on dedicated frozen nodes.
 

diff --git a/docs/reference/troubleshooting/common-issues/disk-usage-exceeded.asciidoc b/docs/reference/troubleshooting/common-issues/disk-usage-exceeded.asciidoc
@@ -46,8 +46,13 @@ PUT _cluster/settings
 {
   "persistent": {
     "cluster.routing.allocation.disk.watermark.low": "90%",
+    "cluster.routing.allocation.disk.watermark.low.max_headroom": "100gb",
     "cluster.routing.allocation.disk.watermark.high": "95%",
-    "cluster.routing.allocation.disk.watermark.flood_stage": "97%"
+    "cluster.routing.allocation.disk.watermark.high.max_headroom": "20gb",
+    "cluster.routing.allocation.disk.watermark.flood_stage": "97%",
+    "cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": "5gb",
+    "cluster.routing.allocation.disk.watermark.flood_stage.frozen": "97%",
+    "cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": "5gb"
   }
 }
 
@@ -77,8 +82,13 @@ PUT _cluster/settings
 {
   "persistent": {
     "cluster.routing.allocation.disk.watermark.low": null,
+    "cluster.routing.allocation.disk.watermark.low.max_headroom": null,
     "cluster.routing.allocation.disk.watermark.high": null,
-    "cluster.routing.allocation.disk.watermark.flood_stage": null
+    "cluster.routing.allocation.disk.watermark.high.max_headroom": null,
+    "cluster.routing.allocation.disk.watermark.flood_stage": null,
+    "cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": null,
+    "cluster.routing.allocation.disk.watermark.flood_stage.frozen": null,
+    "cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": null
   }
 }
-----
+----
diff --git a/docs/reference/troubleshooting/fix-common-cluster-issues.asciidoc b/docs/reference/troubleshooting/fix-common-cluster-issues.asciidoc
@@ -16,8 +16,8 @@ the operation and returns an error.
 The most common causes of high CPU usage and their solutions.
 
 <<high-jvm-memory-pressure,High JVM memory pressure>>::
-High JVM memory usage can degrade cluster performance and trigger circuit 
-breaker errors. 
+High JVM memory usage can degrade cluster performance and trigger circuit
+breaker errors.
 
 <<red-yellow-cluster-status,Red or yellow cluster status>>::
 A red or yellow cluster status indicates one or more shards are missing or
@@ -29,13 +29,13 @@ When {es} rejects a request, it stops the operation and returns an error with a
 `429` response code.
 
 <<task-queue-backlog,Task queue backlog>>::
-A backlogged task queue can prevent tasks from completing and put the cluster 
-into an unhealthy state. 
+A backlogged task queue can prevent tasks from completing and put the cluster
+into an unhealthy state.
 
 include::common-issues/disk-usage-exceeded.asciidoc[]
 include::common-issues/circuit-breaker-errors.asciidoc[]
 include::common-issues/high-cpu-usage.asciidoc[]
 include::common-issues/high-jvm-memory-pressure.asciidoc[]
 include::common-issues/red-yellow-cluster-status.asciidoc[]
 include::common-issues/rejected-requests.asciidoc[]
-include::common-issues/task-queue-backlog.asciidoc[]
+include::common-issues/task-queue-backlog.asciidoc[]
diff --git a/...usterTest/java/org/elasticsearch/cluster/routing/allocation/decider/MockDiskUsagesIT.java b/...usterTest/java/org/elasticsearch/cluster/routing/allocation/decider/MockDiskUsagesIT.java
@@ -34,8 +34,11 @@
 import java.util.Map;
 import java.util.concurrent.atomic.AtomicReference;
 
+import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_MAX_HEADROOM_SETTING;
 import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING;
+import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_MAX_HEADROOM_SETTING;
 import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING;
+import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_LOW_DISK_MAX_HEADROOM_SETTING;
 import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING;
 import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING;
 import static org.elasticsearch.cluster.routing.allocation.decider.EnableAllocationDecider.CLUSTER_ROUTING_REBALANCE_ENABLE_SETTING;
@@ -99,8 +102,11 @@ public void testRerouteOccursOnDiskPassingHighWatermark() throws Exception {
                 .setPersistentSettings(
                     Settings.builder()
                         .put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
+                        .put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_MAX_HEADROOM_SETTING.getKey(), watermarkBytes ? "-1" : "10b")
                         .put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
+                        .put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_MAX_HEADROOM_SETTING.getKey(), watermarkBytes ? "-1" : "10b")
                         .put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING.getKey(), watermarkBytes ? "0b" : "100%")
+                        .put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_MAX_HEADROOM_SETTING.getKey(), watermarkBytes ? "-1" : "10b")
                         .put(CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING.getKey(), "0ms")
                 )
         );
@@ -179,8 +185,11 @@ public void testAutomaticReleaseOfIndexBlock() throws Exception {
                 .setPersistentSettings(
                     Settings.builder()
                         .put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
+                        .put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_MAX_HEADROOM_SETTING.getKey(), watermarkBytes ? "-1" : "10b")
                         .put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
+                        .put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_MAX_HEADROOM_SETTING.getKey(), watermarkBytes ? "-1" : "10b")
                         .put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING.getKey(), watermarkBytes ? "5b" : "95%")
+                        .put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_MAX_HEADROOM_SETTING.getKey(), watermarkBytes ? "-1" : "5b")
                         .put(CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING.getKey(), "150ms")
                 )
         );
@@ -274,6 +283,7 @@ public void testOnlyMovesEnoughShardsToDropBelowHighWatermark() throws Exception
                         .put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(), "90%")
                         .put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING.getKey(), "90%")
                         .put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING.getKey(), "100%")
+                        .put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_MAX_HEADROOM_SETTING.getKey(), "0b")
                         .put(CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING.getKey(), "0ms")
                 )
         );
@@ -366,7 +376,9 @@ public void testDoesNotExceedLowWatermarkWhenRebalancing() throws Exception {
                     Settings.builder()
                         .put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(), "85%")
                         .put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING.getKey(), "100%")
+                        .put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_MAX_HEADROOM_SETTING.getKey(), "0b")
                         .put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING.getKey(), "100%")
+                        .put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_MAX_HEADROOM_SETTING.getKey(), "0b")
                 )
         );
 
@@ -451,6 +463,7 @@ public void testMovesShardsOffSpecificDataPathAboveWatermark() throws Exception
                         .put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(), "90%")
                         .put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING.getKey(), "90%")
                         .put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING.getKey(), "100%")
+                        .put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_MAX_HEADROOM_SETTING.getKey(), "0b")
                         .put(CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING.getKey(), "0ms")
                 )
         );