Skip to content

Commit

Permalink
Introduce max headroom for disk watermark stages
Browse files Browse the repository at this point in the history
Introduce max headroom settings for the low, high, and
flood disk watermark stages, similar to the existing
max headroom setting for the flood stage of the frozen
tier. Also, convert the disk watermarks to
RelativeByteSizeValue, similar to the existing setting
for the flood stage of the frozen tier.

Introduce new max headrooms in HealthMetadata and in
ReactiveStorageDeciderService.

Add multiple tests in DiskThresholdDeciderUnitTests,
DiskThresholdDeciderTests and DiskThresholdMonitorTests.

Fixes elastic#81406
  • Loading branch information
kingherc committed Jul 20, 2022
1 parent 377ad77 commit 61d4f32
Show file tree
Hide file tree
Showing 25 changed files with 2,081 additions and 958 deletions.
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -604,7 +604,7 @@ threshold has been breached:

logger.warn(
"flood stage disk watermark [{}] exceeded on {}, all indices on this node will be marked read-only",
diskThresholdSettings.describeFloodStageThreshold(),
diskThresholdSettings.describeFloodStageThreshold(total, false),
usage
);

Expand Down
46 changes: 28 additions & 18 deletions docs/reference/how-to/fix-common-cluster-issues.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,13 @@ PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "90%",
"cluster.routing.allocation.disk.watermark.low.max_headroom": "100gb",
"cluster.routing.allocation.disk.watermark.high": "95%",
"cluster.routing.allocation.disk.watermark.flood_stage": "97%"
"cluster.routing.allocation.disk.watermark.high.max_headroom": "20gb",
"cluster.routing.allocation.disk.watermark.flood_stage": "97%",
"cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": "5gb",
"cluster.routing.allocation.disk.watermark.flood_stage.frozen": "97%",
"cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": "5gb"
}
}
Expand Down Expand Up @@ -82,8 +87,13 @@ PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": null,
"cluster.routing.allocation.disk.watermark.low.max_headroom": null,
"cluster.routing.allocation.disk.watermark.high": null,
"cluster.routing.allocation.disk.watermark.flood_stage": null
"cluster.routing.allocation.disk.watermark.high.max_headroom": null,
"cluster.routing.allocation.disk.watermark.flood_stage": null,
"cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": null,
"cluster.routing.allocation.disk.watermark.flood_stage.frozen": null,
"cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": null
}
}
----
Expand Down Expand Up @@ -674,8 +684,8 @@ for tips on diagnosing and preventing them.
[[task-queue-backlog]]
=== Task queue backlog

A backlogged task queue can prevent tasks from completing and
put the cluster into an unhealthy state.
A backlogged task queue can prevent tasks from completing and
put the cluster into an unhealthy state.
Resource constraints, a large number of tasks being triggered at once,
and long running tasks can all contribute to a backlogged task queue.

Expand All @@ -685,11 +695,11 @@ and long running tasks can all contribute to a backlogged task queue.

**Check the thread pool status**

A <<high-cpu-usage,depleted thread pool>> can result in <<rejected-requests,rejected requests>>.
A <<high-cpu-usage,depleted thread pool>> can result in <<rejected-requests,rejected requests>>.

You can use the <<cat-thread-pool,cat thread pool API>> to
You can use the <<cat-thread-pool,cat thread pool API>> to
see the number of active threads in each thread pool and
how many tasks are queued, how many have been rejected, and how many have completed.
how many tasks are queued, how many have been rejected, and how many have completed.

[source,console]
----
Expand All @@ -698,9 +708,9 @@ GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,comple

**Inspect the hot threads on each node**

If a particular thread pool queue is backed up,
you can periodically poll the <<cluster-nodes-hot-threads,Nodes hot threads>> API
to determine if the thread has sufficient
If a particular thread pool queue is backed up,
you can periodically poll the <<cluster-nodes-hot-threads,Nodes hot threads>> API
to determine if the thread has sufficient
resources to progress and gauge how quickly it is progressing.

[source,console]
Expand All @@ -710,9 +720,9 @@ GET /_nodes/hot_threads

**Look for long running tasks**

Long-running tasks can also cause a backlog.
You can use the <<tasks,task management>> API to get information about the tasks that are running.
Check the `running_time_in_nanos` to identify tasks that are taking an excessive amount of time to complete.
Long-running tasks can also cause a backlog.
You can use the <<tasks,task management>> API to get information about the tasks that are running.
Check the `running_time_in_nanos` to identify tasks that are taking an excessive amount of time to complete.

[source,console]
----
Expand All @@ -723,16 +733,16 @@ GET /_tasks?filter_path=nodes.*.tasks
[[resolve-task-queue-backlog]]
==== Resolve a task queue backlog

**Increase available resources**
**Increase available resources**

If tasks are progressing slowly and the queue is backing up,
you might need to take steps to <<reduce-cpu-usage>>.
If tasks are progressing slowly and the queue is backing up,
you might need to take steps to <<reduce-cpu-usage>>.

In some cases, increasing the thread pool size might help.
For example, the `force_merge` thread pool defaults to a single thread.
Increasing the size to 2 might help reduce a backlog of force merge requests.

**Cancel stuck tasks**

If you find the active task's hot thread isn't progressing and there's a backlog,
consider canceling the task.
If you find the active task's hot thread isn't progressing and there's a backlog,
consider canceling the task.
3 changes: 2 additions & 1 deletion docs/reference/index-modules/blocks.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,8 @@ block and makes resources available almost immediately.
+
IMPORTANT: {es} adds and removes the read-only index block automatically when
the disk utilization falls below the high watermark, controlled by
<<cluster-routing-flood-stage,cluster.routing.allocation.disk.watermark.flood_stage>>.
<<cluster-routing-flood-stage,cluster.routing.allocation.disk.watermark.flood_stage>>
and <<cluster-routing-flood-stage,cluster.routing.allocation.disk.watermark.flood_stage.max_headroom>>.

`index.blocks.read`::

Expand Down
28 changes: 22 additions & 6 deletions docs/reference/modules/cluster/disk_allocator.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -72,16 +72,26 @@ Defaults to `true`. Set to `false` to disable the disk allocation decider.
// tag::cluster-routing-watermark-low-tag[]
`cluster.routing.allocation.disk.watermark.low` {ess-icon}::
(<<dynamic-cluster-setting,Dynamic>>)
Controls the low watermark for disk usage. It defaults to `85%`, meaning that {es} will not allocate shards to nodes that have more than 85% disk used. It can also be set to an absolute byte value (like `500mb`) to prevent {es} from allocating shards if less than the specified amount of space is available. This setting has no effect on the primary shards of newly-created indices but will prevent their replicas from being allocated.
Controls the low watermark for disk usage. It defaults to `85%`, meaning that {es} will not allocate shards to nodes that have more than 85% disk used. It can alternatively be set to a ratio value, e.g., `0.85`. It can also be set to an absolute byte value (like `500mb`) to prevent {es} from allocating shards if less than the specified amount of space is available. This setting has no effect on the primary shards of newly-created indices but will prevent their replicas from being allocated.
// end::cluster-routing-watermark-low-tag[]

`cluster.routing.allocation.disk.watermark.low.max_headroom` {ess-icon}::
(<<dynamic-cluster-setting,Dynamic>>) Controls the max headroom for the low stage watermark (in case of a percentage/ratio value).
Defaults to 150gb when `cluster.routing.allocation.disk.watermark.low` is not explicitly set.
This caps the amount of free space required.

[[cluster-routing-watermark-high]]
// tag::cluster-routing-watermark-high-tag[]
`cluster.routing.allocation.disk.watermark.high` {ess-icon}::
(<<dynamic-cluster-setting,Dynamic>>)
Controls the high watermark. It defaults to `90%`, meaning that {es} will attempt to relocate shards away from a node whose disk usage is above 90%. It can also be set to an absolute byte value (similarly to the low watermark) to relocate shards away from a node if it has less than the specified amount of free space. This setting affects the allocation of all shards, whether previously allocated or not.
Controls the high watermark. It defaults to `90%`, meaning that {es} will attempt to relocate shards away from a node whose disk usage is above 90%. It can alternatively be set to a ratio value, e.g., `0.9`. It can also be set to an absolute byte value (similarly to the low watermark) to relocate shards away from a node if it has less than the specified amount of free space. This setting affects the allocation of all shards, whether previously allocated or not.
// end::cluster-routing-watermark-high-tag[]

`cluster.routing.allocation.disk.watermark.high.max_headroom` {ess-icon}::
(<<dynamic-cluster-setting,Dynamic>>) Controls the max headroom for the high stage watermark (in case of a percentage/ratio value).
Defaults to 100gb when `cluster.routing.allocation.disk.watermark.high` is not explicitly set.
This caps the amount of free space required.

`cluster.routing.allocation.disk.watermark.enable_for_single_data_node`::
(<<static-cluster-setting,Static>>)
In earlier releases, the default behaviour was to disregard disk watermarks for a single
Expand All @@ -95,10 +105,16 @@ is now `true`. The setting will be removed in a future release.
+
--
(<<dynamic-cluster-setting,Dynamic>>)
Controls the flood stage watermark, which defaults to 95%. {es} enforces a read-only index block (`index.blocks.read_only_allow_delete`) on every index that has one or more shards allocated on the node, and that has at least one disk exceeding the flood stage. This setting is a last resort to prevent nodes from running out of disk space. The index block is automatically released when the disk utilization falls below the high watermark.
Controls the flood stage watermark, which defaults to 95%. {es} enforces a read-only index block (`index.blocks.read_only_allow_delete`) on every index that has one or more shards allocated on the node, and that has at least one disk exceeding the flood stage. This setting is a last resort to prevent nodes from running out of disk space. The index block is automatically released when the disk utilization falls below the high watermark. Similarly to the low and high watermark values, it can alternatively be set to a ratio value, e.g., `0.95`, or an absolute byte value.

`cluster.routing.allocation.disk.watermark.flood_stage.max_headroom` {ess-icon}::
(<<dynamic-cluster-setting,Dynamic>>) Controls the max headroom for the flood stage watermark (in case of a percentage/ratio value).
Defaults to 20gb when
`cluster.routing.allocation.disk.watermark.flood_stage` is not explicitly set.
This caps the amount of free space required.

NOTE: You cannot mix the usage of percentage values and byte values within
these settings. Either all values are set to percentage values, or all are set to byte values. This enforcement is so that {es} can validate that the settings are internally consistent, ensuring that the low disk threshold is less than the high disk threshold, and the high disk threshold is less than the flood stage threshold.
NOTE: You cannot mix the usage of percentage/ratio values and byte values within
the watermark settings. Either all values are set to percentage/ratio values, or all are set to byte values. This enforcement is so that {es} can validate that the settings are internally consistent, ensuring that the low disk threshold is less than the high disk threshold, and the high disk threshold is less than the flood stage threshold. A similar check is done for the max headroom values.

An example of resetting the read-only index block on the `my-index-000001` index:

Expand All @@ -123,7 +139,7 @@ Controls the flood stage watermark for dedicated frozen nodes, which defaults to
`cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom` {ess-icon}::
(<<dynamic-cluster-setting,Dynamic>>)
Controls the max headroom for the flood stage watermark for dedicated frozen
nodes. Defaults to 20GB when
nodes. Defaults to 20gb when
`cluster.routing.allocation.disk.watermark.flood_stage.frozen` is not explicitly
set. This caps the amount of free space required on dedicated frozen nodes.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,13 @@ PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "90%",
"cluster.routing.allocation.disk.watermark.low.max_headroom": "100gb",
"cluster.routing.allocation.disk.watermark.high": "95%",
"cluster.routing.allocation.disk.watermark.flood_stage": "97%"
"cluster.routing.allocation.disk.watermark.high.max_headroom": "20gb",
"cluster.routing.allocation.disk.watermark.flood_stage": "97%",
"cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": "5gb",
"cluster.routing.allocation.disk.watermark.flood_stage.frozen": "97%",
"cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": "5gb"
}
}
Expand Down Expand Up @@ -77,8 +82,13 @@ PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": null,
"cluster.routing.allocation.disk.watermark.low.max_headroom": null,
"cluster.routing.allocation.disk.watermark.high": null,
"cluster.routing.allocation.disk.watermark.flood_stage": null
"cluster.routing.allocation.disk.watermark.high.max_headroom": null,
"cluster.routing.allocation.disk.watermark.flood_stage": null,
"cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": null,
"cluster.routing.allocation.disk.watermark.flood_stage.frozen": null,
"cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": null
}
}
----
----
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ the operation and returns an error.
The most common causes of high CPU usage and their solutions.

<<high-jvm-memory-pressure,High JVM memory pressure>>::
High JVM memory usage can degrade cluster performance and trigger circuit
breaker errors.
High JVM memory usage can degrade cluster performance and trigger circuit
breaker errors.

<<red-yellow-cluster-status,Red or yellow cluster status>>::
A red or yellow cluster status indicates one or more shards are missing or
Expand All @@ -29,13 +29,13 @@ When {es} rejects a request, it stops the operation and returns an error with a
`429` response code.

<<task-queue-backlog,Task queue backlog>>::
A backlogged task queue can prevent tasks from completing and put the cluster
into an unhealthy state.
A backlogged task queue can prevent tasks from completing and put the cluster
into an unhealthy state.

include::common-issues/disk-usage-exceeded.asciidoc[]
include::common-issues/circuit-breaker-errors.asciidoc[]
include::common-issues/high-cpu-usage.asciidoc[]
include::common-issues/high-jvm-memory-pressure.asciidoc[]
include::common-issues/red-yellow-cluster-status.asciidoc[]
include::common-issues/rejected-requests.asciidoc[]
include::common-issues/task-queue-backlog.asciidoc[]
include::common-issues/task-queue-backlog.asciidoc[]
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,11 @@
import java.util.Map;
import java.util.concurrent.atomic.AtomicReference;

import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_MAX_HEADROOM_SETTING;
import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING;
import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_MAX_HEADROOM_SETTING;
import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING;
import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_LOW_DISK_MAX_HEADROOM_SETTING;
import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING;
import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING;
import static org.elasticsearch.cluster.routing.allocation.decider.EnableAllocationDecider.CLUSTER_ROUTING_REBALANCE_ENABLE_SETTING;
Expand Down Expand Up @@ -99,8 +102,11 @@ public void testRerouteOccursOnDiskPassingHighWatermark() throws Exception {
.setPersistentSettings(
Settings.builder()
.put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
.put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_MAX_HEADROOM_SETTING.getKey(), watermarkBytes ? "-1" : "10b")
.put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
.put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_MAX_HEADROOM_SETTING.getKey(), watermarkBytes ? "-1" : "10b")
.put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING.getKey(), watermarkBytes ? "0b" : "100%")
.put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_MAX_HEADROOM_SETTING.getKey(), watermarkBytes ? "-1" : "10b")
.put(CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING.getKey(), "0ms")
)
);
Expand Down Expand Up @@ -179,8 +185,11 @@ public void testAutomaticReleaseOfIndexBlock() throws Exception {
.setPersistentSettings(
Settings.builder()
.put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
.put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_MAX_HEADROOM_SETTING.getKey(), watermarkBytes ? "-1" : "10b")
.put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
.put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_MAX_HEADROOM_SETTING.getKey(), watermarkBytes ? "-1" : "10b")
.put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING.getKey(), watermarkBytes ? "5b" : "95%")
.put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_MAX_HEADROOM_SETTING.getKey(), watermarkBytes ? "-1" : "5b")
.put(CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING.getKey(), "150ms")
)
);
Expand Down Expand Up @@ -274,6 +283,7 @@ public void testOnlyMovesEnoughShardsToDropBelowHighWatermark() throws Exception
.put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(), "90%")
.put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING.getKey(), "90%")
.put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING.getKey(), "100%")
.put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_MAX_HEADROOM_SETTING.getKey(), "0b")
.put(CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING.getKey(), "0ms")
)
);
Expand Down Expand Up @@ -366,7 +376,9 @@ public void testDoesNotExceedLowWatermarkWhenRebalancing() throws Exception {
Settings.builder()
.put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(), "85%")
.put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING.getKey(), "100%")
.put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_MAX_HEADROOM_SETTING.getKey(), "0b")
.put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING.getKey(), "100%")
.put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_MAX_HEADROOM_SETTING.getKey(), "0b")
)
);

Expand Down Expand Up @@ -451,6 +463,7 @@ public void testMovesShardsOffSpecificDataPathAboveWatermark() throws Exception
.put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(), "90%")
.put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING.getKey(), "90%")
.put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING.getKey(), "100%")
.put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_MAX_HEADROOM_SETTING.getKey(), "0b")
.put(CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING.getKey(), "0ms")
)
);
Expand Down
Loading

0 comments on commit 61d4f32

Please sign in to comment.