Improve Node Shutdown Observability #78727

gwbrown · 2021-10-05T23:10:14Z

This PR does two things:

Logs a message when the Get Shutdown Status API detects there's a shard causing a node shutdown to stall. The log message isn't new, just bumped from trace to info.
Adds Allocation Explain output to the shard migration status portion of the node shutdown response if it's causing a shard to stall.

Example of the new Get Shutdown Status response:

{
    "nodes": [
        {
            "node_id": "GszZYZ2bSZiLi5dBinWf8g",
            "type": "REMOVE",
            "reason": "idk my bff jill",
            "shutdown_startedmillis": 1633475221483,
            "status": "STALLED",
            "shard_migration": {
                "status": "STALLED",
                "shard_migrations_remaining": 1,
                "explanation": "shard [0] [primary] of index [my-index] cannot move",
                "node_allocation_decisions": {
                    "can_remain_on_current_node": "no",
                    "can_remain_decisions": [
                        {
                            "decider": "node_shutdown",
                            "decision": "NO",
                            "explanation": "node [GszZYZ2bSZiLi5dBinWf8g] is preparing to be removed from the cluster"
                        },
                        {
                            "decider": "filter",
                            "decision": "NO",
                            "explanation": "node does not match index setting [index.routing.allocation.require] filters [include._id:\"GszZYZ2bSZiLi5dBinWf8g\"]"
                        }
                    ],
                    "can_move_to_other_node": "no",
                    "move_explanation": "cannot move shard to another node, even though it is not allowed to remain on its current node",
                    "node_allocation_decisions": [
                        {
                            "node_id": "T6AnBS1mT2iXlD-McXSteg",
                            "node_name": "node-0",
                            "transport_address": "127.0.0.1:9301",
                            "node_attributes": {
                                "ml.machine_memory": "34359738368",
                                "ml.max_open_jobs": "512",
                                "xpack.installed": "true",
                                "ml.max_jvm_size": "17179869184"
                            },
                            "node_decision": "no",
                            "weight_ranking": 1,
                            "deciders": [
                                {
                                    "decider": "filter",
                                    "decision": "NO",
                                    "explanation": "node does not match index setting [index.routing.allocation.require] filters [include._id:\"GszZYZ2bSZiLi5dBinWf8g\"]"
                                }
                            ]
                        },
                        {
                            "node_id": "81mJIQ_XRrG2zK1ojhjLMg",
                            "node_name": "node-2",
                            "transport_address": "127.0.0.1:9300",
                            "node_attributes": {
                                "ml.machine_memory": "34359738368",
                                "ml.max_open_jobs": "512",
                                "xpack.installed": "true",
                                "ml.max_jvm_size": "17179869184"
                            },
                            "node_decision": "no",
                            "weight_ranking": 2,
                            "deciders": [
                                {
                                    "decider": "filter",
                                    "decision": "NO",
                                    "explanation": "node does not match index setting [index.routing.allocation.require] filters [include._id:\"GszZYZ2bSZiLi5dBinWf8g\"]"
                                }
                            ]
                        }
                    ]
                }
            },
            "persistent_tasks": {
                "status": "COMPLETE"
            },
            "plugins": {
                "status": "COMPLETE"
            }
        }
    ]
}

elasticmachine · 2021-10-05T23:10:18Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

dakrone · 2021-10-05T23:16:44Z

...hutdown/src/main/java/org/elasticsearch/xpack/shutdown/TransportGetShutdownStatusAction.java

-                        Strings.toString(pair.v2())
-                    );
-                }
+                logger.info(


If we hit this every 10 seconds for 30 minutes, that's roughly 180 times, are we okay with that amount of logging? That seems a little too excessive to me. Perhaps it would be better to put this at DEBUG level since we'll expose most of the info in the response now?

dakrone · 2021-10-06T21:17:34Z

Oh yeah, I wanted to leave a reminder for what we talked about regarding logging when a new shutdown is put in place. I think it would also be good to log when we remove/delete a node shutdown as well (so we can see in logs when it has been "completed")

gwbrown · 2021-10-06T22:25:43Z

@dakrone Done, we now log STALLED at debug and PUT/DELETE node shutdown at info.

dakrone

LGTM, I left two really minor comments

dakrone · 2021-10-06T23:19:07Z

.../shutdown/src/main/java/org/elasticsearch/xpack/shutdown/TransportPutShutdownNodeAction.java

+                    logger.info(
+                        "creating shutdown record for node [{}] of type [{}] with reason [{}]",
+                        request.getNodeId(),
+                        request.getType(),
+                        request.getReason()
+                    );


Can you include all the info we have here? Maybe overriding toString in SingleNodeShutdownMetadata would work so we can include the throttling and target node name of those are present.

Done - I did indeed delegate this to SingleNodeShutdownMetadata#toString.

dakrone · 2021-10-06T23:20:36Z

server/src/main/java/org/elasticsearch/cluster/metadata/ShutdownShardMigrationStatus.java

@@ -61,6 +80,13 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
        if (Objects.nonNull(explanation)) {
            builder.field("explanation", explanation);
        }
+        if (Objects.nonNull(allocationDecision)) {
+            builder.startObject("node_allocation_decisions");


Super minor nit, but since this is the top-level decision first rather than a plain list (it definitely has multiple sub parts though), can this be 'node_allocation_decision'?

Done and good catch!

…m the Allocation Explain API

gwbrown · 2021-10-07T22:48:07Z

Opened #78862 for the failure.

@elasticmachine run elasticsearch-ci/part-2

* Increase node shutdown logging level when stalled * Add allocation explanation to STALLED response * Adjust logging levels per review * Include all SingleNodeShutdownMetadata fields in logs * Depluralize `node_shutdown_decisions` per review * Repluralize node_allocation_decisions when we're actually reading from the Allocation Explain API

elasticsearchmachine · 2021-10-08T00:04:57Z

💚 Backport successful

Status	Branch	Result
✅	7.x
✅	7.15

* master: Fix DataTierTests package and add a validation test (elastic#78880) Fix split package org.elasticsearch.common.xcontent (elastic#78831) Store DataTier Preference directly on IndexMetadata (elastic#78668) [DOCS] Fixes typo in calendar API example (elastic#78867) Improve Node Shutdown Observability (elastic#78727) Convert encrypted snapshot license object to LicensedFeature (elastic#78731) Revert "Make nodePaths() singular (elastic#72514)" (elastic#78801) Fix incorrect generic type in PolicyStepsRegistry (elastic#78628) [DOCS] Fixes ML get calendars API (elastic#78808) Implement GET API for System Feature Upgrades (elastic#78642) [TEST] More MetadataStateFormat tests (elastic#78577) Add support for rest compatibility headers to the HLRC (elastic#78490) Un-ignoring tests after backporting fix (elastic#78830) Add node REPLACE shutdown implementation (elastic#76247) Wrap VersionPropertiesLoader in a BuildService to decouple build logic projects (elastic#78704) Adjust /_cat/templates not to request all metadata (elastic#78829) [DOCS] Fixes ML get scheduled events API (elastic#78809) Enable exit on out of memory error (elastic#71542) # Conflicts: # server/src/main/java/org/elasticsearch/cluster/metadata/DataStream.java

@OverRide

* upstream/master: (250 commits) [Transform] HLRC cleanups (elastic#78909) [ML] Make ML indices hidden when the node becomes master (elastic#77416) Introduce a Few Settings Singleton Instances (elastic#78897) Simplify TestCluster extraJar configuration (elastic#78837) Add @OverRide annotations to methods in EnrichPlugin class (elastic#76873) Add v7 restCompat for invalidating API key with the id field (elastic#78664) EQL: Refine repeatable queries (elastic#78895) Fix DataTierTests package and add a validation test (elastic#78880) Fix split package org.elasticsearch.common.xcontent (elastic#78831) Store DataTier Preference directly on IndexMetadata (elastic#78668) [DOCS] Fixes typo in calendar API example (elastic#78867) Improve Node Shutdown Observability (elastic#78727) Convert encrypted snapshot license object to LicensedFeature (elastic#78731) Revert "Make nodePaths() singular (elastic#72514)" (elastic#78801) Fix incorrect generic type in PolicyStepsRegistry (elastic#78628) [DOCS] Fixes ML get calendars API (elastic#78808) Implement GET API for System Feature Upgrades (elastic#78642) [TEST] More MetadataStateFormat tests (elastic#78577) Add support for rest compatibility headers to the HLRC (elastic#78490) Un-ignoring tests after backporting fix (elastic#78830) ... # Conflicts: # server/src/main/java/org/elasticsearch/ingest/IngestService.java # server/src/test/java/org/elasticsearch/ingest/IngestServiceTests.java

This reverts commit 5784755.

This PR does two things: 1) Logs a message when the Get Shutdown Status API detects there's a shard causing a node shutdown to stall. The log message isn't new, just bumped from `trace` to `info`. 2) Adds Allocation Explain output to the shard migration status portion of the node shutdown response if it's causing a shard to stall. Example of the new Get Shutdown Status response: ```json { "nodes": [ { "node_id": "GszZYZ2bSZiLi5dBinWf8g", "type": "REMOVE", "reason": "idk my bff jill", "shutdown_startedmillis": 1633475221483, "status": "STALLED", "shard_migration": { "status": "STALLED", "shard_migrations_remaining": 1, "explanation": "shard [0] [primary] of index [my-index] cannot move", "node_allocation_decisions": { "can_remain_on_current_node": "no", "can_remain_decisions": [ { "decider": "node_shutdown", "decision": "NO", "explanation": "node [GszZYZ2bSZiLi5dBinWf8g] is preparing to be removed from the cluster" }, { "decider": "filter", "decision": "NO", "explanation": "node does not match index setting [index.routing.allocation.require] filters [include._id:\"GszZYZ2bSZiLi5dBinWf8g\"]" } ], "can_move_to_other_node": "no", "move_explanation": "cannot move shard to another node, even though it is not allowed to remain on its current node", "node_allocation_decisions": [ { "node_id": "T6AnBS1mT2iXlD-McXSteg", "node_name": "node-0", "transport_address": "127.0.0.1:9301", "node_attributes": { "ml.machine_memory": "34359738368", "ml.max_open_jobs": "512", "xpack.installed": "true", "ml.max_jvm_size": "17179869184" }, "node_decision": "no", "weight_ranking": 1, "deciders": [ { "decider": "filter", "decision": "NO", "explanation": "node does not match index setting [index.routing.allocation.require] filters [include._id:\"GszZYZ2bSZiLi5dBinWf8g\"]" } ] }, { "node_id": "81mJIQ_XRrG2zK1ojhjLMg", "node_name": "node-2", "transport_address": "127.0.0.1:9300", "node_attributes": { "ml.machine_memory": "34359738368", "ml.max_open_jobs": "512", "xpack.installed": "true", "ml.max_jvm_size": "17179869184" }, "node_decision": "no", "weight_ranking": 2, "deciders": [ { "decider": "filter", "decision": "NO", "explanation": "node does not match index setting [index.routing.allocation.require] filters [include._id:\"GszZYZ2bSZiLi5dBinWf8g\"]" } ] } ] } }, "persistent_tasks": { "status": "COMPLETE" }, "plugins": { "status": "COMPLETE" } } ] } ```

This reverts commit 5784755.

The version used to determine whether we should serialize the allocation decisions as part of the Shutdown Status response was not updated after elastic#78727 was merged. This causes very confusing errors in mixed 8.x/7.16.x clusters. Note: The version constant was updated in the backport to 7.16, but not in 8.x. This is good, as at time of writing, 7.16 has shipped, but 8.0 has not.

The version used to determine whether we should serialize the allocation decisions as part of the Shutdown Status response was not updated after #78727 was merged. This causes very confusing errors in mixed 8.x/7.16.x clusters. Note: The version constant was updated in the backport to 7.16, but not in 8.x. This is good, as at time of writing, 7.16 has shipped, but 8.0 has not.

The version used to determine whether we should serialize the allocation decisions as part of the Shutdown Status response was not updated after elastic#78727 was merged. This causes very confusing errors in mixed 8.x/7.16.x clusters. Note: The version constant was updated in the backport to 7.16, but not in 8.x. This is good, as at time of writing, 7.16 has shipped, but 8.0 has not.

The version used to determine whether we should serialize the allocation decisions as part of the Shutdown Status response was not updated after #78727 was merged. This causes very confusing errors in mixed 8.x/7.16.x clusters. Note: The version constant was updated in the backport to 7.16, but not in 8.x. This is good, as at time of writing, 7.16 has shipped, but 8.0 has not.

Today if a node shutdown is stalled due to unmoveable shards then we say to use the allocation explain API to find details. In fact, since elastic#78727 we include the allocation explanation in the response already so we should tell users just to look at that instead. This commit adjusts the message to address this.

Today if a node shutdown is stalled due to unmoveable shards then we say to use the allocation explain API to find details. In fact, since #78727 we include the allocation explanation in the response already so we should tell users just to look at that instead. This commit adjusts the message to address this.

Today if a node shutdown is stalled due to unmoveable shards then we say to use the allocation explain API to find details. In fact, since #78727 we include the allocation explanation in the response already so we should tell users just to look at that instead. This commit adjusts the message to address this. Backport of #89254

Reporting the `targetNodeName` was added to `main` in elastic#78727 but omitted from the backport in elastic#78865. This commit adds the missing field to the `toString()` response.

Reporting the `targetNodeName` was added to `main` in #78727 but omitted from the backport in #78865. This commit adds the missing field to the `toString()` response.

gwbrown added 2 commits October 5, 2021 16:09

Increase node shutdown logging level when stalled

1e60cef

Add allocation explanation to STALLED response

e781b81

gwbrown added >non-issue v8.0.0 :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown v7.16.0 v7.15.1 labels Oct 5, 2021

gwbrown requested a review from dakrone October 5, 2021 23:10

elasticmachine added the Team:Core/Infra Meta label for core/infra team label Oct 5, 2021

dakrone reviewed Oct 6, 2021

View reviewed changes

gwbrown added 2 commits October 6, 2021 16:23

Adjust logging levels per review

e34a533

Merge branch 'master' into decom/better-diagnostics

64055cb

gwbrown requested a review from dakrone October 6, 2021 22:25

dakrone approved these changes Oct 6, 2021

View reviewed changes

gwbrown added 3 commits October 6, 2021 18:13

Include all SingleNodeShutdownMetadata fields in logs

37e0250

Depluralize node_shutdown_decisions per review

b637514

Merge branch 'master' into decom/better-diagnostics

57756c2

gwbrown added auto-backport-and-merge auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Oct 7, 2021

gwbrown added 2 commits October 7, 2021 15:17

Repluralize node_allocation_decisions when we're actually reading fro…

069079b

…m the Allocation Explain API

Merge branch 'master' into decom/better-diagnostics

bb7660c

elasticsearchmachine merged commit 65d2b09 into elastic:master Oct 8, 2021

gwbrown mentioned this pull request Oct 8, 2021

[7.x] Improve Node Shutdown Observability (#78727) #78865

Merged

gwbrown mentioned this pull request Oct 8, 2021

[7.15] Improve Node Shutdown Observability (#78727) #78866

Closed

gwbrown removed the v7.15.1 label Oct 13, 2021

gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Oct 13, 2021

Disable BWC for backport of elastic#78727

5784755

gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Oct 13, 2021

Revert "Disable BWC for backport of elastic#78727"

9ec2707

This reverts commit 5784755.

gwbrown added a commit that referenced this pull request Oct 13, 2021

Disable BWC for backport of #78727

d7aa9e7

gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Oct 13, 2021

Revert "Disable BWC for backport of elastic#78727"

edd6c43

This reverts commit 5784755.

elasticsearchmachine pushed a commit that referenced this pull request Oct 13, 2021

Revert "Disable BWC for backport of #78727" (#79101)

1ed9b15

This reverts commit 5784755.

jakelandis added v8.0.0-beta1 and removed v8.0.0 labels Oct 27, 2021

gwbrown mentioned this pull request Dec 3, 2021

Fix Serialization version check in Shutdown Status #81342

Merged

DaveCTurner mentioned this pull request Aug 10, 2022

Fix message for stalled shutdown #89254

Merged

DaveCTurner mentioned this pull request Aug 11, 2022

Fix message for stalled shutdown #89267

Merged

DaveCTurner mentioned this pull request Jun 12, 2023

Include targetNodeName in shutdown metadata toString() #96765

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Node Shutdown Observability #78727

Improve Node Shutdown Observability #78727

gwbrown commented Oct 5, 2021

elasticmachine commented Oct 5, 2021

dakrone Oct 5, 2021

dakrone commented Oct 6, 2021

gwbrown commented Oct 6, 2021

dakrone left a comment

dakrone Oct 6, 2021

gwbrown Oct 7, 2021

dakrone Oct 6, 2021

gwbrown Oct 7, 2021

gwbrown commented Oct 7, 2021

elasticsearchmachine commented Oct 8, 2021

Improve Node Shutdown Observability #78727

Improve Node Shutdown Observability #78727

Conversation

gwbrown commented Oct 5, 2021

elasticmachine commented Oct 5, 2021

dakrone Oct 5, 2021

Choose a reason for hiding this comment

dakrone commented Oct 6, 2021

gwbrown commented Oct 6, 2021

dakrone left a comment

Choose a reason for hiding this comment

dakrone Oct 6, 2021

Choose a reason for hiding this comment

gwbrown Oct 7, 2021

Choose a reason for hiding this comment

dakrone Oct 6, 2021

Choose a reason for hiding this comment

gwbrown Oct 7, 2021

Choose a reason for hiding this comment

gwbrown commented Oct 7, 2021

elasticsearchmachine commented Oct 8, 2021

💚 Backport successful