Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Node Shutdown Observability #78727

Merged

Conversation

gwbrown
Copy link
Contributor

@gwbrown gwbrown commented Oct 5, 2021

This PR does two things:

  1. Logs a message when the Get Shutdown Status API detects there's a shard causing a node shutdown to stall. The log message isn't new, just bumped from trace to info.
  2. Adds Allocation Explain output to the shard migration status portion of the node shutdown response if it's causing a shard to stall.

Example of the new Get Shutdown Status response:

{
    "nodes": [
        {
            "node_id": "GszZYZ2bSZiLi5dBinWf8g",
            "type": "REMOVE",
            "reason": "idk my bff jill",
            "shutdown_startedmillis": 1633475221483,
            "status": "STALLED",
            "shard_migration": {
                "status": "STALLED",
                "shard_migrations_remaining": 1,
                "explanation": "shard [0] [primary] of index [my-index] cannot move",
                "node_allocation_decisions": {
                    "can_remain_on_current_node": "no",
                    "can_remain_decisions": [
                        {
                            "decider": "node_shutdown",
                            "decision": "NO",
                            "explanation": "node [GszZYZ2bSZiLi5dBinWf8g] is preparing to be removed from the cluster"
                        },
                        {
                            "decider": "filter",
                            "decision": "NO",
                            "explanation": "node does not match index setting [index.routing.allocation.require] filters [include._id:\"GszZYZ2bSZiLi5dBinWf8g\"]"
                        }
                    ],
                    "can_move_to_other_node": "no",
                    "move_explanation": "cannot move shard to another node, even though it is not allowed to remain on its current node",
                    "node_allocation_decisions": [
                        {
                            "node_id": "T6AnBS1mT2iXlD-McXSteg",
                            "node_name": "node-0",
                            "transport_address": "127.0.0.1:9301",
                            "node_attributes": {
                                "ml.machine_memory": "34359738368",
                                "ml.max_open_jobs": "512",
                                "xpack.installed": "true",
                                "ml.max_jvm_size": "17179869184"
                            },
                            "node_decision": "no",
                            "weight_ranking": 1,
                            "deciders": [
                                {
                                    "decider": "filter",
                                    "decision": "NO",
                                    "explanation": "node does not match index setting [index.routing.allocation.require] filters [include._id:\"GszZYZ2bSZiLi5dBinWf8g\"]"
                                }
                            ]
                        },
                        {
                            "node_id": "81mJIQ_XRrG2zK1ojhjLMg",
                            "node_name": "node-2",
                            "transport_address": "127.0.0.1:9300",
                            "node_attributes": {
                                "ml.machine_memory": "34359738368",
                                "ml.max_open_jobs": "512",
                                "xpack.installed": "true",
                                "ml.max_jvm_size": "17179869184"
                            },
                            "node_decision": "no",
                            "weight_ranking": 2,
                            "deciders": [
                                {
                                    "decider": "filter",
                                    "decision": "NO",
                                    "explanation": "node does not match index setting [index.routing.allocation.require] filters [include._id:\"GszZYZ2bSZiLi5dBinWf8g\"]"
                                }
                            ]
                        }
                    ]
                }
            },
            "persistent_tasks": {
                "status": "COMPLETE"
            },
            "plugins": {
                "status": "COMPLETE"
            }
        }
    ]
}

@gwbrown gwbrown requested a review from dakrone October 5, 2021 23:10
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Oct 5, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

Strings.toString(pair.v2())
);
}
logger.info(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we hit this every 10 seconds for 30 minutes, that's roughly 180 times, are we okay with that amount of logging? That seems a little too excessive to me. Perhaps it would be better to put this at DEBUG level since we'll expose most of the info in the response now?

@dakrone
Copy link
Member

dakrone commented Oct 6, 2021

Oh yeah, I wanted to leave a reminder for what we talked about regarding logging when a new shutdown is put in place. I think it would also be good to log when we remove/delete a node shutdown as well (so we can see in logs when it has been "completed")

@gwbrown
Copy link
Contributor Author

gwbrown commented Oct 6, 2021

@dakrone Done, we now log STALLED at debug and PUT/DELETE node shutdown at info.

@gwbrown gwbrown requested a review from dakrone October 6, 2021 22:25
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I left two really minor comments

Comment on lines 84 to 89
logger.info(
"creating shutdown record for node [{}] of type [{}] with reason [{}]",
request.getNodeId(),
request.getType(),
request.getReason()
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include all the info we have here? Maybe overriding toString in SingleNodeShutdownMetadata would work so we can include the throttling and target node name of those are present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - I did indeed delegate this to SingleNodeShutdownMetadata#toString.

@@ -61,6 +80,13 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
if (Objects.nonNull(explanation)) {
builder.field("explanation", explanation);
}
if (Objects.nonNull(allocationDecision)) {
builder.startObject("node_allocation_decisions");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor nit, but since this is the top-level decision first rather than a plain list (it definitely has multiple sub parts though), can this be 'node_allocation_decision'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and good catch!

@gwbrown gwbrown added auto-backport-and-merge auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Oct 7, 2021
@gwbrown
Copy link
Contributor Author

gwbrown commented Oct 7, 2021

Opened #78862 for the failure.

@elasticmachine run elasticsearch-ci/part-2

@elasticsearchmachine elasticsearchmachine merged commit 65d2b09 into elastic:master Oct 8, 2021
gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Oct 8, 2021
* Increase node shutdown logging level when stalled

* Add allocation explanation to STALLED response

* Adjust logging levels per review

* Include all SingleNodeShutdownMetadata fields in logs

* Depluralize `node_shutdown_decisions` per review

* Repluralize node_allocation_decisions when we're actually reading from the Allocation Explain API
gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Oct 8, 2021
* Increase node shutdown logging level when stalled

* Add allocation explanation to STALLED response

* Adjust logging levels per review

* Include all SingleNodeShutdownMetadata fields in logs

* Depluralize `node_shutdown_decisions` per review

* Repluralize node_allocation_decisions when we're actually reading from the Allocation Explain API
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
7.x
7.15

wjp719 added a commit to wjp719/elasticsearch that referenced this pull request Oct 9, 2021
* master:
  Fix DataTierTests package and add a validation test (elastic#78880)
  Fix split package org.elasticsearch.common.xcontent (elastic#78831)
  Store DataTier Preference directly on IndexMetadata (elastic#78668)
  [DOCS] Fixes typo in calendar API example (elastic#78867)
  Improve Node Shutdown Observability (elastic#78727)
  Convert encrypted snapshot license object to LicensedFeature (elastic#78731)
  Revert "Make nodePaths() singular (elastic#72514)" (elastic#78801)
  Fix incorrect generic type in PolicyStepsRegistry (elastic#78628)
  [DOCS] Fixes ML get calendars API (elastic#78808)
  Implement GET API for System Feature Upgrades (elastic#78642)
  [TEST] More MetadataStateFormat tests (elastic#78577)
  Add support for rest compatibility headers to the HLRC (elastic#78490)
  Un-ignoring tests after backporting fix (elastic#78830)
  Add node REPLACE shutdown implementation (elastic#76247)
  Wrap VersionPropertiesLoader in a BuildService to decouple build logic projects (elastic#78704)
  Adjust /_cat/templates not to request all metadata (elastic#78829)
  [DOCS] Fixes ML get scheduled events API (elastic#78809)
  Enable exit on out of memory error (elastic#71542)

# Conflicts:
#	server/src/main/java/org/elasticsearch/cluster/metadata/DataStream.java
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Oct 11, 2021
* upstream/master: (250 commits)
  [Transform] HLRC cleanups (elastic#78909)
  [ML] Make ML indices hidden when the node becomes master (elastic#77416)
  Introduce a Few Settings Singleton Instances (elastic#78897)
  Simplify TestCluster extraJar configuration (elastic#78837)
  Add @OverRide annotations to methods in EnrichPlugin class (elastic#76873)
  Add v7 restCompat for invalidating API key with the id field (elastic#78664)
  EQL: Refine repeatable queries (elastic#78895)
  Fix DataTierTests package and add a validation test (elastic#78880)
  Fix split package org.elasticsearch.common.xcontent (elastic#78831)
  Store DataTier Preference directly on IndexMetadata (elastic#78668)
  [DOCS] Fixes typo in calendar API example (elastic#78867)
  Improve Node Shutdown Observability (elastic#78727)
  Convert encrypted snapshot license object to LicensedFeature (elastic#78731)
  Revert "Make nodePaths() singular (elastic#72514)" (elastic#78801)
  Fix incorrect generic type in PolicyStepsRegistry (elastic#78628)
  [DOCS] Fixes ML get calendars API (elastic#78808)
  Implement GET API for System Feature Upgrades (elastic#78642)
  [TEST] More MetadataStateFormat tests (elastic#78577)
  Add support for rest compatibility headers to the HLRC (elastic#78490)
  Un-ignoring tests after backporting fix (elastic#78830)
  ...

# Conflicts:
#	server/src/main/java/org/elasticsearch/ingest/IngestService.java
#	server/src/test/java/org/elasticsearch/ingest/IngestServiceTests.java
@gwbrown gwbrown removed the v7.15.1 label Oct 13, 2021
gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Oct 13, 2021
gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Oct 13, 2021
gwbrown added a commit that referenced this pull request Oct 13, 2021
gwbrown added a commit that referenced this pull request Oct 13, 2021
This PR does two things:
1) Logs a message when the Get Shutdown Status API detects there's a shard causing a node shutdown to stall. The log message isn't new, just bumped from `trace` to `info`.
2) Adds Allocation Explain output to the shard migration status portion of the node shutdown response if it's causing a shard to stall.

Example of the new Get Shutdown Status response:
```json
{
    "nodes": [
        {
            "node_id": "GszZYZ2bSZiLi5dBinWf8g",
            "type": "REMOVE",
            "reason": "idk my bff jill",
            "shutdown_startedmillis": 1633475221483,
            "status": "STALLED",
            "shard_migration": {
                "status": "STALLED",
                "shard_migrations_remaining": 1,
                "explanation": "shard [0] [primary] of index [my-index] cannot move",
                "node_allocation_decisions": {
                    "can_remain_on_current_node": "no",
                    "can_remain_decisions": [
                        {
                            "decider": "node_shutdown",
                            "decision": "NO",
                            "explanation": "node [GszZYZ2bSZiLi5dBinWf8g] is preparing to be removed from the cluster"
                        },
                        {
                            "decider": "filter",
                            "decision": "NO",
                            "explanation": "node does not match index setting [index.routing.allocation.require] filters [include._id:\"GszZYZ2bSZiLi5dBinWf8g\"]"
                        }
                    ],
                    "can_move_to_other_node": "no",
                    "move_explanation": "cannot move shard to another node, even though it is not allowed to remain on its current node",
                    "node_allocation_decisions": [
                        {
                            "node_id": "T6AnBS1mT2iXlD-McXSteg",
                            "node_name": "node-0",
                            "transport_address": "127.0.0.1:9301",
                            "node_attributes": {
                                "ml.machine_memory": "34359738368",
                                "ml.max_open_jobs": "512",
                                "xpack.installed": "true",
                                "ml.max_jvm_size": "17179869184"
                            },
                            "node_decision": "no",
                            "weight_ranking": 1,
                            "deciders": [
                                {
                                    "decider": "filter",
                                    "decision": "NO",
                                    "explanation": "node does not match index setting [index.routing.allocation.require] filters [include._id:\"GszZYZ2bSZiLi5dBinWf8g\"]"
                                }
                            ]
                        },
                        {
                            "node_id": "81mJIQ_XRrG2zK1ojhjLMg",
                            "node_name": "node-2",
                            "transport_address": "127.0.0.1:9300",
                            "node_attributes": {
                                "ml.machine_memory": "34359738368",
                                "ml.max_open_jobs": "512",
                                "xpack.installed": "true",
                                "ml.max_jvm_size": "17179869184"
                            },
                            "node_decision": "no",
                            "weight_ranking": 2,
                            "deciders": [
                                {
                                    "decider": "filter",
                                    "decision": "NO",
                                    "explanation": "node does not match index setting [index.routing.allocation.require] filters [include._id:\"GszZYZ2bSZiLi5dBinWf8g\"]"
                                }
                            ]
                        }
                    ]
                }
            },
            "persistent_tasks": {
                "status": "COMPLETE"
            },
            "plugins": {
                "status": "COMPLETE"
            }
        }
    ]
}
```
gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Oct 13, 2021
elasticsearchmachine pushed a commit that referenced this pull request Oct 13, 2021
gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Dec 3, 2021
The version used to determine whether we should serialize the
allocation decisions as part of the Shutdown Status response
was not updated after elastic#78727 was merged. This causes very
confusing errors in mixed 8.x/7.16.x clusters.

Note: The version constant was updated in the backport to 7.16,
but not in 8.x. This is good, as at time of writing, 7.16 has
shipped, but 8.0 has not.
elasticsearchmachine pushed a commit that referenced this pull request Dec 3, 2021
The version used to determine whether we should serialize the allocation
decisions as part of the Shutdown Status response was not updated after
#78727 was merged. This causes very confusing errors in mixed 8.x/7.16.x
clusters. Note: The version constant was updated in the backport to
7.16, but not in 8.x. This is good, as at time of writing, 7.16 has
shipped, but 8.0 has not.
gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Dec 3, 2021
The version used to determine whether we should serialize the allocation
decisions as part of the Shutdown Status response was not updated after
elastic#78727 was merged. This causes very confusing errors in mixed 8.x/7.16.x
clusters. Note: The version constant was updated in the backport to
7.16, but not in 8.x. This is good, as at time of writing, 7.16 has
shipped, but 8.0 has not.
elasticsearchmachine pushed a commit that referenced this pull request Dec 4, 2021
The version used to determine whether we should serialize the allocation
decisions as part of the Shutdown Status response was not updated after
#78727 was merged. This causes very confusing errors in mixed 8.x/7.16.x
clusters. Note: The version constant was updated in the backport to
7.16, but not in 8.x. This is good, as at time of writing, 7.16 has
shipped, but 8.0 has not.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Aug 11, 2022
Today if a node shutdown is stalled due to unmoveable shards then we say
to use the allocation explain API to find details. In fact, since elastic#78727
we include the allocation explanation in the response already so we
should tell users just to look at that instead. This commit adjusts the
message to address this.
elasticmachine pushed a commit to gmarouli/elasticsearch that referenced this pull request Aug 11, 2022
Today if a node shutdown is stalled due to unmoveable shards then we say
to use the allocation explain API to find details. In fact, since elastic#78727
we include the allocation explanation in the response already so we
should tell users just to look at that instead. This commit adjusts the
message to address this.
elasticsearchmachine pushed a commit that referenced this pull request Aug 11, 2022
Today if a node shutdown is stalled due to unmoveable shards then we say
to use the allocation explain API to find details. In fact, since #78727
we include the allocation explanation in the response already so we
should tell users just to look at that instead. This commit adjusts the
message to address this.
elasticsearchmachine pushed a commit that referenced this pull request Aug 11, 2022
Today if a node shutdown is stalled due to unmoveable shards then we say
to use the allocation explain API to find details. In fact, since #78727
we include the allocation explanation in the response already so we
should tell users just to look at that instead. This commit adjusts the
message to address this.

Backport of #89254
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jun 12, 2023
Reporting the `targetNodeName` was added to `main` in elastic#78727 but omitted
from the backport in elastic#78865. This commit adds the missing field to the
`toString()` response.
DaveCTurner added a commit that referenced this pull request Jun 12, 2023
Reporting the `targetNodeName` was added to `main` in #78727 but omitted
from the backport in #78865. This commit adds the missing field to the
`toString()` response.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown >non-issue Team:Core/Infra Meta label for core/infra team v7.16.0 v8.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants