-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Node Shutdown Observability #78727
Improve Node Shutdown Observability #78727
Conversation
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Strings.toString(pair.v2()) | ||
); | ||
} | ||
logger.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we hit this every 10 seconds for 30 minutes, that's roughly 180 times, are we okay with that amount of logging? That seems a little too excessive to me. Perhaps it would be better to put this at DEBUG
level since we'll expose most of the info in the response now?
Oh yeah, I wanted to leave a reminder for what we talked about regarding logging when a new shutdown is put in place. I think it would also be good to log when we remove/delete a node shutdown as well (so we can see in logs when it has been "completed") |
@dakrone Done, we now log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I left two really minor comments
logger.info( | ||
"creating shutdown record for node [{}] of type [{}] with reason [{}]", | ||
request.getNodeId(), | ||
request.getType(), | ||
request.getReason() | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include all the info we have here? Maybe overriding toString in SingleNodeShutdownMetadata would work so we can include the throttling and target node name of those are present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - I did indeed delegate this to SingleNodeShutdownMetadata#toString
.
@@ -61,6 +80,13 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws | |||
if (Objects.nonNull(explanation)) { | |||
builder.field("explanation", explanation); | |||
} | |||
if (Objects.nonNull(allocationDecision)) { | |||
builder.startObject("node_allocation_decisions"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super minor nit, but since this is the top-level decision first rather than a plain list (it definitely has multiple sub parts though), can this be 'node_allocation_decision'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and good catch!
Opened #78862 for the failure. @elasticmachine run elasticsearch-ci/part-2 |
* Increase node shutdown logging level when stalled * Add allocation explanation to STALLED response * Adjust logging levels per review * Include all SingleNodeShutdownMetadata fields in logs * Depluralize `node_shutdown_decisions` per review * Repluralize node_allocation_decisions when we're actually reading from the Allocation Explain API
* Increase node shutdown logging level when stalled * Add allocation explanation to STALLED response * Adjust logging levels per review * Include all SingleNodeShutdownMetadata fields in logs * Depluralize `node_shutdown_decisions` per review * Repluralize node_allocation_decisions when we're actually reading from the Allocation Explain API
* master: Fix DataTierTests package and add a validation test (elastic#78880) Fix split package org.elasticsearch.common.xcontent (elastic#78831) Store DataTier Preference directly on IndexMetadata (elastic#78668) [DOCS] Fixes typo in calendar API example (elastic#78867) Improve Node Shutdown Observability (elastic#78727) Convert encrypted snapshot license object to LicensedFeature (elastic#78731) Revert "Make nodePaths() singular (elastic#72514)" (elastic#78801) Fix incorrect generic type in PolicyStepsRegistry (elastic#78628) [DOCS] Fixes ML get calendars API (elastic#78808) Implement GET API for System Feature Upgrades (elastic#78642) [TEST] More MetadataStateFormat tests (elastic#78577) Add support for rest compatibility headers to the HLRC (elastic#78490) Un-ignoring tests after backporting fix (elastic#78830) Add node REPLACE shutdown implementation (elastic#76247) Wrap VersionPropertiesLoader in a BuildService to decouple build logic projects (elastic#78704) Adjust /_cat/templates not to request all metadata (elastic#78829) [DOCS] Fixes ML get scheduled events API (elastic#78809) Enable exit on out of memory error (elastic#71542) # Conflicts: # server/src/main/java/org/elasticsearch/cluster/metadata/DataStream.java
* upstream/master: (250 commits) [Transform] HLRC cleanups (elastic#78909) [ML] Make ML indices hidden when the node becomes master (elastic#77416) Introduce a Few Settings Singleton Instances (elastic#78897) Simplify TestCluster extraJar configuration (elastic#78837) Add @OverRide annotations to methods in EnrichPlugin class (elastic#76873) Add v7 restCompat for invalidating API key with the id field (elastic#78664) EQL: Refine repeatable queries (elastic#78895) Fix DataTierTests package and add a validation test (elastic#78880) Fix split package org.elasticsearch.common.xcontent (elastic#78831) Store DataTier Preference directly on IndexMetadata (elastic#78668) [DOCS] Fixes typo in calendar API example (elastic#78867) Improve Node Shutdown Observability (elastic#78727) Convert encrypted snapshot license object to LicensedFeature (elastic#78731) Revert "Make nodePaths() singular (elastic#72514)" (elastic#78801) Fix incorrect generic type in PolicyStepsRegistry (elastic#78628) [DOCS] Fixes ML get calendars API (elastic#78808) Implement GET API for System Feature Upgrades (elastic#78642) [TEST] More MetadataStateFormat tests (elastic#78577) Add support for rest compatibility headers to the HLRC (elastic#78490) Un-ignoring tests after backporting fix (elastic#78830) ... # Conflicts: # server/src/main/java/org/elasticsearch/ingest/IngestService.java # server/src/test/java/org/elasticsearch/ingest/IngestServiceTests.java
This reverts commit 5784755.
This PR does two things: 1) Logs a message when the Get Shutdown Status API detects there's a shard causing a node shutdown to stall. The log message isn't new, just bumped from `trace` to `info`. 2) Adds Allocation Explain output to the shard migration status portion of the node shutdown response if it's causing a shard to stall. Example of the new Get Shutdown Status response: ```json { "nodes": [ { "node_id": "GszZYZ2bSZiLi5dBinWf8g", "type": "REMOVE", "reason": "idk my bff jill", "shutdown_startedmillis": 1633475221483, "status": "STALLED", "shard_migration": { "status": "STALLED", "shard_migrations_remaining": 1, "explanation": "shard [0] [primary] of index [my-index] cannot move", "node_allocation_decisions": { "can_remain_on_current_node": "no", "can_remain_decisions": [ { "decider": "node_shutdown", "decision": "NO", "explanation": "node [GszZYZ2bSZiLi5dBinWf8g] is preparing to be removed from the cluster" }, { "decider": "filter", "decision": "NO", "explanation": "node does not match index setting [index.routing.allocation.require] filters [include._id:\"GszZYZ2bSZiLi5dBinWf8g\"]" } ], "can_move_to_other_node": "no", "move_explanation": "cannot move shard to another node, even though it is not allowed to remain on its current node", "node_allocation_decisions": [ { "node_id": "T6AnBS1mT2iXlD-McXSteg", "node_name": "node-0", "transport_address": "127.0.0.1:9301", "node_attributes": { "ml.machine_memory": "34359738368", "ml.max_open_jobs": "512", "xpack.installed": "true", "ml.max_jvm_size": "17179869184" }, "node_decision": "no", "weight_ranking": 1, "deciders": [ { "decider": "filter", "decision": "NO", "explanation": "node does not match index setting [index.routing.allocation.require] filters [include._id:\"GszZYZ2bSZiLi5dBinWf8g\"]" } ] }, { "node_id": "81mJIQ_XRrG2zK1ojhjLMg", "node_name": "node-2", "transport_address": "127.0.0.1:9300", "node_attributes": { "ml.machine_memory": "34359738368", "ml.max_open_jobs": "512", "xpack.installed": "true", "ml.max_jvm_size": "17179869184" }, "node_decision": "no", "weight_ranking": 2, "deciders": [ { "decider": "filter", "decision": "NO", "explanation": "node does not match index setting [index.routing.allocation.require] filters [include._id:\"GszZYZ2bSZiLi5dBinWf8g\"]" } ] } ] } }, "persistent_tasks": { "status": "COMPLETE" }, "plugins": { "status": "COMPLETE" } } ] } ```
This reverts commit 5784755.
The version used to determine whether we should serialize the allocation decisions as part of the Shutdown Status response was not updated after elastic#78727 was merged. This causes very confusing errors in mixed 8.x/7.16.x clusters. Note: The version constant was updated in the backport to 7.16, but not in 8.x. This is good, as at time of writing, 7.16 has shipped, but 8.0 has not.
The version used to determine whether we should serialize the allocation decisions as part of the Shutdown Status response was not updated after #78727 was merged. This causes very confusing errors in mixed 8.x/7.16.x clusters. Note: The version constant was updated in the backport to 7.16, but not in 8.x. This is good, as at time of writing, 7.16 has shipped, but 8.0 has not.
The version used to determine whether we should serialize the allocation decisions as part of the Shutdown Status response was not updated after elastic#78727 was merged. This causes very confusing errors in mixed 8.x/7.16.x clusters. Note: The version constant was updated in the backport to 7.16, but not in 8.x. This is good, as at time of writing, 7.16 has shipped, but 8.0 has not.
The version used to determine whether we should serialize the allocation decisions as part of the Shutdown Status response was not updated after #78727 was merged. This causes very confusing errors in mixed 8.x/7.16.x clusters. Note: The version constant was updated in the backport to 7.16, but not in 8.x. This is good, as at time of writing, 7.16 has shipped, but 8.0 has not.
Today if a node shutdown is stalled due to unmoveable shards then we say to use the allocation explain API to find details. In fact, since elastic#78727 we include the allocation explanation in the response already so we should tell users just to look at that instead. This commit adjusts the message to address this.
Today if a node shutdown is stalled due to unmoveable shards then we say to use the allocation explain API to find details. In fact, since elastic#78727 we include the allocation explanation in the response already so we should tell users just to look at that instead. This commit adjusts the message to address this.
Today if a node shutdown is stalled due to unmoveable shards then we say to use the allocation explain API to find details. In fact, since #78727 we include the allocation explanation in the response already so we should tell users just to look at that instead. This commit adjusts the message to address this.
Today if a node shutdown is stalled due to unmoveable shards then we say to use the allocation explain API to find details. In fact, since #78727 we include the allocation explanation in the response already so we should tell users just to look at that instead. This commit adjusts the message to address this. Backport of #89254
Reporting the `targetNodeName` was added to `main` in elastic#78727 but omitted from the backport in elastic#78865. This commit adds the missing field to the `toString()` response.
This PR does two things:
trace
toinfo
.Example of the new Get Shutdown Status response: