-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track network metrics between nodes #19335
Comments
Discussed in FixItFriday. Agreed that at least some of these metrics would be good to have, but it would be a time-consuming and tedious job to add these stats. Nice to have, but maybe not worth the effort? I'll mark it as adoptme and high hanging fruit |
A simpler way to get started here might be to log warnings on the node (similar to slow logs). If pinging takes longer than a (user-definable) threshold, we could for example log a warning. Same for slow shard transfer rates etc. |
+1 @ywelsch |
1 similar comment
+1 @ywelsch |
This issue has been open for a while, but not a lot has happened with it. I will close this issue for now, because it is a high hanging fruit and there are currently no plans to work on this improvement, also another approach that @ywelsch suggested is easier to get started. Please feel free to leave feedback on the proposal (including +1s). |
Typically Elasticsearch doesn't work well in cross-datacentre architectures, but how can you define that? So long as there is reliable and ample network connection between 2 sites, why not? If Elasticsearch had insight into the reliability of it's relationship to other nodes in the cluster, this could serve as a vital cluster health metrics.
To know that, it would be great if each ES node could track shard transfer rate, ping time, packet loss, relative uptime, etc, metrics against any/all other known nodes. Also tracking minimum_masters stable time from each node's perspective would be useful too.
The results of the metrics could be used in diagnosing or indicating stability problems due to network issues. The availability metrics would be skewed by node restarts, etc, but it would still be highly useful. The transfer rate data would always be consistent.
The text was updated successfully, but these errors were encountered: