-
Notifications
You must be signed in to change notification settings - Fork 16.7k
[stable/elasticsearch] Terminating current master pod causes cluster outage of more than 30 seconds #8785
Comments
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions. |
Not solved. |
I think this is related to this post. I'm trying solution 3 which I can try by myself. Do you know which config sets ping timeout in |
Documentation talks about |
I set both Any http request to node in the cluster responded after logs written below, and outage was almost 2 minutes.
It seems |
I tried many times after making sure all nodes have ping_timeout config. |
1m seems to be as expected, since 3s * 20 tries = 1 minute. This is why I thought that option as being dangerous: in order to have quick timeout, we should set this option to ~500ms (10 seconds timeout), which is dangerously low. |
Ahh, you're right. As master re-election is done in 3~5 seconds, I thought this is irrelevant to master election. Setting |
I made a feature request elastic/elasticsearch#36822 |
I wonder one thing: why does the other ES masters still try to ping the master who left (see my point 1/)? |
The default value for |
As elastic/elasticsearch#36822 (comment). |
Signed-off-by: Taehyun Kim <[email protected]>
Signed-off-by: Taehyun Kim <[email protected]>
I think I fixed this issue in #10687 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions. |
This issue is being automatically closed due to inactivity. |
Running Elasticsearch cluster v6.7.1. Having the same issue, after deleting elasticsearch master pod, cluster is stuck for approx 65-70 seconds. Setting Setting |
@pavdmyt I recommend elastic's official helm chart. This issue is fixed in elastic's official chart via elastic/helm-charts#119 |
@kimxogus Thanks for the reference! I should give it a try. |
BUG REPORT
I have a funny issue with elasticsearch 6.4.2 when I delete the current master Pod.
The theory tells us that there should have a brief downtime of about 3 to 5 seconds, but I get ~35 seconds of downtime.
If I manually kill the java process (kubectl exec -it ... kill 1) it causes a downtime of less than 5 seconds, which is expected.
The reason is quite simple to understand but does not help to fix it:
1/ I ask for Pod termination
2/ SIGTERM is sent to java, causing ES master to shut down properly and to send proper signal to other nodes
3/ Docker container is closed, and IP from network layer (Calico in my case) decommissioned
4/ new master election process starts
5/ Other nodes try to ping the dead master but instead of getting "connection refused" as it would have been in the "old" world outside of kubernetes/containers, it just gets nothing, then timeout after 30 seconds. It means no operation (even read) can be done during 30 seconds.
When I kill the process instead of terminating the pod, steps 1 to 4 are the same but
4/ other nodes try to ping the dead master, gets "connection refused", remove the dead master from their state and continue election
Possible fixes are:
1/ Fix within elasticsearch: a master broadcasting that it is leaving a cluster should cause other nodes to remove it immediately instead of trying to ping it
2/ Wait for a few seconds AFTER java has been killed before terminating the container (and the network associated with it) so that other nodes can get "connection refused". This is what is done in https://github.com/mintel/es-image/pull/3/files but official ES image directly runs a script that
exec
into java process, and it looks like a workaround, not a proper fix3/ Reduce the timeout (dangerous?)
What do you think?
cc @simonswine @icereval @rendhalver @andrenarchy
Here is a log from another node seeing the current master "master-2" leaving the cluster then... blocking everything during 30 seconds (and all other nodes do the same), resulting in complete cluster lock-down (even read-only).
The text was updated successfully, but these errors were encountered: