-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow recovery of write availability after partition of a large cluster #28920
Comments
@elastic/es-distributed can you please take a look if this is valid or might have changed with 6.0? |
I'd expect the recovery of a partitioned 10,000 shard cluster to be faster in 6.x, but not for the reasons described here. Please could you provide the logs from a decent proportion of the nodes (≥ 5 from each side of the partition, including a selection of both master and data nodes) so we can see what was actually happening for those 10 minutes. I can provide an email address if you struggle to attach then here. Also please describe the "optimisations" performed that reduced this from 10 minutes down to 1. |
Also 5.3.1 is approaching a year old and missing a few fixes that might have an impact on this. Is it feasible to upgrade to a more recent version and retry your experiments? Ideally the latest possible version (currently 6.2.2) but it'd be useful to look at the latest 5.x version (currently 5.6.8) if that's not possible. |
@ywelsch and I discussed this issue. We will work on an improvement which avoids failing the same shard multiple times. |
@spinscale @DaveCTurner @dnhatn |
@djjsindy please use [email protected] and I will share with the rest of the team. I expect we will also want to summarise and discuss any analysis here, which might include things like index names, shard counts, and so on. If there is any information like this that you want not to be shared in public then please indicate as such in your email. |
This change replaces the use of string concatenation with a call to String.join(). String concatenation might be quadratic, unless the compiler can optimise it away, whereas String.join() is more reliably linear. There can sometimes be a large number of pending ClusterState update tasks and elastic#28920 includes a report that this operation sometimes takes a long time.
This change replaces the use of string concatenation with a call to String.join(). String concatenation might be quadratic, unless the compiler can optimise it away, whereas String.join() is more reliably linear. There can sometimes be a large number of pending ClusterState update tasks and #28920 includes a report that this operation sometimes takes a long time.
This change replaces the use of string concatenation with a call to String.join(). String concatenation might be quadratic, unless the compiler can optimise it away, whereas String.join() is more reliably linear. There can sometimes be a large number of pending ClusterState update tasks and #28920 includes a report that this operation sometimes takes a long time.
This change replaces the use of string concatenation with a call to String.join(). String concatenation might be quadratic, unless the compiler can optimise it away, whereas String.join() is more reliably linear. There can sometimes be a large number of pending ClusterState update tasks and elastic#28920 includes a report that this operation sometimes takes a long time.
@djjsindy I received your email but it only contained a handful of log messages, showing nothing unexpected. I responded:
Copying this here in case you didn't get my response. Additionally, we just pushed 90bde12 (5.6), e0da114 (6.x), and 033a83b (master) which should make a difference to how fast the task summary |
This change replaces the use of string concatenation with a call to String.join(). String concatenation might be quadratic, unless the compiler can optimise it away, whereas String.join() is more reliably linear. There can sometimes be a large number of pending ClusterState update tasks and elastic#28920 includes a report that this operation sometimes takes a long time.
Thanks for the logs. The issue you are facing is as follows. The master detects the failure of some of the nodes (but, crucially, not all of them) and publishes a cluster state update to remove them. When committing a cluster state update, each receiving node attempts to establish connections to all the nodes that are listed in the new cluster state, which includes all the failed nodes that the master has not yet detected as failed. Each such connection attempt times out after 30 seconds because of the network partition. In 5.3.x it looks like these attempts are made in sequence; #22984 (released in v5.4.0) improves this a little so they happen up to 5-at-a-time, but never to better than 30 seconds. Once all the connection attempts have timed out, the node can finish applying the cluster state update. Meanwhile, the master node has detected the failure of the rest of the nodes and publishes the next update. Once the other nodes have finished failing to connect to all their disconnected peers and applied the first update, the second update seems to be applied reasonably quickly. This explains why waiting before sending the first cluster state update improves the situation dramatically: if the first cluster state update is delayed for long enough to capture all of the failed nodes then its recipients do not waste any time trying to connect to their failed peers and can apply it reasonably quickly. However, this is unlikely to be the solution we choose: it means that Elasticsearch will block writes for the defined waiting time on the failure of just a single node. We need to think about this more deeply. |
@DaveCTurner About write blocking, my sync data process will continue to retry when it encounters an error. The time from the start of write blocking to the write operation can be performed normally will be longer. It includes the time to write retry. My opinion:
Help me see if my opinion is feasible? If my suggestion is feasible, I will try to create a pull request. |
It's not clear that the shard-failed events have anything to do with this. As far as I can tell, it's just about the node disconnections being split across multiple updates. However, I don't like the idea of trying to get all the node disconnection events to occur at the same time. It might be possible in the kind of clean partition you are simulating, but it would leave us open to the same kind of problem in more complicated scenarios. Fundamentally, there are no natural events triggered in the kind of network partition you are simulating, so we must rely on timeouts to detect node disconnection, and I think anything involving timeouts is going to have pathological behaviours similar to the one we're trying to avoid. I think I would prefer better handling of node disconnections that are split across multiple updates instead. If applying a new cluster state did not try and synchronously connect to all the nodes listed in the new cluster state then we would be able to move onto subsequent cluster states much more quickly, removing further batches of failed nodes as their failures are detected. This sounds nontrivial to achieve, for at least two reasons:
I'm raising this for discussion with the wider team, as it'd be good to get some more ideas. |
@DaveCTurner
|
You're welcome @djjsindy. Thank you in turn for your help in digging into the issue. Note that although we opened those issues, no work on them is currently scheduled so we have marked them with the
Sure. Something like a single-node failure a short time before a whole-rack failure would be troublesome: no matter how long you wait after the single-node failure, there's always a chance that you'd decide to proceed with the cluster-state update to remove it at exactly the wrong moment, ending up the very situation we were trying to avoid. It'd be less likely, but in a sense that makes it worse: it'd be much more of a struggle to reproduce and diagnose it. I'm closing this issue as there's no further action required here. |
Recently there have been a couple of threads on the discussion forums that look closely related to this: |
We have a very large cluster which have 128 nodes. This cluster have a lot of index. There are about 20,000 shards, 10000 shards is primary,the other is replica. Primary and replica locate in different racks. Write operation will always exist. In the network partition scenario the write operation will be blocked because it has to wait for replica shard failed cluster state commit. Write operation recovery time will be longer than about 10 minutes.
My opinion: Write slow recovery affected by the following three factors:
In my scenario, I tried to do optimization based on the above mentioned. Write recovery time reduced from 10 minutes to less than 1 minute, It seems to be working.
Please take a look at these three factors can be improved ?
Elasticsearch version (
bin/elasticsearch --version
):5.3.1
Plugins installed: []
JVM version (
java -version
):1.8.0_112
OS version (
uname -a
if on a Unix-like system):2.6.32-220.23.2.xxxxx.el6.x86_64
Description of the problem including expected versus actual behavior:
Write operation will always exist. In the network partition scenario the write operation will be blocked because it has to wait for shard failed cluster state commit. Write operation recovery time will be longer than about 10 minutes.
Expected behavior: Recovery write time is shorter
Steps to reproduce:
Provide logs (if relevant):
The text was updated successfully, but these errors were encountered: