Slow recovery of write availability after partition of a large cluster #28920

djjsindy · 2018-03-07T03:35:46Z

We have a very large cluster which have 128 nodes. This cluster have a lot of index. There are about 20,000 shards, 10000 shards is primary,the other is replica. Primary and replica locate in different racks. Write operation will always exist. In the network partition scenario the write operation will be blocked because it has to wait for replica shard failed cluster state commit. Write operation recovery time will be longer than about 10 minutes.

My opinion: Write slow recovery affected by the following three factors:

Each node disconnect detection is independent. In the network partition scenario, 64 nodes disconnect. Because cluster state batch processing mechanism led to the first cluster state only the first node disconnect. This cluster state's prepare and commit must be time-out,Because this cluster state sent the node contains the remaining 63 nodes.
Too many shard failed lead to task summary toString time is very long,20000 shard failed Calculating the task summary takes about 15 seconds.
Same shard, the same primary term, the same allocationId shard failed request processing did not remove the duplicate request , ShardEntry does not override equals and hashCode methods.

In my scenario, I tried to do optimization based on the above mentioned. Write recovery time reduced from 10 minutes to less than 1 minute, It seems to be working.

Please take a look at these three factors can be improved ？

Elasticsearch version (bin/elasticsearch --version):
5.3.1
Plugins installed: []

JVM version (java -version):
1.8.0_112
OS version (uname -a if on a Unix-like system):
2.6.32-220.23.2.xxxxx.el6.x86_64
Description of the problem including expected versus actual behavior:
Write operation will always exist. In the network partition scenario the write operation will be blocked because it has to wait for shard failed cluster state commit. Write operation recovery time will be longer than about 10 minutes.
Expected behavior: Recovery write time is shorter
Steps to reproduce:

my config：
cluster.routing.allocation.awareness.attributes: rack_id
node.attr.rack_id: xxxxx
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts:
     - xxxxx:9300
     - xxxxx:9300
     - xxxxx:9300
cluster.routing.allocation.awareness.force.rack_id: xxxx

network partition opertion: 
sudo iptables -D INPUT 1 ;sudo iptables -D OUTPUT 1 ;sudo iptables -L -n

Provide logs (if relevant):

The text was updated successfully, but these errors were encountered:

spinscale · 2018-03-07T07:23:44Z

@elastic/es-distributed can you please take a look if this is valid or might have changed with 6.0?

DaveCTurner · 2018-03-07T20:34:25Z

I'd expect the recovery of a partitioned 10,000 shard cluster to be faster in 6.x, but not for the reasons described here. Please could you provide the logs from a decent proportion of the nodes (≥ 5 from each side of the partition, including a selection of both master and data nodes) so we can see what was actually happening for those 10 minutes. I can provide an email address if you struggle to attach then here.

Also please describe the "optimisations" performed that reduced this from 10 minutes down to 1.

DaveCTurner · 2018-03-07T21:37:32Z

Also 5.3.1 is approaching a year old and missing a few fixes that might have an impact on this. Is it feasible to upgrade to a more recent version and retry your experiments? Ideally the latest possible version (currently 6.2.2) but it'd be useful to look at the latest 5.x version (currently 5.6.8) if that's not possible.

dnhatn · 2018-03-07T22:20:32Z

Same shard, the same primary term, the same allocationId shard failed request processing did not remove the duplicate request , ShardEntry does not override equals and hashCode methods.

@ywelsch and I discussed this issue. We will work on an improvement which avoids failing the same shard multiple times.

djjsindy · 2018-03-08T06:51:30Z

@spinscale @DaveCTurner @dnhatn
Could you provide your email address ?
I will send it to you some log and detailed analysis

DaveCTurner · 2018-03-08T09:53:50Z

@djjsindy please use [email protected] and I will share with the rest of the team. I expect we will also want to summarise and discuss any analysis here, which might include things like index names, shard counts, and so on. If there is any information like this that you want not to be shared in public then please indicate as such in your email.

This change replaces the use of string concatenation with a call to String.join(). String concatenation might be quadratic, unless the compiler can optimise it away, whereas String.join() is more reliably linear. There can sometimes be a large number of pending ClusterState update tasks and elastic#28920 includes a report that this operation sometimes takes a long time.

This change replaces the use of string concatenation with a call to String.join(). String concatenation might be quadratic, unless the compiler can optimise it away, whereas String.join() is more reliably linear. There can sometimes be a large number of pending ClusterState update tasks and #28920 includes a report that this operation sometimes takes a long time.

This change replaces the use of string concatenation with a call to String.join(). String concatenation might be quadratic, unless the compiler can optimise it away, whereas String.join() is more reliably linear. There can sometimes be a large number of pending ClusterState update tasks and elastic#28920 includes a report that this operation sometimes takes a long time.

DaveCTurner · 2018-03-09T10:05:43Z

@djjsindy I received your email but it only contained a handful of log messages, showing nothing unexpected. I responded:

Please could we have copies of full logs for the full duration of the outage so we can see what was taking so long to recover. The lines that you quote in your email show nothing unexpectedly slow, but only cover the very start of the outage. Ideally logs from all of the master nodes and some of the data nodes. I note that you have TRACE logging enabled for the ClusterService which is great - the logs will be quite large but this extra detail will help us a lot.

Copying this here in case you didn't get my response.

Additionally, we just pushed 90bde12 (5.6), e0da114 (6.x), and 033a83b (master) which should make a difference to how fast the task summary toString() method takes. If you're able, please can you try this change out and let us know if it helps?

This change replaces the use of string concatenation with a call to String.join(). String concatenation might be quadratic, unless the compiler can optimise it away, whereas String.join() is more reliably linear. There can sometimes be a large number of pending ClusterState update tasks and elastic#28920 includes a report that this operation sometimes takes a long time.

DaveCTurner · 2018-03-12T17:57:21Z

Thanks for the logs. The issue you are facing is as follows. The master detects the failure of some of the nodes (but, crucially, not all of them) and publishes a cluster state update to remove them. When committing a cluster state update, each receiving node attempts to establish connections to all the nodes that are listed in the new cluster state, which includes all the failed nodes that the master has not yet detected as failed. Each such connection attempt times out after 30 seconds because of the network partition. In 5.3.x it looks like these attempts are made in sequence; #22984 (released in v5.4.0) improves this a little so they happen up to 5-at-a-time, but never to better than 30 seconds. Once all the connection attempts have timed out, the node can finish applying the cluster state update.

Meanwhile, the master node has detected the failure of the rest of the nodes and publishes the next update. Once the other nodes have finished failing to connect to all their disconnected peers and applied the first update, the second update seems to be applied reasonably quickly.

This explains why waiting before sending the first cluster state update improves the situation dramatically: if the first cluster state update is delayed for long enough to capture all of the failed nodes then its recipients do not waste any time trying to connect to their failed peers and can apply it reasonably quickly.

However, this is unlikely to be the solution we choose: it means that Elasticsearch will block writes for the defined waiting time on the failure of just a single node. We need to think about this more deeply.

djjsindy · 2018-03-13T10:18:52Z

@DaveCTurner
Indeed, as you analyzed the situation: Version 5.4 uses 5 management threads to create new connections. When the number of the nodes in the cluster is very large, this concurrency mode is not enough, which will cause the data node to process cluster state very long, this will affect the next cluster state processing.

About write blocking, my sync data process will continue to retry when it encounters an error. The time from the start of write blocking to the write operation can be performed normally will be longer. It includes the time to write retry.

My opinion:

In the network partition,master should handles disconnected events earlier than shard failed events . Because the node disconnection event occurs at the same time, these node disconnected events are handled in one batch.

Help me see if my opinion is feasible? If my suggestion is feasible, I will try to create a pull request.

DaveCTurner · 2018-03-13T11:42:27Z

In the network partition,master should handles disconnected events earlier than shard failed events . Because the node disconnection event occurs at the same time, these node disconnected events are handled in one batch.

It's not clear that the shard-failed events have anything to do with this. As far as I can tell, it's just about the node disconnections being split across multiple updates.

However, I don't like the idea of trying to get all the node disconnection events to occur at the same time. It might be possible in the kind of clean partition you are simulating, but it would leave us open to the same kind of problem in more complicated scenarios. Fundamentally, there are no natural events triggered in the kind of network partition you are simulating, so we must rely on timeouts to detect node disconnection, and I think anything involving timeouts is going to have pathological behaviours similar to the one we're trying to avoid.

I think I would prefer better handling of node disconnections that are split across multiple updates instead. If applying a new cluster state did not try and synchronously connect to all the nodes listed in the new cluster state then we would be able to move onto subsequent cluster states much more quickly, removing further batches of failed nodes as their failures are detected.

This sounds nontrivial to achieve, for at least two reasons:

Connecting is a blocking operation at quite a low level in Elasticsearch, but I think it'd need to be made asynchronous so it doesn't consume a whole thread for each node, and so it can be cancelled safely.
Fully exposing the applied cluster state before all the nodes are connected is racy: what happens if we try and talk to a node in a just-applied cluster state but we're still establishing a connection? Throwing an error is bad because we'll end up trying to remove that node from the cluster again; blocking until the connection is established is also bad because in the situation we're looking at here everything will get stuck until the connections time out; queueing the request up just defers the problem until the queue's full.

I'm raising this for discussion with the wider team, as it'd be good to get some more ideas.

DaveCTurner · 2018-03-13T17:22:37Z

We had a good discussion on this subject, and came up with three clear ideas that should help improve Elasticsearch's behaviour in this situation: #29022, #29023 and #29025.

djjsindy · 2018-03-20T02:36:42Z

@DaveCTurner
Thank you for your discussion.We discussed last week and very much appreciated this universal solution.

It might be possible in the kind of clean partition. Our business is very large and our applications are often deployed in multiple racks and multiple cities. This can ensure the disaster recovery of the rack level or city level. For example, the rack outage or network disconnection, the speed of our application recovery is very important.
" I don't like the idea of trying to get all the node disconnection events to occur at the same time. It might be possible in the kind of clean partition you are simulating, but it would leave us open to the same kind of problem in more complicated scenarios" -- Can you elaborate on this complex scenarios?

DaveCTurner · 2018-03-20T09:40:05Z

You're welcome @djjsindy. Thank you in turn for your help in digging into the issue.

Note that although we opened those issues, no work on them is currently scheduled so we have marked them with the adoptme label. PRs are welcome.

" I don't like the idea of trying to get all the node disconnection events to occur at the same time. It might be possible in the kind of clean partition you are simulating, but it would leave us open to the same kind of problem in more complicated scenarios" -- Can you elaborate on this complex scenarios?

Sure. Something like a single-node failure a short time before a whole-rack failure would be troublesome: no matter how long you wait after the single-node failure, there's always a chance that you'd decide to proceed with the cluster-state update to remove it at exactly the wrong moment, ending up the very situation we were trying to avoid. It'd be less likely, but in a sense that makes it worse: it'd be much more of a struggle to reproduce and diagnose it.

I'm closing this issue as there's no further action required here.

DaveCTurner · 2018-06-24T19:40:20Z

Recently there have been a couple of threads on the discussion forums that look closely related to this:

spinscale added the :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. label Mar 7, 2018

DaveCTurner mentioned this issue Mar 8, 2018

Use String.join() to describe a list of tasks #28941

Merged

DaveCTurner changed the title ~~recovery write operation slow in network partition~~ Slow recovery of write availability after partition of a large cluster Mar 13, 2018

This was referenced Mar 13, 2018

Reduce connection timeout for intra-cluster connections #29022

Open

Introduce dedicated threadpool for establishing connections #29023

Closed

Do not attempt to re-establish connections when applying a cluster state update #29025

Closed

DaveCTurner added v7.0.0 v5.3.1 v6.3.0 v5.6.9 labels Mar 13, 2018

DaveCTurner closed this as completed Mar 20, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

DaveCTurner mentioned this issue Mar 18, 2019

Avoid blocking a thread waiting for connections #40150

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow recovery of write availability after partition of a large cluster #28920

Slow recovery of write availability after partition of a large cluster #28920

djjsindy commented Mar 7, 2018

spinscale commented Mar 7, 2018

DaveCTurner commented Mar 7, 2018

DaveCTurner commented Mar 7, 2018

dnhatn commented Mar 7, 2018

djjsindy commented Mar 8, 2018

DaveCTurner commented Mar 8, 2018

DaveCTurner commented Mar 9, 2018

DaveCTurner commented Mar 12, 2018

djjsindy commented Mar 13, 2018

DaveCTurner commented Mar 13, 2018

DaveCTurner commented Mar 13, 2018

djjsindy commented Mar 20, 2018

DaveCTurner commented Mar 20, 2018

DaveCTurner commented Jun 24, 2018

Slow recovery of write availability after partition of a large cluster #28920

Slow recovery of write availability after partition of a large cluster #28920

Comments

djjsindy commented Mar 7, 2018

spinscale commented Mar 7, 2018

DaveCTurner commented Mar 7, 2018

DaveCTurner commented Mar 7, 2018

dnhatn commented Mar 7, 2018

djjsindy commented Mar 8, 2018

DaveCTurner commented Mar 8, 2018

DaveCTurner commented Mar 9, 2018

DaveCTurner commented Mar 12, 2018

djjsindy commented Mar 13, 2018

DaveCTurner commented Mar 13, 2018

DaveCTurner commented Mar 13, 2018

djjsindy commented Mar 20, 2018

DaveCTurner commented Mar 20, 2018

DaveCTurner commented Jun 24, 2018