Cluster gets hanged after setting "cluster.routing.allocation.node_concurrent_recoveries" up to 100. #36195

howardhuanghua · 2018-12-04T04:20:07Z

Elasticsearch version: 6.4.3/6.5.1

JVM version: 1.8.0.181

OS version: CentOS 7.4

Description of the problem including expected versus actual behavior:
Product environment: 15 nodes, 2700+ indices, 15000+ shards.
Cluster gets hanged after setting "cluster.routing.allocation.node_concurrent_recoveries" up to 100.

Steps to reproduce:

Setup a three nodes cluster. 1 core, 2GB per node.
Create 300 indices, 3000 shards. Each index with 100 documents.
Set these parameters for cluster dynamically:
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
"persistent": {
"cluster.routing.allocation.node_concurrent_recoveries": 100,
"indices.recovery.max_bytes_per_sec": "400mb"
}
}'
Stop one of the nodes and remove data from data path.
Startup the stopped node.
After a while, cluster got hanged.

We could see each node's generic thread pool used up to 128 which is full.
[c_log@VM_128_27_centos ~/elasticsearch-6.4.3/bin]$ curl localhost:9200/_cat/thread_pool/generic?v
node_name name active queue rejected
node-3 generic 128 949 0
node-2 generic 128 1093 0
node-1 generic 128 1076 0

Lot's of peer recoveries are waiting:

Jstack output for hanged node, all generic threads are waiting on txGet:
"elasticsearch[node-3][generic][T#128]" #179 daemon prio=5 os_prio=0 tid=0x00007fa8980c8800 nid=0x3cb9 waiting on condition [0x00007fa86ca0a000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000fbef56f0> (a org.elasticsearch.common.util.concurrent.BaseFuture$Sync) at java.util.concurrent.locks.LockSupport.park(Unknown Source) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown Source) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(Unknown Source) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(Unknown Source) at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:251) at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:94) at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:44) at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:32) at org.elasticsearch.indices.recovery.RemoteRecoveryTargetHandler.receiveFileInfo(RemoteRecoveryTargetHandler.java:133) at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$phase1$6(RecoverySourceHandler.java:387) at org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$3071/1370938617.run(Unknown Source) at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:105) at org.elasticsearch.common.util.CancellableThreads.execute(CancellableThreads.java:86) at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:386) at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:172) at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:251) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:309) at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1605) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

So, cluster should get hanged in distributed deadlocks.

Thanks,
Howard

The text was updated successfully, but these errors were encountered:

howardhuanghua · 2018-12-04T04:25:04Z

We have updated _cluster/settings rest level API to reject setting "cluster.routing.allocation.node_concurrent_recoveries" up to 50+ for workaround.

elasticmachine · 2018-12-05T08:48:33Z

Pinging @elastic/es-distributed

Ethan-Zhang · 2018-12-11T10:38:21Z

shall we use org.elasticsearch.transport.txGet(long timeout, TimeUnit unit) instead of txGet() to avoid deadlock?

ywelsch · 2018-12-11T13:11:21Z

@howardhuanghua thanks for reporting this. This is indeed an issue if the number of concurrent recoveries from a node are higher than the max size of the GENERIC thread pool (which is some value >=128, depending on the number of processors). That said, typically you should not have so many shards per node, and allowing such a high number of node_concurrent_recoveries will also not play well with other parts of the system (e.g. shard balancer). Fixing this will require moving this code to be async, which is not a small thing to do. In the meanwhile, we can think about adding a soft-limit to the node_concurrent_recoveries setting.

howardhuanghua · 2018-12-13T12:18:06Z

@ywelsch, thanks for your comment. Currently, we limit node_concurrent_recoveries setting <=50 in our product environment version based on 6.4.3 as follow,

RestClusterUpdateSettingsAction.java prepareRequest function:

        Settings settings = EMPTY_SETTINGS;
        if (source.containsKey(TRANSIENT)) {
            clusterUpdateSettingsRequest.transientSettings((Map) source.get(TRANSIENT));
            settings = clusterUpdateSettingsRequest.transientSettings();
        }
        if (source.containsKey(PERSISTENT)) {
            clusterUpdateSettingsRequest.persistentSettings((Map) source.get(PERSISTENT));
            settings = clusterUpdateSettingsRequest.persistentSettings();
        }
        
        // we limit node concurrent recoveries, as if incoming+outgoing up to generic thread pool would potentially cause cluster hang.
        if (settings.getAsInt(ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_RECOVERIES_SETTING.getKey(), 0) > MAX_CONCURRENT_RECOVERIES
                || settings.getAsInt(ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_INCOMING_RECOVERIES_SETTING.getKey(), 0) > MAX_CONCURRENT_RECOVERIES
                || settings.getAsInt(ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_OUTGOING_RECOVERIES_SETTING.getKey(), 0) > MAX_CONCURRENT_RECOVERIES) {
            throw new IllegalArgumentException("Can't set node concurrent recoveries greater than " + MAX_CONCURRENT_RECOVERIES +".");
        }

Please give us some suggestions if you have, thanks a lot.

Today we block using the generic threadpool on the target side until the source side has fully executed the recovery. We still block on the source side executing the recovery in a blocking fashion but there is no reason to block on the target side. This will release generic threads early if there are many concurrent recoveries happen. Relates to elastic#36195

Today we block using the generic thread-pool on the target side until the source side has fully executed the recovery. We still block on the source side executing the recovery in a blocking fashion but there is no reason to block on the target side. This will release generic threads early if there are many concurrent recoveries happen. Relates to #36195

Today a peer-recovery may run into a deadlock if the value of node_concurrent_recoveries is too high. This happens because the peer-recovery is executed in a blocking fashion. This commit attempts to make the recovery source partially non-blocking. I will make three follow-ups to make it fully non-blocking: (1) send translog operations, (2) primary relocation, (3) send commit files. Relates #36195

Relates #36195

Relates #44040 Relates #36195

dnhatn · 2019-07-18T02:30:35Z

Peer recovery is now non-blocking on both sides (except the relocation handoff step). I am closing this issue as making the handoff step async is optional. @howardhuanghua Thank you for reporting this.

howardhuanghua changed the title ~~ES Cluster gets hanged after setting "cluster.routing.allocation.node_concurrent_recoveries" up to 100.~~ Cluster gets hanged after setting "cluster.routing.allocation.node_concurrent_recoveries" up to 100. Dec 4, 2018

romseygeek added the :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. label Dec 5, 2018

ywelsch added the >bug label Dec 11, 2018

dnhatn mentioned this issue Dec 27, 2018

Reduce recovery time with compress or secure transport #36981

Merged

s1monw mentioned this issue Jan 2, 2019

Don't block on peer recovery on the target side #37076

Merged

dnhatn mentioned this issue Jan 10, 2019

Make recovery source partially non-blocking #37291

Merged

dnhatn self-assigned this Mar 24, 2019

dnhatn mentioned this issue Jun 29, 2019

Make peer recovery clean files step async #43787

Merged

dnhatn added a commit that referenced this issue Jun 29, 2019

Make peer recovery clean files step async (#43787)

a452fff

Relates #36195

dnhatn added a commit that referenced this issue Jun 29, 2019

Make peer recovery clean files step async (#43787)

55b3ec8

Relates #36195

dnhatn mentioned this issue Jun 30, 2019

Make peer recovery send file info step async #43792

Merged

dnhatn added a commit that referenced this issue Jun 30, 2019

Make peer recovery send file info step async (#43792)

562d798

Relates #36195

dnhatn added a commit that referenced this issue Jul 1, 2019

Make peer recovery send file info step async (#43792)

598e00a

Relates #36195

dnhatn mentioned this issue Jul 6, 2019

Make peer recovery send file chunks async #44040

Merged

dnhatn mentioned this issue Jul 17, 2019

Make peer recovery send file chunks async #44468

Merged

dnhatn added a commit that referenced this issue Jul 17, 2019

Make peer recovery send file chunks async (#44468)

34f65c6

Relates #44040 Relates #36195

dnhatn added a commit that referenced this issue Jul 18, 2019

Make peer recovery send file chunks async (#44468)

51180af

Relates #44040 Relates #36195

dnhatn closed this as completed Jul 18, 2019

jasontedor mentioned this issue Sep 5, 2019

allocating replicas shard is blocked when concurrent allocation settings is large #46411

Closed

This was referenced Sep 12, 2019

Reduce recovery time with compress or secure transport. crate/crate#9131

Merged

recovery process improvements (making some parts async) crate/crate#9135

Merged

DaveCTurner mentioned this issue Dec 17, 2019

[Docs] Provide best practices for upgrading large clusters #50242

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster gets hanged after setting "cluster.routing.allocation.node_concurrent_recoveries" up to 100. #36195

Cluster gets hanged after setting "cluster.routing.allocation.node_concurrent_recoveries" up to 100. #36195

howardhuanghua commented Dec 4, 2018 •

edited

Loading

howardhuanghua commented Dec 4, 2018

elasticmachine commented Dec 5, 2018

Ethan-Zhang commented Dec 11, 2018

ywelsch commented Dec 11, 2018

howardhuanghua commented Dec 13, 2018

dnhatn commented Jul 18, 2019

Cluster gets hanged after setting "cluster.routing.allocation.node_concurrent_recoveries" up to 100. #36195

Cluster gets hanged after setting "cluster.routing.allocation.node_concurrent_recoveries" up to 100. #36195

Comments

howardhuanghua commented Dec 4, 2018 • edited Loading

howardhuanghua commented Dec 4, 2018

elasticmachine commented Dec 5, 2018

Ethan-Zhang commented Dec 11, 2018

ywelsch commented Dec 11, 2018

howardhuanghua commented Dec 13, 2018

dnhatn commented Jul 18, 2019

howardhuanghua commented Dec 4, 2018 •

edited

Loading