Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster gets hanged after setting "cluster.routing.allocation.node_concurrent_recoveries" up to 100. #36195

Closed
howardhuanghua opened this issue Dec 4, 2018 · 6 comments
Assignees
Labels
>bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source.

Comments

@howardhuanghua
Copy link
Contributor

howardhuanghua commented Dec 4, 2018

Elasticsearch version: 6.4.3/6.5.1

JVM version: 1.8.0.181

OS version: CentOS 7.4

Description of the problem including expected versus actual behavior:
Product environment: 15 nodes, 2700+ indices, 15000+ shards.
Cluster gets hanged after setting "cluster.routing.allocation.node_concurrent_recoveries" up to 100.

Steps to reproduce:

  1. Setup a three nodes cluster. 1 core, 2GB per node.
  2. Create 300 indices, 3000 shards. Each index with 100 documents.
  3. Set these parameters for cluster dynamically:
    curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
    {
    "persistent": {
    "cluster.routing.allocation.node_concurrent_recoveries": 100,
    "indices.recovery.max_bytes_per_sec": "400mb"
    }
    }'
  4. Stop one of the nodes and remove data from data path.
  5. Startup the stopped node.
  6. After a while, cluster got hanged.

We could see each node's generic thread pool used up to 128 which is full.
[c_log@VM_128_27_centos ~/elasticsearch-6.4.3/bin]$ curl localhost:9200/_cat/thread_pool/generic?v
node_name name active queue rejected
node-3 generic 128 949 0
node-2 generic 128 1093 0
node-1 generic 128 1076 0

Lot's of peer recoveries are waiting:
image

Jstack output for hanged node, all generic threads are waiting on txGet:
"elasticsearch[node-3][generic][T#128]" #179 daemon prio=5 os_prio=0 tid=0x00007fa8980c8800 nid=0x3cb9 waiting on condition [0x00007fa86ca0a000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000fbef56f0> (a org.elasticsearch.common.util.concurrent.BaseFuture$Sync) at java.util.concurrent.locks.LockSupport.park(Unknown Source) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown Source) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(Unknown Source) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(Unknown Source) at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:251) at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:94) at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:44) at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:32) at org.elasticsearch.indices.recovery.RemoteRecoveryTargetHandler.receiveFileInfo(RemoteRecoveryTargetHandler.java:133) at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$phase1$6(RecoverySourceHandler.java:387) at org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$3071/1370938617.run(Unknown Source) at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:105) at org.elasticsearch.common.util.CancellableThreads.execute(CancellableThreads.java:86) at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:386) at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:172) at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:251) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:309) at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1605) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

So, cluster should get hanged in distributed deadlocks.

Thanks,
Howard

@howardhuanghua howardhuanghua changed the title ES Cluster gets hanged after setting "cluster.routing.allocation.node_concurrent_recoveries" up to 100. Cluster gets hanged after setting "cluster.routing.allocation.node_concurrent_recoveries" up to 100. Dec 4, 2018
@howardhuanghua
Copy link
Contributor Author

We have updated _cluster/settings rest level API to reject setting "cluster.routing.allocation.node_concurrent_recoveries" up to 50+ for workaround.

@romseygeek romseygeek added the :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. label Dec 5, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@Ethan-Zhang
Copy link

shall we use org.elasticsearch.transport.txGet(long timeout, TimeUnit unit) instead of txGet() to avoid deadlock?

@ywelsch ywelsch added the >bug label Dec 11, 2018
@ywelsch
Copy link
Contributor

ywelsch commented Dec 11, 2018

@howardhuanghua thanks for reporting this. This is indeed an issue if the number of concurrent recoveries from a node are higher than the max size of the GENERIC thread pool (which is some value >=128, depending on the number of processors). That said, typically you should not have so many shards per node, and allowing such a high number of node_concurrent_recoveries will also not play well with other parts of the system (e.g. shard balancer). Fixing this will require moving this code to be async, which is not a small thing to do. In the meanwhile, we can think about adding a soft-limit to the node_concurrent_recoveries setting.

@howardhuanghua
Copy link
Contributor Author

@ywelsch, thanks for your comment. Currently, we limit node_concurrent_recoveries setting <=50 in our product environment version based on 6.4.3 as follow,

RestClusterUpdateSettingsAction.java prepareRequest function:

        Settings settings = EMPTY_SETTINGS;
        if (source.containsKey(TRANSIENT)) {
            clusterUpdateSettingsRequest.transientSettings((Map) source.get(TRANSIENT));
            settings = clusterUpdateSettingsRequest.transientSettings();
        }
        if (source.containsKey(PERSISTENT)) {
            clusterUpdateSettingsRequest.persistentSettings((Map) source.get(PERSISTENT));
            settings = clusterUpdateSettingsRequest.persistentSettings();
        }
        
        // we limit node concurrent recoveries, as if incoming+outgoing up to generic thread pool would potentially cause cluster hang.
        if (settings.getAsInt(ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_RECOVERIES_SETTING.getKey(), 0) > MAX_CONCURRENT_RECOVERIES
                || settings.getAsInt(ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_INCOMING_RECOVERIES_SETTING.getKey(), 0) > MAX_CONCURRENT_RECOVERIES
                || settings.getAsInt(ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_OUTGOING_RECOVERIES_SETTING.getKey(), 0) > MAX_CONCURRENT_RECOVERIES) {
            throw new IllegalArgumentException("Can't set node concurrent recoveries greater than " + MAX_CONCURRENT_RECOVERIES +".");
        }

Please give us some suggestions if you have, thanks a lot.

s1monw added a commit to s1monw/elasticsearch that referenced this issue Jan 2, 2019
Today we block using the generic threadpool on the target side
until the source side has fully executed the recovery. We still
block on the source side executing the recovery in a blocking fashion
but there is no reason to block on the target side. This will
release generic threads early if there are many concurrent recoveries
happen.

Relates to elastic#36195
s1monw added a commit that referenced this issue Jan 4, 2019
Today we block using the generic thread-pool on the target side
until the source side has fully executed the recovery. We still
block on the source side executing the recovery in a blocking fashion
but there is no reason to block on the target side. This will
release generic threads early if there are many concurrent recoveries
happen.

Relates to #36195
s1monw added a commit that referenced this issue Jan 4, 2019
Today we block using the generic thread-pool on the target side
until the source side has fully executed the recovery. We still
block on the source side executing the recovery in a blocking fashion
but there is no reason to block on the target side. This will
release generic threads early if there are many concurrent recoveries
happen.

Relates to #36195
dnhatn added a commit that referenced this issue Jan 12, 2019
Today a peer-recovery may run into a deadlock if the value of
node_concurrent_recoveries is too high. This happens because the
peer-recovery is executed in a blocking fashion. This commit attempts
to make the recovery source partially non-blocking. I will make three
follow-ups to make it fully non-blocking: (1) send translog operations,
(2) primary relocation, (3) send commit files.

Relates #36195
dnhatn added a commit that referenced this issue Jan 13, 2019
Today a peer-recovery may run into a deadlock if the value of
node_concurrent_recoveries is too high. This happens because the
peer-recovery is executed in a blocking fashion. This commit attempts
to make the recovery source partially non-blocking. I will make three
follow-ups to make it fully non-blocking: (1) send translog operations,
(2) primary relocation, (3) send commit files.

Relates #36195
@dnhatn dnhatn self-assigned this Mar 24, 2019
dnhatn added a commit that referenced this issue Jun 29, 2019
dnhatn added a commit that referenced this issue Jun 29, 2019
dnhatn added a commit that referenced this issue Jul 17, 2019
dnhatn added a commit that referenced this issue Jul 18, 2019
@dnhatn
Copy link
Member

dnhatn commented Jul 18, 2019

Peer recovery is now non-blocking on both sides (except the relocation handoff step). I am closing this issue as making the handoff step async is optional. @howardhuanghua Thank you for reporting this.

@dnhatn dnhatn closed this as completed Jul 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source.
Projects
None yet
Development

No branches or pull requests

6 participants