Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allocating replicas shard is blocked when concurrent allocation settings is large #46411

Closed
kkewwei opened this issue Sep 5, 2019 · 1 comment

Comments

@kkewwei
Copy link
Contributor

kkewwei commented Sep 5, 2019

ES_VERSION: 7.3.1
JVM version : JDK1.8.0_112
OS version : linux
Description of the problem including expected versus actual behavior:

  • At first, I need pre-create 320 of indexes on the cluster (8 data nodes) to reduce the pressure to create shards, the number of shards are 3200(1600 shards are primary shards, 1600 are replicas shards), I want the primary shards to be allocated firstly, so I set the settings of cluster like this:
PUT _cluster/settings/admin
{
    "transient": {
        "cluster": {
         "routing": {
            "allocation": {
               "allow_rebalance": "always",
                 "cluster_concurrent_rebalance": "5",
                 "node_concurrent_recoveries": "0",
                  "node_initial_primaries_recoveries": "100"
            }
         }
      }
    }
}

of cause, 1600 primary shards are allocated successfully, 1600 replicas shards are unallocated, and no data on those indices.

  • At Second, I want the 1600 replicas shards to be allocated, so I set the settings of cluster like this:
PUT _cluster/settings/admin
{
    "transient": {
        "cluster": {
         "routing": {
            "allocation": {
                 "node_concurrent_recoveries": "1000"
            }
         }
      }
    }
}

       The reason why I set the value so high is that I want the replicas shards to be allocated as soon as possible, and no data on the shards, so there are no pressure on the cluster.
       What I except is that the 1600 replicas shards should be allocated successfully, but the result is the 1600 replicas shards are always in initializing state, no matter how long I wait, there has no change, It seems that initialzing of the 1600 replicas shards is blocked.
Explain:

  • The replicas shard is allocated by recovering from peer shard(primary shard), And it uses generic threadpool to create thread to do the work: the thread initializes local variables and sends request to inform the peer primary shard.
    threadPool.generic().execute(new RecoveryRunner(recoveryId));
    The primary shard receives the request and also uses the generic threadpool to create thread to transfer data. but the generic threadpool can create at most 128 threads in my case. This is the problem.
    final int genericThreadPoolMax = boundedBy(4 * availableProcessors, 128, 512);
  • When I set node_concurrent_recoveries to be 1000, every data node use generic threadpool to create 128 threads to initial replicas shards at same time, and send requests to the target node of primary shards. When peer primary shard receives the request, it attampts to creat new thread by generic threadpool to transfer data, but generic threadpool on each target date node has no ability to continue to create new thread, as the result, the request to create thread is put to the queue of threadpool on data node of primary shard. It forms a dead cycle, every thread of recovery is blocked by requesting data from target peer primary data node.
"elasticsearch[node1][generic][T#140]" #276 daemon prio=5 os_prio=0 tid=0x00007f99dc041000 nid=0x2d7f waiting on condition [0x00007f981d3d2000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0000000180944970> (a org.elasticsearch.common.util.concurrent.BaseFuture$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:248)
        at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:91)
        at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:45)
        at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:33)
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$1(PeerRecoveryTargetService.java:229)
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$$Lambda$1622/173120083.run(Unknown Source)
        at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:105)
        at org.elasticsearch.common.util.CancellableThreads.execute(CancellableThreads.java:86)
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:222)
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:73)
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:556)
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

Sulotion:
       In my opinion, should we divide this recovery of thread into two types threadpool? Replicas shard uses one threadpool to create thread initializing local variables and sending request to inform the peer primary shard. Primary shard uses another threadpool to create thread transfering data to replicas shard.
type primary.
       We can also add timeout about the request of tansfering data to solve the problem.

@jasontedor
Copy link
Member

Duplicates #36195

Note that setting that high of a number for node_concurrent_recoveries is not advised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants