allocating replicas shard is blocked when concurrent allocation settings is large #46411

kkewwei · 2019-09-05T18:34:25Z

ES_VERSION: 7.3.1
JVM version : JDK1.8.0_112
OS version : linux
Description of the problem including expected versus actual behavior:

At first, I need pre-create 320 of indexes on the cluster (8 data nodes) to reduce the pressure to create shards, the number of shards are 3200(1600 shards are primary shards, 1600 are replicas shards), I want the primary shards to be allocated firstly, so I set the settings of cluster like this:

PUT _cluster/settings/admin
{
    "transient": {
        "cluster": {
         "routing": {
            "allocation": {
               "allow_rebalance": "always",
                 "cluster_concurrent_rebalance": "5",
                 "node_concurrent_recoveries": "0",
                  "node_initial_primaries_recoveries": "100"
            }
         }
      }
    }
}

of cause, 1600 primary shards are allocated successfully, 1600 replicas shards are unallocated, and no data on those indices.

At Second, I want the 1600 replicas shards to be allocated, so I set the settings of cluster like this:

PUT _cluster/settings/admin
{
    "transient": {
        "cluster": {
         "routing": {
            "allocation": {
                 "node_concurrent_recoveries": "1000"
            }
         }
      }
    }
}

The reason why I set the value so high is that I want the replicas shards to be allocated as soon as possible, and no data on the shards, so there are no pressure on the cluster.
What I except is that the 1600 replicas shards should be allocated successfully, but the result is the 1600 replicas shards are always in initializing state, no matter how long I wait, there has no change, It seems that initialzing of the 1600 replicas shards is blocked.
Explain:

The replicas shard is allocated by recovering from peer shard(primary shard), And it uses generic threadpool to create thread to do the work: the thread initializes local variables and sends request to inform the peer primary shard.

elasticsearch/core/src/main/java/org/elasticsearch/indices/recovery/PeerRecoveryTargetService.java

Line 140 in 688ecce

threadPool.generic().execute(new RecoveryRunner(recoveryId));

The primary shard receives the request and also uses the generic threadpool to create thread to transfer data. but the generic threadpool can create at most 128 threads in my case. This is the problem.

elasticsearch/core/src/main/java/org/elasticsearch/threadpool/ThreadPool.java

Line 169 in 688ecce

final int genericThreadPoolMax = boundedBy(4 * availableProcessors, 128, 512);
When I set node_concurrent_recoveries to be 1000, every data node use generic threadpool to create 128 threads to initial replicas shards at same time, and send requests to the target node of primary shards. When peer primary shard receives the request, it attampts to creat new thread by generic threadpool to transfer data, but generic threadpool on each target date node has no ability to continue to create new thread, as the result, the request to create thread is put to the queue of threadpool on data node of primary shard. It forms a dead cycle, every thread of recovery is blocked by requesting data from target peer primary data node.

"elasticsearch[node1][generic][T#140]" #276 daemon prio=5 os_prio=0 tid=0x00007f99dc041000 nid=0x2d7f waiting on condition [0x00007f981d3d2000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0000000180944970> (a org.elasticsearch.common.util.concurrent.BaseFuture$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:248)
        at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:91)
        at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:45)
        at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:33)
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$1(PeerRecoveryTargetService.java:229)
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$$Lambda$1622/173120083.run(Unknown Source)
        at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:105)
        at org.elasticsearch.common.util.CancellableThreads.execute(CancellableThreads.java:86)
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:222)
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:73)
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:556)
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

Sulotion:
In my opinion, should we divide this recovery of thread into two types threadpool? Replicas shard uses one threadpool to create thread initializing local variables and sending request to inform the peer primary shard. Primary shard uses another threadpool to create thread transfering data to replicas shard.
type primary.
We can also add timeout about the request of tansfering data to solve the problem.

The text was updated successfully, but these errors were encountered:

jasontedor · 2019-09-05T19:52:03Z

Duplicates #36195

Note that setting that high of a number for node_concurrent_recoveries is not advised.

jasontedor closed this as completed Sep 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allocating replicas shard is blocked when concurrent allocation settings is large #46411

allocating replicas shard is blocked when concurrent allocation settings is large #46411

kkewwei commented Sep 5, 2019 •

edited

Loading

jasontedor commented Sep 5, 2019

allocating replicas shard is blocked when concurrent allocation settings is large #46411

allocating replicas shard is blocked when concurrent allocation settings is large #46411

Comments

kkewwei commented Sep 5, 2019 • edited Loading

jasontedor commented Sep 5, 2019

kkewwei commented Sep 5, 2019 •

edited

Loading