[bug report]LockObtainFailedException throws under presure #20876

makeyang · 2016-10-12T03:41:28Z

Elasticsearch version:
2.1
Plugins installed: []
delete-by-query
elasticsearch-analysis-ik
repository-hdfs
JVM version:
8u60
OS version:
CentOS release 6.6 (Final)
Description of the problem including expected versus actual behavior:
one of the data node keep throw below exception:
[2016-10-12 11:34:04,769][WARN ][cluster.action.shard ] [XXXX] [indexName][2] received shard failed for [indexName][2], node[rckOYj-DT42QNoH9CCEBJQ], relocating [v2zayugFQnuMiGu-hS1vXg], [R], v[7091], s[INI
TIALIZING], a[id=bkpcEq2qTXaPEKHl9tOunQ, rId=xeJJijQCRyaJPcSgQa7eGg], expected_shard_size[22462872851], indexUUID [sOKz0tW9Sw-u137Swoevsw], message [failed to create shard], failure [ElasticsearchException[failed to create shard]; nested: LockObtainF
ailedException[Can't lock shard [indexName][2], timed out after 5000ms]; ]
[indexName][[indexName][2]] ElasticsearchException[failed to create shard]; nested: LockObtainFailedException[Can't lock shard [indexName][2], timed out after 5000ms];
at org.elasticsearch.index.IndexService.createShard(IndexService.java:389)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:650)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:550)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:179)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:494)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.store.LockObtainFailedException: Can't lock shard [indexName][2], timed out after 5000ms
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:565)
at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:493)
at org.elasticsearch.index.IndexService.createShard(IndexService.java:307)
... 9 more
Steps to reproduce:(not very presious, I haven't reproduced it yet)

give cluster a lot presure and one node out of cluster
then remove presure and after a while, the node come back and try to recover some shard, it keeps throw below exception

bleskes · 2016-10-12T11:03:06Z

this happens when same background process is still ongoing (or just failed to finish properly) and still holds the shard lock. An example of this is a recovery process which needs access to the shard folder to copy files.

Do you have index.shard.check_on_startup set by any chance?

makeyang · 2016-10-13T03:13:30Z

@bleskes no, we don't set that. should it better to set it?

bleskes · 2016-10-13T06:44:05Z

@bleskes no, we don't set that. should it better to set it?

Oh no - this is one of those things that I know that can take long in the background. Since 2.1 is quite old - do you have the chance to upgrade to 2.4 and try to reproduce?

makeyang · 2016-10-13T07:09:21Z

I'll try to reproduce in 2.4 in test env.

setaou · 2016-10-17T09:39:24Z

We are experiencing the same problem with ES 2.4.1, java 8u101, Ubuntu 14.04.

It has happened two times, and each time was triggered by starting a full snapshot while one index was under heavy indexing load (3-4k docs/s). About 10 minutes after the beginning of the snapshot, some shards of this index begin to throw lots of LockObtainFailedException, and the situation finally gets back to normal about one hour later. Meanwhile, about 2-3k LockObtainFailedException have been thrown.

I hope a solution will be found because currently our only option is to disable our daily snapshot while we are doing heavy indexations.

abeyad · 2016-10-18T21:53:46Z

The likely reason why the LockObtainFailedException keeps occurring is if we have a scenario where the node holding the primary is under heavy load so is slow to respond and leaves the cluster, while a snapshot is taking place. The snapshot holds a lock on the primary shard on the over-loaded node. When the master node realizes that over-loaded node is not responding, it removes it from the cluster, promotes a replica copy of the shard to primary, and cancels the snapshot. When the over-loaded node rejoins the cluster, the master node assigns it to hold a replica copy of the shard. When the node attempts to initialize the shard and recover from the primary, it encounters a LockObtainFailedException because the canceled snapshot process still holds a lock on the shard. The shard lock isn't released until the snapshot actually completes. We are looking into an appropriate fix for this.

makeyang · 2016-10-19T02:38:44Z

@abeyad thanks for u guys' hard work

Previously, if a node left the cluster (for example, due to a long GC), during a snapshot, the master node would mark the snapshot as failed, but the node itself could continue snapshotting the data on its shards to the repository. If the node rejoins the cluster, the master may assign it to hold the replica shard (where it held the primary before getting kicked off the cluster). The initialization of the replica shard would repeatedly fail with a ShardLockObtainFailedException until the snapshot thread finally finishes and relinquishes the lock on the Store. This commit resolves the situation by ensuring that the shard snapshot is aborted when the node responsible for that shard's snapshot leaves the cluster. When the node rejoins the cluster, it will see in the cluster state that the snapshot for that shard is failed and abort the snapshot locally, allowing the shard data directory to be freed for allocation of a replica shard on the same node. Closes elastic#20876

Previously, if a node left the cluster (for example, due to a long GC), during a snapshot, the master node would mark the snapshot as failed, but the node itself could continue snapshotting the data on its shards to the repository. If the node rejoins the cluster, the master may assign it to hold the replica shard (where it held the primary before getting kicked off the cluster). The initialization of the replica shard would repeatedly fail with a ShardLockObtainFailedException until the snapshot thread finally finishes and relinquishes the lock on the Store. This commit resolves the situation by ensuring that when a shard is removed from a node (such as when a node rejoins the cluster and realizes it no longer holds the active shard copy), any snapshotting of the removed shards is aborted. In the scenario above, when the node rejoins the cluster, it will see in the cluster state that the node no longer holds the primary shard, so IndicesClusterStateService will remove the shard, thereby causing any snapshots of that shard to be aborted. Closes #20876

abeyad self-assigned this Oct 18, 2016

abeyad added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v2.4.1 v5.0.0 labels Oct 18, 2016

abeyad mentioned this issue Oct 23, 2016

Abort snapshots on a node that leaves the cluster #21084

Merged

abeyad closed this as completed in #21084 Oct 26, 2016

fixmebot bot referenced this issue in VectorXz/elasticsearch Apr 22, 2021

Create TestFixMe.md

a9fae03

fixmebot bot referenced this issue in VectorXz/elasticsearch May 28, 2021

Create Helloworld.md

1398a04

fixmebot bot referenced this issue in VectorXz/elasticsearch Aug 4, 2021

Update Helloworld.md

f68abab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug report]LockObtainFailedException throws under presure #20876

[bug report]LockObtainFailedException throws under presure #20876

makeyang commented Oct 12, 2016

bleskes commented Oct 12, 2016

makeyang commented Oct 13, 2016

bleskes commented Oct 13, 2016

makeyang commented Oct 13, 2016

setaou commented Oct 17, 2016

abeyad commented Oct 18, 2016

makeyang commented Oct 19, 2016

[bug report]LockObtainFailedException throws under presure #20876

[bug report]LockObtainFailedException throws under presure #20876

Comments

makeyang commented Oct 12, 2016

bleskes commented Oct 12, 2016

makeyang commented Oct 13, 2016

bleskes commented Oct 13, 2016

makeyang commented Oct 13, 2016

setaou commented Oct 17, 2016

abeyad commented Oct 18, 2016

makeyang commented Oct 19, 2016