-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug report]LockObtainFailedException throws under presure #20876
Comments
this happens when same background process is still ongoing (or just failed to finish properly) and still holds the shard lock. An example of this is a recovery process which needs access to the shard folder to copy files. Do you have |
@bleskes no, we don't set that. should it better to set it? |
Oh no - this is one of those things that I know that can take long in the background. Since 2.1 is quite old - do you have the chance to upgrade to 2.4 and try to reproduce? |
I'll try to reproduce in 2.4 in test env. |
We are experiencing the same problem with ES 2.4.1, java 8u101, Ubuntu 14.04. It has happened two times, and each time was triggered by starting a full snapshot while one index was under heavy indexing load (3-4k docs/s). About 10 minutes after the beginning of the snapshot, some shards of this index begin to throw lots of LockObtainFailedException, and the situation finally gets back to normal about one hour later. Meanwhile, about 2-3k LockObtainFailedException have been thrown. I hope a solution will be found because currently our only option is to disable our daily snapshot while we are doing heavy indexations. |
The likely reason why the |
@abeyad thanks for u guys' hard work |
Previously, if a node left the cluster (for example, due to a long GC), during a snapshot, the master node would mark the snapshot as failed, but the node itself could continue snapshotting the data on its shards to the repository. If the node rejoins the cluster, the master may assign it to hold the replica shard (where it held the primary before getting kicked off the cluster). The initialization of the replica shard would repeatedly fail with a ShardLockObtainFailedException until the snapshot thread finally finishes and relinquishes the lock on the Store. This commit resolves the situation by ensuring that the shard snapshot is aborted when the node responsible for that shard's snapshot leaves the cluster. When the node rejoins the cluster, it will see in the cluster state that the snapshot for that shard is failed and abort the snapshot locally, allowing the shard data directory to be freed for allocation of a replica shard on the same node. Closes elastic#20876
Previously, if a node left the cluster (for example, due to a long GC), during a snapshot, the master node would mark the snapshot as failed, but the node itself could continue snapshotting the data on its shards to the repository. If the node rejoins the cluster, the master may assign it to hold the replica shard (where it held the primary before getting kicked off the cluster). The initialization of the replica shard would repeatedly fail with a ShardLockObtainFailedException until the snapshot thread finally finishes and relinquishes the lock on the Store. This commit resolves the situation by ensuring that when a shard is removed from a node (such as when a node rejoins the cluster and realizes it no longer holds the active shard copy), any snapshotting of the removed shards is aborted. In the scenario above, when the node rejoins the cluster, it will see in the cluster state that the node no longer holds the primary shard, so IndicesClusterStateService will remove the shard, thereby causing any snapshots of that shard to be aborted. Closes #20876
Previously, if a node left the cluster (for example, due to a long GC), during a snapshot, the master node would mark the snapshot as failed, but the node itself could continue snapshotting the data on its shards to the repository. If the node rejoins the cluster, the master may assign it to hold the replica shard (where it held the primary before getting kicked off the cluster). The initialization of the replica shard would repeatedly fail with a ShardLockObtainFailedException until the snapshot thread finally finishes and relinquishes the lock on the Store. This commit resolves the situation by ensuring that when a shard is removed from a node (such as when a node rejoins the cluster and realizes it no longer holds the active shard copy), any snapshotting of the removed shards is aborted. In the scenario above, when the node rejoins the cluster, it will see in the cluster state that the node no longer holds the primary shard, so IndicesClusterStateService will remove the shard, thereby causing any snapshots of that shard to be aborted. Closes #20876
Previously, if a node left the cluster (for example, due to a long GC), during a snapshot, the master node would mark the snapshot as failed, but the node itself could continue snapshotting the data on its shards to the repository. If the node rejoins the cluster, the master may assign it to hold the replica shard (where it held the primary before getting kicked off the cluster). The initialization of the replica shard would repeatedly fail with a ShardLockObtainFailedException until the snapshot thread finally finishes and relinquishes the lock on the Store. This commit resolves the situation by ensuring that when a shard is removed from a node (such as when a node rejoins the cluster and realizes it no longer holds the active shard copy), any snapshotting of the removed shards is aborted. In the scenario above, when the node rejoins the cluster, it will see in the cluster state that the node no longer holds the primary shard, so IndicesClusterStateService will remove the shard, thereby causing any snapshots of that shard to be aborted. Closes #20876
Elasticsearch version:
2.1
Plugins installed: []
delete-by-query
elasticsearch-analysis-ik
repository-hdfs
JVM version:
8u60
OS version:
CentOS release 6.6 (Final)
Description of the problem including expected versus actual behavior:
one of the data node keep throw below exception:
[2016-10-12 11:34:04,769][WARN ][cluster.action.shard ] [XXXX] [indexName][2] received shard failed for [indexName][2], node[rckOYj-DT42QNoH9CCEBJQ], relocating [v2zayugFQnuMiGu-hS1vXg], [R], v[7091], s[INI
TIALIZING], a[id=bkpcEq2qTXaPEKHl9tOunQ, rId=xeJJijQCRyaJPcSgQa7eGg], expected_shard_size[22462872851], indexUUID [sOKz0tW9Sw-u137Swoevsw], message [failed to create shard], failure [ElasticsearchException[failed to create shard]; nested: LockObtainF
ailedException[Can't lock shard [indexName][2], timed out after 5000ms]; ]
[indexName][[indexName][2]] ElasticsearchException[failed to create shard]; nested: LockObtainFailedException[Can't lock shard [indexName][2], timed out after 5000ms];
at org.elasticsearch.index.IndexService.createShard(IndexService.java:389)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:650)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:550)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:179)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:494)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.store.LockObtainFailedException: Can't lock shard [indexName][2], timed out after 5000ms
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:565)
at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:493)
at org.elasticsearch.index.IndexService.createShard(IndexService.java:307)
... 9 more
Steps to reproduce:(not very presious, I haven't reproduced it yet)
The text was updated successfully, but these errors were encountered: