Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot recovery failures lock the cluster in red state and prevent additional snapshot operations from running #29423

Closed
redlus opened this issue Apr 8, 2018 · 8 comments
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs feedback_needed

Comments

@redlus
Copy link

redlus commented Apr 8, 2018

Hi,

When recovering an index from a snapshot, a failure to recover a single shard puts the cluster in red state and holds the lock for snapshot operations until a manual intervention is made.

Elasticsearch version:
6.2.3

Plugins installed:
ingest-attachment
ingest-geoip
mapper-murmur3
mapper-size
repository-azure
repository-gcs
repository-s3

JVM version:
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

OS version:
Linux 4.13.0-1011-azure #14-Ubuntu SMP 2018 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
A snapshot of a single index was produced from a live elasticsearch 5.2 cluster, and recovered into a live elasticsearch 6.2 cluster. The recovery process has failed for one shard with the following message grabbed from _cluster/allocation/explain:

"failed shard on node [6NHoqfc5TXiQjDOP8wdnCg]: failed recovery, failure RecoveryFailedException[[507_newlogs_20180314-01][13]: Recovery failed on {prod-elasticsearch-data-010}{6NHoqfc5TXiQjDOP8wdnCg}{vI_sh0rXTYCKnet74Hvavw}{192.168.0.191}{192.168.0.191:9300}{box_type=L8}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [507_newlogs_20180314/0phodsMqT0ujVthywIIhNQ]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: NoSuchFileException[The specified blob does not exist.]; "

The explain API also returns this message for each data node:

"shard has failed to be restored from the snapshot [507_newlogs:507_newlogs_20180314/0phodsMqT0ujVthywIIhNQ] because of [failed shard on node [6NHoqfc5TXiQjDOP8wdnCg]: failed recovery, failure RecoveryFailedException[[507_newlogs_20180314-01][13]: Recovery failed on {prod-elasticsearch-data-010}{6NHoqfc5TXiQjDOP8wdnCg}{vI_sh0rXTYCKnet74Hvavw}{192.168.0.191}{192.168.0.191:9300}{box_type=L8}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [507_newlogs_20180314/0phodsMqT0ujVthywIIhNQ]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: NoSuchFileException[The specified blob does not exist.]; ] - manually close or delete the index [507_newlogs_20180314-01] in order to retry to restore the snapshot again or use the reroute API to force the allocation of an empty primary shard"

Calling _cluster/reroute?retry_failed=true did not help.
At this stage the restore process is stuck, leaving the cluster in a red state and preventing any create / restore / delete snapshot operations from being made - until a manual operation is performed on the cluster (namely "manually close or delete the index...in order to retry to restore the snapshot again or use the reroute API to force the allocation of an empty primary shard").

First, if elasticsearch knows which shard has failed, can't it automatically try to delete this specific shard and retry to restore it from the snapshot it has just loaded? This can be much faster and more reliable than waiting for a manual operation to delete the index and re-restore.

Second, if elasticsearch can't retry to restore the shard (or if such a retry fails as well), it should release the lock on snapshot operations to prevent failure of future requests. The state of the cluster may still be red, but at least other operations can be performed in the background.

If this is the desired behavior, I'd add a user-configureable setting to release the snapshot operations lock in this specific case.

Thanks!

@redlus redlus changed the title Snapshot recovery failures locks the cluster in red state and prevents additional snapshot operations from running Snapshot recovery failures lock the cluster in red state and prevents additional snapshot operations from running Apr 8, 2018
@redlus redlus changed the title Snapshot recovery failures lock the cluster in red state and prevents additional snapshot operations from running Snapshot recovery failures lock the cluster in red state and prevent additional snapshot operations from running Apr 8, 2018
@dnhatn dnhatn added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Apr 8, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@tlrx
Copy link
Member

tlrx commented Apr 16, 2018

Thanks @redlus for the detailed explanation. Here are some comments I have:

A snapshot of a single index was produced from a live elasticsearch 5.2 cluster, and recovered into a live elasticsearch 6.2 cluster.

Just a note: when multiple clusters are accessing the same repository it is recommended that only 1 cluster can create snapshots, and all other clusters have the repository registered as a read_only repository (in your case, the cluster in version 6.2).

Calling _cluster/reroute?retry_failed=true did not help.
At this stage the restore process is stuck, leaving the cluster in a red state and preventing any create / restore / delete snapshot operations from being made

This was true before #27493, the restore operation was stucked and prevent any other snapshot operation. Since #27493 the restore operation should not hang anymore and any create/restore/delete operation should work as expected: creating a snapshot for any index is possible (but it will fail if the partially restored index is involved in the snapshot; it must be closed or deleted first), restoring a snapshot for any index is possible (but the partially restored index must be closed or deleted before trying to restore it again) and deleting a snapshot should work too.

First, if elasticsearch knows which shard has failed, can't it automatically try to delete this specific shard and retry to restore it from the snapshot it has just loaded? This can be much faster and more reliable than waiting for a manual operation to delete the index and re-restore.

I agree this would be much simpler. Restoring a shard from a snapshot uses the same internal mechanism as a "normal" shard recovery. For now it follows the same rules and stops trying to allocate a shard if it failed more than 5 times. There is an issue about automatically retrying failed allocations, see (#24530).

Second, if elasticsearch can't retry to restore the shard (or if such a retry fails as well), it should release the lock on snapshot operations to prevent failure of future requests. The state of the cluster may still be red, but at least other operations can be performed in the background.

It should not be the case since #27493 (merged into 6.2.0). I didn't reproduced this behavior locally, I can create/restore/delete snapshots that do not include the partially restore index.

@redlus
Copy link
Author

redlus commented Apr 22, 2018

Thank for your reply, @tlrx

Something here does not add up. We're running the latest elasticsearch 6.2 on our production and therefore expect #27493 to be included. However, the behavior is different than described: the stuck restore snapshot operation never actually fails itself and requires manual deletion of the restored indices to release the snapshot operations lock (and elastic's red cluster state). Could this be another issue altogether?

FYI, I've just opened #29649, which describes a failure to create new snapshots on v6.2.3. Linking here in case it is related in some way.

@tlrx
Copy link
Member

tlrx commented Apr 23, 2018

Thanks @redlus. I'm going to look closer at this soon.

@colings86 colings86 added the >bug label Apr 24, 2018
@tlrx
Copy link
Member

tlrx commented May 2, 2018

@redlus Can you please provide the elasticsearch logs that contains the exception IndexShardRestoreFailedException[Failed to recover index]?

Also, according to #29649 your repository index was corrupted. Did you reproduce this behavior with a new repository?

@tlrx tlrx added feedback_needed and removed >bug labels May 2, 2018
@redlus
Copy link
Author

redlus commented May 2, 2018

Hi @tlrx
Sadly, the logs from that timeframe are no longer available to :/

@tlrx
Copy link
Member

tlrx commented May 3, 2018

@redlus OK. I can't do much for now. I think that the corrupted index in #29649 just broke everything.

I can't reproduce this behavior locally and we have no log traces. I'm going to close this issue, and if it happens again then please reopen and adds any log and useful information.

@tlrx tlrx closed this as completed May 3, 2018
@redlus
Copy link
Author

redlus commented May 3, 2018

I understand, this probably is related to #29649. I believe it will not happen again after solving the aforementioned issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs feedback_needed
Projects
None yet
Development

No branches or pull requests

5 participants