Snapshot recovery failures lock the cluster in red state and prevent additional snapshot operations from running #29423

redlus · 2018-04-08T15:00:15Z

Hi,

When recovering an index from a snapshot, a failure to recover a single shard puts the cluster in red state and holds the lock for snapshot operations until a manual intervention is made.

Elasticsearch version:
6.2.3

Plugins installed:
ingest-attachment
ingest-geoip
mapper-murmur3
mapper-size
repository-azure
repository-gcs
repository-s3

JVM version:
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

OS version:
Linux 4.13.0-1011-azure #14-Ubuntu SMP 2018 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
A snapshot of a single index was produced from a live elasticsearch 5.2 cluster, and recovered into a live elasticsearch 6.2 cluster. The recovery process has failed for one shard with the following message grabbed from _cluster/allocation/explain:

"failed shard on node [6NHoqfc5TXiQjDOP8wdnCg]: failed recovery, failure RecoveryFailedException[[507_newlogs_20180314-01][13]: Recovery failed on {prod-elasticsearch-data-010}{6NHoqfc5TXiQjDOP8wdnCg}{vI_sh0rXTYCKnet74Hvavw}{192.168.0.191}{192.168.0.191:9300}{box_type=L8}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [507_newlogs_20180314/0phodsMqT0ujVthywIIhNQ]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: NoSuchFileException[The specified blob does not exist.]; "

The explain API also returns this message for each data node:

"shard has failed to be restored from the snapshot [507_newlogs:507_newlogs_20180314/0phodsMqT0ujVthywIIhNQ] because of [failed shard on node [6NHoqfc5TXiQjDOP8wdnCg]: failed recovery, failure RecoveryFailedException[[507_newlogs_20180314-01][13]: Recovery failed on {prod-elasticsearch-data-010}{6NHoqfc5TXiQjDOP8wdnCg}{vI_sh0rXTYCKnet74Hvavw}{192.168.0.191}{192.168.0.191:9300}{box_type=L8}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [507_newlogs_20180314/0phodsMqT0ujVthywIIhNQ]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: NoSuchFileException[The specified blob does not exist.]; ] - manually close or delete the index [507_newlogs_20180314-01] in order to retry to restore the snapshot again or use the reroute API to force the allocation of an empty primary shard"

Calling _cluster/reroute?retry_failed=true did not help.
At this stage the restore process is stuck, leaving the cluster in a red state and preventing any create / restore / delete snapshot operations from being made - until a manual operation is performed on the cluster (namely "manually close or delete the index...in order to retry to restore the snapshot again or use the reroute API to force the allocation of an empty primary shard").

First, if elasticsearch knows which shard has failed, can't it automatically try to delete this specific shard and retry to restore it from the snapshot it has just loaded? This can be much faster and more reliable than waiting for a manual operation to delete the index and re-restore.

Second, if elasticsearch can't retry to restore the shard (or if such a retry fails as well), it should release the lock on snapshot operations to prevent failure of future requests. The state of the cluster may still be red, but at least other operations can be performed in the background.

If this is the desired behavior, I'd add a user-configureable setting to release the snapshot operations lock in this specific case.

Thanks!

elasticmachine · 2018-04-08T18:23:58Z

Pinging @elastic/es-distributed

tlrx · 2018-04-16T13:23:00Z

Thanks @redlus for the detailed explanation. Here are some comments I have:

A snapshot of a single index was produced from a live elasticsearch 5.2 cluster, and recovered into a live elasticsearch 6.2 cluster.

Just a note: when multiple clusters are accessing the same repository it is recommended that only 1 cluster can create snapshots, and all other clusters have the repository registered as a read_only repository (in your case, the cluster in version 6.2).

Calling _cluster/reroute?retry_failed=true did not help.
At this stage the restore process is stuck, leaving the cluster in a red state and preventing any create / restore / delete snapshot operations from being made

This was true before #27493, the restore operation was stucked and prevent any other snapshot operation. Since #27493 the restore operation should not hang anymore and any create/restore/delete operation should work as expected: creating a snapshot for any index is possible (but it will fail if the partially restored index is involved in the snapshot; it must be closed or deleted first), restoring a snapshot for any index is possible (but the partially restored index must be closed or deleted before trying to restore it again) and deleting a snapshot should work too.

First, if elasticsearch knows which shard has failed, can't it automatically try to delete this specific shard and retry to restore it from the snapshot it has just loaded? This can be much faster and more reliable than waiting for a manual operation to delete the index and re-restore.

I agree this would be much simpler. Restoring a shard from a snapshot uses the same internal mechanism as a "normal" shard recovery. For now it follows the same rules and stops trying to allocate a shard if it failed more than 5 times. There is an issue about automatically retrying failed allocations, see (#24530).

Second, if elasticsearch can't retry to restore the shard (or if such a retry fails as well), it should release the lock on snapshot operations to prevent failure of future requests. The state of the cluster may still be red, but at least other operations can be performed in the background.

It should not be the case since #27493 (merged into 6.2.0). I didn't reproduced this behavior locally, I can create/restore/delete snapshots that do not include the partially restore index.

redlus · 2018-04-22T17:55:03Z

Thank for your reply, @tlrx

Something here does not add up. We're running the latest elasticsearch 6.2 on our production and therefore expect #27493 to be included. However, the behavior is different than described: the stuck restore snapshot operation never actually fails itself and requires manual deletion of the restored indices to release the snapshot operations lock (and elastic's red cluster state). Could this be another issue altogether?

FYI, I've just opened #29649, which describes a failure to create new snapshots on v6.2.3. Linking here in case it is related in some way.

tlrx · 2018-04-23T07:22:22Z

Thanks @redlus. I'm going to look closer at this soon.

tlrx · 2018-05-02T14:13:27Z

@redlus Can you please provide the elasticsearch logs that contains the exception IndexShardRestoreFailedException[Failed to recover index]?

Also, according to #29649 your repository index was corrupted. Did you reproduce this behavior with a new repository?

redlus · 2018-05-02T23:26:59Z

Hi @tlrx
Sadly, the logs from that timeframe are no longer available to :/

tlrx · 2018-05-03T10:08:02Z

@redlus OK. I can't do much for now. I think that the corrupted index in #29649 just broke everything.

I can't reproduce this behavior locally and we have no log traces. I'm going to close this issue, and if it happens again then please reopen and adds any log and useful information.

redlus · 2018-05-03T10:17:01Z

I understand, this probably is related to #29649. I believe it will not happen again after solving the aforementioned issue.

redlus changed the title ~~Snapshot recovery failures locks the cluster in red state and prevents additional snapshot operations from running~~ Snapshot recovery failures lock the cluster in red state and prevents additional snapshot operations from running Apr 8, 2018

redlus changed the title ~~Snapshot recovery failures lock the cluster in red state and prevents additional snapshot operations from running~~ Snapshot recovery failures lock the cluster in red state and prevent additional snapshot operations from running Apr 8, 2018

dnhatn added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Apr 8, 2018

bleskes assigned tlrx Apr 10, 2018

colings86 added the >bug label Apr 24, 2018

tlrx added feedback_needed and removed >bug labels May 2, 2018

tlrx closed this as completed May 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot recovery failures lock the cluster in red state and prevent additional snapshot operations from running #29423

Snapshot recovery failures lock the cluster in red state and prevent additional snapshot operations from running #29423

redlus commented Apr 8, 2018

elasticmachine commented Apr 8, 2018

tlrx commented Apr 16, 2018 •

edited

Loading

redlus commented Apr 22, 2018

tlrx commented Apr 23, 2018

tlrx commented May 2, 2018

redlus commented May 2, 2018

tlrx commented May 3, 2018

redlus commented May 3, 2018

Snapshot recovery failures lock the cluster in red state and prevent additional snapshot operations from running #29423

Snapshot recovery failures lock the cluster in red state and prevent additional snapshot operations from running #29423

Comments

redlus commented Apr 8, 2018

elasticmachine commented Apr 8, 2018

tlrx commented Apr 16, 2018 • edited Loading

redlus commented Apr 22, 2018

tlrx commented Apr 23, 2018

tlrx commented May 2, 2018

redlus commented May 2, 2018

tlrx commented May 3, 2018

redlus commented May 3, 2018

tlrx commented Apr 16, 2018 •

edited

Loading