-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot recovery failures lock the cluster in red state and prevent additional snapshot operations from running #29423
Comments
Pinging @elastic/es-distributed |
Thanks @redlus for the detailed explanation. Here are some comments I have:
Just a note: when multiple clusters are accessing the same repository it is recommended that only 1 cluster can create snapshots, and all other clusters have the repository registered as a
This was true before #27493, the restore operation was stucked and prevent any other snapshot operation. Since #27493 the restore operation should not hang anymore and any create/restore/delete operation should work as expected: creating a snapshot for any index is possible (but it will fail if the partially restored index is involved in the snapshot; it must be closed or deleted first), restoring a snapshot for any index is possible (but the partially restored index must be closed or deleted before trying to restore it again) and deleting a snapshot should work too.
I agree this would be much simpler. Restoring a shard from a snapshot uses the same internal mechanism as a "normal" shard recovery. For now it follows the same rules and stops trying to allocate a shard if it failed more than 5 times. There is an issue about automatically retrying failed allocations, see (#24530).
It should not be the case since #27493 (merged into 6.2.0). I didn't reproduced this behavior locally, I can create/restore/delete snapshots that do not include the partially restore index. |
Thank for your reply, @tlrx Something here does not add up. We're running the latest elasticsearch 6.2 on our production and therefore expect #27493 to be included. However, the behavior is different than described: the stuck restore snapshot operation never actually fails itself and requires manual deletion of the restored indices to release the snapshot operations lock (and elastic's red cluster state). Could this be another issue altogether? FYI, I've just opened #29649, which describes a failure to create new snapshots on v6.2.3. Linking here in case it is related in some way. |
Thanks @redlus. I'm going to look closer at this soon. |
Hi @tlrx |
I understand, this probably is related to #29649. I believe it will not happen again after solving the aforementioned issue. |
Hi,
When recovering an index from a snapshot, a failure to recover a single shard puts the cluster in red state and holds the lock for snapshot operations until a manual intervention is made.
Elasticsearch version:
6.2.3
Plugins installed:
ingest-attachment
ingest-geoip
mapper-murmur3
mapper-size
repository-azure
repository-gcs
repository-s3
JVM version:
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)
OS version:
Linux 4.13.0-1011-azure #14-Ubuntu SMP 2018 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
A snapshot of a single index was produced from a live elasticsearch 5.2 cluster, and recovered into a live elasticsearch 6.2 cluster. The recovery process has failed for one shard with the following message grabbed from _cluster/allocation/explain:
The explain API also returns this message for each data node:
Calling _cluster/reroute?retry_failed=true did not help.
At this stage the restore process is stuck, leaving the cluster in a red state and preventing any create / restore / delete snapshot operations from being made - until a manual operation is performed on the cluster (namely "manually close or delete the index...in order to retry to restore the snapshot again or use the reroute API to force the allocation of an empty primary shard").
First, if elasticsearch knows which shard has failed, can't it automatically try to delete this specific shard and retry to restore it from the snapshot it has just loaded? This can be much faster and more reliable than waiting for a manual operation to delete the index and re-restore.
Second, if elasticsearch can't retry to restore the shard (or if such a retry fails as well), it should release the lock on snapshot operations to prevent failure of future requests. The state of the cluster may still be red, but at least other operations can be performed in the background.
If this is the desired behavior, I'd add a user-configureable setting to release the snapshot operations lock in this specific case.
Thanks!
The text was updated successfully, but these errors were encountered: