-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent duplicate snapshot names after failure #70785
Prevent duplicate snapshot names after failure #70785
Conversation
Today we prevent creating a shard snapshot whose name matches one that already exists (elastic#29634) but it's possible to end up with duplicate snapshot names anyway if the `index-N` file is unreadable and the fallback which reads all the `snap-UUID.dat` files encounters two with the same name, perhaps due to some earlier failure. This results in writing a broken `index-N` file that contains duplicate keys, which completely breaks the repository. This commit fails the shard snapshot without writing the broken `index-N` file in this situation, on the assumption that the unreadability of the existing `index-N` file is only temporary.
Pinging @elastic/es-distributed (Team:Distributed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this improves the situation for a user. While previously snapshotting had been slow (and a warning was logged), this now just breaks snapshotting. Many users are not monitoring the health of their snapshots (which isn't ideal, but just the way it is), and not doing any snapshots anymore feels trappy to me (especially if this happens after a bugfix release upgrade). I don't think this is severe enough to be fixed in 6.8, and my preference is to keep the behavior as-is. Happy to discuss though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just lurking today :) but I'm +1 to this. It's not just that a warning was logged previously and that things were slow. It's also that we would create a broken metadata blob that would cause trouble when upgrading to the next version (since it's unreadable by 7.x and then deleting the shard snapshot requires deleting all snapshots for the index name).
-> Much better to stop snapshotting ASAP than to start creating broken metadata IMO.
Also, in the rare case that this is actually caused by some concurrent access issue (dangling data node etc.), it seems that this is slightly safer than what 6.8 did previously.
I do see Yannick's point, today we will at least limp onwards and complete the snapshot if the Basically there's a risk of snapshots not working no matter how we handle this in 6.x; if we don't trust users to verify their snapshots then we're kinda stuck. |
Hmm actually that contradicts "newer versions look to be completely intolerant of a broken or nonexistent |
Yes. But at least, if you don't write a new file in this case and throw here you won't start breaking existing snapshots (assuming the index-N is the only thing broken ... which in fact is quite likely) and the users starts seeing exceptions. If you limp on you just quietly get into an even messier situation in cases like you point out where the file might come back and such. IMO there aren't really any great solutions here but I'd rather have a solution that prevents broken repos that give trouble on upgrades and restores and doesn't break existing snapshots even at the cost of more visibly broken repos. But yea ... it's 6.8 and this isn't a new issue at all so not too strong of an opinion here :) |
On reflection I think I'd rather things stopped working at the point of an upgrade to 7.x, rather than just randomly if a file in the repo breaks in 6.8. At least at an upgrade there's a hope that someone will be looking at the cluster, and after the upgrade users should have better visibility into snapshot failures thanks to SLM. Therefore I'm closing this. |
Today we prevent creating a shard snapshot whose name matches one that
already exists (#29634) but it's possible to end up with duplicate
snapshot names anyway if the
index-N
file is unreadable and thefallback which reads all the
snap-UUID.dat
files encounters two withthe same name, perhaps due to some earlier failure. This results in
writing a broken
index-N
file that contains duplicate keys, whichcompletely breaks the repository.
This commit fails the shard snapshot without writing the broken
index-N
file in this situation, on the assumption that theunreadability of the existing
index-N
file is only temporary.