Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent duplicate snapshot names after failure #70785

Conversation

DaveCTurner
Copy link
Contributor

Today we prevent creating a shard snapshot whose name matches one that
already exists (#29634) but it's possible to end up with duplicate
snapshot names anyway if the index-N file is unreadable and the
fallback which reads all the snap-UUID.dat files encounters two with
the same name, perhaps due to some earlier failure. This results in
writing a broken index-N file that contains duplicate keys, which
completely breaks the repository.

This commit fails the shard snapshot without writing the broken
index-N file in this situation, on the assumption that the
unreadability of the existing index-N file is only temporary.

Today we prevent creating a shard snapshot whose name matches one that
already exists (elastic#29634) but it's possible to end up with duplicate
snapshot names anyway if the `index-N` file is unreadable and the
fallback which reads all the `snap-UUID.dat` files encounters two with
the same name, perhaps due to some earlier failure. This results in
writing a broken `index-N` file that contains duplicate keys, which
completely breaks the repository.

This commit fails the shard snapshot without writing the broken
`index-N` file in this situation, on the assumption that the
unreadability of the existing `index-N` file is only temporary.
@DaveCTurner DaveCTurner added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v6.8.16 labels Mar 23, 2021
@DaveCTurner DaveCTurner requested a review from ywelsch March 23, 2021 20:23
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Mar 23, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this improves the situation for a user. While previously snapshotting had been slow (and a warning was logged), this now just breaks snapshotting. Many users are not monitoring the health of their snapshots (which isn't ideal, but just the way it is), and not doing any snapshots anymore feels trappy to me (especially if this happens after a bugfix release upgrade). I don't think this is severe enough to be fixed in 6.8, and my preference is to keep the behavior as-is. Happy to discuss though.

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just lurking today :) but I'm +1 to this. It's not just that a warning was logged previously and that things were slow. It's also that we would create a broken metadata blob that would cause trouble when upgrading to the next version (since it's unreadable by 7.x and then deleting the shard snapshot requires deleting all snapshots for the index name).
-> Much better to stop snapshotting ASAP than to start creating broken metadata IMO.
Also, in the rare case that this is actually caused by some concurrent access issue (dangling data node etc.), it seems that this is slightly safer than what 6.8 did previously.

@DaveCTurner
Copy link
Contributor Author

I do see Yannick's point, today we will at least limp onwards and complete the snapshot if the index-N file is unreadable. We make a new, broken, index-N file in that case but we still manage to take restorable snapshots even after an upgrade. There's no good way to ensure that we write a non-broken index-N file, if we do nothing then the old index-N file might come back to life later which would effectively drop blobs from the new snapshots, and newer versions look to be completely intolerant of a broken or nonexistent index-N file.

Basically there's a risk of snapshots not working no matter how we handle this in 6.x; if we don't trust users to verify their snapshots then we're kinda stuck.

@DaveCTurner
Copy link
Contributor Author

we still manage to take restorable snapshots even after an upgrade

Hmm actually that contradicts "newer versions look to be completely intolerant of a broken or nonexistent index-N file". Snapshots will continue to work in early 7.x versions but more recent ones will break after an upgrade.

@original-brownbear
Copy link
Member

Basically there's a risk of snapshots not working no matter how we handle this in 6.x; if we don't trust users to verify their snapshots then we're kinda stuck.

Yes. But at least, if you don't write a new file in this case and throw here you won't start breaking existing snapshots (assuming the index-N is the only thing broken ... which in fact is quite likely) and the users starts seeing exceptions. If you limp on you just quietly get into an even messier situation in cases like you point out where the file might come back and such.

IMO there aren't really any great solutions here but I'd rather have a solution that prevents broken repos that give trouble on upgrades and restores and doesn't break existing snapshots even at the cost of more visibly broken repos. But yea ... it's 6.8 and this isn't a new issue at all so not too strong of an opinion here :)

@DaveCTurner
Copy link
Contributor Author

On reflection I think I'd rather things stopped working at the point of an upgrade to 7.x, rather than just randomly if a file in the repo breaks in 6.8. At least at an upgrade there's a hope that someone will be looking at the cluster, and after the upgrade users should have better visibility into snapshot failures thanks to SLM. Therefore I'm closing this.

@DaveCTurner DaveCTurner deleted the 2021-03-23-prevent-duplicate-snapshot-names-after-failure branch March 24, 2021 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v6.8.16
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants