Prevent duplicate snapshot names after failure #70785

DaveCTurner · 2021-03-23T20:23:19Z

Today we prevent creating a shard snapshot whose name matches one that
already exists (#29634) but it's possible to end up with duplicate
snapshot names anyway if the index-N file is unreadable and the
fallback which reads all the snap-UUID.dat files encounters two with
the same name, perhaps due to some earlier failure. This results in
writing a broken index-N file that contains duplicate keys, which
completely breaks the repository.

This commit fails the shard snapshot without writing the broken
index-N file in this situation, on the assumption that the
unreadability of the existing index-N file is only temporary.

Today we prevent creating a shard snapshot whose name matches one that already exists (elastic#29634) but it's possible to end up with duplicate snapshot names anyway if the `index-N` file is unreadable and the fallback which reads all the `snap-UUID.dat` files encounters two with the same name, perhaps due to some earlier failure. This results in writing a broken `index-N` file that contains duplicate keys, which completely breaks the repository. This commit fails the shard snapshot without writing the broken `index-N` file in this situation, on the assumption that the unreadability of the existing `index-N` file is only temporary.

elasticmachine · 2021-03-23T20:23:22Z

Pinging @elastic/es-distributed (Team:Distributed)

ywelsch

I'm not sure that this improves the situation for a user. While previously snapshotting had been slow (and a warning was logged), this now just breaks snapshotting. Many users are not monitoring the health of their snapshots (which isn't ideal, but just the way it is), and not doing any snapshots anymore feels trappy to me (especially if this happens after a bugfix release upgrade). I don't think this is severe enough to be fixed in 6.8, and my preference is to keep the behavior as-is. Happy to discuss though.

original-brownbear

Just lurking today :) but I'm +1 to this. It's not just that a warning was logged previously and that things were slow. It's also that we would create a broken metadata blob that would cause trouble when upgrading to the next version (since it's unreadable by 7.x and then deleting the shard snapshot requires deleting all snapshots for the index name).
-> Much better to stop snapshotting ASAP than to start creating broken metadata IMO.
Also, in the rare case that this is actually caused by some concurrent access issue (dangling data node etc.), it seems that this is slightly safer than what 6.8 did previously.

DaveCTurner · 2021-03-24T08:20:23Z

I do see Yannick's point, today we will at least limp onwards and complete the snapshot if the index-N file is unreadable. We make a new, broken, index-N file in that case but we still manage to take restorable snapshots even after an upgrade. There's no good way to ensure that we write a non-broken index-N file, if we do nothing then the old index-N file might come back to life later which would effectively drop blobs from the new snapshots, and newer versions look to be completely intolerant of a broken or nonexistent index-N file.

Basically there's a risk of snapshots not working no matter how we handle this in 6.x; if we don't trust users to verify their snapshots then we're kinda stuck.

DaveCTurner · 2021-03-24T08:21:41Z

we still manage to take restorable snapshots even after an upgrade

Hmm actually that contradicts "newer versions look to be completely intolerant of a broken or nonexistent index-N file". Snapshots will continue to work in early 7.x versions but more recent ones will break after an upgrade.

original-brownbear · 2021-03-24T08:37:06Z

Basically there's a risk of snapshots not working no matter how we handle this in 6.x; if we don't trust users to verify their snapshots then we're kinda stuck.

Yes. But at least, if you don't write a new file in this case and throw here you won't start breaking existing snapshots (assuming the index-N is the only thing broken ... which in fact is quite likely) and the users starts seeing exceptions. If you limp on you just quietly get into an even messier situation in cases like you point out where the file might come back and such.

IMO there aren't really any great solutions here but I'd rather have a solution that prevents broken repos that give trouble on upgrades and restores and doesn't break existing snapshots even at the cost of more visibly broken repos. But yea ... it's 6.8 and this isn't a new issue at all so not too strong of an opinion here :)

DaveCTurner · 2021-03-24T09:28:02Z

On reflection I think I'd rather things stopped working at the point of an upgrade to 7.x, rather than just randomly if a file in the repo breaks in 6.8. At least at an upgrade there's a hope that someone will be looking at the cluster, and after the upgrade users should have better visibility into snapshot failures thanks to SLM. Therefore I'm closing this.

DaveCTurner added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v6.8.16 labels Mar 23, 2021

DaveCTurner requested a review from ywelsch March 23, 2021 20:23

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Mar 23, 2021

ywelsch reviewed Mar 24, 2021

View reviewed changes

original-brownbear approved these changes Mar 24, 2021

View reviewed changes

DaveCTurner closed this Mar 24, 2021

DaveCTurner deleted the 2021-03-23-prevent-duplicate-snapshot-names-after-failure branch March 24, 2021 09:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent duplicate snapshot names after failure #70785

Prevent duplicate snapshot names after failure #70785

DaveCTurner commented Mar 23, 2021

elasticmachine commented Mar 23, 2021

ywelsch left a comment

original-brownbear left a comment

DaveCTurner commented Mar 24, 2021

DaveCTurner commented Mar 24, 2021

original-brownbear commented Mar 24, 2021

DaveCTurner commented Mar 24, 2021

Prevent duplicate snapshot names after failure #70785

Prevent duplicate snapshot names after failure #70785

Conversation

DaveCTurner commented Mar 23, 2021

elasticmachine commented Mar 23, 2021

ywelsch left a comment

Choose a reason for hiding this comment

original-brownbear left a comment

Choose a reason for hiding this comment

DaveCTurner commented Mar 24, 2021

DaveCTurner commented Mar 24, 2021

original-brownbear commented Mar 24, 2021

DaveCTurner commented Mar 24, 2021