-
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What happens if two writers to object storage run at the same time? #11
Comments
Indeed, the losing writer might still overwrite the data tile of the winning writer. This is bad, but as long as the old data can be recovered, it's not fatal (unlike signing two incompatible views of the tree). For S3, we recommend running with object versioning enabled. I forgot to mention this in the README. For backends that support it (currently only Tigris), we use |
I have a hard time coming up with a scenario where two writers end up running at the same time, so for that rare case relying on manual recovery sounds reasonable. Perhaps more plausible is the single writer crashing and then later restarting. In that case the If-Match header and the required manual intervention could be annoying. I played around a bit with making the commit protocol atomic: First stage all the tiles, then update the lock, and then upload the staged tiles. If the checkpoint doesn't match the lock, the recovery mechanism (re-)uploads the staged files. This mechanism makes the commit protocol a bit more expensive and adds some complexity, but perhaps it is worth the peace of mind? One approach puts all tiles in a single tar files (jellevandenhooff#1) and one uses individual objects (jellevandenhooff#2). |
Automatic recovery from crashes is a design goal and something that I believe the current implementation provides. It’s even moderately well tested because the sequencing fails out on timeouts, and those do tend to happen.
Can you describe a crash scenario that is not automatically recovered?
|
I can think of two problematic recovery scenarios: #12: A sequencing failure after uploading a tile but before updating the lock can cause the log to get stuck:
#13: A sequencing failure with a delayed earlier write can cause the log to get stuck:
|
Thank you for elaborating, this is very useful! #12 is definitely something I overlooked while adding the #13 is something I worried about and it's part of why the compare-and-swap mechanism exists. From the design doc:
However, I also think it's fundamentally unfixable: if a request that timed out in the past is to be considered concurrent with all future ones, any object might rollback at any point in the future. The good news is that at a closer reading of the S3 Consistency Model I think it's actually not allowed to happen.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#ConsistencyModel In our case, the request either timed out before reaching S3, in which case it can't succeed, or after, in which case it has an earlier internal timestamp and will lose to A2. I think #13 is either impossible or incredibly rare, and difficult to protect against, so manual recovery is ok. |
Glad you found it useful, happy to be thinking along!
I agree that #13 should be rare, but I am not sure about impossible.
A delay anywhere in the path between PutObject in the SDK and the final write internally in S3 could cause the behavior in the test, I think, and the timeouts and sequencing interval are small enough (5 seconds, 1 second) that I could see this race happening. Anecdotally, I have seen S3 latency do weird things, and the AWS documentation does mention retrying after latency of either 2 or 4 seconds.
I do not agree the problem is unsolvable: Changing the commit protocol to not write the immutable files until the lock has been updated would prevent this problem, at the cost of some complexity and more S3 operations. Perhaps still worth it? I don’t know how sensitive to the extra cost or complexity operators will be.
In the alternative, if manual recovery is an acceptable scenario, perhaps it’d be worthwhile to have an automated recovery tool? With a versioned bucket, recovery should be a well-defined operation since the lock checkpoint specifies all matching hashes and all correct files should have been uploaded. It could check all versions of problematic tiles and suggest which ones to copy.
|
I don't think that's right, the test violates last-writer-wins semantics. The S3 docs point out correctly that usually you can't rely on them because you don't know if a request had a delay before or after it reached S3, but in out case we know that either it reached S3 before the next request (in which case it will lose last-writer-wins) or it never reached S3 (in which case it can't succeed). That's because we don't ever sequence concurrently, so we stop sending a request before starting a new one.
Indeed, the Backend implementation races a new upload after 75ms, and that helped tail latency quite a bit.
I assume you mean
If a crash happens between A and B we'll have to rollback the LockBackend, which means a inconsistent tree was signed, even if not published, which I am uncomfortable about. I also don't think it solves the problem (assuming it exists). Imagine a crash that happens due to a timeout during B. The old tile upload becomes a zombie that happens to succeed 10s later. The writer restarts, discards the new checkpoint in the LockBackend (because it doesn't have all the tiles it needs to resume from there), sequences a new set of tiles, and then the zombie succeeds and overwrites them. |
How do you know it reached S3? What if there was a delay on the network?
Ah! Sorry, I meant something like jellevandenhooff#1 or jellevandenhooff#2:
By staging the tiles, and then updating the checkpoint, all uploads afterwards are deterministic and can be safely retried. Any delayed or zombie writes are perfectly fine: All writes to a backend file are guaranteed to be the same. The signing happens just the same as in the code today, only after uploading the tiles and before publishing the checkpoint. |
Based on a design by @jellevandenhooff at jellevandenhooff#1. Fixes #11 Co-authored-by: Jelle van den Hooff <[email protected]>
Based on a design by @jellevandenhooff at jellevandenhooff#1. Fixes #11 Co-authored-by: Jelle van den Hooff <[email protected]>
Based on a design by @jellevandenhooff at jellevandenhooff#1. Fixes #11 Co-authored-by: Jelle van den Hooff <[email protected]>
What happens if two sunlight instances write to object storage run at the same time? Only one instance will be able to advance the true head of the tree using the signed tree head compare-and-swap. But what happens to the (partial) tiles they write? It seems to me that two writers might overwrite each others’ files.
The text was updated successfully, but these errors were encountered: