Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What happens if two writers to object storage run at the same time? #11

Closed
jellevandenhooff opened this issue Mar 15, 2024 · 8 comments
Closed

Comments

@jellevandenhooff
Copy link
Contributor

What happens if two sunlight instances write to object storage run at the same time? Only one instance will be able to advance the true head of the tree using the signed tree head compare-and-swap. But what happens to the (partial) tiles they write? It seems to me that two writers might overwrite each others’ files.

@FiloSottile
Copy link
Owner

Indeed, the losing writer might still overwrite the data tile of the winning writer. This is bad, but as long as the old data can be recovered, it's not fatal (unlike signing two incompatible views of the tree).

For S3, we recommend running with object versioning enabled. I forgot to mention this in the README. For backends that support it (currently only Tigris), we use If-Match headers to avoid overwriting. See 1046953.

@jellevandenhooff
Copy link
Contributor Author

jellevandenhooff commented Mar 16, 2024

I have a hard time coming up with a scenario where two writers end up running at the same time, so for that rare case relying on manual recovery sounds reasonable.

Perhaps more plausible is the single writer crashing and then later restarting. In that case the If-Match header and the required manual intervention could be annoying.

I played around a bit with making the commit protocol atomic: First stage all the tiles, then update the lock, and then upload the staged tiles. If the checkpoint doesn't match the lock, the recovery mechanism (re-)uploads the staged files. This mechanism makes the commit protocol a bit more expensive and adds some complexity, but perhaps it is worth the peace of mind?

One approach puts all tiles in a single tar files (jellevandenhooff#1) and one uses individual objects (jellevandenhooff#2).

@FiloSottile
Copy link
Owner

FiloSottile commented Mar 16, 2024 via email

@jellevandenhooff
Copy link
Contributor Author

jellevandenhooff commented Mar 17, 2024

I can think of two problematic recovery scenarios:

#12: A sequencing failure after uploading a tile but before updating the lock can cause the log to get stuck:

  • Sequencing starts.
  • The sequencer writer tile file "A".
  • The sequencer fails to update the lock.
  • Sequencing restarts.
  • The sequencer tries to write tile file "A".
  • The sequencer fails because tile file "A" already exists.

#13: A sequencing failure with a delayed earlier write can cause the log to get stuck:

  • Sequencing starts.
  • The sequencer starts writing tile file "A" with contents "A1" (asynchronously).
  • The sequencer times out.
  • Sequencing restarts.
  • The sequencer tries to write tile file "A" with contents "A2" and succeeds.
  • The earlier write to "A" succeeds with contents "A1" replacing "A2".
  • The sequencer updates the lock to a tree with hash expecting "A" to contain "A2".
  • The log is in a bad state because "A" contains "A1" but should contain "A2".

@FiloSottile
Copy link
Owner

Thank you for elaborating, this is very useful!

#12 is definitely something I overlooked while adding the If-Match, thank you.

#13 is something I worried about and it's part of why the compare-and-swap mechanism exists. From the design doc:

The compare-and-swap backend also helps relax the criticality of the object storage consistency guarantees. For example, Amazon recommends hedging requests to reduce tail latency: if a request is taking too long, fire off a second identical request in parallel, and cancel one when the other succeeds. This is fine if cancellation is effective, but what if it isn’t? Then the “losing” request might succeed later, in theory even after a subsequent checkpoint was uploaded, rolling it back. (Or, more mundanely, operational error might lead to recovering object storage from a backup.) If the node were to restart during this rolled back state, the tree would fork if it didn’t have the compare-and-swap backend to load the correct STH from.

However, I also think it's fundamentally unfixable: if a request that timed out in the past is to be considered concurrent with all future ones, any object might rollback at any point in the future.

The good news is that at a closer reading of the S3 Consistency Model I think it's actually not allowed to happen.

Amazon S3 internally uses last-writer-wins semantics to determine which write takes precedence. However, the order in which Amazon S3 receives the requests and the order in which applications receive acknowledgments cannot be predicted because of various factors, such as network latency.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#ConsistencyModel

In our case, the request either timed out before reaching S3, in which case it can't succeed, or after, in which case it has an earlier internal timestamp and will lose to A2.

I think #13 is either impossible or incredibly rare, and difficult to protect against, so manual recovery is ok.

@jellevandenhooff
Copy link
Contributor Author

jellevandenhooff commented Mar 17, 2024 via email

@FiloSottile
Copy link
Owner

A delay anywhere in the path between PutObject in the SDK and the final write internally in S3 could cause the behavior in the test, I think, and the timeouts and sequencing interval are small enough (5 seconds, 1 second) that I could see this race happening.

I don't think that's right, the test violates last-writer-wins semantics.

The S3 docs point out correctly that usually you can't rely on them because you don't know if a request had a delay before or after it reached S3, but in out case we know that either it reached S3 before the next request (in which case it will lose last-writer-wins) or it never reached S3 (in which case it can't succeed). That's because we don't ever sequence concurrently, so we stop sending a request before starting a new one.

Anecdotally, I have seen S3 latency do weird things, and the AWS documentation does mention retrying after latency of either 2 or 4 seconds.

Indeed, the Backend implementation races a new upload after 75ms, and that helped tail latency quite a bit.

I do not agree the problem is unsolvable: Changing the commit protocol to not write the immutable files until the lock has been updated would prevent this problem, at the cost of some complexity and more S3 operations. Perhaps still worth it? I don’t know how sensitive to the extra cost or complexity operators will be.

I assume you mean

  • A. update the checkpoint in the LockBackend
  • B. upload the tiles to the Backend
  • C. upload the checkpoint to the Backend

If a crash happens between A and B we'll have to rollback the LockBackend, which means a inconsistent tree was signed, even if not published, which I am uncomfortable about.

I also don't think it solves the problem (assuming it exists). Imagine a crash that happens due to a timeout during B. The old tile upload becomes a zombie that happens to succeed 10s later. The writer restarts, discards the new checkpoint in the LockBackend (because it doesn't have all the tiles it needs to resume from there), sequences a new set of tiles, and then the zombie succeeds and overwrites them.

@jellevandenhooff
Copy link
Contributor Author

The S3 docs point out correctly that usually you can't rely on them because you don't know if a request had a delay before or after it reached S3, but in out case we know that either it reached S3 before the next request (in which case it will lose last-writer-wins) or it never reached S3 (in which case it can't succeed). That's because we don't ever sequence concurrently, so we stop sending a request before starting a new one.

How do you know it reached S3? What if there was a delay on the network?

I assume you mean

  • A. update the checkpoint in the LockBackend
  • B. upload the tiles to the Backend
  • C. upload the checkpoint to the Backend
    If a crash happens between A and B we'll have to rollback the LockBackend, which means a inconsistent tree was signed, even if not published, which I am uncomfortable about.

Ah! Sorry, I meant something like jellevandenhooff#1 or jellevandenhooff#2:

  1. Upload the tiles to a unique staging area.
  2. Update the checkpoint in the lock backend.
  3. Upload the staged tiles to the backend.
  4. Upload the checkpoint to the backend.

By staging the tiles, and then updating the checkpoint, all uploads afterwards are deterministic and can be safely retried. Any delayed or zombie writes are perfectly fine: All writes to a backend file are guaranteed to be the same. The signing happens just the same as in the code today, only after uploading the tiles and before publishing the checkpoint.

FiloSottile added a commit that referenced this issue Aug 4, 2024
FiloSottile added a commit that referenced this issue Aug 4, 2024
FiloSottile added a commit that referenced this issue Aug 4, 2024
FiloSottile added a commit that referenced this issue Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants