-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Google Cloud races #5081
Fix Google Cloud races #5081
Conversation
@fastest963 FYI. I don't think your logs indicated that you hit this edge case, but just wanted to make you aware that it does exist. |
@sethvargo Thanks for the heads up! |
return errwrap.Wrapf("failed to read lock for deletion: {{err}}", err) | ||
} | ||
if r != nil && r.Identity == l.identity { | ||
ctx := context.Background() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ctx call is unnecessary, ctx from above is already Background.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been bitten in the past during refactoring where it's unclear that a context is being used further down the method, so I'd prefer to keep the call to .get and the call to .Delete to use separate contexts that are explicit.
Looks fine, if there isn't a reason to reinstantiate ctx there might as well remove it. |
Previously we were deleting a lock without first checking if the lock we were deleting was our own. There existed a small period of time where vault-0 would lose leadership and vault-1 would get leadership. vault-0 would delete the lock key while vault-1 would write it. If vault-0 won, there'd be another leader election, etc. This fixes the race by using a CAS operation instead.
Thanks! |
"The real issue is that GCS is a real-life race condition. Let's say vault-0 is the current leader, but voluntarily steps down. If it's stepping down at the same time that vault-1 is trying to acquire leadership, the lockfile may reach the max ttl. vault-1 writes its own lockfile, but vault-0 is shutting down and deletes said lockfile as part of its shutdown. Then lock contention arises." @sethvargo In this case, there could be two leaders at that point in time, one of them ll give up though. Will it auto-recover from this situation? We had a similar case but the behavior we see is that vault just hangs on gaining primary status. |
While debugging #4915, I was able to get the backend into a weird state. I thought it was un-reproducible, but I finally figured out how to reproduce it and found the sources of the bug.
First, a silly Golang mistake on my part - I was breaking out of a
select
, not the outerfor
loop. This wasn't the root cause of the bug, but it meant that it took longer for a leader to step down properly.The real issue is that GCS is a real-life race condition. Let's say vault-0 is the current leader, but voluntarily steps down. If it's stepping down at the same time that vault-1 is trying to acquire leadership, the lockfile may reach the max ttl. vault-1 writes its own lockfile, but vault-0 is shutting down and deletes said lockfile as part of its shutdown. Then lock contention arises.
This PR uses GCS's metadata generation attributes to perform a check-and-delete operation to make sure that we only delete the lockfile when it's truly our lockfile.
/cc @emilymye @briankassouf