Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Google Cloud races #5081

Merged
merged 3 commits into from
Aug 14, 2018
Merged

Fix Google Cloud races #5081

merged 3 commits into from
Aug 14, 2018

Conversation

sethvargo
Copy link
Contributor

While debugging #4915, I was able to get the backend into a weird state. I thought it was un-reproducible, but I finally figured out how to reproduce it and found the sources of the bug.

First, a silly Golang mistake on my part - I was breaking out of a select, not the outer for loop. This wasn't the root cause of the bug, but it meant that it took longer for a leader to step down properly.

The real issue is that GCS is a real-life race condition. Let's say vault-0 is the current leader, but voluntarily steps down. If it's stepping down at the same time that vault-1 is trying to acquire leadership, the lockfile may reach the max ttl. vault-1 writes its own lockfile, but vault-0 is shutting down and deletes said lockfile as part of its shutdown. Then lock contention arises.

This PR uses GCS's metadata generation attributes to perform a check-and-delete operation to make sure that we only delete the lockfile when it's truly our lockfile.

/cc @emilymye @briankassouf

@sethvargo
Copy link
Contributor Author

@fastest963 FYI. I don't think your logs indicated that you hit this edge case, but just wanted to make you aware that it does exist.

@jameshartig
Copy link

@sethvargo Thanks for the heads up!

return errwrap.Wrapf("failed to read lock for deletion: {{err}}", err)
}
if r != nil && r.Identity == l.identity {
ctx := context.Background()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ctx call is unnecessary, ctx from above is already Background.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been bitten in the past during refactoring where it's unclear that a context is being used further down the method, so I'd prefer to keep the call to .get and the call to .Delete to use separate contexts that are explicit.

@jefferai jefferai added this to the 0.10.5 milestone Aug 13, 2018
@jefferai
Copy link
Member

Looks fine, if there isn't a reason to reinstantiate ctx there might as well remove it.

@jefferai jefferai modified the milestones: 0.10.5 , 0.11 Aug 13, 2018
Previously we were deleting a lock without first checking if the lock we were deleting was our own. There existed a small period of time where vault-0 would lose leadership and vault-1 would get leadership. vault-0 would delete the lock key while vault-1 would write it. If vault-0 won, there'd be another leader election, etc.

This fixes the race by using a CAS operation instead.
@jefferai
Copy link
Member

Thanks!

@jefferai jefferai merged commit 19f1a94 into hashicorp:master Aug 14, 2018
@sethvargo sethvargo deleted the sethvargo/gcs_race branch August 14, 2018 14:00
@LeenaSinghal0801
Copy link

"The real issue is that GCS is a real-life race condition. Let's say vault-0 is the current leader, but voluntarily steps down. If it's stepping down at the same time that vault-1 is trying to acquire leadership, the lockfile may reach the max ttl. vault-1 writes its own lockfile, but vault-0 is shutting down and deletes said lockfile as part of its shutdown. Then lock contention arises."

@sethvargo In this case, there could be two leaders at that point in time, one of them ll give up though. Will it auto-recover from this situation?

We had a similar case but the behavior we see is that vault just hangs on gaining primary status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants