Fix Google Cloud races #5081

sethvargo · 2018-08-10T04:47:24Z

While debugging #4915, I was able to get the backend into a weird state. I thought it was un-reproducible, but I finally figured out how to reproduce it and found the sources of the bug.

First, a silly Golang mistake on my part - I was breaking out of a select, not the outer for loop. This wasn't the root cause of the bug, but it meant that it took longer for a leader to step down properly.

The real issue is that GCS is a real-life race condition. Let's say vault-0 is the current leader, but voluntarily steps down. If it's stepping down at the same time that vault-1 is trying to acquire leadership, the lockfile may reach the max ttl. vault-1 writes its own lockfile, but vault-0 is shutting down and deletes said lockfile as part of its shutdown. Then lock contention arises.

This PR uses GCS's metadata generation attributes to perform a check-and-delete operation to make sure that we only delete the lockfile when it's truly our lockfile.

/cc @emilymye @briankassouf

sethvargo · 2018-08-10T04:52:29Z

@fastest963 FYI. I don't think your logs indicated that you hit this edge case, but just wanted to make you aware that it does exist.

jameshartig · 2018-08-10T16:56:19Z

@sethvargo Thanks for the heads up!

jefferai · 2018-08-13T14:35:07Z

physical/gcs/gcs_ha.go

+		return errwrap.Wrapf("failed to read lock for deletion: {{err}}", err)
+	}
+	if r != nil && r.Identity == l.identity {
+		ctx := context.Background()


This ctx call is unnecessary, ctx from above is already Background.

I've been bitten in the past during refactoring where it's unclear that a context is being used further down the method, so I'd prefer to keep the call to .get and the call to .Delete to use separate contexts that are explicit.

jefferai · 2018-08-13T14:36:06Z

Looks fine, if there isn't a reason to reinstantiate ctx there might as well remove it.

Previously we were deleting a lock without first checking if the lock we were deleting was our own. There existed a small period of time where vault-0 would lose leadership and vault-1 would get leadership. vault-0 would delete the lock key while vault-1 would write it. If vault-0 won, there'd be another leader election, etc. This fixes the race by using a CAS operation instead.

jefferai · 2018-08-14T13:53:28Z

Thanks!

LeenaSinghal0801 · 2019-06-12T21:06:05Z

"The real issue is that GCS is a real-life race condition. Let's say vault-0 is the current leader, but voluntarily steps down. If it's stepping down at the same time that vault-1 is trying to acquire leadership, the lockfile may reach the max ttl. vault-1 writes its own lockfile, but vault-0 is shutting down and deletes said lockfile as part of its shutdown. Then lock contention arises."

@sethvargo In this case, there could be two leaders at that point in time, one of them ll give up though. Will it auto-recover from this situation?

We had a similar case but the behavior we see is that vault just hangs on gaining primary status.

jefferai reviewed Aug 13, 2018

View reviewed changes

jefferai added this to the 0.10.5 milestone Aug 13, 2018

jefferai modified the milestones: 0.10.5 , 0.11 Aug 13, 2018

sethvargo added 3 commits August 14, 2018 08:58

storage/gcs: properly break out of loop during stop

c7bb87e

storage/spanner: properly break out of loop during stop

e7296a6

jefferai merged commit 19f1a94 into hashicorp:master Aug 14, 2018

sethvargo deleted the sethvargo/gcs_race branch August 14, 2018 14:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Google Cloud races #5081

Fix Google Cloud races #5081

sethvargo commented Aug 10, 2018

sethvargo commented Aug 10, 2018

jameshartig commented Aug 10, 2018

jefferai Aug 13, 2018

sethvargo Aug 13, 2018

jefferai commented Aug 13, 2018

jefferai commented Aug 14, 2018

LeenaSinghal0801 commented Jun 12, 2019

Fix Google Cloud races #5081

Fix Google Cloud races #5081

Conversation

sethvargo commented Aug 10, 2018

sethvargo commented Aug 10, 2018

jameshartig commented Aug 10, 2018

jefferai Aug 13, 2018

Choose a reason for hiding this comment

sethvargo Aug 13, 2018

Choose a reason for hiding this comment

jefferai commented Aug 13, 2018

jefferai commented Aug 14, 2018

LeenaSinghal0801 commented Jun 12, 2019