-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: persist proposed lease transfers to disk #9523
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lease transfers are a per-range operation; blocking the entire store on restart seems unnecessarily coarse (especially if we start transferring all our leases away on shutdown; then the worst case of having to wait for almost the full lease duration (9s) will be common). I'd rather store this at the replica level instead of the store level.
We could almost do it by modifying our persisted version of the lease, but that is defined as replicated state, so we need to introduce a new unreplicated range-id-local key.
@@ -80,6 +80,9 @@ var ( | |||
// localStoreGossipSuffix stores gossip bootstrap metadata for this | |||
// store, updated any time new gossip hosts are encountered. | |||
localStoreGossipSuffix = []byte("goss") | |||
// localStoreSafeStartSuffix stores the minimum timestamp when it's safe for | |||
// the store to start serving after a restart. | |||
localStoreSafeStartSuffix = []byte("safe-start") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All suffixes are supposed to be 4 characters (see localSuffixLength
above).
@@ -448,6 +448,9 @@ type Store struct { | |||
// raft. | |||
droppedPlaceholders int32 | |||
} | |||
|
|||
// safeStartMu serializes access to the engine's "safe start" key. | |||
safeStartMu syncutil.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this get a new mutex?
log.Infof(ctx, "waiting for engines' safe start timestamp...") | ||
time.Sleep( | ||
// Add 1 nanosecond because we're ignoring the logical part. | ||
time.Duration(safeStart.WallTime-s.clock.PhysicalNow()+1) * time.Nanosecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't call clock.{Physical,}Now()
more than once here.
@bdarnell yeah, this mechanism is course (and this is explained in the comments in this commit), but the alternative seemed complex. Not only would we need to store per-range (or per-lease) local keys, but we'd also need a way to not use the respective individual leases after a restart (and this mechanism should probably be different than what's used by the server for not using the transferred lease before a restart - the I was punting on doing anything better for the time when/if non-expiring leases become a reality. In the meantime, note that we only wait out the lease duration when a server restarts immediately (which should not be common, I think). Also a server takes a while to shutdown, so depending on how we do the lease transfers on shutdown, that might also take care of some of the wait duration. So, I dunno... If you don't feel strongly, I'd go ahead with this change to get these transfers be correct and improve... later. |
Storing this per-replica doesn't seem significantly more complicated than doing it per-store. You just need a new unreplicated member of ReplicaState, which can be loaded at startup just like the previous lease is. You'd check this lease transfer timestamp at the same time you check whether the current lease covers the current timestamp. I'd much rather keep this at the replica level than introduce something entirely new at the store level that interacts with leases. |
Reviewed 8 of 8 files at r1. Comments from Reviewable |
Reviewed 6 of 8 files at r1. server/server.go, line 510 at r1 (raw file):
this wrapping should happen inside the closure so we know which store. server/server.go, line 513 at r1 (raw file):
is it the engines' or the stores'? server/server.go, line 517 at r1 (raw file):
not sure you need the common prefix here - the server isn't running yet, so there's nothing else in the logs. storage/replica_range_lease.go, line 292 at r1 (raw file):
"error updating safe start" Comments from Reviewable |
ok, will rewrite Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks failed. Comments from Reviewable |
Assigned |
TestStatsInit was writing stats and asserting it can read them. But it was doing so using a MultiTestContext and racing with the stats written by the ranges. Moved the test to the engines test.
This is required by the next commits because we'll start being notified about when the lease changes and a duplicate notification can throw us off.
I need to use a TestCluster in a test that can't depend on Server, so it's time to expand the shim. Also expand the TestServerInterface to give access to the stores. This is also needed in said test, and downcasting to TestServer is not an option. The stores are exposed through a shim interface.
... so we don't use those leases after a restart fixes cockroachdb#7996
91d83b3
to
81eee1d
Compare
I've rewritten the thing. Now, each replica's lease transfer state is persisted to disk in such a way that, upon restart, it can be almost perfectly reconstructed. So there's no longer a difference between how we're checking if we're in the middle of a transfer during regular running versus immediately after a restart. @spencerkimball, since you insisted on the idea of converging in-memory and on-disk state, I think you get the review. But say if you're too busy.
CC @RaduBerinde for Review status: 0 of 27 files reviewed at latest revision, 7 unresolved discussions, some commit checks failed. keys/constants.go, line 85 at r1 (raw file):
|
Reviewed 2 of 2 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 11 of 12 files at r6, 12 of 12 files at r7. keys/constants.go, line 118 at r7 (raw file):
Move this after "rlla" below. They're alphabetized by value to make it easier to spot conflicts. storage/client_test.go, line 317 at r7 (raw file):
This new argument is unused. storage/client_test.go, line 357 at r7 (raw file):
This could be a log.Fatal instead and then there'd be a timestamp automatically. storage/log_test.go, line 151 at r7 (raw file):
Its unfortunate that all these explicit casts are necessary here in the storage package where all the types are accessible. storage/replica_command.go, line 1747 at r7 (raw file):
Yes, I think that's expected. But why do we need to handle this specially, instead of treating it as an "extension" that doesn't actually move the expiration forward? storage/replica_range_lease.go, line 114 at r7 (raw file):
Why would the same request both succeed and fail? There are lease index and reproposal failures, but those don't make it far enough to call SignalLeaseApplied. storage/replica_range_lease.go, line 131 at r7 (raw file):
I don't see any tests that show that this makes a difference. What kind of failures could reach this point while the old lease is still active, and what would be the consequences of handling extensions in the same way as transfers (i.e. conservatively allowing NextLease to remain)? storage/replica_range_lease.go, line 145 at r7 (raw file):
We have the Replica.maybeSetCorrupt method (soon to move to Store). storage/replica_range_lease.go, line 154 at r7 (raw file):
If this can easily return a storage/replica_range_lease.go, line 162 at r7 (raw file):
What does "atomically" mean here? Under some lock? Which one? storage/replica_range_lease.go, line 252 at r7 (raw file):
getWaitChan doesn't have a panic. storage/replica_range_lease.go, line 309 at r7 (raw file):
Is this still true? Comments from Reviewable |
@andreimatei do you still want me to review this? |
Ben seems to be on it. Probably enough. |
So something came up when discussing with @tschottdorf , on the sides of his comments in #6290 about using leases after restarts in a proposer-evaluated KV world. @bdarnell @spencerkimball any opinions? Review status: all files reviewed at latest revision, 19 unresolved discussions, some commit checks failed. Comments from Reviewable |
I would in general be in favor of the blunt approach. Complexity scares me. And with the epoch-based range leases, it becomes less and less likely that any ranges on a restarted node will be reusing existing leases anyway. But in general, if a node is restarted, none of its ranges should ever be allowed to use pre-existing leases until all committed log entries have been applied to the finite state machine. So the command queue issue which you're worried about should be a moot point. |
Yeah, I think I'd abandon this change and instead make sure that we only use leases that were requested by the current process.
I'm not sure what you're trying to say here; we certainly don't enforce this today. |
@bdarnell yes we don't enforce this today. In more detail: if you restart with the lease still unexpired, and serve a read immediately for a key which has a pending, committed change in the raft log which hasn't yet been applied, you'll have a problem. I'm not entirely sure the scenario is possible, but I think it could be unless we apply all committed commands at raft group load time. The simpler approach we're now advocating of outlawing use of leases not from the current process will fix this potential consistency vulnerability. |
I've opened a new PR (#10211) instead of this one, addressing the original motivation (don't reuse transferred leases) and more. |
Unassigning due to not being able to review this past 11/6. |
closing in favor of #10420 |
... so we don't use those leases after a restart
fixes #7996
cc @cockroachdb/stability @petermattis
This change is