Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: persist proposed lease transfers to disk #9523

Closed

Conversation

andreimatei
Copy link
Contributor

@andreimatei andreimatei commented Sep 24, 2016

... so we don't use those leases after a restart

fixes #7996

cc @cockroachdb/stability @petermattis


This change is Reviewable

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lease transfers are a per-range operation; blocking the entire store on restart seems unnecessarily coarse (especially if we start transferring all our leases away on shutdown; then the worst case of having to wait for almost the full lease duration (9s) will be common). I'd rather store this at the replica level instead of the store level.

We could almost do it by modifying our persisted version of the lease, but that is defined as replicated state, so we need to introduce a new unreplicated range-id-local key.

@@ -80,6 +80,9 @@ var (
// localStoreGossipSuffix stores gossip bootstrap metadata for this
// store, updated any time new gossip hosts are encountered.
localStoreGossipSuffix = []byte("goss")
// localStoreSafeStartSuffix stores the minimum timestamp when it's safe for
// the store to start serving after a restart.
localStoreSafeStartSuffix = []byte("safe-start")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All suffixes are supposed to be 4 characters (see localSuffixLength above).

@@ -448,6 +448,9 @@ type Store struct {
// raft.
droppedPlaceholders int32
}

// safeStartMu serializes access to the engine's "safe start" key.
safeStartMu syncutil.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this get a new mutex?

log.Infof(ctx, "waiting for engines' safe start timestamp...")
time.Sleep(
// Add 1 nanosecond because we're ignoring the logical part.
time.Duration(safeStart.WallTime-s.clock.PhysicalNow()+1) * time.Nanosecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't call clock.{Physical,}Now() more than once here.

@andreimatei
Copy link
Contributor Author

@bdarnell yeah, this mechanism is course (and this is explained in the comments in this commit), but the alternative seemed complex. Not only would we need to store per-range (or per-lease) local keys, but we'd also need a way to not use the respective individual leases after a restart (and this mechanism should probably be different than what's used by the server for not using the transferred lease before a restart - the replica.pendingLeaseRequest struct). The idea of just marking the local lease as tainted was considered originally when I introduced the lease transfers, but, as you've said, the idea of applying a non-replicated change to this otherwise replicated state seemed like a bad idea.

I was punting on doing anything better for the time when/if non-expiring leases become a reality. In the meantime, note that we only wait out the lease duration when a server restarts immediately (which should not be common, I think). Also a server takes a while to shutdown, so depending on how we do the lease transfers on shutdown, that might also take care of some of the wait duration.

So, I dunno... If you don't feel strongly, I'd go ahead with this change to get these transfers be correct and improve... later.

@bdarnell
Copy link
Contributor

Storing this per-replica doesn't seem significantly more complicated than doing it per-store. You just need a new unreplicated member of ReplicaState, which can be loaded at startup just like the previous lease is. You'd check this lease transfer timestamp at the same time you check whether the current lease covers the current timestamp. I'd much rather keep this at the replica level than introduce something entirely new at the store level that interacts with leases.

@tbg
Copy link
Member

tbg commented Sep 26, 2016

Reviewed 8 of 8 files at r1.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Sep 26, 2016

Reviewed 6 of 8 files at r1.
Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.


server/server.go, line 510 at r1 (raw file):

  })
  if err != nil {
      return errors.Wrap(err, "error reading stores' safe start")

this wrapping should happen inside the closure so we know which store.


server/server.go, line 513 at r1 (raw file):

  }
  if s.clock.Now().Less(safeStart) {
      log.Infof(ctx, "waiting for engines' safe start timestamp...")

is it the engines' or the stores'?


server/server.go, line 517 at r1 (raw file):

          // Add 1 nanosecond because we're ignoring the logical part.
          time.Duration(safeStart.WallTime-s.clock.PhysicalNow()+1) * time.Nanosecond)
      log.Infof(ctx, "waiting for engines' safe start timestamp... done")

not sure you need the common prefix here - the server isn't running yet, so there's nothing else in the logs.


storage/replica_range_lease.go, line 292 at r1 (raw file):

      if err := r.store.MaybeUpdateSafeStart(ctx, lease); err != nil {
          return nil, nil, errors.Wrap(err, "error persisting lease transfer")

"error updating safe start"


Comments from Reviewable

@andreimatei
Copy link
Contributor Author

ok, will rewrite


Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.


Comments from Reviewable

@tbg tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Oct 4, 2016
@tbg
Copy link
Member

tbg commented Oct 4, 2016

Assigned stability label to mark for supervised merge.

TestStatsInit was writing stats and asserting it can read them. But it
was doing so using a MultiTestContext and racing with the stats written
by the ranges. Moved the test to the engines test.
This is required by the next commits because we'll start being notified
about when the lease changes and a duplicate notification can throw us
off.
I need to use a TestCluster in a test that can't depend on Server, so
it's time to expand the shim.

Also expand the TestServerInterface to give access to the stores. This
is also needed in said test, and downcasting to TestServer is not an
option. The stores are exposed through a shim interface.
... so we don't use those leases after a restart

fixes cockroachdb#7996
@andreimatei
Copy link
Contributor Author

I've rewritten the thing. Now, each replica's lease transfer state is persisted to disk in such a way that, upon restart, it can be almost perfectly reconstructed. So there's no longer a difference between how we're checking if we're in the middle of a transfer during regular running versus immediately after a restart.
This required an interaction between the raft machinery and the pendingLeaseRequest struct - whereas before one would request a new lease (new, extension or transfer) and then wait on the replica.Send channel to see if the new lease has applied or not (an ephemeral mechanism that cannot survive a restart), now it will wait on a channel that's signaled by the commit trigger that applies the new lease. This arguably broke an encapsulation of the logic in the pendingLeaseRequest struct, but, besides serving the purpose of the PR, it's also more correct (before, it was unclear what would happen if, say, someone's waiting on a lease proposal, but another lease is applied in the meantime). The change is also in the spirit of the Replica, having reconstructible state, which pendingLeaseRequest was going against.

@spencerkimball, since you insisted on the idea of converging in-memory and on-disk state, I think you get the review. But say if you're too busy.

r7 is the main commit of interest, the rest are misc. I'll probably need to add more tests that mimic a restart, but wanna get some validation.

CC @RaduBerinde for r6 - expending the TestCluster shim.


Review status: 0 of 27 files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.


keys/constants.go, line 85 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

All suffixes are supposed to be 4 characters (see localSuffixLength above).

Done.

storage/replica_range_lease.go, line 292 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

"error updating safe start"

it's gone

storage/store.go, line 453 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why does this get a new mutex?

gone

server/server.go, line 510 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this wrapping should happen inside the closure so we know which store.

gone

server/server.go, line 513 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

is it the engines' or the stores'?

gone

server/server.go, line 516 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Don't call clock.{Physical,}Now() more than once here.

gone

server/server.go, line 517 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

not sure you need the common prefix here - the server isn't running yet, so there's nothing else in the logs.

gone

Comments from Reviewable

@bdarnell
Copy link
Contributor

Reviewed 2 of 2 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 11 of 12 files at r6, 12 of 12 files at r7.
Review status: all files reviewed at latest revision, 16 unresolved discussions, some commit checks failed.


keys/constants.go, line 118 at r7 (raw file):

  LocalRangeLeaseSuffix = []byte("rll-")
  // LocalNextLeaseSuffix is the suffix for a range's "next lease".
  LocalNextLeaseSuffix = []byte("rnl-")

Move this after "rlla" below. They're alphabetized by value to make it easier to spot conflicts.


storage/client_test.go, line 317 at r7 (raw file):

      wg.Add(len(m.stoppers))
      for i, s := range m.stoppers {
          go func(i int, s *stop.Stopper) {

This new argument is unused.


storage/client_test.go, line 357 at r7 (raw file):

          m.t.Error("timed out during shutdown")
      } else {
          panic(fmt.Sprintf("timed out during shutdown. current time: %s.", timeutil.Now()))

This could be a log.Fatal instead and then there'd be a timestamp automatically.


storage/log_test.go, line 151 at r7 (raw file):

  // Stores (or a creative usage of VisitStores could suffice).
  testStore, err := s.(*server.TestServer).Stores().GetStore(roachpb.StoreID(1))
  store := testStore.(*storage.Store)

Its unfortunate that all these explicit casts are necessary here in the storage package where all the types are accessible.


storage/replica_command.go, line 1747 at r7 (raw file):

  // replays.
  // TODO(andrei): I've seen replays of a LeaseRequest in a test, happening very
  // predictably. Is this expected?

Yes, I think that's expected. But why do we need to handle this specially, instead of treating it as an "extension" that doesn't actually move the expiration forward?


storage/replica_range_lease.go, line 114 at r7 (raw file):

      // request this signal corresponds too. It might not correspond to the
      // request the current waiters are waiting on, since a signal delivering a
      // success can race with a signal signaling an error for the same request.

Why would the same request both succeed and fail? There are lease index and reproposal failures, but those don't make it far enough to call SignalLeaseApplied.


storage/replica_range_lease.go, line 131 at r7 (raw file):

      // lease to be used for serving.
      if isExtension := lease.Replica.ReplicaID == p.replica.mu.replicaID; isExtension {
          p.WipeNextLease(ctx)

I don't see any tests that show that this makes a difference. What kind of failures could reach this point while the old lease is still active, and what would be the consequences of handling extensions in the same way as transfers (i.e. conservatively allowing NextLease to remain)?


storage/replica_range_lease.go, line 145 at r7 (raw file):

      ctx, p.replica.store.engine, p.replica.RangeID, roachpb.Lease{})
  if err != nil {
      // TODO(andrei): What should we do for such replica corruption errors?

We have the Replica.maybeSetCorrupt method (soon to move to Store).


storage/replica_range_lease.go, line 154 at r7 (raw file):

// applies a lease (the lease the caller has just proposed, or a completely
// different one).
func (p *pendingLeaseRequest) getWaitChan(ctx context.Context) chan leaseOrErr {

If this can easily return a <-chan, it should.


storage/replica_range_lease.go, line 162 at r7 (raw file):

// notifyWaiters sends a notification to the current batch of waiters. It then
// clears the list of waiters. Callers should also call WipeNextLease atomically

What does "atomically" mean here? Under some lock? Which one?


storage/replica_range_lease.go, line 252 at r7 (raw file):

          // NOTE(andrei): We could return an error to the caller here since we
          // haven't sent the request yet, but we can't similarly handle the same
          // kind of corruption error in GetWaitChan(), so we just panic here too

getWaitChan doesn't have a panic.


storage/replica_range_lease.go, line 309 at r7 (raw file):

// LeaderLease.
//
// Note that the pendingLeaseRequest, and hence this method, is unaware of any

Is this still true?


Comments from Reviewable

@spencerkimball
Copy link
Member

@andreimatei do you still want me to review this?

@andreimatei
Copy link
Contributor Author

Ben seems to be on it. Probably enough.

@andreimatei
Copy link
Contributor Author

So something came up when discussing with @tschottdorf , on the sides of his comments in #6290 about using leases after restarts in a proposer-evaluated KV world.
It seems that, even today, it's incorrect to serve reads after a restart given that the command queue has been wiped on restart - the serialization of reads with writes that might be in flight is gone and that's a correctness problem. So it looks like, after a restart, we can't use any of the leases we had before the restart (as opposed to not using the ones that were in the process of being transferred, as this PR is trying to do).
So I'm wondering if we should give up this PR, or at least give up on relying on it for protecting against anything on restart, and revert to a blunt way of sitting out all old leases when upon restart.

@bdarnell @spencerkimball any opinions?


Review status: all files reviewed at latest revision, 19 unresolved discussions, some commit checks failed.


Comments from Reviewable

@spencerkimball
Copy link
Member

I would in general be in favor of the blunt approach. Complexity scares me. And with the epoch-based range leases, it becomes less and less likely that any ranges on a restarted node will be reusing existing leases anyway.

But in general, if a node is restarted, none of its ranges should ever be allowed to use pre-existing leases until all committed log entries have been applied to the finite state machine. So the command queue issue which you're worried about should be a moot point.

@bdarnell
Copy link
Contributor

Yeah, I think I'd abandon this change and instead make sure that we only use leases that were requested by the current process.

But in general, if a node is restarted, none of its ranges should ever be allowed to use pre-existing leases until all committed log entries have been applied to the finite state machine. So the command queue issue which you're worried about should be a moot point.

I'm not sure what you're trying to say here; we certainly don't enforce this today.

@spencerkimball
Copy link
Member

@bdarnell yes we don't enforce this today. In more detail: if you restart with the lease still unexpired, and serve a read immediately for a key which has a pending, committed change in the raft log which hasn't yet been applied, you'll have a problem. I'm not entirely sure the scenario is possible, but I think it could be unless we apply all committed commands at raft group load time.

The simpler approach we're now advocating of outlawing use of leases not from the current process will fix this potential consistency vulnerability.

@andreimatei
Copy link
Contributor Author

I've opened a new PR (#10211) instead of this one, addressing the original motivation (don't reuse transferred leases) and more.
I'll now see what of this PR deserves salvaging.

@tbg
Copy link
Member

tbg commented Nov 1, 2016

Unassigning due to not being able to review this past 11/6.

@tbg tbg removed their assignment Nov 1, 2016
@andreimatei
Copy link
Contributor Author

closing in favor of #10420

@andreimatei andreimatei closed this Nov 2, 2016
@andreimatei andreimatei deleted the start-wait-maxtrans branch April 27, 2018 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage: lease transfer must be persisted to disk
5 participants