storage: persist proposed lease transfers to disk #9523

andreimatei · 2016-09-24T17:41:53Z

... so we don't use those leases after a restart

cc @cockroachdb/stability @petermattis

This change is

bdarnell

Lease transfers are a per-range operation; blocking the entire store on restart seems unnecessarily coarse (especially if we start transferring all our leases away on shutdown; then the worst case of having to wait for almost the full lease duration (9s) will be common). I'd rather store this at the replica level instead of the store level.

We could almost do it by modifying our persisted version of the lease, but that is defined as replicated state, so we need to introduce a new unreplicated range-id-local key.

bdarnell · 2016-09-25T09:14:22Z

keys/constants.go

@@ -80,6 +80,9 @@ var (
 	// localStoreGossipSuffix stores gossip bootstrap metadata for this
 	// store, updated any time new gossip hosts are encountered.
 	localStoreGossipSuffix = []byte("goss")
+	// localStoreSafeStartSuffix stores the minimum timestamp when it's safe for
+	// the store to start serving after a restart.
+	localStoreSafeStartSuffix = []byte("safe-start")


All suffixes are supposed to be 4 characters (see localSuffixLength above).

bdarnell · 2016-09-25T09:30:47Z

storage/store.go

@@ -448,6 +448,9 @@ type Store struct {
 		// raft.
 		droppedPlaceholders int32
 	}
+
+	// safeStartMu serializes access to the engine's "safe start" key.
+	safeStartMu syncutil.Mutex


Why does this get a new mutex?

bdarnell · 2016-09-25T09:44:27Z

server/server.go

+		log.Infof(ctx, "waiting for engines' safe start timestamp...")
+		time.Sleep(
+			// Add 1 nanosecond because we're ignoring the logical part.
+			time.Duration(safeStart.WallTime-s.clock.PhysicalNow()+1) * time.Nanosecond)


Don't call clock.{Physical,}Now() more than once here.

andreimatei · 2016-09-25T15:54:21Z

@bdarnell yeah, this mechanism is course (and this is explained in the comments in this commit), but the alternative seemed complex. Not only would we need to store per-range (or per-lease) local keys, but we'd also need a way to not use the respective individual leases after a restart (and this mechanism should probably be different than what's used by the server for not using the transferred lease before a restart - the replica.pendingLeaseRequest struct). The idea of just marking the local lease as tainted was considered originally when I introduced the lease transfers, but, as you've said, the idea of applying a non-replicated change to this otherwise replicated state seemed like a bad idea.

I was punting on doing anything better for the time when/if non-expiring leases become a reality. In the meantime, note that we only wait out the lease duration when a server restarts immediately (which should not be common, I think). Also a server takes a while to shutdown, so depending on how we do the lease transfers on shutdown, that might also take care of some of the wait duration.

So, I dunno... If you don't feel strongly, I'd go ahead with this change to get these transfers be correct and improve... later.

bdarnell · 2016-09-26T02:30:07Z

Storing this per-replica doesn't seem significantly more complicated than doing it per-store. You just need a new unreplicated member of ReplicaState, which can be loaded at startup just like the previous lease is. You'd check this lease transfer timestamp at the same time you check whether the current lease covers the current timestamp. I'd much rather keep this at the replica level than introduce something entirely new at the store level that interacts with leases.

tbg · 2016-09-26T15:01:17Z

Reviewed 8 of 8 files at r1.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.

Comments from Reviewable

tamird · 2016-09-26T18:42:01Z

Reviewed 6 of 8 files at r1.
Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.

server/server.go, line 510 at r1 (raw file):

  })
  if err != nil {
      return errors.Wrap(err, "error reading stores' safe start")

this wrapping should happen inside the closure so we know which store.

server/server.go, line 513 at r1 (raw file):

  }
  if s.clock.Now().Less(safeStart) {
      log.Infof(ctx, "waiting for engines' safe start timestamp...")

is it the engines' or the stores'?

server/server.go, line 517 at r1 (raw file):

          // Add 1 nanosecond because we're ignoring the logical part.
          time.Duration(safeStart.WallTime-s.clock.PhysicalNow()+1) * time.Nanosecond)
      log.Infof(ctx, "waiting for engines' safe start timestamp... done")

not sure you need the common prefix here - the server isn't running yet, so there's nothing else in the logs.

storage/replica_range_lease.go, line 292 at r1 (raw file):

      if err := r.store.MaybeUpdateSafeStart(ctx, lease); err != nil {
          return nil, nil, errors.Wrap(err, "error persisting lease transfer")

"error updating safe start"

Comments from Reviewable

andreimatei · 2016-09-26T22:28:40Z

ok, will rewrite

Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.

Comments from Reviewable

tbg · 2016-10-04T01:19:26Z

Assigned stability label to mark for supervised merge.

TestStatsInit was writing stats and asserting it can read them. But it was doing so using a MultiTestContext and racing with the stats written by the ranges. Moved the test to the engines test.

This is required by the next commits because we'll start being notified about when the lease changes and a duplicate notification can throw us off.

I need to use a TestCluster in a test that can't depend on Server, so it's time to expand the shim. Also expand the TestServerInterface to give access to the stores. This is also needed in said test, and downcasting to TestServer is not an option. The stores are exposed through a shim interface.

... so we don't use those leases after a restart fixes cockroachdb#7996

andreimatei · 2016-10-11T20:33:01Z

I've rewritten the thing. Now, each replica's lease transfer state is persisted to disk in such a way that, upon restart, it can be almost perfectly reconstructed. So there's no longer a difference between how we're checking if we're in the middle of a transfer during regular running versus immediately after a restart.
This required an interaction between the raft machinery and the pendingLeaseRequest struct - whereas before one would request a new lease (new, extension or transfer) and then wait on the replica.Send channel to see if the new lease has applied or not (an ephemeral mechanism that cannot survive a restart), now it will wait on a channel that's signaled by the commit trigger that applies the new lease. This arguably broke an encapsulation of the logic in the pendingLeaseRequest struct, but, besides serving the purpose of the PR, it's also more correct (before, it was unclear what would happen if, say, someone's waiting on a lease proposal, but another lease is applied in the meantime). The change is also in the spirit of the Replica, having reconstructible state, which pendingLeaseRequest was going against.

@spencerkimball, since you insisted on the idea of converging in-memory and on-disk state, I think you get the review. But say if you're too busy.

r7 is the main commit of interest, the rest are misc. I'll probably need to add more tests that mimic a restart, but wanna get some validation.

CC @RaduBerinde for r6 - expending the TestCluster shim.

Review status: 0 of 27 files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.

keys/constants.go, line 85 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

All suffixes are supposed to be 4 characters (see localSuffixLength above).

Done.

storage/replica_range_lease.go, line 292 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

"error updating safe start"

it's gone

storage/store.go, line 453 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why does this get a new mutex?

gone

server/server.go, line 510 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this wrapping should happen inside the closure so we know which store.

gone

server/server.go, line 513 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

is it the engines' or the stores'?

gone

server/server.go, line 516 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Don't call clock.{Physical,}Now() more than once here.

gone

server/server.go, line 517 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

not sure you need the common prefix here - the server isn't running yet, so there's nothing else in the logs.

gone

Comments from Reviewable

bdarnell · 2016-10-12T06:17:46Z

Reviewed 2 of 2 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 11 of 12 files at r6, 12 of 12 files at r7.
Review status: all files reviewed at latest revision, 16 unresolved discussions, some commit checks failed.

keys/constants.go, line 118 at r7 (raw file):

  LocalRangeLeaseSuffix = []byte("rll-")
  // LocalNextLeaseSuffix is the suffix for a range's "next lease".
  LocalNextLeaseSuffix = []byte("rnl-")

Move this after "rlla" below. They're alphabetized by value to make it easier to spot conflicts.

storage/client_test.go, line 317 at r7 (raw file):

      wg.Add(len(m.stoppers))
      for i, s := range m.stoppers {
          go func(i int, s *stop.Stopper) {

This new argument is unused.

storage/client_test.go, line 357 at r7 (raw file):

          m.t.Error("timed out during shutdown")
      } else {
          panic(fmt.Sprintf("timed out during shutdown. current time: %s.", timeutil.Now()))

This could be a log.Fatal instead and then there'd be a timestamp automatically.

storage/log_test.go, line 151 at r7 (raw file):

  // Stores (or a creative usage of VisitStores could suffice).
  testStore, err := s.(*server.TestServer).Stores().GetStore(roachpb.StoreID(1))
  store := testStore.(*storage.Store)

Its unfortunate that all these explicit casts are necessary here in the storage package where all the types are accessible.

storage/replica_command.go, line 1747 at r7 (raw file):

  // replays.
  // TODO(andrei): I've seen replays of a LeaseRequest in a test, happening very
  // predictably. Is this expected?

Yes, I think that's expected. But why do we need to handle this specially, instead of treating it as an "extension" that doesn't actually move the expiration forward?

storage/replica_range_lease.go, line 114 at r7 (raw file):

      // request this signal corresponds too. It might not correspond to the
      // request the current waiters are waiting on, since a signal delivering a
      // success can race with a signal signaling an error for the same request.

Why would the same request both succeed and fail? There are lease index and reproposal failures, but those don't make it far enough to call SignalLeaseApplied.

storage/replica_range_lease.go, line 131 at r7 (raw file):

      // lease to be used for serving.
      if isExtension := lease.Replica.ReplicaID == p.replica.mu.replicaID; isExtension {
          p.WipeNextLease(ctx)

I don't see any tests that show that this makes a difference. What kind of failures could reach this point while the old lease is still active, and what would be the consequences of handling extensions in the same way as transfers (i.e. conservatively allowing NextLease to remain)?

storage/replica_range_lease.go, line 145 at r7 (raw file):

      ctx, p.replica.store.engine, p.replica.RangeID, roachpb.Lease{})
  if err != nil {
      // TODO(andrei): What should we do for such replica corruption errors?

We have the Replica.maybeSetCorrupt method (soon to move to Store).

storage/replica_range_lease.go, line 154 at r7 (raw file):

// applies a lease (the lease the caller has just proposed, or a completely
// different one).
func (p *pendingLeaseRequest) getWaitChan(ctx context.Context) chan leaseOrErr {

If this can easily return a <-chan, it should.

storage/replica_range_lease.go, line 162 at r7 (raw file):

// notifyWaiters sends a notification to the current batch of waiters. It then
// clears the list of waiters. Callers should also call WipeNextLease atomically

What does "atomically" mean here? Under some lock? Which one?

storage/replica_range_lease.go, line 252 at r7 (raw file):

          // NOTE(andrei): We could return an error to the caller here since we
          // haven't sent the request yet, but we can't similarly handle the same
          // kind of corruption error in GetWaitChan(), so we just panic here too

getWaitChan doesn't have a panic.

storage/replica_range_lease.go, line 309 at r7 (raw file):

// LeaderLease.
//
// Note that the pendingLeaseRequest, and hence this method, is unaware of any

Is this still true?

Comments from Reviewable

spencerkimball · 2016-10-15T16:40:01Z

@andreimatei do you still want me to review this?

andreimatei · 2016-10-15T16:44:59Z

Ben seems to be on it. Probably enough.

andreimatei · 2016-10-18T18:12:52Z

So something came up when discussing with @tschottdorf , on the sides of his comments in #6290 about using leases after restarts in a proposer-evaluated KV world.
It seems that, even today, it's incorrect to serve reads after a restart given that the command queue has been wiped on restart - the serialization of reads with writes that might be in flight is gone and that's a correctness problem. So it looks like, after a restart, we can't use any of the leases we had before the restart (as opposed to not using the ones that were in the process of being transferred, as this PR is trying to do).
So I'm wondering if we should give up this PR, or at least give up on relying on it for protecting against anything on restart, and revert to a blunt way of sitting out all old leases when upon restart.

@bdarnell @spencerkimball any opinions?

Review status: all files reviewed at latest revision, 19 unresolved discussions, some commit checks failed.

Comments from Reviewable

spencerkimball · 2016-10-18T20:43:10Z

I would in general be in favor of the blunt approach. Complexity scares me. And with the epoch-based range leases, it becomes less and less likely that any ranges on a restarted node will be reusing existing leases anyway.

But in general, if a node is restarted, none of its ranges should ever be allowed to use pre-existing leases until all committed log entries have been applied to the finite state machine. So the command queue issue which you're worried about should be a moot point.

bdarnell · 2016-10-19T07:30:11Z

Yeah, I think I'd abandon this change and instead make sure that we only use leases that were requested by the current process.

But in general, if a node is restarted, none of its ranges should ever be allowed to use pre-existing leases until all committed log entries have been applied to the finite state machine. So the command queue issue which you're worried about should be a moot point.

I'm not sure what you're trying to say here; we certainly don't enforce this today.

spencerkimball · 2016-10-19T12:42:20Z

@bdarnell yes we don't enforce this today. In more detail: if you restart with the lease still unexpired, and serve a read immediately for a key which has a pending, committed change in the raft log which hasn't yet been applied, you'll have a problem. I'm not entirely sure the scenario is possible, but I think it could be unless we apply all committed commands at raft group load time.

The simpler approach we're now advocating of outlawing use of leases not from the current process will fix this potential consistency vulnerability.

andreimatei · 2016-10-25T19:47:50Z

I've opened a new PR (#10211) instead of this one, addressing the original motivation (don't reuse transferred leases) and more.
I'll now see what of this PR deserves salvaging.

tbg · 2016-11-01T00:36:42Z

Unassigning due to not being able to review this past 11/6.

andreimatei · 2016-11-02T21:59:05Z

closing in favor of #10420

andreimatei assigned tbg Sep 24, 2016

bdarnell reviewed Sep 25, 2016

View reviewed changes

tbg mentioned this pull request Sep 26, 2016

storage, server: no more timestamp arithmetic in leases in and timestamp cache #9504

Merged

tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Oct 4, 2016

andreimatei added 6 commits October 10, 2016 19:18

testutils: massage all server args in a TestCluster

c037c36

engine: reduce scope of a stats test

0c0042f

TestStatsInit was writing stats and asserting it can read them. But it was doing so using a MultiTestContext and racing with the stats written by the ranges. Moved the test to the engines test.

storage: make TestStoreScanIntents more robust

ecace68

storage: protect against replays of lease requests

b628323

This is required by the next commits because we'll start being notified about when the lease changes and a duplicate notification can throw us off.

storage: persist proposed lease transfers to disk

81eee1d

... so we don't use those leases after a restart fixes cockroachdb#7996

andreimatei force-pushed the start-wait-maxtrans branch from 91d83b3 to 81eee1d Compare October 11, 2016 19:48

tbg mentioned this pull request Oct 18, 2016

storage: implement proposer-evaluated kv #6290

Closed

tbg removed their assignment Nov 1, 2016

andreimatei closed this Nov 2, 2016

andreimatei deleted the start-wait-maxtrans branch April 27, 2018 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: persist proposed lease transfers to disk #9523

storage: persist proposed lease transfers to disk #9523

andreimatei commented Sep 24, 2016 •

edited by rjnn

Loading

bdarnell left a comment

bdarnell Sep 25, 2016

bdarnell Sep 25, 2016

bdarnell Sep 25, 2016

andreimatei commented Sep 25, 2016

bdarnell commented Sep 26, 2016

tbg commented Sep 26, 2016

tamird commented Sep 26, 2016

andreimatei commented Sep 26, 2016

tbg commented Oct 4, 2016

andreimatei commented Oct 11, 2016

bdarnell commented Oct 12, 2016

spencerkimball commented Oct 15, 2016

andreimatei commented Oct 15, 2016

andreimatei commented Oct 18, 2016

spencerkimball commented Oct 18, 2016

bdarnell commented Oct 19, 2016

spencerkimball commented Oct 19, 2016

andreimatei commented Oct 25, 2016

tbg commented Nov 1, 2016

andreimatei commented Nov 2, 2016

storage: persist proposed lease transfers to disk #9523

storage: persist proposed lease transfers to disk #9523

Conversation

andreimatei commented Sep 24, 2016 • edited by rjnn Loading

bdarnell left a comment

Choose a reason for hiding this comment

bdarnell Sep 25, 2016

Choose a reason for hiding this comment

bdarnell Sep 25, 2016

Choose a reason for hiding this comment

bdarnell Sep 25, 2016

Choose a reason for hiding this comment

andreimatei commented Sep 25, 2016

bdarnell commented Sep 26, 2016

tbg commented Sep 26, 2016

tamird commented Sep 26, 2016

andreimatei commented Sep 26, 2016

tbg commented Oct 4, 2016

andreimatei commented Oct 11, 2016

bdarnell commented Oct 12, 2016

spencerkimball commented Oct 15, 2016

andreimatei commented Oct 15, 2016

andreimatei commented Oct 18, 2016

spencerkimball commented Oct 18, 2016

bdarnell commented Oct 19, 2016

spencerkimball commented Oct 19, 2016

andreimatei commented Oct 25, 2016

tbg commented Nov 1, 2016

andreimatei commented Nov 2, 2016

andreimatei commented Sep 24, 2016 •

edited by rjnn

Loading