Skip to content

Commit

Permalink
storage, server: don't reuse leases obtained before a restart or tran…
Browse files Browse the repository at this point in the history
…sfer

Before this patch, leases held before a restart could be used after the
restart. This is incorrect, since the command queue has been wiped in
the restart and so reads and writes are not properly sequenced with
possible in-flight commands.

There was also a problem before this patch, related to lease transfers:
if a transfer command returned an error to the `replica.Send()` call
that proposed it, we'd clear our "in transfer" state and allow the
replica to use the existing lease. However, such errors are generally
ambiguous - there's no guarantee that the transfer will still apply. In
such cases, the replica was actually breaking its promise to not use the
lease after initiating the tranfer.

The current patch addresses both these problems by introducing a
`replica.minLeaseProposedTS` and a `Lease.ProposedTS`. These fields are
used to enforce the fact that only leases proposed after a particular
time will be used *for proposing* new commands, or for serving reads. On
a transfer, the field for the replica in question is set to the present
time. On restart, the field is set for all the replicas to the current
time, after the Server has waited such that it has a guarantee that the
current HLC is above what it might have been on previous incarnations.
This ensures that, in both the transfer and the restart case, leases
that have been proposed before the transfer/restart and were in flight
at the time when the transfer/restart happens are not eligible to be
used.

This patch also changes the way waiting on startup for a server's HLC to
be guaranteed to be monotonic wrt to the HLC before the restart works.
Before, we were waiting just before starting serving, but after all the
stores had been started. The point of the waiting was to not serve reads
thinking we have a lower timestamp than before the restart. I'm unclear
about whether that was correct or not, considering that, for example, a
store's queues were already running and potentially doing work.
Now that we also rely on that waiting for not reusing old leases (we
need to initialize replica.minLeaseProposedTS with a value higher than
the node had before the restart), I've had to do the waiting *before*
initializing the stores. I could have done the setting of that field
late, but it seems even more dangerous than before to allow queues to do
work with potentially bad leases.
We lost the amortization of this wait time with the store creation
process... But note that, for empty engines (e.g. in tests) we don't do
any waiting.

Fixes cockroachdb#7996

THIS PATCH NEEDS TO BE DEPLOYED THROUGH A "STOP THE WORLD"
Because of the field being added to the Lease proto, we can't have
leases proposed by new servers being applied by old servers - they're
going to be serialized differently and the consistency checker will flag
it.
Note that a "freeze" is not required, though. The new field is nullable,
so any lease requests that might be in flight at the time of the world
restart will be serialized the same as they have been by the nodes that
already applied them before the restart.
  • Loading branch information
andreimatei committed Nov 7, 2016
1 parent 0b4aad8 commit 3d508a1
Show file tree
Hide file tree
Showing 16 changed files with 682 additions and 175 deletions.
251 changes: 155 additions & 96 deletions pkg/roachpb/data.pb.go

Large diffs are not rendered by default.

9 changes: 9 additions & 0 deletions pkg/roachpb/data.proto
Original file line number Diff line number Diff line change
Expand Up @@ -331,6 +331,15 @@ message Lease {

// The address of the would-be lease holder.
optional ReplicaDescriptor replica = 3 [(gogoproto.nullable) = false];

// The current timestamp when this lease has been proposed. Used after a
// transfer and after a node restart to enforce that a node only uses leases
// proposed after the time of the said transfer or restart. This is nullable
// to help with the rollout (such that a lease applied by some nodes before
// the rollout and some nodes after the rollout is serialized the same).
// TODO(andrei): Make this non-nullable after the rollout.
optional util.hlc.Timestamp proposed_ts = 5 [(gogoproto.nullable) = true,
(gogoproto.customname) = "ProposedTS"];
}

// AbortCacheEntry contains information about a transaction which has
Expand Down
61 changes: 33 additions & 28 deletions pkg/server/server.go
Original file line number Diff line number Diff line change
Expand Up @@ -552,19 +552,43 @@ func (s *Server) Start(ctx context.Context) error {
}
s.stopper.AddCloser(&s.engines)

anyStoreBootstrapped := false
for _, e := range s.engines {
if _, err := storage.ReadStoreIdent(ctx, e); err != nil {
// NotBootstrappedError is expected.
if _, ok := err.(*storage.NotBootstrappedError); !ok {
return err
// We might have to sleep a bit to protect against this node producing non-
// monotonic timestamps. Before restarting, its clock might have been driven
// by other nodes' fast clocks, but when we restarted, we lost all this
// information. For example, a client might have written a value at a
// timestamp that's in the future of the restarted node's clock, and if we
// don't do something, the same client's read would not return the written
// value. So, we wait up to MaxOffset; we couldn't have served timestamps more
// than MaxOffset in the future (assuming that MaxOffset was not changed, see
// #9733).
//
// As an optimization for tests, we don't sleep if all the stores are brand
// new. In this case, the node will not serve anything anyway until it
// synchronizes with other nodes.
{
anyStoreBootstrapped := false
for _, e := range s.engines {
if _, err := storage.ReadStoreIdent(ctx, e); err != nil {
// NotBootstrappedError is expected.
if _, ok := err.(*storage.NotBootstrappedError); !ok {
return err
}
} else {
anyStoreBootstrapped = true
break
}
}
if anyStoreBootstrapped {
sleepDuration := s.clock.MaxOffset() - timeutil.Since(startTime)
if sleepDuration > 0 {
log.Infof(ctx, "sleeping for %s to guarantee HLC monotonicity", sleepDuration)
time.Sleep(sleepDuration)
}
} else {
anyStoreBootstrapped = true
break
}
}

// Now that we have a monotonic HLC wrt previous incarnations of the process,
// init all the replicas.
err = s.node.start(
ctx,
unresolvedAdvertAddr,
Expand Down Expand Up @@ -610,25 +634,6 @@ func (s *Server) Start(ctx context.Context) error {
log.Infof(ctx, "starting postgres server at unix:%s", s.cfg.SocketFile)
}

// We might have to sleep a bit to protect against this node producing non-
// monotonic timestamps. Before restarting, its clock might have been driven
// by other nodes' fast clocks, but when we restarted, we lost all this
// information. For example, a client might have written a value at a
// timestamp that's in the future of the restarted node's clock, and if we
// don't do something, the same client's read would not return it. So, we
// wait up to MaxOffset. We assume we couldn't have served timestamps more
// than MaxOffset in the future.
//
// As an optimization for tests, we don't sleep if all the stores are brand
// new. In this case, the node will not serve anything anyway until it
// synchronizes with other nodes.
if anyStoreBootstrapped {
sleepDuration := s.clock.MaxOffset() - timeutil.Since(startTime)
if sleepDuration > 0 {
log.Infof(ctx, "sleeping for %s to guarantee HLC monotonicity", sleepDuration)
time.Sleep(sleepDuration)
}
}
s.stopper.RunWorker(func() {
netutil.FatalIfUnexpected(m.Serve())
})
Expand Down
2 changes: 1 addition & 1 deletion pkg/storage/below_raft_protos_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ var belowRaftGoldenProtos = map[reflect.Type]fixture{
reflect.TypeOf(&roachpb.Lease{}): {
populatedConstructor: func(r *rand.Rand) proto.Message { return roachpb.NewPopulatedLease(r, false) },
emptySum: 10006158318270644799,
populatedSum: 717371977055084394,
populatedSum: 17421216026521129287,
},
reflect.TypeOf(&roachpb.RaftTruncatedState{}): {
populatedConstructor: func(r *rand.Rand) proto.Message { return roachpb.NewPopulatedRaftTruncatedState(r, false) },
Expand Down
3 changes: 3 additions & 0 deletions pkg/storage/client_raft_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,9 @@ func TestRestoreReplicas(t *testing.T) {
// Disable periodic gossip activities. The periodic gossiping of the first
// range can cause spurious lease transfers which cause this test to fail.
sc.TestingKnobs.DisablePeriodicGossips = true
// Allow a replica to use the lease it had before a restart; we don't want
// this test to deal with needing to acquire new leases after the restart.
sc.TestingKnobs.DontPreventUseOfOldLeaseOnStart = true
mtc := &multiTestContext{storeConfig: &sc}
mtc.Start(t, 2)
defer mtc.Stop()
Expand Down
56 changes: 56 additions & 0 deletions pkg/storage/client_replica_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -662,6 +662,62 @@ func TestRangeTransferLease(t *testing.T) {
wg.Wait()
}

// Test that leases held before a restart are not used after the restart.
// See replica.mu.minLeaseProposedTS for the reasons why this isn't allowed.
func TestLeaseNotUsedAfterRestart(t *testing.T) {
defer leaktest.AfterTest(t)()
sc := storage.TestStoreConfig(nil)
var leaseAcquisitionTrap atomic.Value
// Disable the split queue so that no ranges are split. This makes it easy
// below to trap any lease request and infer that it refers to the range we're
// interested in.
sc.TestingKnobs.DisableSplitQueue = true
sc.TestingKnobs.LeaseRequestEvent = func(ts hlc.Timestamp) {
val := leaseAcquisitionTrap.Load()
if val == nil {
return
}
trapCallback := val.(func(ts hlc.Timestamp))
if trapCallback != nil {
trapCallback(ts)
}
}
mtc := &multiTestContext{storeConfig: &sc}
mtc.Start(t, 1)
defer mtc.Stop()

// Send a read, to acquire a lease.
getArgs := getArgs([]byte("a"))
if _, err := client.SendWrapped(context.Background(), rg1(mtc.stores[0]), &getArgs); err != nil {
t.Fatal(err)
}

// Restart the mtc. Before we do that, we're installing a callback used to
// assert that a new lease has been requested. The callback is installed
// before the restart, as the lease might be requested at any time and for
// many reasons by background processes, even before we send the read below.
leaseAcquisitionCh := make(chan error)
var once sync.Once
leaseAcquisitionTrap.Store(func(_ hlc.Timestamp) {
once.Do(func() {
close(leaseAcquisitionCh)
})
})
mtc.restart()

// Send another read and check that the pre-existing lease has not been used.
// Concretely, we check that a new lease is requested.
if _, err := client.SendWrapped(context.Background(), rg1(mtc.stores[0]), &getArgs); err != nil {
t.Fatal(err)
}
// Check that the Send above triggered a lease acquisition.
select {
case <-leaseAcquisitionCh:
case <-time.After(time.Second):
t.Fatalf("read did not acquire a new lease")
}
}

// Test that a lease extension (a RequestLeaseRequest that doesn't change the
// lease holder) is not blocked by ongoing reads.
// The test relies on two things:
Expand Down
87 changes: 85 additions & 2 deletions pkg/storage/engine/rocksdb/cockroach/pkg/roachpb/data.pb.cc

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 3d508a1

Please sign in to comment.