Skip to content

Commit

Permalink
storage, server: don't reuse leases obtained before a restart or tran…
Browse files Browse the repository at this point in the history
…sfer

Before this patch, leases held before a restart could be used after the
restart. This is incorrect, since the command queue has been wiped in
the restart and so reads and writes are not properly sequenced with
possible in-flight commands.

There was also a problem before this patch, related to lease transfers:
if a transfer command returned an error to the `replica.Send()` call
that proposed it, we'd clear our "in transfer" state and allow the
replica to use the existing lease. However, such errors are generally
ambiguous - there's no guarantee that the transfer will still apply. In
such cases, the replica was actually breaking its promise to not use the
lease after initiating the tranfer.

The current patch addresses both these problems by introducing a
`replica.minLeaseProposedTs` and a `Lease.ProposedTs`. These fields are
used to enforce the fact that only leases proposed after a particular
time will be used *for proposing* new commands, or for serving reads. On
a transfer, the field for the replica in question is set to the present
time. On restart, the field is set for all the replicas to the current
time, after the Server has waited such that it has a guarantee that the
current HLC is above what it might have been on previous incarnations.
This ensures that, in both the transfer and the restart case, leases
that have been proposed before the transfer/restart and were in flight
at the time when the transfer/restart happens are not eligible to be
used.

This patch also changes the way waiting on startup for a server's HLC to
be guaranteed to be monotonic wrt to the HLC before the restart works.
Before, we were waiting just before starting serving, but after all the
stores had been started. The point of the waiting was to not serve reads
thinking we have a lower timestamp than before the restart. I'm unclear
about whether that was correct or not, considering that, for example, a
store's queues were already running and potentially doing work.
Now that we also rely on that waiting for not reusing old leases (we
need to initialize replica.minLeaseProposedTs with a value higher than
the node had before the restart), I've had to do the waiting *before*
initializing the stores. I could have done the setting of that field
late, but it seems even more dangerous than before to allow queues to do
work with potentially bad leases.
We lost the amortization of this wait time with the store creation
process... But note that, for empty engines (e.g. in tests) we don't do
any waiting.

Fixes #7996

THIS PATCH NEEDS TO BE DEPLOYED THROUGH A "STOP THE WORLD"
Because of the field being added to the Lease proto, we can't have
leases proposed by new servers being applied by old servers - they're
going to be serialized differently and the consistency checker will flag
it.
Note that a "freeze" is not required, though. The new field is nullable,
so any lease requests that might be in flight at the time of the world
restart will be serialized the same as they have been by the nodes that
already applied them before the restart.
  • Loading branch information
andreimatei committed Nov 2, 2016
1 parent fa7a0a1 commit 6cca1f8
Show file tree
Hide file tree
Showing 13 changed files with 612 additions and 173 deletions.
250 changes: 154 additions & 96 deletions pkg/roachpb/data.pb.go

Large diffs are not rendered by default.

8 changes: 8 additions & 0 deletions pkg/roachpb/data.proto
Original file line number Diff line number Diff line change
Expand Up @@ -331,6 +331,14 @@ message Lease {

// The address of the would-be lease holder.
optional ReplicaDescriptor replica = 3 [(gogoproto.nullable) = false];

// The current timestamp of the proposed at the time when this lease has been
// proposed. Used after a transfer and after a node restart to enforce that a
// node only uses leases proposed after the time of the transfer or restart.
// This is nullable to help with the rollout (such that a lease applied by
// some nodes before the rollout and some nodes after the rollout is
// serialized the same). It should be non-nullable subsequently.
optional util.hlc.Timestamp proposed_ts = 5 [(gogoproto.nullable) = true];
}

// AbortCacheEntry contains information about a transaction which has
Expand Down
60 changes: 32 additions & 28 deletions pkg/server/server.go
Original file line number Diff line number Diff line change
Expand Up @@ -553,19 +553,42 @@ func (s *Server) Start(ctx context.Context) error {
}
s.stopper.AddCloser(&s.engines)

anyStoreBootstrapped := false
for _, e := range s.engines {
if _, err := storage.ReadStoreIdent(ctx, e); err != nil {
// NotBootstrappedError is expected.
if _, ok := err.(*storage.NotBootstrappedError); !ok {
return err
// We might have to sleep a bit to protect against this node producing non-
// monotonic timestamps. Before restarting, its clock might have been driven
// by other nodes' fast clocks, but when we restarted, we lost all this
// information. For example, a client might have written a value at a
// timestamp that's in the future of the restarted node's clock, and if we
// don't do something, the same client's read would not return it. So, we
// wait up to MaxOffset. We assume we couldn't have served timestamps more
// than MaxOffset in the future.
//
// As an optimization for tests, we don't sleep if all the stores are brand
// new. In this case, the node will not serve anything anyway until it
// synchronizes with other nodes.
{
anyStoreBootstrapped := false
for _, e := range s.engines {
if _, err := storage.ReadStoreIdent(ctx, e); err != nil {
// NotBootstrappedError is expected.
if _, ok := err.(*storage.NotBootstrappedError); !ok {
return err
}
} else {
anyStoreBootstrapped = true
break
}
}
if anyStoreBootstrapped {
sleepDuration := s.clock.MaxOffset() - timeutil.Since(startTime)
if sleepDuration > 0 {
log.Infof(ctx, "sleeping for %s to guarantee HLC monotonicity", sleepDuration)
time.Sleep(sleepDuration)
}
} else {
anyStoreBootstrapped = true
break
}
}

// Now that we have a monotonic HLC wrt previous incarnations of the process,
// init all the replicas.
err = s.node.start(
ctx,
unresolvedAdvertAddr,
Expand Down Expand Up @@ -609,25 +632,6 @@ func (s *Server) Start(ctx context.Context) error {
log.Infof(ctx, "starting postgres server at unix:%s", s.cfg.SocketFile)
}

// We might have to sleep a bit to protect against this node producing non-
// monotonic timestamps. Before restarting, its clock might have been driven
// by other nodes' fast clocks, but when we restarted, we lost all this
// information. For example, a client might have written a value at a
// timestamp that's in the future of the restarted node's clock, and if we
// don't do something, the same client's read would not return it. So, we
// wait up to MaxOffset. We assume we couldn't have served timestamps more
// than MaxOffset in the future.
//
// As an optimization for tests, we don't sleep if all the stores are brand
// new. In this case, the node will not serve anything anyway until it
// synchronizes with other nodes.
if anyStoreBootstrapped {
sleepDuration := s.clock.MaxOffset() - timeutil.Since(startTime)
if sleepDuration > 0 {
log.Infof(ctx, "sleeping for %s to guarantee HLC monotonicity", sleepDuration)
time.Sleep(sleepDuration)
}
}
s.stopper.RunWorker(func() {
netutil.FatalIfUnexpected(m.Serve())
})
Expand Down
94 changes: 92 additions & 2 deletions pkg/storage/engine/rocksdb/cockroach/pkg/roachpb/data.pb.cc

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

60 changes: 60 additions & 0 deletions pkg/storage/engine/rocksdb/cockroach/pkg/roachpb/data.pb.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 6cca1f8

Please sign in to comment.