Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: lease transfer must be persisted to disk #7996

Closed
tbg opened this issue Jul 24, 2016 · 0 comments
Closed

storage: lease transfer must be persisted to disk #7996

tbg opened this issue Jul 24, 2016 · 0 comments
Assignees

Comments

@tbg
Copy link
Member

tbg commented Jul 24, 2016

Or a node which transfers its lease away might restart rapidly and use a tail end of its lease while its lease was really shortened.

The naive fix (which might well be the right one) is to persist the expiration date of the transferred lease to disk (keyed by RangeID) and, when starting the process, sitting ducks until the local HLC clock - max_offset reports a higher timestamp.

There is also the somewhat related issue that the hlc clock itself may jump backwards. For that reason alone, we should already sleep max_offset when starting the process.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 24, 2016
... so we don't use those leases after a restart

fixes cockroachdb#7996
andreimatei added a commit to andreimatei/cockroach that referenced this issue Oct 11, 2016
... so we don't use those leases after a restart

fixes cockroachdb#7996
andreimatei added a commit to andreimatei/cockroach that referenced this issue Oct 25, 2016
Before this patch, leases held before a restart could be used after the
restart. This is incorrect, since the command queue has been wiped in
the restart and so reads and writes are not properly sequenced with
possible in-flight commands.
Another problem is that a lease that was in the process of being
transferred away before the restart is used, and that's no good because
it breaks the lease holder's promise to not use that lease.

This patch adds a bit of in-memory state to a replica - whether the
lease has changed since the process has been started. If it hasn't, a
lease is not used _at propose time_ and instead the holder pretends it
doesn't have a valid lease and tries to acquire one. Nodes other than
the holder do not modify their behavior.
No similar check is done at apply time (nor could it) since there we
naturally rely on Raft to order writes; so, if the proposer is still the
lease holder, there's no reason not to apply a command.

fixes cockroachdb#7996
andreimatei added a commit to andreimatei/cockroach that referenced this issue Oct 25, 2016
Before this patch, leases held before a restart could be used after the
restart. This is incorrect, since the command queue has been wiped in
the restart and so reads and writes are not properly sequenced with
possible in-flight commands.
Another problem is that a lease that was in the process of being
transferred away before the restart is used, and that's no good because
it breaks the lease holder's promise to not use that lease.

This patch adds a bit of in-memory state to a replica - whether the
lease has changed since the process has been started. If it hasn't, a
lease is not used _at propose time_ and instead the holder pretends it
doesn't have a valid lease and tries to acquire one. Nodes other than
the holder do not modify their behavior.
No similar check is done at apply time (nor could it) since there we
naturally rely on Raft to order writes; so, if the proposer is still the
lease holder, there's no reason not to apply a command.

fixes cockroachdb#7996
andreimatei added a commit to andreimatei/cockroach that referenced this issue Nov 2, 2016
…sfer

Before this patch, leases held before a restart could be used after the
restart. This is incorrect, since the command queue has been wiped in
the restart and so reads and writes are not properly sequenced with
possible in-flight commands.

There was also a problem before this patch, related to lease transfers:
if a transfer command returned an error to the `replica.Send()` call
that proposed it, we'd clear our "in transfer" state and allow the
replica to use the existing lease. However, such errors are generally
ambiguous - there's no guarantee that the transfer will still apply. In
such cases, the replica was actually breaking its promise to not use the
lease after initiating the tranfer.

The current patch addresses both these problems by introducing a
`replica.minLeaseProposedTs` and a `Lease.ProposedTs`. These fields are
used to enforce the fact that only leases proposed after a particular
time will be used *for proposing* new commands, or for serving reads. On
a transfer, the field for the replica in question is set to the present
time. On restart, the field is set for all the replicas to the current
time, after the Server has waited such that it has a guarantee that the
current HLC is above what it might have been on previous incarnations.
This ensures that, in both the transfer and the restart case, leases
that have been proposed before the transfer/restart and were in flight
at the time when the transfer/restart happens are not eligible to be
used.

This patch also changes the way waiting on startup for a server's HLC to
be guaranteed to be monotonic wrt to the HLC before the restart works.
Before, we were waiting just before starting serving, but after all the
stores had been started. The point of the waiting was to not serve reads
thinking we have a lower timestamp than before the restart. I'm unclear
about whether that was correct or not, considering that, for example, a
store's queues were already running and potentially doing work.
Now that we also rely on that waiting for not reusing old leases (we
need to initialize replica.minLeaseProposedTs with a value higher than
the node had before the restart), I've had to do the waiting *before*
initializing the stores. I could have done the setting of that field
late, but it seems even more dangerous than before to allow queues to do
work with potentially bad leases.
We lost the amortization of this wait time with the store creation
process... But note that, for empty engines (e.g. in tests) we don't do
any waiting.

Fixes cockroachdb#7996

THIS PATCH NEEDS TO BE DEPLOYED THROUGH A "STOP THE WORLD"
Because of the field being added to the Lease proto, we can't have
leases proposed by new servers being applied by old servers - they're
going to be serialized differently and the consistency checker will flag
it.
Note that a "freeze" is not required, though. The new field is nullable,
so any lease requests that might be in flight at the time of the world
restart will be serialized the same as they have been by the nodes that
already applied them before the restart.
andreimatei added a commit to andreimatei/cockroach that referenced this issue Nov 3, 2016
…sfer

Before this patch, leases held before a restart could be used after the
restart. This is incorrect, since the command queue has been wiped in
the restart and so reads and writes are not properly sequenced with
possible in-flight commands.

There was also a problem before this patch, related to lease transfers:
if a transfer command returned an error to the `replica.Send()` call
that proposed it, we'd clear our "in transfer" state and allow the
replica to use the existing lease. However, such errors are generally
ambiguous - there's no guarantee that the transfer will still apply. In
such cases, the replica was actually breaking its promise to not use the
lease after initiating the tranfer.

The current patch addresses both these problems by introducing a
`replica.minLeaseProposedTS` and a `Lease.ProposedTS`. These fields are
used to enforce the fact that only leases proposed after a particular
time will be used *for proposing* new commands, or for serving reads. On
a transfer, the field for the replica in question is set to the present
time. On restart, the field is set for all the replicas to the current
time, after the Server has waited such that it has a guarantee that the
current HLC is above what it might have been on previous incarnations.
This ensures that, in both the transfer and the restart case, leases
that have been proposed before the transfer/restart and were in flight
at the time when the transfer/restart happens are not eligible to be
used.

This patch also changes the way waiting on startup for a server's HLC to
be guaranteed to be monotonic wrt to the HLC before the restart works.
Before, we were waiting just before starting serving, but after all the
stores had been started. The point of the waiting was to not serve reads
thinking we have a lower timestamp than before the restart. I'm unclear
about whether that was correct or not, considering that, for example, a
store's queues were already running and potentially doing work.
Now that we also rely on that waiting for not reusing old leases (we
need to initialize replica.minLeaseProposedTS with a value higher than
the node had before the restart), I've had to do the waiting *before*
initializing the stores. I could have done the setting of that field
late, but it seems even more dangerous than before to allow queues to do
work with potentially bad leases.
We lost the amortization of this wait time with the store creation
process... But note that, for empty engines (e.g. in tests) we don't do
any waiting.

Fixes cockroachdb#7996

THIS PATCH NEEDS TO BE DEPLOYED THROUGH A "STOP THE WORLD"
Because of the field being added to the Lease proto, we can't have
leases proposed by new servers being applied by old servers - they're
going to be serialized differently and the consistency checker will flag
it.
Note that a "freeze" is not required, though. The new field is nullable,
so any lease requests that might be in flight at the time of the world
restart will be serialized the same as they have been by the nodes that
already applied them before the restart.
andreimatei added a commit to andreimatei/cockroach that referenced this issue Nov 4, 2016
…sfer

Before this patch, leases held before a restart could be used after the
restart. This is incorrect, since the command queue has been wiped in
the restart and so reads and writes are not properly sequenced with
possible in-flight commands.

There was also a problem before this patch, related to lease transfers:
if a transfer command returned an error to the `replica.Send()` call
that proposed it, we'd clear our "in transfer" state and allow the
replica to use the existing lease. However, such errors are generally
ambiguous - there's no guarantee that the transfer will still apply. In
such cases, the replica was actually breaking its promise to not use the
lease after initiating the tranfer.

The current patch addresses both these problems by introducing a
`replica.minLeaseProposedTS` and a `Lease.ProposedTS`. These fields are
used to enforce the fact that only leases proposed after a particular
time will be used *for proposing* new commands, or for serving reads. On
a transfer, the field for the replica in question is set to the present
time. On restart, the field is set for all the replicas to the current
time, after the Server has waited such that it has a guarantee that the
current HLC is above what it might have been on previous incarnations.
This ensures that, in both the transfer and the restart case, leases
that have been proposed before the transfer/restart and were in flight
at the time when the transfer/restart happens are not eligible to be
used.

This patch also changes the way waiting on startup for a server's HLC to
be guaranteed to be monotonic wrt to the HLC before the restart works.
Before, we were waiting just before starting serving, but after all the
stores had been started. The point of the waiting was to not serve reads
thinking we have a lower timestamp than before the restart. I'm unclear
about whether that was correct or not, considering that, for example, a
store's queues were already running and potentially doing work.
Now that we also rely on that waiting for not reusing old leases (we
need to initialize replica.minLeaseProposedTS with a value higher than
the node had before the restart), I've had to do the waiting *before*
initializing the stores. I could have done the setting of that field
late, but it seems even more dangerous than before to allow queues to do
work with potentially bad leases.
We lost the amortization of this wait time with the store creation
process... But note that, for empty engines (e.g. in tests) we don't do
any waiting.

Fixes cockroachdb#7996

THIS PATCH NEEDS TO BE DEPLOYED THROUGH A "STOP THE WORLD"
Because of the field being added to the Lease proto, we can't have
leases proposed by new servers being applied by old servers - they're
going to be serialized differently and the consistency checker will flag
it.
Note that a "freeze" is not required, though. The new field is nullable,
so any lease requests that might be in flight at the time of the world
restart will be serialized the same as they have been by the nodes that
already applied them before the restart.
andreimatei added a commit to andreimatei/cockroach that referenced this issue Nov 7, 2016
…sfer

Before this patch, leases held before a restart could be used after the
restart. This is incorrect, since the command queue has been wiped in
the restart and so reads and writes are not properly sequenced with
possible in-flight commands.

There was also a problem before this patch, related to lease transfers:
if a transfer command returned an error to the `replica.Send()` call
that proposed it, we'd clear our "in transfer" state and allow the
replica to use the existing lease. However, such errors are generally
ambiguous - there's no guarantee that the transfer will still apply. In
such cases, the replica was actually breaking its promise to not use the
lease after initiating the tranfer.

The current patch addresses both these problems by introducing a
`replica.minLeaseProposedTS` and a `Lease.ProposedTS`. These fields are
used to enforce the fact that only leases proposed after a particular
time will be used *for proposing* new commands, or for serving reads. On
a transfer, the field for the replica in question is set to the present
time. On restart, the field is set for all the replicas to the current
time, after the Server has waited such that it has a guarantee that the
current HLC is above what it might have been on previous incarnations.
This ensures that, in both the transfer and the restart case, leases
that have been proposed before the transfer/restart and were in flight
at the time when the transfer/restart happens are not eligible to be
used.

This patch also changes the way waiting on startup for a server's HLC to
be guaranteed to be monotonic wrt to the HLC before the restart works.
Before, we were waiting just before starting serving, but after all the
stores had been started. The point of the waiting was to not serve reads
thinking we have a lower timestamp than before the restart. I'm unclear
about whether that was correct or not, considering that, for example, a
store's queues were already running and potentially doing work.
Now that we also rely on that waiting for not reusing old leases (we
need to initialize replica.minLeaseProposedTS with a value higher than
the node had before the restart), I've had to do the waiting *before*
initializing the stores. I could have done the setting of that field
late, but it seems even more dangerous than before to allow queues to do
work with potentially bad leases.
We lost the amortization of this wait time with the store creation
process... But note that, for empty engines (e.g. in tests) we don't do
any waiting.

Fixes cockroachdb#7996

THIS PATCH NEEDS TO BE DEPLOYED THROUGH A "STOP THE WORLD"
Because of the field being added to the Lease proto, we can't have
leases proposed by new servers being applied by old servers - they're
going to be serialized differently and the consistency checker will flag
it.
Note that a "freeze" is not required, though. The new field is nullable,
so any lease requests that might be in flight at the time of the world
restart will be serialized the same as they have been by the nodes that
already applied them before the restart.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants