-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage, server: don't reuse leases obtained before a restart or transfer #10420
storage, server: don't reuse leases obtained before a restart or transfer #10420
Conversation
Review status: 0 of 13 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed. pkg/storage/replica.go, line 3647 at r2 (raw file):
I'm not entirely clear on when exactly we want to gossip and when we don't want to... I'm not entirely sure of the consequences of gossiping at the wrong time. Maybe @tschottdorf or someone else can suggest a better comment. Comments from Reviewable |
Needs a rebase and has some failing tests. Should we be reviewing before the tests are passing? |
Yeah, please review. The tests are about the proto that changed, and Btw I'm also writing a test for a server restart, but that needs some new On Nov 2, 2016 10:05 PM, "Tamir Duberstein" [email protected]
|
Reviewed 13 of 13 files at r2. pkg/roachpb/data.proto, line 335 at r2 (raw file):
s/of the proposed// pkg/storage/replica.go, line 618 at r2 (raw file):
This will give all the replicas (slightly) different timestamps. This could be a single value chosen by the store, although that's probably not worth the plumbing. pkg/storage/replica.go, line 3647 at r2 (raw file):
|
good test (ok, bit hard to follow along, but that's just how those tests are). Does this affect the runtime of the chaos acceptance tests at all? Those are the only tests that would still potentially eat longer unavailability because of Reviewed 13 of 13 files at r2. pkg/roachpb/data.proto, line 335 at r2 (raw file):
|
Reviewed 1 of 1 files at r1, 10 of 13 files at r2. pkg/server/server.go, line 562 at r2 (raw file):
|
6cca1f8
to
4b2a30a
Compare
still haven't written that restart test... but I will. Review status: all files reviewed at latest revision, 18 unresolved discussions, some commit checks failed. pkg/roachpb/data.proto, line 335 at r2 (raw file):
|
Reviewed 14 of 14 files at r3, 12 of 15 files at r4. pkg/server/server.go, line 562 at r2 (raw file):
|
Reviewed 1 of 14 files at r3, 15 of 15 files at r4. pkg/server/server.go, line 562 at r2 (raw file):
|
4b2a30a
to
ea274f2
Compare
Review status: 0 of 17 files reviewed at latest revision, 9 unresolved discussions. pkg/server/server.go, line 562 at r2 (raw file):
|
Reviewed 1 of 16 files at r5, 12 of 16 files at r6. pkg/storage/client_raft_test.go, line 309 at r4 (raw file):
|
Before (`allocsim -n 4 -w 4`): _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1448.8 0 1686 559/554 383/0 331/1 413/2 After: _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1413.7 0 1658 406/132 422/141 410/136 420/138 I do see a little bit of thrashiness with the lease-transfer heuristics, though it settles down relatively quickly so I'm not sure if it is worth addressing yet. Depends on cockroachdb#10420
Before (`allocsim -n 4 -w 4`): _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1448.8 0 1686 559/554 383/0 331/1 413/2 After: _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1413.7 0 1658 406/132 422/141 410/136 420/138 I do see a little bit of thrashiness with the lease-transfer heuristics, though it settles down relatively quickly so I'm not sure if it is worth addressing yet. Depends on cockroachdb#10420
Before (`allocsim -n 4 -w 4`): _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1448.8 0 1686 559/554 383/0 331/1 413/2 After: _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1413.7 0 1658 406/132 422/141 410/136 420/138 I do see a little bit of thrashiness with the lease-transfer heuristics, though it settles down relatively quickly so I'm not sure if it is worth addressing yet. Depends on cockroachdb#10420
…sfer Before this patch, leases held before a restart could be used after the restart. This is incorrect, since the command queue has been wiped in the restart and so reads and writes are not properly sequenced with possible in-flight commands. There was also a problem before this patch, related to lease transfers: if a transfer command returned an error to the `replica.Send()` call that proposed it, we'd clear our "in transfer" state and allow the replica to use the existing lease. However, such errors are generally ambiguous - there's no guarantee that the transfer will still apply. In such cases, the replica was actually breaking its promise to not use the lease after initiating the tranfer. The current patch addresses both these problems by introducing a `replica.minLeaseProposedTS` and a `Lease.ProposedTS`. These fields are used to enforce the fact that only leases proposed after a particular time will be used *for proposing* new commands, or for serving reads. On a transfer, the field for the replica in question is set to the present time. On restart, the field is set for all the replicas to the current time, after the Server has waited such that it has a guarantee that the current HLC is above what it might have been on previous incarnations. This ensures that, in both the transfer and the restart case, leases that have been proposed before the transfer/restart and were in flight at the time when the transfer/restart happens are not eligible to be used. This patch also changes the way waiting on startup for a server's HLC to be guaranteed to be monotonic wrt to the HLC before the restart works. Before, we were waiting just before starting serving, but after all the stores had been started. The point of the waiting was to not serve reads thinking we have a lower timestamp than before the restart. I'm unclear about whether that was correct or not, considering that, for example, a store's queues were already running and potentially doing work. Now that we also rely on that waiting for not reusing old leases (we need to initialize replica.minLeaseProposedTS with a value higher than the node had before the restart), I've had to do the waiting *before* initializing the stores. I could have done the setting of that field late, but it seems even more dangerous than before to allow queues to do work with potentially bad leases. We lost the amortization of this wait time with the store creation process... But note that, for empty engines (e.g. in tests) we don't do any waiting. Fixes cockroachdb#7996 THIS PATCH NEEDS TO BE DEPLOYED THROUGH A "STOP THE WORLD" Because of the field being added to the Lease proto, we can't have leases proposed by new servers being applied by old servers - they're going to be serialized differently and the consistency checker will flag it. Note that a "freeze" is not required, though. The new field is nullable, so any lease requests that might be in flight at the time of the world restart will be serialized the same as they have been by the nodes that already applied them before the restart.
ea274f2
to
3d508a1
Compare
the only chaos test that's not skipped is TestNodeRestart, which takes anywhere between 15 and 22s, with or without this PR. Review status: 0 of 17 files reviewed at latest revision, 9 unresolved discussions, some commit checks pending. pkg/storage/client_raft_test.go, line 309 at r4 (raw file):
|
Reviewed 1 of 17 files at r7, 13 of 16 files at r8. pkg/storage/replica_range_lease.go, line 217 at r8 (raw file):
slight preference for a binding local to this Comments from Reviewable |
Before (`allocsim -n 4 -w 4`): _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1448.8 0 1686 559/554 383/0 331/1 413/2 After: _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1413.7 0 1658 406/132 422/141 410/136 420/138 I do see a little bit of thrashiness with the lease-transfer heuristics, though it settles down relatively quickly so I'm not sure if it is worth addressing yet. Depends on cockroachdb#10420
Before (`allocsim -n 4 -w 4`): _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1472.5 0 2124 704/687 492/0 522/0 406/0 After: _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1506.7 0 2157 520/183 548/180 547/166 542/180 I do see a little bit of thrashiness with the lease-transfer heuristics, though it settles down relatively quickly so I'm not sure if it is worth addressing yet. Depends on cockroachdb#10420
Before (`allocsim -n 4 -w 4`): _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1472.5 0 2124 704/687 492/0 522/0 406/0 After: _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1506.7 0 2157 520/183 548/180 547/166 542/180 I do see a little bit of thrashiness with the lease-transfer heuristics, though it settles down relatively quickly so I'm not sure if it is worth addressing yet. Depends on cockroachdb#10420
Before (`allocsim -n 4 -w 4`): _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1472.5 0 2124 704/687 492/0 522/0 406/0 After: _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1506.7 0 2157 520/183 548/180 547/166 542/180 I do see a little bit of thrashiness with the lease-transfer heuristics, though it settles down relatively quickly so I'm not sure if it is worth addressing yet. Depends on cockroachdb#10420
Before (`allocsim -n 4 -w 4`): _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1472.5 0 2124 704/687 492/0 522/0 406/0 After: _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1506.7 0 2157 520/183 548/180 547/166 542/180 I do see a little bit of thrashiness with the lease-transfer heuristics, though it settles down relatively quickly so I'm not sure if it is worth addressing yet. Depends on cockroachdb#10420
Before (`allocsim -n 4 -w 4`): _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1472.5 0 2124 704/687 492/0 522/0 406/0 After: _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1506.7 0 2157 520/183 548/180 547/166 542/180 I do see a little bit of thrashiness with the lease-transfer heuristics, though it settles down relatively quickly so I'm not sure if it is worth addressing yet. Depends on cockroachdb#10420
Before (`allocsim -n 4 -w 4`): _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1472.5 0 2124 704/687 492/0 522/0 406/0 After: _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1506.7 0 2157 520/183 548/180 547/166 542/180 I do see a little bit of thrashiness with the lease-transfer heuristics, though it settles down relatively quickly so I'm not sure if it is worth addressing yet. Depends on cockroachdb#10420
Before (`allocsim -n 4 -w 4`): _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1472.5 0 2124 704/687 492/0 522/0 406/0 After: _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1506.7 0 2157 520/183 548/180 547/166 542/180 I do see a little bit of thrashiness with the lease-transfer heuristics, though it settles down relatively quickly so I'm not sure if it is worth addressing yet. Depends on cockroachdb#10420
Before (`allocsim -n 4 -w 4`): _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1472.5 0 2124 704/687 492/0 522/0 406/0 After: _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1506.7 0 2157 520/183 548/180 547/166 542/180 I do see a little bit of thrashiness with the lease-transfer heuristics, though it settles down relatively quickly so I'm not sure if it is worth addressing yet. Depends on cockroachdb#10420
Before (`allocsim -n 4 -w 4`): _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1472.5 0 2124 704/687 492/0 522/0 406/0 After: _elapsed__ops/sec___errors_replicas_________1_________2_________3_________4 5m0s 1506.7 0 2157 520/183 548/180 547/166 542/180 I do see a little bit of thrashiness with the lease-transfer heuristics, though it settles down relatively quickly so I'm not sure if it is worth addressing yet. Depends on cockroachdb#10420
This actually highlights some open questions that I am posting this PR for. Touches cockroachdb#10420. Release note: None
While we were already making sure that a lease obtained before a node restart was not used after, the new requested lease would usually be an extension of the old. As such, commands proposed under both would be able to apply under the new one, which could theoretically cause consistency issues as the previous commands would not be tracked by the command queue (though it would be hard to engineer and has not been observed in practice, to the best of our knowledge). This change plugs that hole by preventing an extension of a previously held lease post restart. Touches cockroachdb#10420. Release note (bug fix): Prevent potential consistency issues when a node is stopped and restarted in rapid succession.
While we were already making sure that a lease obtained before a node restart was not used after, the new requested lease would usually be an extension of the old. As such, commands proposed under both would be able to apply under the new one, which could theoretically cause consistency issues as the previous commands would not be tracked by the command queue (though it would be hard to engineer and has not been observed in practice, to the best of our knowledge). This change plugs that hole by preventing an extension of a previously held lease post restart. Touches cockroachdb#10420. Release note (bug fix): Prevent potential consistency issues when a node is stopped and restarted in rapid succession.
While we were already making sure that a lease obtained before a node restart was not used after, the new requested lease would usually be an extension of the old. As such, commands proposed under both would be able to apply under the new one, which could theoretically cause consistency issues as the previous commands would not be tracked by the command queue (though it would be hard to engineer and has not been observed in practice, to the best of our knowledge). This change plugs that hole by preventing an extension of a previously held lease post restart. Touches cockroachdb#10420. Release note (bug fix): Prevent potential consistency issues when a node is stopped and restarted in rapid succession.
While we were already making sure that a lease obtained before a node restart was not used after, the new requested lease would usually be an extension of the old. As such, commands proposed under both would be able to apply under the new one, which could theoretically cause consistency issues as the previous commands would not be tracked by the command queue (though it would be hard to engineer and has not been observed in practice, to the best of our knowledge). This change plugs that hole by preventing an extension of a previously held lease post restart. Touches cockroachdb#10420. Release note (bug fix): Prevent potential consistency issues when a node is stopped and restarted in rapid succession.
While we were already making sure that a lease obtained before a node restart was not used after, the new requested lease would usually be an extension of the old. As such, commands proposed under both would be able to apply under the new one, which could theoretically cause consistency issues as the previous commands would not be tracked by the command queue (though it would be hard to engineer and has not been observed in practice, to the best of our knowledge). This change plugs that hole by preventing an extension of a previously held lease post restart. Touches cockroachdb#10420. Release note (bug fix): Prevent potential consistency issues when a node is stopped and restarted in rapid succession.
While we were already making sure that a lease obtained before a node restart was not used after, the new requested lease would usually be an extension of the old. As such, commands proposed under both would be able to apply under the new one, which could theoretically cause consistency issues as the previous commands would not be tracked by the command queue (though it would be hard to engineer and has not been observed in practice, to the best of our knowledge). This change plugs that hole by preventing an extension of a previously held lease post restart. Touches cockroachdb#10420. Release note (bug fix): Prevent potential consistency issues when a node is stopped and restarted in rapid succession.
Before this patch, leases held before a restart could be used after the
restart. This is incorrect, since the command queue has been wiped in
the restart and so reads and writes are not properly sequenced with
possible in-flight commands.
There was also a problem before this patch, related to lease transfers:
if a transfer command returned an error to the
replica.Send()
callthat proposed it, we'd clear our "in transfer" state and allow the
replica to use the existing lease. However, such errors are generally
ambiguous - there's no guarantee that the transfer will still apply. In
such cases, the replica was actually breaking its promise to not use the
lease after initiating the transfer.
The current patch addresses both these problems by introducing a
replica.minLeaseProposedTs
and aLease.ProposedTs
. These fields areused to enforce the fact that only leases proposed after a particular
time will be used for proposing new commands, or for serving reads. On
a transfer, the field for the replica in question is set to the present
time. On restart, the field is set for all the replicas to the current
time, after the Server has waited such that it has a guarantee that the
current HLC is above what it might have been on previous incarnations.
This ensures that, in both the transfer and the restart case, leases
that have been proposed before the transfer/restart and were in flight
at the time when the transfer/restart happens are not eligible to be
used.
This patch also changes the way waiting on startup for a server's HLC to
be guaranteed to be monotonic wrt to the HLC before the restart works.
Before, we were waiting just before starting serving, but after all the
stores had been started. The point of the waiting was to not serve reads
thinking we have a lower timestamp than before the restart. I'm unclear
about whether that was correct or not, considering that, for example, a
store's queues were already running and potentially doing work.
Now that we also rely on that waiting for not reusing old leases (we
need to initialize replica.minLeaseProposedTs with a value higher than
the node had before the restart), I've had to do the waiting before
initializing the stores. I could have done the setting of that field
late, but it seems even more dangerous than before to allow queues to do
work with potentially bad leases.
We lost the amortization of this wait time with the store creation
process... But note that, for empty engines (e.g. in tests) we don't do
any waiting.
Fixes #7996
THIS PATCH NEEDS TO BE DEPLOYED THROUGH A "STOP THE WORLD"
Because of the field being added to the Lease proto, we can't have
leases proposed by new servers being applied by old servers - they're
going to be serialized differently and the consistency checker will flag
it.
Note that a "freeze" is not required, though. The new field is nullable,
so any lease requests that might be in flight at the time of the world
restart will be serialized the same as they have been by the nodes that
already applied them before the restart.
cc @petermattis @tschottdorf @bdarnell
@tamird for validating what I've said above about not needing the freeze
@jseldess for keeping in mind that the beta where we roll this out with need a "stop the world"
This change is