Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: understand, prevent, and recover quickly from a leaseholder needing a Raft snapshot #81561

Closed
2 tasks done
nvanbenschoten opened this issue May 19, 2022 · 2 comments
Closed
2 tasks done
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. N-followup Needs followup. O-postmortem Originated from a Postmortem action item. T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented May 19, 2022

Revival of #68539 and #61604.

A common form of prolonged availability loss is due to situations where a Range's leaseholder is in need of a Raft snapshot. During these situations, the Range's leaseholder (as defined by the replicated state machine) is not caught up on its log far enough to recognize that it holds the lease. As a result, every KV request to the leaseholder is met with a redirection to an earlier leaseholder, who in turn redirects back to the replica in need of a snapshot. However, even though the leaseholder does not recognize itself as such, it continues to heartbeat its liveness record, indirectly extending its lease so that it does not expire. The consequence of this situation is availability loss on the range until the leaseholder replica receives a snapshot.

Understand

How does this situation happen? There are two ways that a replica can become the leaseholder. The first is through a non-cooperative RequestLease, where a replica acquires a lease for a range that does not currently have a leaseholder. The second is through a cooperative LeaseTransfer, where the current leaseholder passes the lease to another replica in its range.

RequestLease requests behave by proposing a request through Raft that, when committed, performs a compare-and-swap on the previous, expired lease. This lease acquisition request is bound to fail (because the lease that it's based on is stale), but while it fails (or, rather, until the behind replica finds out that it failed) local requests are blocked. In the past (#37906), this was observed to cause outages, as a replica that was behind on its log could propose a RequestLease and then block until it heard the result of the request, which required a snapshot. We've since resolved these issues in #55148, which prevented follower replicas from proposing RequestLease requests.

An important invariant in Raft is that a leader at the time of election is never behind on its log and in need of a snapshot. However, it is possible for a replica to think that it is the leader after it has already been replaced. This does leave a small margin where a leader could propose a RequestLease after it has been voted out. This would appear to be a problem. However, in practice it is not because the locking in propBuf.FlushLockedWithRaftGroup ensures that the raftGroup is never stepped between the leadership check and the proposal. This means that the raftGroup will always try to propose the RequestLease itself instead of forwarding it to the new leader. In such cases, the proposal must be rejected by the outgoing leader's peers. So the protection in #55148 is sound and RequestLease should never create the leaseholder-in-need-of-snapshot scenario.

LeaseTransfer requests are more like normal Raft log proposals. They are proposed "under" a current lease with a lease sequence and max lease applied index. This means that they can only be proposed by the current leaseholder and will be rejected if committed out of order (e.g. after the leaseholder has been replaced). Below Raft, LeaseTransfer requests can target any replica to assume the lease. However, the outgoing leaseholder contains a series of checks in its allocator and during command evaluation that attempt to ensure that this is a "good" lease target.

These checks are flawed and do not guarantee the protection we would like for three reasons:

  1. they are incomplete. The check that the sender of the lease transfer is the leader takes place on some code paths, but not on others. This improved in a6a8d5c, but there are still gaps in this protection. For instance, an AdminTransferLeaseRequest (used by some higher-level rebalancing code paths) bypasses this protection.
  2. they are racy. While there are checks that consult the local raft status to check whether the replica is the raft leader and that the lease transfer target can catch up from this leader's log, this information may be stale by the time the lease transfer is proposed. Raft leadership may have moved by this point. Similarly, the raft log may also have been truncated.
  3. they rely on a flawed assumption that a replica that can be caught up using the log by one raft leader could be caught up using the log by another raft leader, should leadership change. This has not been true since storage: don't send historical Raft log with snapshots #35701.

A third potential avenue that could create this situation is if the raft log is truncated while already in a split leaseholder/leader situation. However, this is not possible in practice, as the leaseholder is always the replica that decides on the log truncation index, so it will never truncate above its own raft log. For added protection, we also disable Raft log truncation by leaseholders that are not also leaders.

Out of the three potential cases that could lead to a leaseholder needing a snapshot, two are guaranteed today to not create such a situation. However, lease transfers have insufficient protection against creating this situation. We should fix them.

Prevent

To prevent this situation, we need firmer guarantees during leaseholder transfers. #55148 provides a blueprint through which to think about protections that are both complete and non-racy. The protection added #55148 is built into the raft propBuf and run within the Raft state machine loop. This ensures that it applies to all Raft proposals (and re-proposals) and has an accurate understanding of Raft leadership (or else the proposal will be rejected).

We should do something similar for lease transfers. We should add a check into propBuf.FlushLockedWithRaftGroup that only allows the Raft leader to propose lease transfers and only to replicas who are 1) in StateReplicate and 2) have a Match index that is > the leaseholder's understanding of the Raft log's truncated index. Latching on the leaseholder will ensure that log truncation and lease transfers are properly synchronized, so that any log truncation request immediately before a lease transfer is accounted for in the protection.

We should also consider reverting #35701, or otherwise finding a way to ensure that a leaseholder's view of the log truncation index is an accurate upper bound of the earlier log index than any future Raft leader might contain. Otherwise, we are still susceptible to a leadership change re-introducing the need to snapshot the leaseholder.

Recover

Even with what we believe to be firm guarantees against a leaseholder getting into this state, we should still optimize for recovery from it, given the severity of any mistake. This kind of improved recovery likely includes sender-side prioritization of snapshots to leaseholders, which is underway in #80817.

It may also include:

  • receiver-side prioritization of snapshots to leaseholders
  • higher snapshot rate limits in these cases

Action Items

Jira issue: CRDB-15077

Epic CRDB-16160

@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. A-kv Anything in KV that doesn't belong in a more specific category. labels May 19, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label May 19, 2022
@nvanbenschoten
Copy link
Member Author

In a private slack conversation, @tbg had an additional proposal to minimize the blast radius of an ill-advised lease transfer. It's quite clever and serves as an actionable recovery mechanism. The idea stems out of:

However, even though the leaseholder does not recognize itself as such, it continues to heartbeat its liveness record, indirectly extending its lease so that it does not expire.

This is a key part of the hazard here. With expiration-based leases, a leaseholder needs to periodically (every 4.5 seconds) extend the lease by 9 seconds or it will expire and be up for grabs. This requires the leaseholder to recognize itself as the leaseholder within 9 seconds after a lease transfer. However, with epoch-based leases, this lease extension is indirectly performed through the node's liveness record. This means that a newly appointed leaseholder can continue to hold on to the lease even if it doesn't recognize itself as the leaseholder for an unbounded amount of time.

Tobi's proposal is that even on the portion of the keyspace that can use epoch-based leases, lease transfers could always install an expiration-based lease. The new leaseholder would then need to learn about this lease within 9 seconds or the lease would expire. When performing its first lease extension, it would promote the lease back to an epoch-based lease. This limits the damage of a bad lease transfer to a 9-second outage. There a likely some bugs lurking here because we don't regularly switch between epoch and expiration-based leases, but doing so is meant to be possible.

One potential hazard is that this limits the potential lease candidates to those replicas which are less than 9s behind on their log. If a lease target is persistently more than 9s behind, the lease could thrash back and forth. This could partially re-introduce #38065, or an even more disruptive variant of that issue (due to the thrashing). I'm not sure whether that's a real concern, as 9s of replication lag is severe and a leaseholder should not attempt to transfer a lease to a replica that is so far behind on its log. However, it's difficult to reason through whether the quota pool provides any real protection here. We'll need to keep this hazard in mind.

I think we should explore this recovery mechanism in addition to the proposed protection mechanism presented in this issue. With those two changes, we will be in a much better place.

@tbg
Copy link
Member

tbg commented May 24, 2022

We should also consider reverting #35701, or otherwise finding a way to ensure that a leaseholder's view of the log truncation index is an accurate upper bound of the earlier log index than any future Raft leader might contain. Otherwise, we are still susceptible to a leadership change re-introducing the need to snapshot the leaseholder.

I'm opposed to simply reverting 35701. Sending the raft log with snapshots caused lots of issues. If nothing else, it causes overhead that in 99.99% of the cases just isn't needed. Also conceptually it doesn't seem right.

We can instead build these checks into a pre-transfer handshake, but it's sort of unsatisfying if you then fail one of these checks. Do you wait, do you transfer anyway? Seems dangerous to reject lease transfers, seems dangerous to wait for a snapshot that maybe won't get sent because the node is overloaded. It is hard to tell what's best without knowing a lot about the state the system is in.

Taking a step back, similar to how we have follower snapshots, we could also have follower appends (or log snapshots). This is basically "reverting 35701" but reframed into something that makes more sense to me. If a leaseholder sees a follower that it can't catch up, rather than going straight to snapshot, could it know which other follower still has enough of the log, and set up a transfer? That way, we're not constantly shipping log entries that will never again be used in the common case, but we do it when it's necessary and avoid a tight coupling of lease transfers and snapshots.

@1lann 1lann added the O-postmortem Originated from a Postmortem action item. label May 25, 2022
@lunevalex lunevalex added the N-followup Needs followup. label May 26, 2022
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 12, 2022
Fixes cockroachdb#81763.
Part of cockroachdb#81561.

\### Background

When performing a lease transfer, the outgoing leaseholder revokes its
lease before proposing the lease transfer request, meaning that it
promises to stop using the previous lease to serve reads or writes. The
lease transfer request is then proposed and committed to the Raft log, at
which point the new lease officially becomes active. However, this new
lease is not usable until the incoming leaseholder applies the Raft entry
that contains the lease transfer and notices that it is now the
leaseholder for the range.

The effect of this handoff is that there exists a "power vacuum" time
period when the outgoing leaseholder has revoked its previous lease but
the incoming leaseholder has not yet applied its new lease. During this
time period, a range is effectively unavailable for strong reads and
writes, because no replica will act as the leaseholder. Instead, requests
that require the lease will be redirected back and forth between the
outgoing leaseholder and the incoming leaseholder (the client backs off).
To minimize the disruption caused by lease transfers, we need to minimize
this time period.

We assume that if a lease transfer target is sufficiently caught up on
its log such that it will be able to apply the lease transfer through log
entry application then this unavailability window will be acceptable.
This may be a faulty assumption in cases with severe replication lag, but
we must balance any heuristics here that attempts to determine "too much
lag" with the possibility of starvation of lease transfers under
sustained write load and a resulting sustained replication lag. See cockroachdb#38065
and cockroachdb#42379, which removed such a heuristic. For now, we don't try to make
such a determination.

\### Patch Details

However, with this change, we now draw a distinction between lease
transfer targets that will be able to apply the lease transfer through
log entry application and those that will require a Raft snapshot to
catch up and apply the lease transfer. Raft snapshots are more expensive
than Raft entry replication. They are also significantly more likely to
be delayed due to queueing behind other snapshot traffic in the system.
This potential for delay makes transferring a lease to a replica that
needs a snapshot very risky, as doing so has the effect of inducing
range unavailability until the snapshot completes, which could take
seconds, minutes, or hours.

In the future, we will likely get better at prioritizing snapshots to
improve the responsiveness of snapshots that are needed to recover
availability. However, even in this world, it is not worth inducing
unavailability that can only be recovered through a Raft snapshot. It is
better to catch the desired lease target up on the log first and then
initiate the lease transfer once its log is connected to the leader's.
For this reason, unless we can guarantee that the lease transfer target
does not need a Raft snapshot, we don't let it through.

This commit adds protection against such risky lease transfers at two
levels. First, it includes hardened protection in the Replica proposal
buffer, immediately before handing the lease transfer proposal off to
`etcd/raft`. Second, it includes best-effort protection before a Replica
begins to initiate a lease transfer in `AdminTransferLease`, which all
lease transfer operations flow through.

The change includes protection at two levels because rejecting a lease
transfer in the proposal buffer after we have revoked our current lease
is more disruptive than doing so earlier, before we have revoked our
current lease. Best-effort protection is also able to respond more
gracefully to invalid targets (e.g. they pick the next best target).

However, the check in the Replica proposal buffer is the only place
where the protection is airtight against race conditions because the
check is performed:
1. by the current Raft leader, else the proposal will fail
2. while holding latches that prevent interleaving log truncation

\### Remaining Work

With this change, there is a single known race which can lead to an
incoming leaseholder needing a snapshot. This is the case when a
leader/leaseholder transfers the lease and then quickly loses Raft
leadership to a peer that has a shorter log. Even if the older leader
could have caught the incoming leaseholder up on its log, the new leader
may not be able to because its log may not go back as far. Such a
scenario has been possible since we stopped ensuring that all replicas
have logs that start at the same index. For more details, see the
discussion about cockroachdb#35701 in cockroachdb#81561.

This race is theoretical — we have not seen it in practice. It's not
clear whether we will try to address it or rely on a mitigation like
the one described in cockroachdb#81764 to limit its blast radius.

----

Release note (bug fix): Range lease transfers are no longer permitted to
follower replicas that may require a Raft snapshot. This ensures that
lease transfers are never delayed behind snapshots, which could
previously create range unavailability until the snapshot completed.
Lease transfers now err on the side of caution and are only allowed when
the outgoing leaseholder can guarantee that the incoming leaseholder
does not need a snapshot.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 21, 2022
Fixes cockroachdb#81763.
Part of cockroachdb#81561.

\### Background

When performing a lease transfer, the outgoing leaseholder revokes its
lease before proposing the lease transfer request, meaning that it
promises to stop using the previous lease to serve reads or writes. The
lease transfer request is then proposed and committed to the Raft log, at
which point the new lease officially becomes active. However, this new
lease is not usable until the incoming leaseholder applies the Raft entry
that contains the lease transfer and notices that it is now the
leaseholder for the range.

The effect of this handoff is that there exists a "power vacuum" time
period when the outgoing leaseholder has revoked its previous lease but
the incoming leaseholder has not yet applied its new lease. During this
time period, a range is effectively unavailable for strong reads and
writes, because no replica will act as the leaseholder. Instead, requests
that require the lease will be redirected back and forth between the
outgoing leaseholder and the incoming leaseholder (the client backs off).
To minimize the disruption caused by lease transfers, we need to minimize
this time period.

We assume that if a lease transfer target is sufficiently caught up on
its log such that it will be able to apply the lease transfer through log
entry application then this unavailability window will be acceptable.
This may be a faulty assumption in cases with severe replication lag, but
we must balance any heuristics here that attempts to determine "too much
lag" with the possibility of starvation of lease transfers under
sustained write load and a resulting sustained replication lag. See cockroachdb#38065
and cockroachdb#42379, which removed such a heuristic. For now, we don't try to make
such a determination.

\### Patch Details

However, with this change, we now draw a distinction between lease
transfer targets that will be able to apply the lease transfer through
log entry application and those that will require a Raft snapshot to
catch up and apply the lease transfer. Raft snapshots are more expensive
than Raft entry replication. They are also significantly more likely to
be delayed due to queueing behind other snapshot traffic in the system.
This potential for delay makes transferring a lease to a replica that
needs a snapshot very risky, as doing so has the effect of inducing
range unavailability until the snapshot completes, which could take
seconds, minutes, or hours.

In the future, we will likely get better at prioritizing snapshots to
improve the responsiveness of snapshots that are needed to recover
availability. However, even in this world, it is not worth inducing
unavailability that can only be recovered through a Raft snapshot. It is
better to catch the desired lease target up on the log first and then
initiate the lease transfer once its log is connected to the leader's.
For this reason, unless we can guarantee that the lease transfer target
does not need a Raft snapshot, we don't let it through.

This commit adds protection against such risky lease transfers at two
levels. First, it includes hardened protection in the Replica proposal
buffer, immediately before handing the lease transfer proposal off to
`etcd/raft`. Second, it includes best-effort protection before a Replica
begins to initiate a lease transfer in `AdminTransferLease`, which all
lease transfer operations flow through.

The change includes protection at two levels because rejecting a lease
transfer in the proposal buffer after we have revoked our current lease
is more disruptive than doing so earlier, before we have revoked our
current lease. Best-effort protection is also able to respond more
gracefully to invalid targets (e.g. they pick the next best target).

However, the check in the Replica proposal buffer is the only place
where the protection is airtight against race conditions because the
check is performed:
1. by the current Raft leader, else the proposal will fail
2. while holding latches that prevent interleaving log truncation

\### Remaining Work

With this change, there is a single known race which can lead to an
incoming leaseholder needing a snapshot. This is the case when a
leader/leaseholder transfers the lease and then quickly loses Raft
leadership to a peer that has a shorter log. Even if the older leader
could have caught the incoming leaseholder up on its log, the new leader
may not be able to because its log may not go back as far. Such a
scenario has been possible since we stopped ensuring that all replicas
have logs that start at the same index. For more details, see the
discussion about cockroachdb#35701 in cockroachdb#81561.

This race is theoretical — we have not seen it in practice. It's not
clear whether we will try to address it or rely on a mitigation like
the one described in cockroachdb#81764 to limit its blast radius.

----

Release note (bug fix): Range lease transfers are no longer permitted to
follower replicas that may require a Raft snapshot. This ensures that
lease transfers are never delayed behind snapshots, which could
previously create range unavailability until the snapshot completed.
Lease transfers now err on the side of caution and are only allowed when
the outgoing leaseholder can guarantee that the incoming leaseholder
does not need a snapshot.
craig bot pushed a commit that referenced this issue Jun 22, 2022
82560: sql: removed redundant parameter in method to schedule sql stats compaction r=rafiss a=surahman

The `crdb_internal.schedule_sql_stats_compaction` SQL function does not require the `byte` string parameter and has thus been removed. Closes #78332.

Jira issue: [CRDB-14071](https://cockroachlabs.atlassian.net/browse/CRDB-14071)

`@Azhng`

82758: kv: don't allow lease transfers to replicas in need of snapshot r=nvanbenschoten a=nvanbenschoten

Fixes #81763.
Fixes #79385.
Part of #81561.

### Background

When performing a lease transfer, the outgoing leaseholder revokes its lease before proposing the lease transfer request, meaning that it promises to stop using the previous lease to serve reads or writes. The lease transfer request is then proposed and committed to the Raft log, at which point the new lease officially becomes active. However, this new lease is not usable until the incoming leaseholder applies the Raft entry that contains the lease transfer and notices that it is now the leaseholder for the range.

The effect of this handoff is that there exists a "power vacuum" time period when the outgoing leaseholder has revoked its previous lease but the incoming leaseholder has not yet applied its new lease. During this time period, a range is effectively unavailable for strong reads and writes, because no replica will act as the leaseholder. Instead, requests that require the lease will be redirected back and forth between the outgoing leaseholder and the incoming leaseholder (the client backs off). To minimize the disruption caused by lease transfers, we need to minimize this time period.

We assume that if a lease transfer target is sufficiently caught up on its log such that it will be able to apply the lease transfer through log entry application then this unavailability window will be acceptable. This may be a faulty assumption in cases with severe replication lag, but we must balance any heuristics here that attempts to determine "too much lag" with the possibility of starvation of lease transfers under sustained write load and a resulting sustained replication lag. See #38065 and #42379, which removed such a heuristic. For now, we don't try to make such a determination.

### Patch Details

However, with this change, we now draw a distinction between lease transfer targets that will be able to apply the lease transfer through log entry application and those that will require a Raft snapshot to catch up and apply the lease transfer. Raft snapshots are more expensive than Raft entry replication. They are also significantly more likely to be delayed due to queueing behind other snapshot traffic in the system. This potential for delay makes transferring a lease to a replica that needs a snapshot very risky, as doing so has the effect of inducing range unavailability until the snapshot completes, which could take seconds, minutes, or hours.

In the future, we will likely get better at prioritizing snapshots to improve the responsiveness of snapshots that are needed to recover availability. However, even in this world, it is not worth inducing unavailability that can only be recovered through a Raft snapshot. It is better to catch the desired lease target up on the log first and then initiate the lease transfer once its log is connected to the leader's. For this reason, unless we can guarantee that the lease transfer target does not need a Raft snapshot, we don't let it through. 

This commit adds protection against such risky lease transfers at two levels. First, it includes hardened protection in the Replica proposal buffer, immediately before handing the lease transfer proposal off to etcd/raft. Second, it includes best-effort protection before a Replica begins to initiate a lease transfer in `AdminTransferLease`, which all lease transfer operations flow through.

The change includes protection at two levels because rejecting a lease transfer in the proposal buffer after we have revoked our current lease is more disruptive than doing so earlier, before we have revoked our current lease. Best-effort protection is also able to respond more gracefully to invalid targets (e.g. they pick the next best target).

However, the check in the Replica proposal buffer is the only place where the protection is airtight against race conditions because the check is performed:
1. by the current Raft leader, else the proposal will fail
2. while holding latches that prevent interleaving log truncation

### Remaining Work

With this change, there is a single known race which can lead to an incoming leaseholder needing a snapshot. This is the case when a leader/leaseholder transfers the lease and then quickly loses Raft leadership to a peer that has a shorter log. Even if the older leader could have caught the incoming leaseholder up on its log, the new leader may not be able to because its log may not go back as far. Such a scenario has been possible since we stopped ensuring that all replicas have logs that start at the same index. For more details, see the discussion about #35701 in #81561.

This race is theoretical — we have not seen it in practice. It's not clear whether we will try to address it or rely on a mitigation like the one described in #81764 to limit its blast radius.

----

Release note (bug fix): Range lease transfers are no longer permitted to follower replicas that may require a Raft snapshot. This ensures that lease transfers are never delayed behind snapshots, which could previously create range unavailability until the snapshot completed. Lease transfers now err on the side of caution and are only allowed when the outgoing leaseholder can guarantee that the incoming leaseholder does not need a snapshot.

83109: asim: batch workload events r=kvoli a=kvoli

This patch introduces batching for load events. Previously, load events
were generated per-key and applied individually to the simulator state
by finding the range containing that key. This patch batches load events
on the same key, then applies the load events in ascending order, over
the range tree.

This results in a speedup of 5x on a 32 store, 32k replicas, 16k QPS
cluster.

Release note: None

Co-authored-by: Saad Ur Rahman <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. N-followup Needs followup. O-postmortem Originated from a Postmortem action item. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

6 participants