storage: leaseholder preferences are not enforced when the preferred replica is behind the quorum #38065

awoods187 · 2019-06-06T17:15:10Z

A customer recently an into a problem with Follow the Workload.

Background:
Cluster topology is 3 nodes in 3 separate regions
Most connections are going to Asia (using the right connection string)
Determine this is due to the setup of the cluster and their use case
Admin UI is slow because customer is connecting to it through the global LB

Follow-the-workload didn’t seem to work:

80-90% of connections were issued in Asia but the leaseholders did not move there
The leases did not move as evidenced by the SQL CLI SHOW RANGE
Localities were enabled.
N1 (asia-southeast1) received nearly all connections & queries
System.users and wallets table leases were on n2 (us-central1)
Alter zone configs with leaseholder preferences return successfully but do not actually alter the zone configuration, and leases do not transfer.

The text was updated successfully, but these errors were encountered:

ajwerner · 2019-06-06T18:05:33Z

I am not sure this is a follow the workload problem. There will always be a tension between optimizing for latency and balancing load. In this specific case there was an egregiously bad zone configuration. All of the indices for the table in question (as well as indices for some other tables) were set to have leaseholder preference in Asia however the table itself (and its primary index) did not have that lease holder preference set. The primary index for the table was placed in Europe. Traffic to that primary index represented roughly 100% of the traffic to the node in Europe. If follow-the-workload were to move the lease to Asia, it would lead the node in Europe with zero traffic while adding to the next-most-loaded node. I think I'd chalk this specific case up as user error.

ajwerner · 2019-06-06T18:05:57Z

Alter zone configs with leaseholder preferences return successfully but do not actually alter the zone configuration, and leases do not transfer.

This I want to dig in to. What do you mean?

awoods187 · 2019-06-06T18:25:27Z

@tim-o for the response on the alter zone

tim-o · 2019-06-06T19:31:12Z

alter table <TableName> CONFIGURE ZONE USING lease_preferences = '[[+region=gcp-asia-southeast1]]';
CONFIGURE ZONE 1

Time: 1.011314805s

Running SHOW ZONE CONFIGURATIONS; afterwards returns no result for <TableName>, even hours after the fact.

ajwerner · 2019-06-06T19:34:26Z

Okay that's the real bug here. Should we re-label the issue or create a new one?

awoods187 · 2019-06-06T19:35:50Z

Let's relabel as you see fit @ajwerner

vilterp · 2019-06-06T19:36:59Z

I would say not being able to set a zone config on a table should be a new bug.

There still may be something about follow the workload not working; let's leave this bug to discuss that.

awoods187 · 2019-06-06T19:41:12Z

Alright I opened #38074 for the zone config issue. @vilterp what makes you think this isn't user error?

ajwerner · 2019-06-06T19:47:35Z

I think we probably need a low severity issue to discuss the nuance between balance load and following the workload, we can use this for that. There should be a high severity issue for the ALTER TABLE ZONE ... CONFIGURE. @tim-o you're the one who experienced it, any chance you could write up your experience? It seems like there's a related bug where we cannot load the Data Distribute & Zone Configuration page.

ajwerner · 2019-06-06T21:57:05Z

We now understand why the lease isn't being transferred better and it's pretty interesting.

The cluster here has 3 nodes, one each in Asia (n1), Europe (n2), and the US (n3). See the latency chart:

Most of the traffic is from Asia and furthermore we've set leaseholder preferences to be in Asia, yet we see that the leaseholder for the primary index is stuck in Europe. Why? Shouldn't the system transfer it? Well, yes, we do want to transfer the lease but unfortunately the we're foiled by this strict logic to prune behind replicas and the speed of light. In this topology, under this constant workload, it is roughly never the case that the faraway node won't be behind. This logic is very strict, much stricter than the logic in the quotaPool which allows replicas which are making progress but a bit behind due to replication latency to not prevent writes (see here).

darinpp · 2019-11-26T17:06:02Z

This is fixed in master and the fix is backported to 19.2/19.1

Fixes cockroachdb#81763. Part of cockroachdb#81561. \### Background When performing a lease transfer, the outgoing leaseholder revokes its lease before proposing the lease transfer request, meaning that it promises to stop using the previous lease to serve reads or writes. The lease transfer request is then proposed and committed to the Raft log, at which point the new lease officially becomes active. However, this new lease is not usable until the incoming leaseholder applies the Raft entry that contains the lease transfer and notices that it is now the leaseholder for the range. The effect of this handoff is that there exists a "power vacuum" time period when the outgoing leaseholder has revoked its previous lease but the incoming leaseholder has not yet applied its new lease. During this time period, a range is effectively unavailable for strong reads and writes, because no replica will act as the leaseholder. Instead, requests that require the lease will be redirected back and forth between the outgoing leaseholder and the incoming leaseholder (the client backs off). To minimize the disruption caused by lease transfers, we need to minimize this time period. We assume that if a lease transfer target is sufficiently caught up on its log such that it will be able to apply the lease transfer through log entry application then this unavailability window will be acceptable. This may be a faulty assumption in cases with severe replication lag, but we must balance any heuristics here that attempts to determine "too much lag" with the possibility of starvation of lease transfers under sustained write load and a resulting sustained replication lag. See cockroachdb#38065 and cockroachdb#42379, which removed such a heuristic. For now, we don't try to make such a determination. \### Patch Details However, with this change, we now draw a distinction between lease transfer targets that will be able to apply the lease transfer through log entry application and those that will require a Raft snapshot to catch up and apply the lease transfer. Raft snapshots are more expensive than Raft entry replication. They are also significantly more likely to be delayed due to queueing behind other snapshot traffic in the system. This potential for delay makes transferring a lease to a replica that needs a snapshot very risky, as doing so has the effect of inducing range unavailability until the snapshot completes, which could take seconds, minutes, or hours. In the future, we will likely get better at prioritizing snapshots to improve the responsiveness of snapshots that are needed to recover availability. However, even in this world, it is not worth inducing unavailability that can only be recovered through a Raft snapshot. It is better to catch the desired lease target up on the log first and then initiate the lease transfer once its log is connected to the leader's. For this reason, unless we can guarantee that the lease transfer target does not need a Raft snapshot, we don't let it through. This commit adds protection against such risky lease transfers at two levels. First, it includes hardened protection in the Replica proposal buffer, immediately before handing the lease transfer proposal off to `etcd/raft`. Second, it includes best-effort protection before a Replica begins to initiate a lease transfer in `AdminTransferLease`, which all lease transfer operations flow through. The change includes protection at two levels because rejecting a lease transfer in the proposal buffer after we have revoked our current lease is more disruptive than doing so earlier, before we have revoked our current lease. Best-effort protection is also able to respond more gracefully to invalid targets (e.g. they pick the next best target). However, the check in the Replica proposal buffer is the only place where the protection is airtight against race conditions because the check is performed: 1. by the current Raft leader, else the proposal will fail 2. while holding latches that prevent interleaving log truncation \### Remaining Work With this change, there is a single known race which can lead to an incoming leaseholder needing a snapshot. This is the case when a leader/leaseholder transfers the lease and then quickly loses Raft leadership to a peer that has a shorter log. Even if the older leader could have caught the incoming leaseholder up on its log, the new leader may not be able to because its log may not go back as far. Such a scenario has been possible since we stopped ensuring that all replicas have logs that start at the same index. For more details, see the discussion about cockroachdb#35701 in cockroachdb#81561. This race is theoretical — we have not seen it in practice. It's not clear whether we will try to address it or rely on a mitigation like the one described in cockroachdb#81764 to limit its blast radius. ---- Release note (bug fix): Range lease transfers are no longer permitted to follower replicas that may require a Raft snapshot. This ensures that lease transfers are never delayed behind snapshots, which could previously create range unavailability until the snapshot completed. Lease transfers now err on the side of caution and are only allowed when the outgoing leaseholder can guarantee that the incoming leaseholder does not need a snapshot.

82560: sql: removed redundant parameter in method to schedule sql stats compaction r=rafiss a=surahman The `crdb_internal.schedule_sql_stats_compaction` SQL function does not require the `byte` string parameter and has thus been removed. Closes #78332. Jira issue: [CRDB-14071](https://cockroachlabs.atlassian.net/browse/CRDB-14071) `@Azhng` 82758: kv: don't allow lease transfers to replicas in need of snapshot r=nvanbenschoten a=nvanbenschoten Fixes #81763. Fixes #79385. Part of #81561. ### Background When performing a lease transfer, the outgoing leaseholder revokes its lease before proposing the lease transfer request, meaning that it promises to stop using the previous lease to serve reads or writes. The lease transfer request is then proposed and committed to the Raft log, at which point the new lease officially becomes active. However, this new lease is not usable until the incoming leaseholder applies the Raft entry that contains the lease transfer and notices that it is now the leaseholder for the range. The effect of this handoff is that there exists a "power vacuum" time period when the outgoing leaseholder has revoked its previous lease but the incoming leaseholder has not yet applied its new lease. During this time period, a range is effectively unavailable for strong reads and writes, because no replica will act as the leaseholder. Instead, requests that require the lease will be redirected back and forth between the outgoing leaseholder and the incoming leaseholder (the client backs off). To minimize the disruption caused by lease transfers, we need to minimize this time period. We assume that if a lease transfer target is sufficiently caught up on its log such that it will be able to apply the lease transfer through log entry application then this unavailability window will be acceptable. This may be a faulty assumption in cases with severe replication lag, but we must balance any heuristics here that attempts to determine "too much lag" with the possibility of starvation of lease transfers under sustained write load and a resulting sustained replication lag. See #38065 and #42379, which removed such a heuristic. For now, we don't try to make such a determination. ### Patch Details However, with this change, we now draw a distinction between lease transfer targets that will be able to apply the lease transfer through log entry application and those that will require a Raft snapshot to catch up and apply the lease transfer. Raft snapshots are more expensive than Raft entry replication. They are also significantly more likely to be delayed due to queueing behind other snapshot traffic in the system. This potential for delay makes transferring a lease to a replica that needs a snapshot very risky, as doing so has the effect of inducing range unavailability until the snapshot completes, which could take seconds, minutes, or hours. In the future, we will likely get better at prioritizing snapshots to improve the responsiveness of snapshots that are needed to recover availability. However, even in this world, it is not worth inducing unavailability that can only be recovered through a Raft snapshot. It is better to catch the desired lease target up on the log first and then initiate the lease transfer once its log is connected to the leader's. For this reason, unless we can guarantee that the lease transfer target does not need a Raft snapshot, we don't let it through. This commit adds protection against such risky lease transfers at two levels. First, it includes hardened protection in the Replica proposal buffer, immediately before handing the lease transfer proposal off to etcd/raft. Second, it includes best-effort protection before a Replica begins to initiate a lease transfer in `AdminTransferLease`, which all lease transfer operations flow through. The change includes protection at two levels because rejecting a lease transfer in the proposal buffer after we have revoked our current lease is more disruptive than doing so earlier, before we have revoked our current lease. Best-effort protection is also able to respond more gracefully to invalid targets (e.g. they pick the next best target). However, the check in the Replica proposal buffer is the only place where the protection is airtight against race conditions because the check is performed: 1. by the current Raft leader, else the proposal will fail 2. while holding latches that prevent interleaving log truncation ### Remaining Work With this change, there is a single known race which can lead to an incoming leaseholder needing a snapshot. This is the case when a leader/leaseholder transfers the lease and then quickly loses Raft leadership to a peer that has a shorter log. Even if the older leader could have caught the incoming leaseholder up on its log, the new leader may not be able to because its log may not go back as far. Such a scenario has been possible since we stopped ensuring that all replicas have logs that start at the same index. For more details, see the discussion about #35701 in #81561. This race is theoretical — we have not seen it in practice. It's not clear whether we will try to address it or rely on a mitigation like the one described in #81764 to limit its blast radius. ---- Release note (bug fix): Range lease transfers are no longer permitted to follower replicas that may require a Raft snapshot. This ensures that lease transfers are never delayed behind snapshots, which could previously create range unavailability until the snapshot completed. Lease transfers now err on the side of caution and are only allowed when the outgoing leaseholder can guarantee that the incoming leaseholder does not need a snapshot. 83109: asim: batch workload events r=kvoli a=kvoli This patch introduces batching for load events. Previously, load events were generated per-key and applied individually to the simulator state by finding the range containing that key. This patch batches load events on the same key, then applies the load events in ascending order, over the range tree. This results in a speedup of 5x on a 32 store, 32k replicas, 16k QPS cluster. Release note: None Co-authored-by: Saad Ur Rahman <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Austen McClernon <[email protected]>

awoods187 added the A-kv-distribution Relating to rebalancing and leasing. label Jun 6, 2019

awoods187 added the C-investigation Further steps needed to qualify. C-label will change. label Jun 6, 2019

awoods187 changed the title ~~core: follow the workload appears to not be enabled~~ core: follow the workload appears to not be working Jun 6, 2019

awoods187 mentioned this issue Jun 6, 2019

core: zone config not set #38074

Closed

awoods187 assigned darinpp Jul 9, 2019

andreimatei added S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting and removed C-investigation Further steps needed to qualify. C-label will change. labels Sep 8, 2019

andreimatei changed the title ~~core: follow the workload appears to not be working~~ storage: leaseholder preferences are not enforced when the preferred replica is behind the quorum Sep 8, 2019

tbg mentioned this issue Nov 12, 2019

storage: remove the replica up to date check when transfering lease #42379

Merged

darinpp closed this as completed Nov 26, 2019

This was referenced May 20, 2022

kv: understand, prevent, and recover quickly from a leaseholder needing a Raft snapshot #81561

Closed

kv: always transfer expiration-based leases during lease transfers #81764

Closed

nvanbenschoten mentioned this issue Jun 10, 2022

kv: don't allow lease transfers to replicas in need of snapshot #82758

Merged

nvanbenschoten mentioned this issue Jan 31, 2023

kv: issue replication lag probe before lease transfer #96304

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: leaseholder preferences are not enforced when the preferred replica is behind the quorum #38065

storage: leaseholder preferences are not enforced when the preferred replica is behind the quorum #38065

awoods187 commented Jun 6, 2019 •

edited

Loading

ajwerner commented Jun 6, 2019

ajwerner commented Jun 6, 2019

awoods187 commented Jun 6, 2019

tim-o commented Jun 6, 2019 •

edited

Loading

ajwerner commented Jun 6, 2019

awoods187 commented Jun 6, 2019

vilterp commented Jun 6, 2019

awoods187 commented Jun 6, 2019

ajwerner commented Jun 6, 2019

ajwerner commented Jun 6, 2019

darinpp commented Nov 26, 2019

storage: leaseholder preferences are not enforced when the preferred replica is behind the quorum #38065

storage: leaseholder preferences are not enforced when the preferred replica is behind the quorum #38065

Comments

awoods187 commented Jun 6, 2019 • edited Loading

ajwerner commented Jun 6, 2019

ajwerner commented Jun 6, 2019

awoods187 commented Jun 6, 2019

tim-o commented Jun 6, 2019 • edited Loading

ajwerner commented Jun 6, 2019

awoods187 commented Jun 6, 2019

vilterp commented Jun 6, 2019

awoods187 commented Jun 6, 2019

ajwerner commented Jun 6, 2019

ajwerner commented Jun 6, 2019

darinpp commented Nov 26, 2019

awoods187 commented Jun 6, 2019 •

edited

Loading

tim-o commented Jun 6, 2019 •

edited

Loading