Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: don't allow lease transfers to replicas in need of snapshot #82758

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Jun 10, 2022

Fixes #81763.
Fixes #79385.
Part of #81561.

Background

When performing a lease transfer, the outgoing leaseholder revokes its lease before proposing the lease transfer request, meaning that it promises to stop using the previous lease to serve reads or writes. The lease transfer request is then proposed and committed to the Raft log, at which point the new lease officially becomes active. However, this new lease is not usable until the incoming leaseholder applies the Raft entry that contains the lease transfer and notices that it is now the leaseholder for the range.

The effect of this handoff is that there exists a "power vacuum" time period when the outgoing leaseholder has revoked its previous lease but the incoming leaseholder has not yet applied its new lease. During this time period, a range is effectively unavailable for strong reads and writes, because no replica will act as the leaseholder. Instead, requests that require the lease will be redirected back and forth between the outgoing leaseholder and the incoming leaseholder (the client backs off). To minimize the disruption caused by lease transfers, we need to minimize this time period.

We assume that if a lease transfer target is sufficiently caught up on its log such that it will be able to apply the lease transfer through log entry application then this unavailability window will be acceptable. This may be a faulty assumption in cases with severe replication lag, but we must balance any heuristics here that attempts to determine "too much lag" with the possibility of starvation of lease transfers under sustained write load and a resulting sustained replication lag. See #38065 and #42379, which removed such a heuristic. For now, we don't try to make such a determination.

Patch Details

However, with this change, we now draw a distinction between lease transfer targets that will be able to apply the lease transfer through log entry application and those that will require a Raft snapshot to catch up and apply the lease transfer. Raft snapshots are more expensive than Raft entry replication. They are also significantly more likely to be delayed due to queueing behind other snapshot traffic in the system. This potential for delay makes transferring a lease to a replica that needs a snapshot very risky, as doing so has the effect of inducing range unavailability until the snapshot completes, which could take seconds, minutes, or hours.

In the future, we will likely get better at prioritizing snapshots to improve the responsiveness of snapshots that are needed to recover availability. However, even in this world, it is not worth inducing unavailability that can only be recovered through a Raft snapshot. It is better to catch the desired lease target up on the log first and then initiate the lease transfer once its log is connected to the leader's. For this reason, unless we can guarantee that the lease transfer target does not need a Raft snapshot, we don't let it through.

This commit adds protection against such risky lease transfers at two levels. First, it includes hardened protection in the Replica proposal buffer, immediately before handing the lease transfer proposal off to etcd/raft. Second, it includes best-effort protection before a Replica begins to initiate a lease transfer in AdminTransferLease, which all lease transfer operations flow through.

The change includes protection at two levels because rejecting a lease transfer in the proposal buffer after we have revoked our current lease is more disruptive than doing so earlier, before we have revoked our current lease. Best-effort protection is also able to respond more gracefully to invalid targets (e.g. they pick the next best target).

However, the check in the Replica proposal buffer is the only place where the protection is airtight against race conditions because the check is performed:

  1. by the current Raft leader, else the proposal will fail
  2. while holding latches that prevent interleaving log truncation

Remaining Work

With this change, there is a single known race which can lead to an incoming leaseholder needing a snapshot. This is the case when a leader/leaseholder transfers the lease and then quickly loses Raft leadership to a peer that has a shorter log. Even if the older leader could have caught the incoming leaseholder up on its log, the new leader may not be able to because its log may not go back as far. Such a scenario has been possible since we stopped ensuring that all replicas have logs that start at the same index. For more details, see the discussion about #35701 in #81561.

This race is theoretical — we have not seen it in practice. It's not clear whether we will try to address it or rely on a mitigation like the one described in #81764 to limit its blast radius.


Release note (bug fix): Range lease transfers are no longer permitted to follower replicas that may require a Raft snapshot. This ensures that lease transfers are never delayed behind snapshots, which could previously create range unavailability until the snapshot completed. Lease transfers now err on the side of caution and are only allowed when the outgoing leaseholder can guarantee that the incoming leaseholder does not need a snapshot.

@nvanbenschoten nvanbenschoten marked this pull request as ready for review June 10, 2022 21:50
@nvanbenschoten nvanbenschoten requested a review from a team as a code owner June 10, 2022 21:50
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/leaseTransferGuard branch 4 times, most recently from b625cbd to 31d203a Compare June 11, 2022 19:27
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/leaseTransferGuard branch 2 times, most recently from 3f24f2c to a374775 Compare June 12, 2022 15:57
craig bot pushed a commit that referenced this pull request Jun 13, 2022
82677: sql, pgwire: Add `SEVERITY_NONLOCALIZED` error field r=e-mbrown a=e-mbrown

Resolves #81794

Release note (sql change): We now send the Severity_Nonlocalized field
in the pgwire Notice Response.

82797: kv: extract etcd/raft utilities into raftutil library r=nvanbenschoten a=nvanbenschoten

Commit pulled from #82758.

This commit extracts some scattered logic to interpret the state of an `etcd/raft` Status into a new `raftutil` library. The library initially has two functions: `ReplicaIsBehind` and `ReplicaMayNeedSnapshot`. In the future, I expect that we'll extract more helper functions that make working with `etcd/raft` easier into this library.

82816: Upgrade Go to version 1.17.11 r=jlinder,rickystewart a=rail

* [x] Adjust version in Docker image
* [x] Adjust version in the TeamCity agent image
* [x] Rebuild and push the Docker image
* [x] Download ALL the archives (`.tar.gz`, `.zip`) for the new Go version from https://golang.org/dl/ and mirror them in the  `public-bazel-artifacts` bucket in the `Bazel artifacts` project in GCP (sub-directory `go`, next to the other Go SDK's).
* [x] Bump the version in `WORKSPACE` under `go_download_sdk`. You may need to bump [rules_go](https://github.com/bazelbuild/rules_go/releases). Also edit the filenames listed in `sdks` and update all the hashes to match what you mirrored in the step above.
* [x] Run `./dev generate bazel` to refresh `distdir_files.bzl`, then `bazel fetch `@distdir//:archives`` to ensure you've updated all hashes to the correct value.
* [x] Bump the version in `builder.sh` accordingly
* [x] Bump the default installed version of Go in `bootstrap-debian.sh`
* [ ] Update the `builder.dockerImage` parameter in the TeamCity [`Cockroach`](https://teamcity.cockroachdb.com/admin/editProject.html?projectId=Cockroach&tab=projectParams) and  [`Internal`](https://teamcity.cockroachdb.com/admin/editProject.html?projectId=Internal&tab=projectParams) projects.
* [ ] Deploy new TeamCity agent images

Partially addresses #82808

Release note: None


Co-authored-by: e-mbrown <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Rail Aliiev <[email protected]>
Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty great tests and pretty great commentary!

Reviewed 12 of 12 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)


pkg/kv/kvserver/replica_proposal_buf.go line 558 at r4 (raw file):

// - RequestLease when the proposer is not the raft leader (with caveats).
// - TransferLease when the proposer cannot guarantee that the lease transfer
//   target does not currently need a Raft snapshot and therefore will not

[mega nit] The triple negative form makes this slightly hard to read.


pkg/kv/kvserver/client_replica_test.go line 2815 at r4 (raw file):

		transferErrC := make(chan error, 1)
		if rejectAfterRevoke {
			_ = tc.Stopper().RunAsyncTask(ctx, "transfer leas", func(ctx context.Context) {

s/leas/lease

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to backport this to 22.1 (and maybe 21.2)?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)

craig bot pushed a commit that referenced this pull request Jun 21, 2022
82798: kv: clean up BatchRequest.IsLeaseRequest r=nvanbenschoten a=nvanbenschoten

Commit pulled from #82758.

This commit cleans up `BatchRequest.IsLeaseRequest` by replacing its internals with a call to `isSingleRequestWithMethod(RequestLease)` and renaming it to mirror the rest of the `BatchRequest.IsSingleXYZRequest` methods.

Co-authored-by: Nathan VanBenschoten <[email protected]>
Pure refactor in preparation of the next commit.
Fixes cockroachdb#81763.
Part of cockroachdb#81561.

\### Background

When performing a lease transfer, the outgoing leaseholder revokes its
lease before proposing the lease transfer request, meaning that it
promises to stop using the previous lease to serve reads or writes. The
lease transfer request is then proposed and committed to the Raft log, at
which point the new lease officially becomes active. However, this new
lease is not usable until the incoming leaseholder applies the Raft entry
that contains the lease transfer and notices that it is now the
leaseholder for the range.

The effect of this handoff is that there exists a "power vacuum" time
period when the outgoing leaseholder has revoked its previous lease but
the incoming leaseholder has not yet applied its new lease. During this
time period, a range is effectively unavailable for strong reads and
writes, because no replica will act as the leaseholder. Instead, requests
that require the lease will be redirected back and forth between the
outgoing leaseholder and the incoming leaseholder (the client backs off).
To minimize the disruption caused by lease transfers, we need to minimize
this time period.

We assume that if a lease transfer target is sufficiently caught up on
its log such that it will be able to apply the lease transfer through log
entry application then this unavailability window will be acceptable.
This may be a faulty assumption in cases with severe replication lag, but
we must balance any heuristics here that attempts to determine "too much
lag" with the possibility of starvation of lease transfers under
sustained write load and a resulting sustained replication lag. See cockroachdb#38065
and cockroachdb#42379, which removed such a heuristic. For now, we don't try to make
such a determination.

\### Patch Details

However, with this change, we now draw a distinction between lease
transfer targets that will be able to apply the lease transfer through
log entry application and those that will require a Raft snapshot to
catch up and apply the lease transfer. Raft snapshots are more expensive
than Raft entry replication. They are also significantly more likely to
be delayed due to queueing behind other snapshot traffic in the system.
This potential for delay makes transferring a lease to a replica that
needs a snapshot very risky, as doing so has the effect of inducing
range unavailability until the snapshot completes, which could take
seconds, minutes, or hours.

In the future, we will likely get better at prioritizing snapshots to
improve the responsiveness of snapshots that are needed to recover
availability. However, even in this world, it is not worth inducing
unavailability that can only be recovered through a Raft snapshot. It is
better to catch the desired lease target up on the log first and then
initiate the lease transfer once its log is connected to the leader's.
For this reason, unless we can guarantee that the lease transfer target
does not need a Raft snapshot, we don't let it through.

This commit adds protection against such risky lease transfers at two
levels. First, it includes hardened protection in the Replica proposal
buffer, immediately before handing the lease transfer proposal off to
`etcd/raft`. Second, it includes best-effort protection before a Replica
begins to initiate a lease transfer in `AdminTransferLease`, which all
lease transfer operations flow through.

The change includes protection at two levels because rejecting a lease
transfer in the proposal buffer after we have revoked our current lease
is more disruptive than doing so earlier, before we have revoked our
current lease. Best-effort protection is also able to respond more
gracefully to invalid targets (e.g. they pick the next best target).

However, the check in the Replica proposal buffer is the only place
where the protection is airtight against race conditions because the
check is performed:
1. by the current Raft leader, else the proposal will fail
2. while holding latches that prevent interleaving log truncation

\### Remaining Work

With this change, there is a single known race which can lead to an
incoming leaseholder needing a snapshot. This is the case when a
leader/leaseholder transfers the lease and then quickly loses Raft
leadership to a peer that has a shorter log. Even if the older leader
could have caught the incoming leaseholder up on its log, the new leader
may not be able to because its log may not go back as far. Such a
scenario has been possible since we stopped ensuring that all replicas
have logs that start at the same index. For more details, see the
discussion about cockroachdb#35701 in cockroachdb#81561.

This race is theoretical — we have not seen it in practice. It's not
clear whether we will try to address it or rely on a mitigation like
the one described in cockroachdb#81764 to limit its blast radius.

----

Release note (bug fix): Range lease transfers are no longer permitted to
follower replicas that may require a Raft snapshot. This ensures that
lease transfers are never delayed behind snapshots, which could
previously create range unavailability until the snapshot completed.
Lease transfers now err on the side of caution and are only allowed when
the outgoing leaseholder can guarantee that the incoming leaseholder
does not need a snapshot.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/leaseTransferGuard branch from a374775 to 034611b Compare June 21, 2022 21:41
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Do we want to backport this to 22.1 (and maybe 21.2)?

Yes, I think we should. But I also think we'll want to be generous with the baking period. I'll let this sit on master for about a month before backporting. I'd be surprised if there wasn't some kind of fallout.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)


pkg/kv/kvserver/replica_proposal_buf.go line 558 at r4 (raw file):

Previously, irfansharif (irfan sharif) wrote…

[mega nit] The triple negative form makes this slightly hard to read.

I hardly found that this was not difficult to confuse. Reduced to a double negative, which is now not non-trivial to read.

ezgif-3-01abf602cf.jpeg


pkg/kv/kvserver/client_replica_test.go line 2815 at r4 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/leas/lease

Done.

@nvanbenschoten
Copy link
Member Author

bors r+

🤞 the fallout is minimal.

@craig
Copy link
Contributor

craig bot commented Jun 22, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jun 22, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jun 22, 2022

Build succeeded:

@craig craig bot merged commit 6eb3bf0 into cockroachdb:master Jun 22, 2022
@ajwerner
Copy link
Contributor

I'm interested in the backport here as well as any thoughts as to whether this is related to #82778. I'm adding the label so backboard reminds us.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 28, 2022
Fixes cockroachdb#83498.
Fixes cockroachdb#83402.
Fixes cockroachdb#83308.

This was fallout from cockroachdb#82758.

This commit adds logic to `propBuf.maybeRejectUnsafeProposalLocked` to avoid
trying to reject proposals based on the state of the raft group when the group
is not provided (e.g. when flushing the buffer). We already had this logic for
`RequestLease` (indirectly), but did not for `TransferLease`.
@nvanbenschoten
Copy link
Member Author

I'm adding the label so backboard reminds us.

Thanks. I didn't want to do this before the merge because the auto-backport will skew and fail. Now that this has merged, I'll add the other targets and use this comment to keep a list of the PRs that need to be backported together.

@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/leaseTransferGuard branch June 28, 2022 17:49
craig bot pushed a commit that referenced this pull request Jun 28, 2022
82352: server, sql: surface session txnCount, recent txn fingerprints, active time r=xinhaoz a=xinhaoz

Finishing up Gerardo's PR, original review here: #80717

--------------------------------------------------
Partially addresses #74257.

Previously, the status server did not provide session details such as
total number of transactions executed, transaction fingerprint
IDs, and total active time. This change adds the aforementioned session
details to the `serverpb.Session` struct.

To track recently executed transaction fingerprint IDs, a FIFO cache
`TxnFingerprintIDCache` is introduced with its corresponding cluster
setting `TxnFingerprintIDBufferCapacity` to control the capacity. The
default capacity is set at 100 fingerprints.

The total number of transactions executed is filled using the existing
`txnCounter` from the `extraTxnState` in `connExecutor`. The total active
time is calculated by introducing a `timeutil.StopWatch` to the connection
executor, which is started and stopped when a transaction is started and
finished respectively.

Release note (api change): the `serverpb.Session` struct now has three
new fields: number of transactions executed, transaction fingerprint
IDs, and total active time.

82623: backupinfo: introduce a backupinfo package r=stevendanna a=adityamaru

The backupinfo package contains logic related to interacting
with information and metadata describing the backup. After this
change we have `backupdest` depending on `backupinfo`.

Release note: None

82718: kvserver: emit MVCC range tombstones over rangefeeds r=aliher1911 a=erikgrinaker

This patch adds MVCC range tombstone support in rangefeeds. Whenever an
MVCC range tombstone is written, a new `MVCCDeleteRangeOp` logical op
is recorded and emitted across the rangefeed as a `RangeFeedDeleteRange`
event. MVCC range tombstones will only be written when the
`MVCCRangeTombstones` version gate has been enabled.

Changefeeds will emit an error for these events. We do not expect to see
these in online spans with changefeeds, since they are initially only
planned for use with schema GC and import rollbacks.

The rangefeed client library has been extended with support for these
events, but no existing callers handle them for the same reason as
changefeeds. Initial scans do not emit regular tombstones, and thus not
range tombstones either, but catchup scans will emit them if
encountered.

This patch has rudimentary testing of MVCC range tombstones in
rangefeeds. A later patch will add a data-driven test harness for
rangefeeds with more exhaustive tests.

Resolves #82449.
Touches #70433.

Release note: None

82936: sql/schemachanger: implement DROP OWNED BY r=jasonmchan a=jasonmchan

Previously, we did not support the DROP OWNED BY statement (#55381).
This commit adds partial support for DROP OWNED BY in the declarative
schema changer. Followup work is needed to support the CASCADE modifier.

Release note (sql change): Support `DROP OWNED BY`.

83229: ui: remove option 10/30 min from SQL Activity page r=maryliag a=maryliag

Note to reviewers: only 2nd commit is relevant to this PR

Previously we had the options for 10 and 30min on
SQL Activity pages, which created some confusion, since
we would always show the last 1h info.
This commit remove those 2 options.
If the user select any of those options on the Metrics
page, it will get updated to 1h on the SQL Activity
pages.

<img width="444" alt="Screen Shot 2022-06-22 at 5 43 53 PM" src="https://user-images.githubusercontent.com/1017486/175144243-2f084e0b-5e09-4874-9640-e7eea6179343.png">

https://www.loom.com/share/226e54322df6456aa2039b5c54f72eb1


Fixes #82914

Release note (ui change): Removal of the 10 and 30min options
on the SQL Activity page.

83420: ui: improve tooltip UX with text updates r=ericharmeling a=ericharmeling

Fixes #81374.
Fixes #83256.
Fixes #81248.
Fixes #79018.

Note the following:

- The updates resolving #79018 effectively revert the tooltip text for Rows Read to the original wording (which [was updated for accuracy](e379e9d#diff-492398441e971e355a687a4ce333a9766e2195287d0227682444d5dc0eb7ee1a)). I assume this is okay. `@kevin-v-ngo`
- The updates resolving #81248 do not in fact refer to the time intervals as date ranges, as this language is misleading (a 1h interval is an interval and not a date range). Instead, this update just removes the anchor and the link to the non-existent Interval Range section of https://www.cockroachlabs.com/docs/stable/ui-statements-page.html. We may want to consider updating the docs to call the "time picker" data type a time interval and not a date range. This appears to have been the case in previous releases (https://www.cockroachlabs.com/docs/v21.1/ui-statements-page#time-interval). `@stbof` 

Release note (ui change): Updated tooltips on the Statements and
Transactions pages in the DB Console for improved UX.

83428: sql: rename anonymizedStmt in sqlstats pkg to stmtNoConstants r=ericharmeling a=ericharmeling

Note that this commit does not change any files outside the sqlstats package.

Fixes #80725.

Release note: None

83468: ui: update all dates to use same format r=maryliag a=maryliag

Update all dates to use the same format.

Fixes #81159

Release note: None

83520: kv: don't try to reject lease transfer when flushing proposal buffer r=nvanbenschoten a=nvanbenschoten

Fixes #83498.
Fixes #83402.
Fixes #83308.

This was fallout from #82758.

This commit adds logic to `propBuf.maybeRejectUnsafeProposalLocked` to avoid
trying to reject proposals based on the state of the raft group when the group
is not provided (e.g. when flushing the buffer). We already had this logic for
`RequestLease` (indirectly), but did not for `TransferLease`.

Co-authored-by: Gerardo Torres <[email protected]>
Co-authored-by: Aditya Maru <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Jason Chan <[email protected]>
Co-authored-by: Marylia Gutierrez <[email protected]>
Co-authored-by: Eric Harmeling <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jan 13, 2023
Fixes cockroachdb#83498.
Fixes cockroachdb#83402.
Fixes cockroachdb#83308.

This was fallout from cockroachdb#82758.

This commit adds logic to `propBuf.maybeRejectUnsafeProposalLocked` to avoid
trying to reject proposals based on the state of the raft group when the group
is not provided (e.g. when flushing the buffer). We already had this logic for
`RequestLease` (indirectly), but did not for `TransferLease`.
nathanstilwell pushed a commit that referenced this pull request Feb 2, 2023
Fixes #83498.
Fixes #83402.
Fixes #83308.

This was fallout from #82758.

This commit adds logic to `propBuf.maybeRejectUnsafeProposalLocked` to avoid
trying to reject proposals based on the state of the raft group when the group
is not provided (e.g. when flushing the buffer). We already had this logic for
`RequestLease` (indirectly), but did not for `TransferLease`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants