-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: merging of ranges with non-voters can be extremely disruptive to foreground traffic #63199
Comments
Hi @aayushshah15, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Thanks for writing this issue up. I agree with the GA-blocker label. Were you able to construct this scenario in a unit test like you were discussing today? |
Yup, our theory here checks out. In fact, there's a much simpler way to simulate the failure -- just add a 5 second sleep in this testing knob. This made range-merges deterministically fail for all non-collocated pairs of LHS/RHS that have non-voters. As expected, range-merges for ranges that don't have non-voters are not affected by this delay. |
Resolves cockroachdb#63199 Before this commit, we relied on the raft snapshot queue to asynchronously perform the initial upreplication of non-voting replicas. This meant that by the time `AdminChangeReplicas` (and consequently, `AdminRelocateRange`) returned to its client, non-voters were not guaranteed to have been initialized. This was a deliberate decision and was, thus far, believed to be copacetic. However, this decision subtly made range merges (of ranges that have any number of non-voters) extremely unlikely to suceed, while causing severe disruption on foreground traffic on the right hand side of a merge. This was because the `mergeQueue` will first call `AdminRelocateRange` on the right hand side range in order to collocate its replicas with the replicas of the left hand side range. If the `mergeQueue` happened to relocate any non-voting replicas, they were likely to still be waiting for their initial snapshot by the time the `AdminMerge` attempted to subsume the RHS. Essentially, this meant that we were subsuming the RHS of a merge while some of its replicas weren't even initialized. This would cause the merge to fail and, in the interim, block all traffic over the RHS range for a 5 second window. This commit fixes the unfortunate sequence of events described above by making the behavior of `AdminChangeReplicas` more symmetric for voting and non-voting replicas. Now, if `AdminChangeReplicas` successfully returns, its client can safely assume that all new replicas have at least been upreplicated via an initial snapshot. Release note: None
Resolves cockroachdb#63199 Before this commit, we relied on the raft snapshot queue to asynchronously perform the initial upreplication of non-voting replicas. This meant that by the time `AdminChangeReplicas` (and consequently, `AdminRelocateRange`) returned to its client, non-voters were not guaranteed to have been initialized. This was a deliberate decision and was, thus far, believed to be copacetic. However, this decision subtly made range merges (of ranges that have any number of non-voters) extremely unlikely to suceed, while causing severe disruption on foreground traffic on the right hand side of a merge. This was because the `mergeQueue` will first call `AdminRelocateRange` on the right hand side range in order to collocate its replicas with the replicas of the left hand side range. If the `mergeQueue` happened to relocate any non-voting replicas, they were likely to still be waiting for their initial snapshot by the time the `AdminMerge` attempted to subsume the RHS. Essentially, this meant that we were subsuming the RHS of a merge while some of its replicas weren't even initialized. This would cause the merge to fail and, in the interim, block all traffic over the RHS range for a 5 second window. This commit fixes the unfortunate sequence of events described above by making the behavior of `AdminChangeReplicas` more symmetric for voting and non-voting replicas. Now, if `AdminChangeReplicas` successfully returns, its client can safely assume that all new replicas have at least been upreplicated via an initial snapshot. Release note: None
Resolves cockroachdb#63199 Before this commit, we relied on the raft snapshot queue to asynchronously perform the initial upreplication of non-voting replicas. This meant that by the time `AdminChangeReplicas` (and consequently, `AdminRelocateRange`) returned to its client, non-voters were not guaranteed to have been initialized. This was a deliberate decision and was, thus far, believed to be copacetic. However, this decision subtly made range merges (of ranges that have any number of non-voters) extremely unlikely to suceed, while causing severe disruption on foreground traffic on the right hand side of a merge. This was because the `mergeQueue` will first call `AdminRelocateRange` on the right hand side range in order to collocate its replicas with the replicas of the left hand side range. If the `mergeQueue` happened to relocate any non-voting replicas, they were likely to still be waiting for their initial snapshot by the time the `AdminMerge` attempted to subsume the RHS. Essentially, this meant that we were subsuming the RHS of a merge while some of its replicas weren't even initialized. This would cause the merge to fail and, in the interim, block all traffic over the RHS range for a 5 second window. This commit fixes the unfortunate sequence of events described above by making the behavior of `AdminChangeReplicas` more symmetric for voting and non-voting replicas. Now, if `AdminChangeReplicas` successfully returns, its client can safely assume that all new replicas have at least been upreplicated via an initial snapshot. Release note: None
Resolves cockroachdb#63199 Before this commit, we relied on the raft snapshot queue to asynchronously perform the initial upreplication of non-voting replicas. This meant that by the time `AdminChangeReplicas` (and consequently, `AdminRelocateRange`) returned to its client, non-voters were not guaranteed to have been initialized. This was a deliberate decision and was, thus far, believed to be copacetic. However, this decision subtly made range merges (of ranges that have any number of non-voters) extremely unlikely to suceed, while causing severe disruption on foreground traffic on the right hand side of a merge. This was because the `mergeQueue` will first call `AdminRelocateRange` on the right hand side range in order to collocate its replicas with the replicas of the left hand side range. If the `mergeQueue` happened to relocate any non-voting replicas, they were likely to still be waiting for their initial snapshot by the time the `AdminMerge` attempted to subsume the RHS. Essentially, this meant that we were subsuming the RHS of a merge while some of its replicas weren't even initialized. This would cause the merge to fail and, in the interim, block all traffic over the RHS range for a 5 second window. This commit fixes the unfortunate sequence of events described above by making the behavior of `AdminChangeReplicas` more symmetric for voting and non-voting replicas. Now, if `AdminChangeReplicas` successfully returns, its client can safely assume that all new replicas have at least been upreplicated via an initial snapshot. Release note: None
Resolves cockroachdb#63199 Before this commit, we relied on the raft snapshot queue to asynchronously perform the initial upreplication of non-voting replicas. This meant that by the time `AdminChangeReplicas` (and consequently, `AdminRelocateRange`) returned to its client, non-voters were not guaranteed to have been initialized. This was a deliberate decision and was, thus far, believed to be copacetic. However, this decision subtly made range merges (of ranges that have any number of non-voters) extremely unlikely to suceed, while causing severe disruption on foreground traffic on the right hand side of a merge. This was because the `mergeQueue` will first call `AdminRelocateRange` on the right hand side range in order to collocate its replicas with the replicas of the left hand side range. If the `mergeQueue` happened to relocate any non-voting replicas, they were likely to still be waiting for their initial snapshot by the time the `AdminMerge` attempted to subsume the RHS. Essentially, this meant that we were subsuming the RHS of a merge while some of its replicas weren't even initialized. This would cause the merge to fail and, in the interim, block all traffic over the RHS range for a 5 second window. This commit fixes the unfortunate sequence of events described above by making the behavior of `AdminChangeReplicas` more symmetric for voting and non-voting replicas. Now, if `AdminChangeReplicas` successfully returns, its client can safely assume that all new replicas have at least been upreplicated via an initial snapshot. Release note: None
62827: sql: disallow dropping enum values used in default/computed columns r=ajwerner a=the-ericwang35 Previously, users could drop enum values being used in default expressions or computed columns, since we did not perform any checks beforehand. This meant that default expressions and computed columns could become corrupted after making such a drop. This patch addresses this by walking default expressions and computed columns when an enum member is dropped, and disallows the drop if it finds any usages. Partially fixes #59807 (this PR addresses the issue in default/computed columns, and #62736 addresses the issue in views). Release note: None 63215: kvserver: perform initial upreplication of non-voters synchronously r=aayushshah15 a=aayushshah15 Resolves #63199 Before this commit, we relied on the raft snapshot queue to asynchronously perform the initial upreplication of non-voting replicas. This meant that by the time `AdminChangeReplicas` (and consequently, `AdminRelocateRange`) returned to its client, non-voters were not guaranteed to have been initialized. This was a deliberate decision and was, thus far, believed to be copacetic. However, this decision subtly made range merges (of ranges that have any number of non-voters) extremely unlikely to suceed, while causing severe disruption on foreground traffic on the right hand side of a merge. This was because the `mergeQueue` will first call `AdminRelocateRange` on the right hand side range in order to collocate its replicas with the replicas of the left hand side range. If the `mergeQueue` happened to relocate any non-voting replicas, they were likely to still be waiting for their initial snapshot by the time the `AdminMerge` attempted to subsume the RHS. Essentially, this meant that we were subsuming the RHS of a merge while some of its replicas weren't even initialized. This would cause the merge to fail and, in the interim, block all traffic over the RHS range for a 5 second window. This commit fixes the unfortunate sequence of events described above by making the behavior of `AdminChangeReplicas` more symmetric for voting and non-voting replicas. Now, if `AdminChangeReplicas` successfully returns, its client can safely assume that all new replicas have at least been upreplicated via an initial snapshot. Release note: None 63388: bazel: bump timeout of `ring_test` r=rickystewart a=rickystewart This has timed out in CI. Release note: None Co-authored-by: Eric Wang <[email protected]> Co-authored-by: Aayush Shah <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
Non-voting replicas are not upreplicated through a synchronous snapshot the way learners are. We queue up new non-voters into the raft snapshot queue and rely on it to initialize newly added / rebalanced non-voters.
A range merge will try to collocate replica sets for the LHS and RHS, which may require rebalancing one or more non-voting replicas from the right hand side range to stores that have replicas for the left hand side range. An
AdminMerge
will then send aSubsumeRequest
to the RHS and wait until all replicas have caught up to the LAI of the subsume.At this point, non-voting replicas that were just rebalanced for the sake of the merge are very likely to still be waiting for, or be in the process of receiving, their initial snapshot from the raft snapshot queue (once the snapshot queue starts sending these newly rebalanced non-voters their initial snapshots, it will take roughly ~64 seconds per replica if we assume default settings for range size and snapshot rates)
So, the
AdminMerge
will hit this 5-second timeout and the merge will fail. This will prompt the merge queue to log an error and try again after about 10 minutes. Now, notice that we’re waiting for those 5 seconds after we’ve sent the subsume request. This means that RHS will not serve any traffic for those 5 seconds.The expected high-level consequence of all this interaction is that all foreground traffic on a range that has non-voting replicas could experience seemingly random 5 second blips, where all requests to such a range are blocked for those 5 seconds.
The text was updated successfully, but these errors were encountered: