Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: merging of ranges with non-voters can be extremely disruptive to foreground traffic #63199

Closed
aayushshah15 opened this issue Apr 7, 2021 · 3 comments · Fixed by #63215
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker

Comments

@aayushshah15
Copy link
Contributor

aayushshah15 commented Apr 7, 2021

Non-voting replicas are not upreplicated through a synchronous snapshot the way learners are. We queue up new non-voters into the raft snapshot queue and rely on it to initialize newly added / rebalanced non-voters.

A range merge will try to collocate replica sets for the LHS and RHS, which may require rebalancing one or more non-voting replicas from the right hand side range to stores that have replicas for the left hand side range. An AdminMerge will then send a SubsumeRequest to the RHS and wait until all replicas have caught up to the LAI of the subsume.

At this point, non-voting replicas that were just rebalanced for the sake of the merge are very likely to still be waiting for, or be in the process of receiving, their initial snapshot from the raft snapshot queue (once the snapshot queue starts sending these newly rebalanced non-voters their initial snapshots, it will take roughly ~64 seconds per replica if we assume default settings for range size and snapshot rates)

So, the AdminMerge will hit this 5-second timeout and the merge will fail. This will prompt the merge queue to log an error and try again after about 10 minutes. Now, notice that we’re waiting for those 5 seconds after we’ve sent the subsume request. This means that RHS will not serve any traffic for those 5 seconds.

The expected high-level consequence of all this interaction is that all foreground traffic on a range that has non-voting replicas could experience seemingly random 5 second blips, where all requests to such a range are blocked for those 5 seconds.

@aayushshah15 aayushshah15 added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 7, 2021
@aayushshah15 aayushshah15 self-assigned this Apr 7, 2021
@blathers-crl
Copy link

blathers-crl bot commented Apr 7, 2021

Hi @aayushshah15, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@nvanbenschoten
Copy link
Member

Thanks for writing this issue up. I agree with the GA-blocker label. Were you able to construct this scenario in a unit test like you were discussing today?

@aayushshah15
Copy link
Contributor Author

Were you able to construct this scenario in a unit test like you were discussing today?

Yup, our theory here checks out. In fact, there's a much simpler way to simulate the failure -- just add a 5 second sleep in this testing knob. This made range-merges deterministically fail for all non-collocated pairs of LHS/RHS that have non-voters. As expected, range-merges for ranges that don't have non-voters are not affected by this delay.

aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 7, 2021
Resolves cockroachdb#63199

Before this commit, we relied on the raft snapshot queue to
asynchronously perform the initial upreplication of non-voting replicas.
This meant that by the time `AdminChangeReplicas` (and consequently,
`AdminRelocateRange`) returned to its client, non-voters were not
guaranteed to have been initialized. This was a deliberate decision and
was, thus far, believed to be copacetic. However, this decision subtly
made range merges (of ranges that have any number of non-voters)
extremely unlikely to suceed, while causing severe disruption on
foreground traffic on the right hand side of a merge. This was because
the `mergeQueue` will first call `AdminRelocateRange` on the right hand
side range in order to collocate its replicas with the replicas of the
left hand side range.  If the `mergeQueue` happened to relocate any
non-voting replicas, they were likely to still be waiting for their
initial snapshot by the time the `AdminMerge` attempted to subsume the
RHS. Essentially, this meant that we were subsuming the RHS of a merge
while some of its replicas weren't even initialized. This would cause
the merge to fail and, in the interim, block all traffic over the RHS
range for a 5 second window.

This commit fixes the unfortunate sequence of events described above by
making the behavior of `AdminChangeReplicas` more symmetric for voting
and non-voting replicas. Now, if `AdminChangeReplicas` successfully
returns, its client can safely assume that all new replicas have at
least been upreplicated via an initial snapshot.

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 8, 2021
Resolves cockroachdb#63199

Before this commit, we relied on the raft snapshot queue to
asynchronously perform the initial upreplication of non-voting replicas.
This meant that by the time `AdminChangeReplicas` (and consequently,
`AdminRelocateRange`) returned to its client, non-voters were not
guaranteed to have been initialized. This was a deliberate decision and
was, thus far, believed to be copacetic. However, this decision subtly
made range merges (of ranges that have any number of non-voters)
extremely unlikely to suceed, while causing severe disruption on
foreground traffic on the right hand side of a merge. This was because
the `mergeQueue` will first call `AdminRelocateRange` on the right hand
side range in order to collocate its replicas with the replicas of the
left hand side range.  If the `mergeQueue` happened to relocate any
non-voting replicas, they were likely to still be waiting for their
initial snapshot by the time the `AdminMerge` attempted to subsume the
RHS. Essentially, this meant that we were subsuming the RHS of a merge
while some of its replicas weren't even initialized. This would cause
the merge to fail and, in the interim, block all traffic over the RHS
range for a 5 second window.

This commit fixes the unfortunate sequence of events described above by
making the behavior of `AdminChangeReplicas` more symmetric for voting
and non-voting replicas. Now, if `AdminChangeReplicas` successfully
returns, its client can safely assume that all new replicas have at
least been upreplicated via an initial snapshot.

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 9, 2021
Resolves cockroachdb#63199

Before this commit, we relied on the raft snapshot queue to
asynchronously perform the initial upreplication of non-voting replicas.
This meant that by the time `AdminChangeReplicas` (and consequently,
`AdminRelocateRange`) returned to its client, non-voters were not
guaranteed to have been initialized. This was a deliberate decision and
was, thus far, believed to be copacetic. However, this decision subtly
made range merges (of ranges that have any number of non-voters)
extremely unlikely to suceed, while causing severe disruption on
foreground traffic on the right hand side of a merge. This was because
the `mergeQueue` will first call `AdminRelocateRange` on the right hand
side range in order to collocate its replicas with the replicas of the
left hand side range.  If the `mergeQueue` happened to relocate any
non-voting replicas, they were likely to still be waiting for their
initial snapshot by the time the `AdminMerge` attempted to subsume the
RHS. Essentially, this meant that we were subsuming the RHS of a merge
while some of its replicas weren't even initialized. This would cause
the merge to fail and, in the interim, block all traffic over the RHS
range for a 5 second window.

This commit fixes the unfortunate sequence of events described above by
making the behavior of `AdminChangeReplicas` more symmetric for voting
and non-voting replicas. Now, if `AdminChangeReplicas` successfully
returns, its client can safely assume that all new replicas have at
least been upreplicated via an initial snapshot.

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 9, 2021
Resolves cockroachdb#63199

Before this commit, we relied on the raft snapshot queue to
asynchronously perform the initial upreplication of non-voting replicas.
This meant that by the time `AdminChangeReplicas` (and consequently,
`AdminRelocateRange`) returned to its client, non-voters were not
guaranteed to have been initialized. This was a deliberate decision and
was, thus far, believed to be copacetic. However, this decision subtly
made range merges (of ranges that have any number of non-voters)
extremely unlikely to suceed, while causing severe disruption on
foreground traffic on the right hand side of a merge. This was because
the `mergeQueue` will first call `AdminRelocateRange` on the right hand
side range in order to collocate its replicas with the replicas of the
left hand side range.  If the `mergeQueue` happened to relocate any
non-voting replicas, they were likely to still be waiting for their
initial snapshot by the time the `AdminMerge` attempted to subsume the
RHS. Essentially, this meant that we were subsuming the RHS of a merge
while some of its replicas weren't even initialized. This would cause
the merge to fail and, in the interim, block all traffic over the RHS
range for a 5 second window.

This commit fixes the unfortunate sequence of events described above by
making the behavior of `AdminChangeReplicas` more symmetric for voting
and non-voting replicas. Now, if `AdminChangeReplicas` successfully
returns, its client can safely assume that all new replicas have at
least been upreplicated via an initial snapshot.

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 9, 2021
Resolves cockroachdb#63199

Before this commit, we relied on the raft snapshot queue to
asynchronously perform the initial upreplication of non-voting replicas.
This meant that by the time `AdminChangeReplicas` (and consequently,
`AdminRelocateRange`) returned to its client, non-voters were not
guaranteed to have been initialized. This was a deliberate decision and
was, thus far, believed to be copacetic. However, this decision subtly
made range merges (of ranges that have any number of non-voters)
extremely unlikely to suceed, while causing severe disruption on
foreground traffic on the right hand side of a merge. This was because
the `mergeQueue` will first call `AdminRelocateRange` on the right hand
side range in order to collocate its replicas with the replicas of the
left hand side range.  If the `mergeQueue` happened to relocate any
non-voting replicas, they were likely to still be waiting for their
initial snapshot by the time the `AdminMerge` attempted to subsume the
RHS. Essentially, this meant that we were subsuming the RHS of a merge
while some of its replicas weren't even initialized. This would cause
the merge to fail and, in the interim, block all traffic over the RHS
range for a 5 second window.

This commit fixes the unfortunate sequence of events described above by
making the behavior of `AdminChangeReplicas` more symmetric for voting
and non-voting replicas. Now, if `AdminChangeReplicas` successfully
returns, its client can safely assume that all new replicas have at
least been upreplicated via an initial snapshot.

Release note: None
craig bot pushed a commit that referenced this issue Apr 9, 2021
62827: sql: disallow dropping enum values used in default/computed columns r=ajwerner a=the-ericwang35

Previously, users could drop enum values
being used in default expressions or computed
columns, since we did not perform any checks beforehand.
This meant that default expressions and computed columns
could become corrupted after making such a drop.
This patch addresses this by walking default expressions
and computed columns when an enum member is dropped, and
disallows the drop if it finds any usages.

Partially fixes #59807 (this PR addresses the issue
in default/computed columns, and #62736 addresses the 
issue in views).

Release note: None

63215: kvserver: perform initial upreplication of non-voters synchronously r=aayushshah15 a=aayushshah15

Resolves #63199

Before this commit, we relied on the raft snapshot queue to
asynchronously perform the initial upreplication of non-voting replicas.
This meant that by the time `AdminChangeReplicas` (and consequently,
`AdminRelocateRange`) returned to its client, non-voters were not
guaranteed to have been initialized. This was a deliberate decision and
was, thus far, believed to be copacetic. However, this decision subtly
made range merges (of ranges that have any number of non-voters)
extremely unlikely to suceed, while causing severe disruption on
foreground traffic on the right hand side of a merge. This was because
the `mergeQueue` will first call `AdminRelocateRange` on the right hand
side range in order to collocate its replicas with the replicas of the
left hand side range.  If the `mergeQueue` happened to relocate any
non-voting replicas, they were likely to still be waiting for their
initial snapshot by the time the `AdminMerge` attempted to subsume the
RHS. Essentially, this meant that we were subsuming the RHS of a merge
while some of its replicas weren't even initialized. This would cause
the merge to fail and, in the interim, block all traffic over the RHS
range for a 5 second window.

This commit fixes the unfortunate sequence of events described above by
making the behavior of `AdminChangeReplicas` more symmetric for voting
and non-voting replicas. Now, if `AdminChangeReplicas` successfully
returns, its client can safely assume that all new replicas have at
least been upreplicated via an initial snapshot.

Release note: None

63388: bazel: bump timeout of `ring_test` r=rickystewart a=rickystewart

This has timed out in CI.
Release note: None

Co-authored-by: Eric Wang <[email protected]>
Co-authored-by: Aayush Shah <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
@craig craig bot closed this as completed in c52b01f Apr 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants