-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/kvserver: TestStoreRangeMergeSlowUnabandonedFollower_WithSplit failed #73838
Comments
kv/kvserver.TestStoreRangeMergeSlowUnabandonedFollower_WithSplit failed with artifacts on master @ 45390e7b59618d508a0075bae99bb015e61805b2:
HelpSee also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)Parameters in this failure:
|
kv/kvserver.TestStoreRangeMergeSlowUnabandonedFollower_WithSplit failed with artifacts on master @ 3b4e180a23f5121e0d4106eb3dc5f61ebc314188:
HelpSee also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)Parameters in this failure:
|
I bisected this failure to d77bee9. I'll take a look into what's going wrong. |
I think I understand what's going on here. The test is performing the following series of actions:
There's a lot going on, but it looks like the reason why d77bee9 made the test flaky is because it broke the cockroach/pkg/kv/kvserver/split_trigger_helper.go Lines 136 to 147 in db229ca
This used to allow the new RHS range's replica 3 to eventually reject a MsgApp instead of dropping it, which allowed the leader to notice the need for a snapshot, try to send a snapshot, hit an overlapping key range error, trigger a replicaGCQueue process on the old RHS range's replica 3, clear out the old RHS range's replica 3, send another snapshot which succeeded, and catch up the new RHS range's replica 3. Now that we're not ticking uninitialized replicas, this At first, I thought that's where this ended. But the comment in the
We're intending to pass the LHS replica through the replica GC queue to force a replica GC if necessary so that the next message does not get dropped, but instead, we accidentally pass it through the MVCC GC queue. So even though the LHS replica has been removed from its range, it never gets GCed and Next steps:
|
I published a PR for the first of these next steps: #74073. @tbg I'm curious whether you have ideas about Also, are you 👍 on |
👍🏽 on the rename and the strategy you describe. |
kv/kvserver.TestStoreRangeMergeSlowUnabandonedFollower_WithSplit failed with artifacts on master @ 1903b8f7e8195d92cd6ddb281b7fed764900f5da:
HelpSee also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)Parameters in this failure:
|
74073: kv: add to replicaGCQueue in replicaMsgAppDropper, not gcQueue r=tbg a=nvanbenschoten Fixes #73838. This commit is the first of the three "next steps" identified in #73838. It fixes a case where we were accidentally adding a replica to the wrong queue. When dropping a MsgApp in `maybeDropMsgApp`, we want to GC the replica on the LHS of the split if it has been removed from its range. However, we were instead passing it to the MVCC GC queue, which was both irrelevant and a no-op because the LHS was not the leaseholder. It's possible that we have seen the effects of this in roachtests like `splits/largerange`. This but could have delayed a snapshot to the RHS of a split for up to `maxDelaySplitTriggerTicks * 200ms = 20s` in some rare cases. We've seen the logs corresponding to this issue in a few tests over the past year: https://github.com/cockroachdb/cockroach/issues?q=is%3Aissue+%22would+have+dropped+incoming+MsgApp+to+wait+for+split+trigger%22+is%3Aclosed. Co-authored-by: Nathan VanBenschoten <[email protected]>
Fixes #73838. This commit is the first of the three "next steps" identified in #73838. It fixes a case where we were accidentally adding a replica to the wrong queue. When dropping a `MsgApp` in `maybeDropMsgApp`, we want to GC the replica on the LHS of the split if it has been removed from its range. However, we were instead passing it to the MVCC GC queue, which was both irrelevant and also a no-op because the LHS was not the leaseholder. It's possible that we have seen the effects of this in roachtests like `splits/largerange`. This but could have delayed a snapshot to the RHS of a split for up to `maxDelaySplitTriggerTicks * 200ms = 20s` in some rare cases. We've seen the logs corresponding to this issue in a few tests over the past year: https://github.com/cockroachdb/cockroach/issues?q=is%3Aissue+%22would+have+dropped+incoming+MsgApp+to+wait+for+split+trigger%22+is%3Aclosed.
Fixes #73838. This commit is the first of the three "next steps" identified in #73838. It fixes a case where we were accidentally adding a replica to the wrong queue. When dropping a `MsgApp` in `maybeDropMsgApp`, we want to GC the replica on the LHS of the split if it has been removed from its range. However, we were instead passing it to the MVCC GC queue, which was both irrelevant and also a no-op because the LHS was not the leaseholder. It's possible that we have seen the effects of this in roachtests like `splits/largerange`. This but could have delayed a snapshot to the RHS of a split for up to `maxDelaySplitTriggerTicks * 200ms = 20s` in some rare cases. We've seen the logs corresponding to this issue in a few tests over the past year: https://github.com/cockroachdb/cockroach/issues?q=is%3Aissue+%22would+have+dropped+incoming+MsgApp+to+wait+for+split+trigger%22+is%3Aclosed.
Fixes #73838. This commit is the first of the three "next steps" identified in #73838. It fixes a case where we were accidentally adding a replica to the wrong queue. When dropping a `MsgApp` in `maybeDropMsgApp`, we want to GC the replica on the LHS of the split if it has been removed from its range. However, we were instead passing it to the MVCC GC queue, which was both irrelevant and also a no-op because the LHS was not the leaseholder. It's possible that we have seen the effects of this in roachtests like `splits/largerange`. This but could have delayed a snapshot to the RHS of a split for up to `maxDelaySplitTriggerTicks * 200ms = 20s` in some rare cases. We've seen the logs corresponding to this issue in a few tests over the past year: https://github.com/cockroachdb/cockroach/issues?q=is%3Aissue+%22would+have+dropped+incoming+MsgApp+to+wait+for+split+trigger%22+is%3Aclosed.
This commit renames the "GC queue" to the "MVCC GC queue" (which GC's old MVCC versions) to avoid confusion with the "replica GC queue" (which GC's abandoned replicas). We've already been using this terminology in various other contexts to avoid confusion, so this refactor updates the code to reflect this naming. This comes in response to cockroachdb#73838, which found a bug that had survived for three years and was a direct consequence of this ambiguous naming. The commit doesn't go quite as far as renaming the `pkg/kv/kvserver/gc` package, but that could be a follow-up to this commit.
Related to cockroachdb#73838. In d77bee9, we stopped ticking uninitialized replicas, so we can no longer use ticks as a proxy for the age of a replica in the escape hatch of `maybeDropMsgApp`. Instead, we now use the age of the replica directly. We hit the escape hatch for any replica that is older than 20s, which corresponds to the 100 ticks we used before.
This commit renames the "GC queue" to the "MVCC GC queue" (which GC's old MVCC versions) to avoid confusion with the "replica GC queue" (which GC's abandoned replicas). We've already been using this terminology in various other contexts to avoid confusion, so this refactor updates the code to reflect this naming. This comes in response to cockroachdb#73838, which found a bug that had survived for three years and was a direct consequence of this ambiguous naming. The commit doesn't go quite as far as renaming the `pkg/kv/kvserver/gc` package, but that could be a follow-up to this commit.
73941: roachtest: make crdb crash on span-use-after-Finish r=andreimatei a=andreimatei This patch makes roachtest pass an env var to crdb asking it to panic on mis-use of tracing spans. I've been battling such bugs, which become more problematic as I'm trying to introduce span reuse. In production we'll probably continue tolerating such bugs for the time being, but I want tests to yell. Unit tests are already running with this use-after-Finish detection, and so far so good. I've done a manual run of all the roachtests in this configuration and nothing crashed, so I don't expect a tragedy. Release note: None 74109: kv: rename gcQueue to mvccGCQueue r=tbg a=nvanbenschoten This commit renames the "GC queue" to the "MVCC GC queue" (which GC's old MVCC versions) to avoid confusion with the "replica GC queue" (which GC's abandoned replicas). We've already been using this terminology in various other contexts to avoid confusion, so this refactor updates the code to reflect this naming. This comes in response to #73838, which found a bug that had survived for three years and was a direct consequence of this ambiguous naming. The commit doesn't go quite as far as renaming the `pkg/kv/kvserver/gc` package, but that could be a follow-up to this commit. 74126: geo: move projection data to embedded compressed file r=RaduBerinde a=RaduBerinde The geoprojbase package embeds projection info as constants, leading to a 6MB code file. Large code files are undesirable especially from the perspective of static analysis tools, IDEs, etc. This change moves the projections data to an embedded json.gz file. We define the schema of this file in a new `embeddedproj` subpackage. The data is loaded lazily. The data file was obtained by modifying the existing constants to fill out an `embeddedproj.Data`: https://github.com/RaduBerinde/cockroach/blob/geospatial-proj-data/pkg/geo/geoprojbase/embeddedproj/data_test.go The `generate-spatial-ref-sys` command is also updated to generate this file from the `.csv`. The `make buildshort` binary size is decreased by ~7MB. Fixes #63969. Release note: None 74128: cockroach: don't import randgen in binary r=RaduBerinde a=RaduBerinde The `sql/randgen` package creates a lot of global datums, some of which use geospatial and require the loading of geospatial data. This package is meant for testing and should not be part of the cockroach binary. This change removes the non-testing uses of randgen. Tested via `go list -deps ./pkg/cmd/cockroach`. Note that the updated test is ineffectual for now (tracked by #74119). Informs #74120. Release note: None 74159: sql: default index recommendations to be off for logic tests r=nehageorge a=nehageorge **sql: refactor GlobalDefault for session variables** This commit refactors pkg/sql/vars.go to use globalFalse and globalTrue as the setting GlobalDefault where possible. Release note: None **sql: default index recommendations to be off for logic tests** This commit configures index recommendations to be off for logic tests. This is to avoid flaky tests, as the index recommendation output can vary depending on the best plan chosen by the optimizer. Fixes: #74069. Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Radu Berinde <[email protected]> Co-authored-by: Neha George <[email protected]>
74108: kv: remove dependency on ticks from maybeDropMsgApp r=nvanbenschoten a=nvanbenschoten Related to #73838. In d77bee9, we stopped ticking uninitialized replicas, so we can no longer use ticks as a proxy for the age of a replica in the escape hatch of `maybeDropMsgApp`. Instead, we now use the age of the replica directly. We hit the escape hatch for any replica that is older than 20s, which corresponds to the 100 ticks we used before. Co-authored-by: Nathan VanBenschoten <[email protected]>
Related to cockroachdb#73838. In d77bee9, we stopped ticking uninitialized replicas, so we can no longer use ticks as a proxy for the age of a replica in the escape hatch of `maybeDropMsgApp`. Instead, we now use the age of the replica directly. We hit the escape hatch for any replica that is older than 20s, which corresponds to the 100 ticks we used before.
Related to cockroachdb#73838. In d77bee9, we stopped ticking uninitialized replicas, so we can no longer use ticks as a proxy for the age of a replica in the escape hatch of `maybeDropMsgApp`. Instead, we now use the age of the replica directly. We hit the escape hatch for any replica that is older than 20s, which corresponds to the 100 ticks we used before.
Fixes cockroachdb#73838. This commit is the first of the three "next steps" identified in cockroachdb#73838. It fixes a case where we were accidentally adding a replica to the wrong queue. When dropping a `MsgApp` in `maybeDropMsgApp`, we want to GC the replica on the LHS of the split if it has been removed from its range. However, we were instead passing it to the MVCC GC queue, which was both irrelevant and also a no-op because the LHS was not the leaseholder. It's possible that we have seen the effects of this in roachtests like `splits/largerange`. This but could have delayed a snapshot to the RHS of a split for up to `maxDelaySplitTriggerTicks * 200ms = 20s` in some rare cases. We've seen the logs corresponding to this issue in a few tests over the past year: https://github.com/cockroachdb/cockroach/issues?q=is%3Aissue+%22would+have+dropped+incoming+MsgApp+to+wait+for+split+trigger%22+is%3Aclosed.
This commit renames the "GC queue" to the "MVCC GC queue" (which GC's old MVCC versions) to avoid confusion with the "replica GC queue" (which GC's abandoned replicas). We've already been using this terminology in various other contexts to avoid confusion, so this refactor updates the code to reflect this naming. This comes in response to cockroachdb#73838, which found a bug that had survived for three years and was a direct consequence of this ambiguous naming. The commit doesn't go quite as far as renaming the `pkg/kv/kvserver/gc` package, but that could be a follow-up to this commit.
Related to cockroachdb#73838. In d77bee9, we stopped ticking uninitialized replicas, so we can no longer use ticks as a proxy for the age of a replica in the escape hatch of `maybeDropMsgApp`. Instead, we now use the age of the replica directly. We hit the escape hatch for any replica that is older than 20s, which corresponds to the 100 ticks we used before.
kv/kvserver.TestStoreRangeMergeSlowUnabandonedFollower_WithSplit failed with artifacts on master @ f6f60c6cd9da2300540cda93422aad3e033880ed:
Help
See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)Parameters in this failure:
This test on roachdash | Improve this report!
The text was updated successfully, but these errors were encountered: