Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestStoreRangeMergeSlowUnabandonedFollower_WithSplit failed #73838

Closed
cockroach-teamcity opened this issue Dec 15, 2021 · 7 comments · Fixed by #74073
Closed

kv/kvserver: TestStoreRangeMergeSlowUnabandonedFollower_WithSplit failed #73838

cockroach-teamcity opened this issue Dec 15, 2021 · 7 comments · Fixed by #74073
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

kv/kvserver.TestStoreRangeMergeSlowUnabandonedFollower_WithSplit failed with artifacts on master @ f6f60c6cd9da2300540cda93422aad3e033880ed:

=== RUN   TestStoreRangeMergeSlowUnabandonedFollower_WithSplit
    test_log_scope.go:79: test logs captured to: /artifacts/tmp/_tmp/751d67000aac5f3394c2369309253f02/logTestStoreRangeMergeSlowUnabandonedFollower_WithSplit2913701310
    test_log_scope.go:80: use -show-logs to present logs inline
    client_merge_test.go:2774: condition failed to evaluate within 45s: rhs store does not own valid lease for rhs range
        goroutine 8287559 [running]:
        runtime/debug.Stack()
        	GOROOT/src/runtime/debug/stack.go:24 +0x65
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithin({0x56e9958, 0xc017722000}, 0xb, 0x0)
        	github.com/cockroachdb/cockroach/pkg/testutils/soon.go:60 +0x5f
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon({0x56e9958, 0xc017722000}, 0x2)
        	github.com/cockroachdb/cockroach/pkg/testutils/soon.go:41 +0x4a
        github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestStoreRangeMergeSlowUnabandonedFollower_WithSplit(0xc017722000)
        	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/client_merge_test.go:2774 +0x889
        testing.tRunner(0xc017722000, 0x4807740)
        	GOROOT/src/testing/testing.go:1259 +0x102
        created by testing.(*T).Run
        	GOROOT/src/testing/testing.go:1306 +0x35a
    panic.go:642: -- test log scope end --
--- FAIL: TestStoreRangeMergeSlowUnabandonedFollower_WithSplit (46.48s)
Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)Parameters in this failure:

  • TAGS=bazel,gss,deadlock

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Dec 15, 2021
@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestStoreRangeMergeSlowUnabandonedFollower_WithSplit failed with artifacts on master @ 45390e7b59618d508a0075bae99bb015e61805b2:

=== RUN   TestStoreRangeMergeSlowUnabandonedFollower_WithSplit
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestStoreRangeMergeSlowUnabandonedFollower_WithSplit3829434542
    test_log_scope.go:80: use -show-logs to present logs inline
    client_merge_test.go:2774: condition failed to evaluate within 45s: rhs store does not own valid lease for rhs range
        goroutine 193597 [running]:
        runtime/debug.Stack()
        	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithin({0x569de38, 0xc005c38d00}, 0xb, 0x0)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:60 +0x5f
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon({0x569de38, 0xc005c38d00}, 0x2)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:41 +0x4a
        github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestStoreRangeMergeSlowUnabandonedFollower_WithSplit(0xc005c38d00)
        	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/client_merge_test.go:2774 +0x889
        testing.tRunner(0xc005c38d00, 0x47bc670)
        	/usr/local/go/src/testing/testing.go:1259 +0x102
        created by testing.(*T).Run
        	/usr/local/go/src/testing/testing.go:1306 +0x35a
    panic.go:642: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestStoreRangeMergeSlowUnabandonedFollower_WithSplit3829434542
--- FAIL: TestStoreRangeMergeSlowUnabandonedFollower_WithSplit (46.82s)
Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)Parameters in this failure:

  • GOFLAGS=-json

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestStoreRangeMergeSlowUnabandonedFollower_WithSplit failed with artifacts on master @ 3b4e180a23f5121e0d4106eb3dc5f61ebc314188:

=== RUN   TestStoreRangeMergeSlowUnabandonedFollower_WithSplit
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestStoreRangeMergeSlowUnabandonedFollower_WithSplit4081635205
    test_log_scope.go:80: use -show-logs to present logs inline
    client_merge_test.go:2776: condition failed to evaluate within 45s: rhs store does not own valid lease for rhs range
        goroutine 162546 [running]:
        runtime/debug.Stack()
        	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithin({0x56bd878, 0xc0066df040}, 0xb, 0x0)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:60 +0x5f
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon({0x56bd878, 0xc0066df040}, 0x2)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:41 +0x4a
        github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestStoreRangeMergeSlowUnabandonedFollower_WithSplit(0xc0066df040)
        	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/client_merge_test.go:2776 +0x889
        testing.tRunner(0xc0066df040, 0x47c7f40)
        	/usr/local/go/src/testing/testing.go:1259 +0x102
        created by testing.(*T).Run
        	/usr/local/go/src/testing/testing.go:1306 +0x35a
    panic.go:642: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestStoreRangeMergeSlowUnabandonedFollower_WithSplit4081635205
--- FAIL: TestStoreRangeMergeSlowUnabandonedFollower_WithSplit (45.87s)
Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)Parameters in this failure:

  • GOFLAGS=-parallel=4

This test on roachdash | Improve this report!

@nvanbenschoten
Copy link
Member

I bisected this failure to d77bee9. I'll take a look into what's going wrong.

@nvanbenschoten nvanbenschoten self-assigned this Dec 19, 2021
@nvanbenschoten
Copy link
Member

I think I understand what's going on here. The test is performing the following series of actions:

  1. create a new range, upreplicate to 3 replicas
  2. split range in half
  3. partition the LHS range's replica 3
  4. merge the ranges (replica 3 on either side doesn't hear through log)
  5. split the ranges again (replica 3 on either side doesn't hear through log)
  6. remove the LHS range from store 3 so that it never applies the split through its log
  7. wait for new RHS range's replica 3 to catch up on the log

There's a lot going on, but it looks like the reason why d77bee9 made the test flaky is because it broke the maxDelaySplitTriggerTicks logic in maybeDropMsgApp:

if ticks > maxDelaySplitTriggerTicks {
// This is an escape hatch in case there are other scenarios (missed in
// the above analysis) in which a split trigger just isn't coming. If
// there are, the idea is that we notice this log message and improve
// the heuristics.
log.Warningf(
ctx,
"would have dropped incoming MsgApp to wait for split trigger, "+
"but allowing due to %d (>%d) ticks",
ticks, maxDelaySplitTriggerTicks)
return false
}

This used to allow the new RHS range's replica 3 to eventually reject a MsgApp instead of dropping it, which allowed the leader to notice the need for a snapshot, try to send a snapshot, hit an overlapping key range error, trigger a replicaGCQueue process on the old RHS range's replica 3, clear out the old RHS range's replica 3, send another snapshot which succeeded, and catch up the new RHS range's replica 3.

Now that we're not ticking uninitialized replicas, this maxDelaySplitTriggerTicks condition is never hit, so we reject MsgApps indefinitely.

At first, I thought that's where this ended. But the comment in the ticks > maxDelaySplitTriggerTicks condition is interesting. It indicates that we don't expect to need to hit the escape hatch under normal conditions. So why do we need it here? It turns out that we need it because our LHS replica is never being GCed. Why? Because of this code:

lhsRepl.store.gcQueue.AddAsync(context.Background(), lhsRepl, replicaGCPriorityDefault)

We're intending to pass the LHS replica through the replica GC queue to force a replica GC if necessary so that the next message does not get dropped, but instead, we accidentally pass it through the MVCC GC queue. So even though the LHS replica has been removed from its range, it never gets GCed and maybeDropMsgApp never stops dropping messages, so we get stuck.

Next steps:

  1. replicaMsgAppDropper.ShouldDrop should add the LHS to the replicaGCQueue, not the gcQueue. This deflakes the test.
  2. figure out what to do with maxDelaySplitTriggerTicks. It's dead code right now because we no longer tick uninitialized replicas. Even with the main fix here, it still seems like a good idea to have some kind of escape hatch in maybeDropMsgApp. However, this doesn't need to be based on ticks.
  3. is now the time we finally s/gcQueue/mvccGCQueue/?

@nvanbenschoten
Copy link
Member

I published a PR for the first of these next steps: #74073.

@tbg I'm curious whether you have ideas about maxDelaySplitTriggerTicks. It still seems like a reasonable escape hatch, if only because it limits the fallout from a bug (like this!) in this area. Maybe we just do something more direct and add a creation timestamp to Replica, set it to timeutil.Now() in newUnloadedReplica, and have a maximum time that a replica will wait before allowing the snapshot through.

Also, are you 👍 on s/gcQueue/mvccGCQueue/? If so, I'll go perform the rename.

@tbg
Copy link
Member

tbg commented Dec 20, 2021

👍🏽 on the rename and the strategy you describe.

@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestStoreRangeMergeSlowUnabandonedFollower_WithSplit failed with artifacts on master @ 1903b8f7e8195d92cd6ddb281b7fed764900f5da:

=== RUN   TestStoreRangeMergeSlowUnabandonedFollower_WithSplit
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestStoreRangeMergeSlowUnabandonedFollower_WithSplit706578554
    test_log_scope.go:80: use -show-logs to present logs inline
    client_merge_test.go:2776: condition failed to evaluate within 3m45s: rhs store does not own valid lease for rhs range
        goroutine 169366 [running]:
        runtime/debug.Stack()
        	/usr/local/go/src/runtime/debug/stack.go:24 +0x72
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithin({0x95e9818, 0xc006322680}, 0x49961c, 0xc005c677a0)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:60 +0x7d
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon({0x95e9818, 0xc006322680}, 0x2)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:41 +0x5b
        github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestStoreRangeMergeSlowUnabandonedFollower_WithSplit(0xc006322680)
        	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/client_merge_test.go:2776 +0xc33
        testing.tRunner(0xc006322680, 0x66266e8)
        	/usr/local/go/src/testing/testing.go:1259 +0x230
        created by testing.(*T).Run
        	/usr/local/go/src/testing/testing.go:1306 +0x727
    panic.go:661: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestStoreRangeMergeSlowUnabandonedFollower_WithSplit706578554
--- FAIL: TestStoreRangeMergeSlowUnabandonedFollower_WithSplit (230.78s)
Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)Parameters in this failure:

  • GOFLAGS=-json

This test on roachdash | Improve this report!

craig bot pushed a commit that referenced this issue Dec 21, 2021
74073: kv: add to replicaGCQueue in replicaMsgAppDropper, not gcQueue r=tbg a=nvanbenschoten

Fixes #73838.

This commit is the first of the three "next steps" identified in #73838. It fixes a case where we were accidentally adding a replica to the wrong queue. When dropping a MsgApp in `maybeDropMsgApp`, we want to GC the replica on the LHS of the split if it has been removed from its range. However, we were instead passing it to the MVCC GC queue, which was both irrelevant and a no-op because the LHS was not the leaseholder.

It's possible that we have seen the effects of this in roachtests like `splits/largerange`. This but could have delayed a snapshot to the RHS of a split for up to `maxDelaySplitTriggerTicks * 200ms = 20s` in some rare cases. We've seen the logs corresponding to this issue in a few tests over the past year: https://github.com/cockroachdb/cockroach/issues?q=is%3Aissue+%22would+have+dropped+incoming+MsgApp+to+wait+for+split+trigger%22+is%3Aclosed.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in f11f912 Dec 21, 2021
blathers-crl bot pushed a commit that referenced this issue Dec 21, 2021
Fixes #73838.

This commit is the first of the three "next steps" identified in #73838.
It fixes a case where we were accidentally adding a replica to the wrong
queue. When dropping a `MsgApp` in `maybeDropMsgApp`, we want to GC the
replica on the LHS of the split if it has been removed from its range.
However, we were instead passing it to the MVCC GC queue, which was both
irrelevant and also a no-op because the LHS was not the leaseholder.

It's possible that we have seen the effects of this in roachtests like
`splits/largerange`. This but could have delayed a snapshot to the RHS
of a split for up to `maxDelaySplitTriggerTicks * 200ms = 20s` in some
rare cases. We've seen the logs corresponding to this issue in a few
tests over the past year: https://github.com/cockroachdb/cockroach/issues?q=is%3Aissue+%22would+have+dropped+incoming+MsgApp+to+wait+for+split+trigger%22+is%3Aclosed.
blathers-crl bot pushed a commit that referenced this issue Dec 21, 2021
Fixes #73838.

This commit is the first of the three "next steps" identified in #73838.
It fixes a case where we were accidentally adding a replica to the wrong
queue. When dropping a `MsgApp` in `maybeDropMsgApp`, we want to GC the
replica on the LHS of the split if it has been removed from its range.
However, we were instead passing it to the MVCC GC queue, which was both
irrelevant and also a no-op because the LHS was not the leaseholder.

It's possible that we have seen the effects of this in roachtests like
`splits/largerange`. This but could have delayed a snapshot to the RHS
of a split for up to `maxDelaySplitTriggerTicks * 200ms = 20s` in some
rare cases. We've seen the logs corresponding to this issue in a few
tests over the past year: https://github.com/cockroachdb/cockroach/issues?q=is%3Aissue+%22would+have+dropped+incoming+MsgApp+to+wait+for+split+trigger%22+is%3Aclosed.
blathers-crl bot pushed a commit that referenced this issue Dec 21, 2021
Fixes #73838.

This commit is the first of the three "next steps" identified in #73838.
It fixes a case where we were accidentally adding a replica to the wrong
queue. When dropping a `MsgApp` in `maybeDropMsgApp`, we want to GC the
replica on the LHS of the split if it has been removed from its range.
However, we were instead passing it to the MVCC GC queue, which was both
irrelevant and also a no-op because the LHS was not the leaseholder.

It's possible that we have seen the effects of this in roachtests like
`splits/largerange`. This but could have delayed a snapshot to the RHS
of a split for up to `maxDelaySplitTriggerTicks * 200ms = 20s` in some
rare cases. We've seen the logs corresponding to this issue in a few
tests over the past year: https://github.com/cockroachdb/cockroach/issues?q=is%3Aissue+%22would+have+dropped+incoming+MsgApp+to+wait+for+split+trigger%22+is%3Aclosed.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 21, 2021
This commit renames the "GC queue" to the "MVCC GC queue" (which GC's old MVCC
versions) to avoid confusion with the "replica GC queue" (which GC's abandoned
replicas). We've already been using this terminology in various other contexts
to avoid confusion, so this refactor updates the code to reflect this naming.

This comes in response to cockroachdb#73838, which found a bug that had survived for three
years and was a direct consequence of this ambiguous naming.

The commit doesn't go quite as far as renaming the `pkg/kv/kvserver/gc` package,
but that could be a follow-up to this commit.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 21, 2021
Related to cockroachdb#73838.

In d77bee9, we stopped ticking uninitialized replicas, so we can no longer use
ticks as a proxy for the age of a replica in the escape hatch of `maybeDropMsgApp`.
Instead, we now use the age of the replica directly. We hit the escape hatch for
any replica that is older than 20s, which corresponds to the 100 ticks we used
before.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 21, 2021
This commit renames the "GC queue" to the "MVCC GC queue" (which GC's old MVCC
versions) to avoid confusion with the "replica GC queue" (which GC's abandoned
replicas). We've already been using this terminology in various other contexts
to avoid confusion, so this refactor updates the code to reflect this naming.

This comes in response to cockroachdb#73838, which found a bug that had survived for three
years and was a direct consequence of this ambiguous naming.

The commit doesn't go quite as far as renaming the `pkg/kv/kvserver/gc` package,
but that could be a follow-up to this commit.
craig bot pushed a commit that referenced this issue Dec 21, 2021
73941: roachtest: make crdb crash on span-use-after-Finish r=andreimatei a=andreimatei

This patch makes roachtest pass an env var to crdb asking it to panic on
mis-use of tracing spans. I've been battling such bugs, which become
more problematic as I'm trying to introduce span reuse. In production
we'll probably continue tolerating such bugs for the time being, but I
want tests to yell. Unit tests are already running with this
use-after-Finish detection, and so far so good. I've done a manual run
of all the roachtests in this configuration and nothing crashed, so I
don't expect a tragedy.

Release note: None

74109: kv: rename gcQueue to mvccGCQueue r=tbg a=nvanbenschoten

This commit renames the "GC queue" to the "MVCC GC queue" (which GC's old MVCC
versions) to avoid confusion with the "replica GC queue" (which GC's abandoned
replicas). We've already been using this terminology in various other contexts
to avoid confusion, so this refactor updates the code to reflect this naming.

This comes in response to #73838, which found a bug that had survived for three
years and was a direct consequence of this ambiguous naming.

The commit doesn't go quite as far as renaming the `pkg/kv/kvserver/gc` package,
but that could be a follow-up to this commit.

74126: geo: move projection data to embedded compressed file r=RaduBerinde a=RaduBerinde

The geoprojbase package embeds projection info as constants, leading
to a 6MB code file. Large code files are undesirable especially from
the perspective of static analysis tools, IDEs, etc.

This change moves the projections data to an embedded json.gz file. We
define the schema of this file in a new `embeddedproj` subpackage. The
data is loaded lazily.

The data file was obtained by modifying the existing constants to fill
out an `embeddedproj.Data`:
https://github.com/RaduBerinde/cockroach/blob/geospatial-proj-data/pkg/geo/geoprojbase/embeddedproj/data_test.go

The `generate-spatial-ref-sys` command is also updated to generate
this file from the `.csv`.

The `make buildshort` binary size is decreased by ~7MB.

Fixes #63969.

Release note: None

74128: cockroach: don't import randgen in binary r=RaduBerinde a=RaduBerinde

The `sql/randgen` package creates a lot of global datums, some of
which use geospatial and require the loading of geospatial data. This
package is meant for testing and should not be part of the cockroach
binary.

This change removes the non-testing uses of randgen.

Tested via `go list -deps ./pkg/cmd/cockroach`. Note that the updated
test is ineffectual for now (tracked by #74119).

Informs #74120.

Release note: None

74159: sql: default index recommendations to be off for logic tests  r=nehageorge a=nehageorge

**sql: refactor GlobalDefault for session variables**

This commit refactors pkg/sql/vars.go to use globalFalse and
globalTrue as the setting GlobalDefault where possible.

Release note: None

**sql: default index recommendations to be off for logic tests**

This commit configures index recommendations to be off for logic tests.
This is to avoid flaky tests, as the index recommendation output can
vary depending on the best plan chosen by the optimizer.

Fixes: #74069.

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: Neha George <[email protected]>
craig bot pushed a commit that referenced this issue Dec 22, 2021
74108: kv: remove dependency on ticks from maybeDropMsgApp r=nvanbenschoten a=nvanbenschoten

Related to #73838.

In d77bee9, we stopped ticking uninitialized replicas, so we can no longer use ticks as a proxy for the age of a replica in the escape hatch of `maybeDropMsgApp`.  Instead, we now use the age of the replica directly. We hit the escape hatch for any replica that is older than 20s, which corresponds to the 100 ticks we used before.

Co-authored-by: Nathan VanBenschoten <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 22, 2021
Related to cockroachdb#73838.

In d77bee9, we stopped ticking uninitialized replicas, so we can no longer use
ticks as a proxy for the age of a replica in the escape hatch of `maybeDropMsgApp`.
Instead, we now use the age of the replica directly. We hit the escape hatch for
any replica that is older than 20s, which corresponds to the 100 ticks we used
before.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 22, 2021
Related to cockroachdb#73838.

In d77bee9, we stopped ticking uninitialized replicas, so we can no longer use
ticks as a proxy for the age of a replica in the escape hatch of `maybeDropMsgApp`.
Instead, we now use the age of the replica directly. We hit the escape hatch for
any replica that is older than 20s, which corresponds to the 100 ticks we used
before.
gustasva pushed a commit to gustasva/cockroach that referenced this issue Jan 4, 2022
Fixes cockroachdb#73838.

This commit is the first of the three "next steps" identified in cockroachdb#73838.
It fixes a case where we were accidentally adding a replica to the wrong
queue. When dropping a `MsgApp` in `maybeDropMsgApp`, we want to GC the
replica on the LHS of the split if it has been removed from its range.
However, we were instead passing it to the MVCC GC queue, which was both
irrelevant and also a no-op because the LHS was not the leaseholder.

It's possible that we have seen the effects of this in roachtests like
`splits/largerange`. This but could have delayed a snapshot to the RHS
of a split for up to `maxDelaySplitTriggerTicks * 200ms = 20s` in some
rare cases. We've seen the logs corresponding to this issue in a few
tests over the past year: https://github.com/cockroachdb/cockroach/issues?q=is%3Aissue+%22would+have+dropped+incoming+MsgApp+to+wait+for+split+trigger%22+is%3Aclosed.
gustasva pushed a commit to gustasva/cockroach that referenced this issue Jan 4, 2022
This commit renames the "GC queue" to the "MVCC GC queue" (which GC's old MVCC
versions) to avoid confusion with the "replica GC queue" (which GC's abandoned
replicas). We've already been using this terminology in various other contexts
to avoid confusion, so this refactor updates the code to reflect this naming.

This comes in response to cockroachdb#73838, which found a bug that had survived for three
years and was a direct consequence of this ambiguous naming.

The commit doesn't go quite as far as renaming the `pkg/kv/kvserver/gc` package,
but that could be a follow-up to this commit.
gustasva pushed a commit to gustasva/cockroach that referenced this issue Jan 4, 2022
Related to cockroachdb#73838.

In d77bee9, we stopped ticking uninitialized replicas, so we can no longer use
ticks as a proxy for the age of a replica in the escape hatch of `maybeDropMsgApp`.
Instead, we now use the age of the replica directly. We hit the escape hatch for
any replica that is older than 20s, which corresponds to the 100 ticks we used
before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants