Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: TestAdminAPITableStats failed [priority range ID already set] #75939

Closed
cockroach-teamcity opened this issue Feb 3, 2022 · 31 comments · Fixed by #78166
Closed

server: TestAdminAPITableStats failed [priority range ID already set] #75939

cockroach-teamcity opened this issue Feb 3, 2022 · 31 comments · Fixed by #78166
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. S-1 High impact: many users impacted, serious risk of high unavailability or data loss T-kv KV Team X-nostale Marks an issue/pr that should be ignored by the stale bot

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 3, 2022

server.TestAdminAPITableStats failed with artifacts on master @ 5c0d5132ade6d2e30a9d414163851272750d8afa:

Fatal error:

panic: priority range ID already set: old=2, new=44 [recovered]
	panic: priority range ID already set: old=2, new=44

Stack:

goroutine 270411 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc002cac360, {0x505f438, 0xc001fccae0})
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:251 +0x94
panic({0x389e8c0, 0xc00a4b5260})
	/usr/local/go/src/runtime/panic.go:1038 +0x215
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*rangeIDQueue).SetPriorityID(0xc00252bf50, 0x2c)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:125 +0xb9
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).SetPriorityID(0xc002a944e0, 0xb)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:240 +0x58
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).setDescLockedRaftMuLocked(0xc00407aa80, {0x505f438, 0xc006931ce0}, 0xc0079433b0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_init.go:385 +0x745
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).loadRaftMuLockedReplicaMuLocked(0xc00407aa80, 0xc0079433b0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_init.go:212 +0x635
github.com/cockroachdb/cockroach/pkg/kv/kvserver.prepareRightReplicaForSplit({0x505f438, 0xc006931b30}, 0xc007943340, 0xc001eff500)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_split.go:226 +0x19e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.splitPostApply({0x505f438, 0xc006931b30}, {0x0, 0x16d04231cd60f61b, 0x0, 0xe1, 0x90, 0x3, 0x6f, 0x3, ...}, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_split.go:145 +0x4b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleSplitResult(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_application_result.go:247
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).handleNonTrivialReplicatedEvalResult(0xc001eff608, {0x505f438, 0xc006931b30}, 0xc001623040)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_application_state_machine.go:1272 +0x46c
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).ApplySideEffects(0xc001eff608, {0x505f438, 0xc006931b30}, {0x50c0508, 0xc001623008})
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_application_state_machine.go:1154 +0x637
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.mapCheckedCmdIter({0x7f36bff819e0, 0xc001eff878}, 0xc00252d320)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/cmd.go:206 +0x158
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch(0xc00252d8a8, {0x505f438, 0xc0054e4150}, {0x508f328, 0xc001eff818})
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:291 +0x205
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries(0xc00252d8a8, {0x505f438, 0xc0054e4150})
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:247 +0x9a
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked(0xc001eff500, {0x505f438, 0xc0054e4150}, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...})
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:890 +0x161e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady(0xc00252de70, {0x505f438, 0xc0054e4150}, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...})
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:531 +0x15b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady(0xc004e9a900, 0x505f438)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:507 +0xd7
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker(0xc002a944e0, {0x505f438, 0xc001fccae0})
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:305 +0x25d
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494 +0x16f
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:483 +0x445
Log preceding fatal error

=== RUN   TestAdminAPITableStats
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestAdminAPITableStats686845435
    test_log_scope.go:80: use -show-logs to present logs inline

Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • GOFLAGS=-parallel=4

/cc @cockroachdb/server

This test on roachdash | Improve this report!

Jira issue: CRDB-12886

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Feb 3, 2022
@blathers-crl blathers-crl bot added the T-server-and-security DB Server & Security label Feb 3, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Feb 7, 2022
@cameronnunez
Copy link
Contributor

cameronnunez commented Feb 7, 2022

kv bug maybe? cc @cockroachdb/kv-prs for triage

@tbg
Copy link
Member

tbg commented Feb 8, 2022

Yup, looks like it!

@jtsiros jtsiros removed the T-server-and-security DB Server & Security label Feb 9, 2022
@tbg
Copy link
Member

tbg commented Feb 10, 2022

Not too hard to repro using gceworker:

499 runs so far, 0 failures, over 6m55s
505 runs so far, 0 failures, over 7m0s
511 runs so far, 0 failures, over 7m5s
517 runs so far, 0 failures, over 7m10s
524 runs so far, 0 failures, over 7m15s
530 runs so far, 0 failures, over 7m20s
536 runs so far, 0 failures, over 7m25s
543 runs so far, 0 failures, over 7m30s

=== RUN   TestAdminAPITableStats
    test_log_scope.go:79: test logs captured to: /tmp/logTestAdminAPITableStats2045598821
    test_log_scope.go:80: use -show-logs to present logs inline
panic: priority range ID already set: old=2, new=45 [recovered]
        panic: priority range ID already set: old=2, new=45

goroutine 505 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc000694870, {0x506a918, 0xc002f9c480})
        /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:251 +0x94
panic({0x3876f80, 0xc0044099a0})
        /usr/local/go/src/runtime/panic.go:1038 +0x215
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*rangeIDQueue).SetPriorityID(0xc0013f1f20, 0x2d)
        /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:125 +0xb9
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).SetPriorityID(0xc002da8f70, 0xb)
        /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:240 +0x58
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).setDescLockedRaftMuLocked(0xc005d52a80, {0x506a918, 0xc0076fe8a0}, 0xc0033d2490)

Guess I'll play the bisect game later.

@tbg
Copy link
Member

tbg commented Feb 10, 2022

git bisect start

bad: [5c0d513] Merge #74235

git bisect bad 5c0d513

bad: [d8b2ea1] kvserver: improve logging around successful qps-based rebalances

git bisect bad d8b2ea1

bad: [79e8dc7] *: use bootstrap.BootstrappedSystemIDChecker in tests

git bisect bad 79e8dc7

good: [442e5b9] kvserver: add metric for replica circuit breaker

git bisect good 442e5b9

good: [4fc0855] Merge #73883

git bisect good 4fc0855

skip: [2367979] Merge #75231 #75273

git bisect skip 2367979

good: [72c5471] Merge #74810 #74923

git bisect good 72c5471

good: [81c447d] Merge #72665 #74251 #74831 #75058 #75076 #75465 #75504

git bisect good 81c447d

good: [5216fba] Merge #75532

git bisect good 5216fba

skip: [036375e] ci: pass GITHUB_API_TOKEN down to bazel roachtest nightlies

git bisect skip 036375e

skip: [df77141] sql: update SHOW GRANTS ON TABLE to include grant options

git bisect skip df77141

skip: [90db97c] logictest: minor improvements

git bisect skip 90db97c

bad: [dcde8c1] Merge #75127 #75155

git bisect bad dcde8c1

good: [5c767a0] Merge #75277 #75535

git bisect good 5c767a0

good: [0bd031d] kvstreamer: fix the bug with corrupting request when it is put back

git bisect good 0bd031d

good: [b64fe37] Merge #75550 #75573

git bisect good b64fe37

bad: [f1c2179] sql: fix logic to detect if primary index constraint name is in use

git bisect bad f1c2179

first bad commit: [f1c2179] sql: fix logic to detect if primary index constraint name is in use

f1c2179 doesn't strike me as a likely culprit so I might've missed a bad commit. But at least we now know that the bug was certainly present on Jan 27th. There we so many "good" runs (which I generally ran for 20+ minutes; the "bads" are usually <10 mins in), I'll stress the nearest-parent "good" commit again: 5216fba

Edit: ok 5216fba is actually bad after 35 minutes.

I think they're probably all bad, in either case doesn't seem like trying to bisect it down more is the expedient thing to do. We'll just have to investigate the repro.

@AlexTalks AlexTalks added the S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) label Feb 15, 2022
@tbg
Copy link
Member

tbg commented Feb 24, 2022

Saw this once or twice in bors too so going to take a look.

@tbg tbg self-assigned this Feb 24, 2022
@tbg tbg added S-0 Extreme impact: many users impacted and irrecoverable outages / data loss and removed S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) labels Feb 24, 2022
@tbg
Copy link
Member

tbg commented Feb 24, 2022

servertest.log

What's interesting is that there is no mention of a split in this log. And yet, the panic is triggered by a split. Might be holding it wrong.

tbg added a commit to tbg/cockroach that referenced this issue Feb 24, 2022
Merge-to-master is currently pretty red, and this has popped up once or
twice. Disable the assertion (which is fine for tests, for now) while
we investigate.

Touches cockroachdb#75939

Release note: None
@tbg
Copy link
Member

tbg commented Feb 24, 2022

Won't be able to figure it out today and about to head out for a long weekend, so skipping the assertion instead: #76986

@tbg tbg added the X-nostale Marks an issue/pr that should be ignored by the stale bot label Feb 24, 2022
@tbg tbg added S-1 High impact: many users impacted, serious risk of high unavailability or data loss and removed S-0 Extreme impact: many users impacted and irrecoverable outages / data loss labels Mar 1, 2022
@erikgrinaker
Copy link
Contributor

Did another bisection, with 1000 runs each:

git bisect start
# bad: [dcde8c1056e4fb6dda52ac5d95b554709c8d10f0] Merge #75127 #75155
git bisect bad dcde8c1056e4fb6dda52ac5d95b554709c8d10f0
# good: [365b4da8bd02c06ee59d2130a56dec74ffc9ce21] Merge #73876
git bisect good 365b4da8bd02c06ee59d2130a56dec74ffc9ce21
# good: [2ec9cdf70389be608be2f8628c508ca4a7927b9b] kvserver: re-write TestStoreSetRangesMaxBytes
git bisect good 2ec9cdf70389be608be2f8628c508ca4a7927b9b
# good: [44b2ba84cf2a9ecea995b6196bb917e7d3e9c1d5] Merge #68488 #75271 #75293 #75303
git bisect good 44b2ba84cf2a9ecea995b6196bb917e7d3e9c1d5
# bad: [1c57fcdce5a3e67651a2fcac261ab01fa21faa35] scripts: consolidate bazel generate steps in bump-pebble.sh
git bisect bad 1c57fcdce5a3e67651a2fcac261ab01fa21faa35
# good: [507dd7125c7736f3b36213059c7baf128a009c57] Merge #74918 #75450 #75489
git bisect good 507dd7125c7736f3b36213059c7baf128a009c57
# bad: [81c447ddb47e33d66f59acbcb1f76c37f5a148f6] Merge #72665 #74251 #74831 #75058 #75076 #75465 #75504
git bisect bad 81c447ddb47e33d66f59acbcb1f76c37f5a148f6
# good: [83e519003816008c88e9cd864e59b7ebb2b002f5] pkg/sql: add `-linecomment` when building `roleoption` `stringer` file
git bisect good 83e519003816008c88e9cd864e59b7ebb2b002f5
# good: [5d5196362bd7a7b04d7a77488ec349d1d8cc3010] server: allow statements EP to optionally exclude stmt or txns stats
git bisect good 5d5196362bd7a7b04d7a77488ec349d1d8cc3010
# good: [8eaa2256b6cc2851dc657844864e97459a6dd0f1] sql: remove invalid database privileges
git bisect good 8eaa2256b6cc2851dc657844864e97459a6dd0f1
# good: [6135b2ce1d18bdd2c7402c0c6e26d226e95ef6a7] opt: add avgSize stat to statisticsBuilder
git bisect good 6135b2ce1d18bdd2c7402c0c6e26d226e95ef6a7
# good: [be0e69d833ccc0bb7249e0f14ffe63eadc0c3953] ci: add `go_transition_test` support to `bazci`
git bisect good be0e69d833ccc0bb7249e0f14ffe63eadc0c3953
# good: [37e74f0346a815ce5b31bbdd1f2df722003d8708] sql: native evaluation support for NotExpr
git bisect good 37e74f0346a815ce5b31bbdd1f2df722003d8708
# good: [87968336ccdd54f5f32616034d21ffc5dacd5302] schemachanger: columns are not always backfilled in transactions
git bisect good 87968336ccdd54f5f32616034d21ffc5dacd5302
# first bad commit: [81c447ddb47e33d66f59acbcb1f76c37f5a148f6] Merge #72665 #74251 #74831 #75058 #75076 #75465 #75504

Doesn't make sense for the merge commit to be the bad one, so I'll run another set.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Apr 11, 2022

Got another hit on 5d51963. This is likely too rare to get a good signal from, but casting a wider net with longer stress runs.

@erikgrinaker
Copy link
Contributor

Looks like the test timings have changed significantly here. Back on 598bd25, 1000 runs complete in about 4 minutes, while more recent commits take 8 minutes. That could be a contributing factor. Will try to pinpoint where it changed.

@tbg
Copy link
Member

tbg commented Apr 11, 2022

I had to run it for 90 minutes on a 20 node 8vcpu roachprod cluster sometimes to get a repro. This one is nasty to pin down. Sorry, never got around to sharing the partial attempt:

$ cat bisect.sh
#!/bin/bash

f="stress-$(git rev-parse HEAD).log"

function cleanup() {
  code=$?
  if [ $code -eq 124 ]; then
    # 'timeout' aborted the process, i.e. a pass.
    exit 0
  fi
  if grep -q 'priority.range' "${f}"; then
	  exit 1
  fi
  # Skip.
  echo "skipping due to exit code ${code}"
  exit 125
}

trap "cleanup" EXIT

set -euxo pipefail

rm -f stress.log
ln -s "${f}" stress.log
timeout 90m make roachprod-stress CLUSTER=tobias-stress PKG=./pkg/server/ TESTFLAGS=-v TESTS=TestAdminAPITableStats STRESSFLAGS='-failure=priority.range -maxtime 1m' 2>&1 | tee "${f}"
$ cat log.txt
git bisect start '--term-good' 'panics' '--term-bad' 'passes'
# panics: [5c0d5132ade6d2e30a9d414163851272750d8afa] Merge #74235
git bisect panics 5c0d5132ade6d2e30a9d414163851272750d8afa
# passes: [bd5def5fef607e70ea7b6c8835d362ca1790790a] Merge #76892
git bisect passes bd5def5fef607e70ea7b6c8835d362ca1790790a
# panics: [86db5d05acc1e213e9541cd1445288c46a6b5a5d] Merge #76587
git bisect panics 86db5d05acc1e213e9541cd1445288c46a6b5a5d
# passes: [67c82774aa955e7691e4d2ec9725a048014bcccc] Merge #76012 #76215 #76358
git bisect passes 67c82774aa955e7691e4d2ec9725a048014bcccc

@tbg
Copy link
Member

tbg commented Apr 11, 2022

I had come away from it thinking it would be easier to just diagnose and fix the problem and find the commit from that.

@erikgrinaker
Copy link
Contributor

I had come away from it thinking it would be easier to just diagnose and fix the problem and find the commit from that.

Yeah, that makes sense. I'll backport your stack trace change to one of the flaky commits and see what I find.

@erikgrinaker erikgrinaker changed the title server: TestAdminAPITableStats failed server: TestAdminAPITableStats failed [priority range ID already set] Apr 11, 2022
@erikgrinaker
Copy link
Contributor

Of course, after adding the stack trace, this refuses to reproduce after tens of thousands of runs. Will keep at it, and try to provoke it.

@erikgrinaker
Copy link
Contributor

Removing the GA blocker here, due to the rarity and the inability to reproduce in later commits. But will keep investigating.

@erikgrinaker
Copy link
Contributor

Finally triggered again. Nothing surprising really -- just dumping the stacks here for now, will have a look in the morning:

	panic: priority range ID already set: old=2, new=44, first set at:

goroutine 20 [running]:
runtime/debug.Stack()
	GOROOT/src/runtime/debug/stack.go:24 +0x65
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*rangeIDQueue).SetPriorityID(0xc00148cba8, 0x2)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:131 +0x35
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).SetPriorityID(0xc00148cb40, 0xb)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:243 +0x58
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).setDescLockedRaftMuLocked(0xc002e2ca80, {0x509dfd8, 0xc000911320}, 0xc00056ab60)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:385 +0x745
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).loadRaftMuLockedReplicaMuLocked(0xc002e2ca80, 0xc00056ab60)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:212 +0x635
github.com/cockroachdb/cockroach/pkg/kv/kvserver.newReplica({0x509dfd8, 0xc0016889f0}, 0x0, 0x0, 0x739)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:54 +0x112
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Start.func1({0x2, {0xc000cacb80, 0xb, 0x10}, {0xc000cacbb0, 0xb, 0x10}, {0xc001545fe0, 0x1, 0x1}, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:1859 +0x105
github.com/cockroachdb/cockroach/pkg/kv/kvserver.IterateRangeDescriptorsFromDisk.func1({{0xc002d7e508, 0x15, 0x78}, {{0xc002d7e527, 0x2d, 0x59}, {0x16e4ef87dd1c2bc8, 0x0, 0x0}}})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:1711 +0x1f5
github.com/cockroachdb/cockroach/pkg/storage.MVCCIterate({0x509dfd8, 0xc0016889f0}, {0x5121e00, 0xc000f80400}, {0xc000cac400, 0x9, 0xed9e67dea}, {0xc000cac440, 0xb, 0x10}, ...)
	github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:2563 +0x718
github.com/cockroachdb/cockroach/pkg/kv/kvserver.IterateRangeDescriptorsFromDisk({0x509dfd8, 0xc0016889f0}, {0x5121e00, 0xc000f80400}, 0xc00115f368)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:1717 +0x25a
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Start(0xc001956000, {0x509dfd8, 0xc000f212f0}, 0xc000fd6480)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:1836 +0x848
github.com/cockroachdb/cockroach/pkg/server.(*Node).start(0xc000340000, {0x509dfd8, 0xc000f212f0}, {0x5033558, 0xc00162ffe0}, {_, _}, {0x1, {0x5, 0x36, ...}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/server/node.go:446 +0xd93
github.com/cockroachdb/cockroach/pkg/server.(*Server).PreStart(0xc0008af000, {0x509df68, 0xc000078048})
	github.com/cockroachdb/cockroach/pkg/server/server.go:1262 +0x1e3e
github.com/cockroachdb/cockroach/pkg/server.(*Server).Start(0x0, {0x509df68, 0xc000078048})
	github.com/cockroachdb/cockroach/pkg/server/server.go:850 +0x28
github.com/cockroachdb/cockroach/pkg/server.(*TestServer).Start(...)
	github.com/cockroachdb/cockroach/pkg/server/testserver.go:500
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).startServer(_, _, {{{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:486 +0x7b
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start(0xc000276580, {0x5155e90, 0xc000403a00})
	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:316 +0x24c
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.StartTestCluster({_, _}, _, {{{{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:185 +0x85
github.com/cockroachdb/cockroach/pkg/server_test.TestAdminAPITableStats(0xc000403a00)
	github.com/cockroachdb/cockroach/pkg/server_test/pkg/server/admin_cluster_test.go:86 +0x11a
testing.tRunner(0xc000403a00, 0x41602e8)
	GOROOT/src/testing/testing.go:1259 +0x102
created by testing.(*T).Run
	GOROOT/src/testing/testing.go:1306 +0x35a


goroutine 554 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc000fd6480, {0x509dfd8, 0xc002e60d80})
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:251 +0x94
panic({0x38da2a0, 0xc003313950})
	GOROOT/src/runtime/panic.go:1038 +0x215
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*rangeIDQueue).SetPriorityID(0xc00148cba8, 0x2c)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:127 +0x11c
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).SetPriorityID(0xc00148cb40, 0xb)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:243 +0x58
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).setDescLockedRaftMuLocked(0xc004bbb500, {0x509dfd8, 0xc00655d1d0}, 0xc005d7ee30)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:385 +0x745
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).loadRaftMuLockedReplicaMuLocked(0xc004bbb500, 0xc005d7ee30)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:212 +0x635
github.com/cockroachdb/cockroach/pkg/kv/kvserver.prepareRightReplicaForSplit({0x509dfd8, 0xc00655d020}, 0xc005d7edc0, 0xc002e2c000)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_split.go:226 +0x19e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.splitPostApply({0x509dfd8, 0xc00655d020}, {0x0, 0x16e4ef88aa4c2c4c, 0x0, 0xe1, 0x90, 0x3, 0x6f, 0x3, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_split.go:145 +0x4b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleSplitResult(...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_result.go:247
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).handleNonTrivialReplicatedEvalResult(0xc002e2c108, {0x509dfd8, 0xc00655d020}, 0xc006e3a040)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:1272 +0x46c
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).ApplySideEffects(0xc002e2c108, {0x509dfd8, 0xc00655d020}, {0x50ff228, 0xc006e3a008})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:1154 +0x637
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.mapCheckedCmdIter({0x7ff3c590b840, 0xc002e2c378}, 0xc004a79320)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/cmd.go:206 +0x158
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch(0xc004a798a8, {0x509dfd8, 0xc001689920}, {0x50cdec8, 0xc002e2c318})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:291 +0x205
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries(0xc004a798a8, {0x509dfd8, 0xc001689920})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:247 +0x9a
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked(0xc002e2c000, {0x509dfd8, 0xc001689920}, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:890 +0x161e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady(0xc004a79e70, {0x509dfd8, 0xc001689920}, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:531 +0x15b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady(0xc001956000, 0x509dfd8)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:507 +0xd7
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker(0xc00148cb40, {0x509dfd8, 0xc002e60d80})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:308 +0x25d
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494 +0x16f
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:483 +0x445

@erikgrinaker
Copy link
Contributor

priority range ID already set: old=2, new=44

r1 covers /{Min-System/NodeLiveness}, r2 covers r2:/System/NodeLiveness{-Max}. r44 is /{Table/50-Max}, i.e. the user tables.

This is probably a race where we begin splitting off the user table range before we've properly split off the system ranges, such that the LHS of r44 ends up being r1 with /{Min-System/NodeLiveness}, or something like that.

@erikgrinaker
Copy link
Contributor

Ah, table 50 is actually the tenant_settings table. May or may not be significant.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Apr 13, 2022

Ah, looks like we end up merging r2 into r1 early on:

I220412 22:01:21.062142 6059 kv/kvserver/pkg/kv/kvserver/replica_command.go:677  [n1,merge,s1,r1/1:/{Min-System/NodeL…}] 276  initiating a merge of r2:/System/NodeLiveness{-Max} [(n1,s1):1, next=2, gen=0] into this range (lhs+rhs has (size=2.2 KiB+219 B=2.4 KiB qps=0.00+-1.00=0.00qps) below threshold (size=128 MiB, qps=1250.00))
I220412 22:01:21.065430 411 kv/kvserver/pkg/kv/kvserver/store_remove_replica.go:153  [n1,s1,r1/1:/{Min-System/NodeL…}] 277  removing replica r2/1

Of course, r2 is still set as the priority range ID, even though it no longer exists. So when we later split off r45 for /System/NodeLiveness and (correctly) give it priority, it triggers the assertion:

I220412 22:01:23.969020 65092 kv/kvserver/pkg/kv/kvserver/replica_command.go:400  [n1,split,s1,r1/1:/{Min-System/NodeL…}] 1421  initiating a split of this range at key /System/NodeLiveness [r45] (span config)
I220412 22:01:23.975207 462 kv/kvserver/pkg/kv/kvserver/replica_init.go:330  [n1,s1,r45/1:{-}] 1429  setDescLockedRaftMuLocked: r45:/System/NodeLiveness{-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=6]
I220412 22:01:23.975318 462 kv/kvserver/pkg/kv/kvserver/replica_init.go:386  [n1,s1,r45/1:/System/NodeLiveness{-Max}] 1430  SetPriorityID: r45:/System/NodeLiveness{-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=6]
panic: priority range ID already set: old=2, new=45, first set at:

I'm guessing this might have something to do with the new system range designation system, which is async and therefore vulnerable to races. Will look into it.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Apr 13, 2022

Yeah, I think the merge should be prevented by this check:

if confReader.NeedsSplit(ctx, desc.StartKey, desc.EndKey.Next()) {
// This range would need to be split if it extended just one key further.
// There is thus no possible right-hand neighbor that it could be merged
// with.
return false, 0
}

Which used to check the static splits here:

for _, split := range staticSplits {
if startKey.Less(split) {
if split.Less(endKey) {
// The split point is contained within [startKey, endKey), so we need to
// create the split.
return split
}
// [startKey, endKey) is contained between the previous split point and
// this split point.
return nil
}
// [startKey, endKey) is somewhere greater than this split point. Continue.
}

var staticSplits = []roachpb.RKey{
roachpb.RKey(keys.NodeLivenessPrefix), // end of meta records / start of node liveness span
roachpb.RKey(keys.NodeLivenessKeyMax), // end of node liveness span
roachpb.RKey(keys.TimeseriesPrefix), // start of timeseries span
roachpb.RKey(keys.TimeseriesPrefix.PrefixEnd()), // end of timeseries span
roachpb.RKey(keys.TableDataMin), // end of system ranges / start of system config tables
}

But with the new spanconfigs it looks like it only prevents splits in the system config span:

// We don't want to split within the system config span while we're still
// also using it to disseminate zone configs.
//
// TODO(irfansharif): Once we've fully phased out the system config span, we
// can get rid of this special handling.
if keys.SystemConfigSpan.Contains(sp) {
return nil
}
if keys.SystemConfigSpan.ContainsKey(sp.Key) {
return roachpb.RKey(keys.SystemConfigSpan.EndKey)
}

This does not cover the liveness ranges (starts at 0x88, liveness is at 0x04):

SystemConfigSpan = roachpb.Span{Key: SystemConfigSplitKey, EndKey: SystemConfigTableDataMax}

And so it goes on to check the asynchronous span configs, which I believe are racy.

@irfansharif I haven't looked into the spanconfig stuff in any depth, but can you confirm that this race is no longer possible? This bug reproduced a couple of months back, but no longer -- a link to the PR that addressed it would be great.

@tbg
Copy link
Member

tbg commented Apr 14, 2022

Nice, good catch!

@irfansharif
Copy link
Contributor

irfansharif commented Apr 14, 2022

What exactly is the race we're talking about here? Is the question why did we merge ranges r1:/{Min-System/NodeLiveness} + r2:/System/NodeLiveness{-Max}), despite us not wanting to, where we later split off the liveness range r45:/System/NodeLiveness{-Max}? That's likely because of the bug #78122 fixed. Looking at the bisect logs above, 86db5d0 is the most recent SHA that experienced this failure, and it precedes the PR merge above.

Unrelated, the mechanism to prevent range splits isn't exactly in the snippet linked -- that bit is just an exception for the system config span (which doesn't encompass the liveness range either). It's a bit more hidden than our hard-coded lists unfortunately, but split keys are determined by the contents of system.span_configurations, data that's emitted by this interface:

// SQLTranslator translates SQL descriptors and their corresponding zone
// configurations to constituent spans and span configurations.
//
// Concretely, for the following zone configuration hierarchy:
//
// CREATE DATABASE db;
// CREATE TABLE db.t1();
// ALTER DATABASE db CONFIGURE ZONE USING num_replicas=7;
// ALTER TABLE db.t1 CONFIGURE ZONE USING num_voters=5;
//
// The SQLTranslator produces the following translation (represented as a diff
// against RANGE DEFAULT for brevity):
//
// Table/5{3-4} num_replicas=7 num_voters=5
type SQLTranslator interface {

That's not very helpful though, so usually when I want to figure out what split points are supposed to exist, I look at these test files:

diff
----
--- gossiped system config span (legacy)
+++ span config infrastructure (current)
@@ -1,7 +1,7 @@
-/Min ttl_seconds=3600 ignore_strict_gc=true num_replicas=5 rangefeed_enabled=true
-/System/NodeLiveness ttl_seconds=600 ignore_strict_gc=true num_replicas=5 rangefeed_enabled=true
-/System/NodeLivenessMax database system (host)
-/System/tsd database system (tenant)
-/System/"tse" database system (host)
+/Min ttl_seconds=3600 num_replicas=5
+/System/NodeLiveness ttl_seconds=600 num_replicas=5
+/System/NodeLivenessMax range system
+/System/tsd range default
+/System/"tse" range system
/Table/SystemConfigSpan/Start database system (host)
/Table/11 database system (host)
@@ -10,11 +10,11 @@
/Table/14 database system (host)
/Table/15 database system (host)
-/Table/16 database system (host)
-/Table/17 database system (host)
-/Table/18 database system (host)
+/Table/16 range system
+/Table/17 range system
+/Table/18 range system
/Table/19 database system (host)
/Table/20 database system (host)
/Table/21 database system (host)
-/Table/22 database system (host)
+/Table/22 range system
/Table/23 database system (host)
/Table/24 database system (host)
@@ -23,5 +23,5 @@
/Table/27 ttl_seconds=600 ignore_strict_gc=true num_replicas=5 rangefeed_enabled=true
/Table/28 database system (host)
-/Table/29 database system (host)
+/Table/29 range system
/NamespaceTable/30 database system (host)
/NamespaceTable/Max database system (host)
@@ -32,5 +32,5 @@
/Table/36 database system (host)
/Table/37 database system (host)
-/Table/38 database system (host)
+/Table/38 range system
/Table/39 database system (host)
/Table/40 database system (host)
@@ -42,5 +42,5 @@
/Table/46 database system (host)
/Table/47 database system (host)
-/Table/50 range system
+/Table/50 database system (host)
/Table/106 num_replicas=7 num_voters=5
/Table/107 num_replicas=7

Closing.

@erikgrinaker
Copy link
Contributor

Yep, the r1+r2 merge shouldn't have happened. Thanks for confirming the fix!

irfansharif added a commit to irfansharif/cockroach that referenced this issue Apr 3, 2023
Fixes cockroachdb#98200. This test was written pre-spanconfig days, and when
enabling spanconfigs by default, opted out of using it. This commit
makes it use spanconfigs after giving up on reproducing/diagnosing the
original flake (this test is notoriously slow -- taking 30+s given it
waits for actual upreplication and replica movement, so not --stress
friendly).

Using spanconfigs here surfaced a rare, latent bug, one this author
incorrectly thought was fixed back in cockroachdb#75939. In very rare cases, right
during cluster bootstrap before the span config reconciler has ever had
a chance to run (i.e. system.span_configurations is empty), it's
possible that the subscriber has subscribed to an empty span config
state (we've only seen this happen in unit tests with 50ms scan
intervals). So it's not been meaningfully "updated" in any sense of the
word, but we still previously set a non-empty last-updated timestamp,
something various components in KV rely on as proof that we have span
configs as of some timestamp. As a result, we saw KV incorrectly merge
away the liveness range into adjacent ranges, and then later split it
off. We don't think we've ever seen this happen outside of tests as it
instantly triggers the following fatal in the raftScheduler, which wants
to prioritize the liveness range above all else:

    panic: priority range ID already set: old=2, new=61, first set at:

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jun 1, 2023
Fixes cockroachdb#98200. This test was written pre-spanconfig days, and when
enabling spanconfigs by default over a year ago, opted out of using it.
It's a real chore to bring this old test back up to spec (cockroachdb#100210 is an
earlier attempt). It has been skipped for a while after flaking (for
test-only reasons that are understood, see cockroachdb#100210) and is notoriously
slow taking 30+s given it waits for actual upreplication and replica
movement, making it not --stress friendly.

In our earlier attempt to upgrade this to use spanconfigs, we learnt two
new things:

- There was a latent bug, previously thought to have been fixed in
  cockroachdb#75939. In very rare cases, right during cluster bootstrap before the
  span config reconciler has ever had a chance to run (i.e.
  system.span_configurations is empty), it was possible that the
  subscriber had subscribed to an empty span config state (we've only
  seen this happen in unit tests with 50ms scan intervals). So it was
  not been meaningfully "updated" in any sense of the word, but we still
  previously set a non-empty last-updated timestamp, something various
  components in KV rely on as proof that we have span configs as of some
  timestamp. As a result, we saw KV incorrectly merge away the liveness
  range into adjacent ranges, and then later split it off. We don't
  think we've ever seen this happen outside of tests as it instantly
  triggers the following fatal in the raftScheduler, which wants to
  prioritize the liveness range above all else:
    panic: priority range ID already set: old=2, new=61, first set at:
  This bug continues to exist. We've filed cockroachdb#104195 to track fixing it.

- Fixing the bug above (by erroring out until a span config snapshot is
  available) made it so that tests now needed to actively wait for a
  span config snapshot before relocating ranges manually or using
  certain kv queues. Adding that synchronization made lots of tests a
  whole lot slower (by 3+s each) despite reducing the closed timestamp
  interval, etc. These tests weren't really being harmed by the bug (==
  empty span config snapshot). So it's not clear that the bug fix is
  worth fixing. But that can be litigated in cockroachdb#104195.

We don't really need this test in this current form (end-to-end
spanconfig tests exist elsewhere and are more comprehensive without
suffering the issues above).

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jun 27, 2023
Fixes cockroachdb#98200. This test was written pre-spanconfig days, and when
enabling spanconfigs by default over a year ago, opted out of using it.
It's a real chore to bring this old test back up to spec (cockroachdb#100210 is an
earlier attempt). It has been skipped for a while after flaking (for
test-only reasons that are understood, see cockroachdb#100210) and is notoriously
slow taking 30+s given it waits for actual upreplication and replica
movement, making it not --stress friendly.

In our earlier attempt to upgrade this to use spanconfigs, we learnt two
new things:

- There was a latent bug, previously thought to have been fixed in
  cockroachdb#75939. In very rare cases, right during cluster bootstrap before the
  span config reconciler has ever had a chance to run (i.e.
  system.span_configurations is empty), it was possible that the
  subscriber had subscribed to an empty span config state (we've only
  seen this happen in unit tests with 50ms scan intervals). So it was
  not been meaningfully "updated" in any sense of the word, but we still
  previously set a non-empty last-updated timestamp, something various
  components in KV rely on as proof that we have span configs as of some
  timestamp. As a result, we saw KV incorrectly merge away the liveness
  range into adjacent ranges, and then later split it off. We don't
  think we've ever seen this happen outside of tests as it instantly
  triggers the following fatal in the raftScheduler, which wants to
  prioritize the liveness range above all else:
    panic: priority range ID already set: old=2, new=61, first set at:
  This bug continues to exist. We've filed cockroachdb#104195 to track fixing it.

- Fixing the bug above (by erroring out until a span config snapshot is
  available) made it so that tests now needed to actively wait for a
  span config snapshot before relocating ranges manually or using
  certain kv queues. Adding that synchronization made lots of tests a
  whole lot slower (by 3+s each) despite reducing the closed timestamp
  interval, etc. These tests weren't really being harmed by the bug (==
  empty span config snapshot). So it's not clear that the bug fix is
  worth fixing. But that can be litigated in cockroachdb#104195.

We don't really need this test in this current form (end-to-end
spanconfig tests exist elsewhere and are more comprehensive without
suffering the issues above).

Release note: None
craig bot pushed a commit that referenced this issue Jun 27, 2023
104198: kvserver: kill TestSystemZoneConfigs r=irfansharif a=irfansharif

Fixes #98200. This test was written pre-spanconfig days, and when enabling spanconfigs by default over a year ago, opted out of using it. It's a real chore to bring this old test back up to spec (#100210 is an earlier attempt). It has been skipped for a while after flaking (for test-only reasons that are understood, see #100210) and is notoriously slow taking 30+s given it waits for actual upreplication and replica movement, making it not --stress friendly.

In our earlier attempt to upgrade this to use spanconfigs, we learnt two new things:

- There was a latent bug, previously thought to have been fixed in #75939. In very rare cases, right during cluster bootstrap before the span config reconciler has ever had a chance to run (i.e. system.span_configurations is empty), it was possible that the subscriber had subscribed to an empty span config state (we've only seen this happen in unit tests with 50ms scan intervals). So it was not been meaningfully "updated" in any sense of the word, but we still previously set a non-empty last-updated timestamp, something various components in KV rely on as proof that we have span configs as of some timestamp. As a result, we saw KV incorrectly merge away the liveness range into adjacent ranges, and then later split it off. We don't think we've ever seen this happen outside of tests as it instantly triggers the following fatal in the raftScheduler, which wants to prioritize the liveness range above all else: panic: priority range ID already set: old=2, new=61, first set at: This bug continues to exist. We've filed #104195 to track fixing it.

- Fixing the bug above (by erroring out until a span config snapshot is available) made it so that tests now needed to actively wait for a span config snapshot before relocating ranges manually or using certain kv queues. Adding that synchronization made lots of tests a whole lot slower (by 3+s each) despite reducing the closed timestamp interval, etc. These tests weren't really being harmed by the bug (== empty span config snapshot). So it's not clear that the bug fix is worth fixing. But that can be litigated in #104195.

We don't really need this test in this current form (end-to-end spanconfig tests exist elsewhere and are more comprehensive without suffering the issues above).

Release note: None

Co-authored-by: irfan sharif <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. S-1 High impact: many users impacted, serious risk of high unavailability or data loss T-kv KV Team X-nostale Marks an issue/pr that should be ignored by the stale bot
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants