-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: TestAdminAPITableStats failed [priority range ID already set] #75939
Comments
kv bug maybe? cc @cockroachdb/kv-prs for triage |
Yup, looks like it! |
Not too hard to repro using gceworker:
Guess I'll play the bisect game later. |
git bisect startbad: [5c0d513] Merge #74235git bisect bad 5c0d513 bad: [d8b2ea1] kvserver: improve logging around successful qps-based rebalancesgit bisect bad d8b2ea1 bad: [79e8dc7] *: use bootstrap.BootstrappedSystemIDChecker in testsgit bisect bad 79e8dc7 good: [442e5b9] kvserver: add metric for replica circuit breakergit bisect good 442e5b9 good: [4fc0855] Merge #73883git bisect good 4fc0855 skip: [2367979] Merge #75231 #75273git bisect skip 2367979 good: [72c5471] Merge #74810 #74923git bisect good 72c5471 good: [81c447d] Merge #72665 #74251 #74831 #75058 #75076 #75465 #75504git bisect good 81c447d good: [5216fba] Merge #75532git bisect good 5216fba skip: [036375e] ci: pass
|
Saw this once or twice in bors too so going to take a look. |
What's interesting is that there is no mention of a split in this log. And yet, the panic is triggered by a split. Might be holding it wrong. |
Merge-to-master is currently pretty red, and this has popped up once or twice. Disable the assertion (which is fine for tests, for now) while we investigate. Touches cockroachdb#75939 Release note: None
Won't be able to figure it out today and about to head out for a long weekend, so skipping the assertion instead: #76986 |
Did another bisection, with 1000 runs each:
Doesn't make sense for the merge commit to be the bad one, so I'll run another set. |
Got another hit on 5d51963. This is likely too rare to get a good signal from, but casting a wider net with longer stress runs. |
Looks like the test timings have changed significantly here. Back on 598bd25, 1000 runs complete in about 4 minutes, while more recent commits take 8 minutes. That could be a contributing factor. Will try to pinpoint where it changed. |
I had to run it for 90 minutes on a 20 node 8vcpu roachprod cluster sometimes to get a repro. This one is nasty to pin down. Sorry, never got around to sharing the partial attempt:
|
I had come away from it thinking it would be easier to just diagnose and fix the problem and find the commit from that. |
Yeah, that makes sense. I'll backport your stack trace change to one of the flaky commits and see what I find. |
Of course, after adding the stack trace, this refuses to reproduce after tens of thousands of runs. Will keep at it, and try to provoke it. |
Removing the GA blocker here, due to the rarity and the inability to reproduce in later commits. But will keep investigating. |
Finally triggered again. Nothing surprising really -- just dumping the stacks here for now, will have a look in the morning:
|
r1 covers This is probably a race where we begin splitting off the user table range before we've properly split off the system ranges, such that the LHS of r44 ends up being r1 with |
Ah, table 50 is actually the |
Ah, looks like we end up merging r2 into r1 early on:
Of course, r2 is still set as the priority range ID, even though it no longer exists. So when we later split off r45 for
I'm guessing this might have something to do with the new system range designation system, which is async and therefore vulnerable to races. Will look into it. |
Yeah, I think the merge should be prevented by this check: cockroach/pkg/kv/kvserver/merge_queue.go Lines 146 to 151 in 949658b
Which used to check the static splits here: cockroach/pkg/config/system.go Lines 500 to 512 in 4b8258c
cockroach/pkg/config/system.go Lines 455 to 461 in 4b8258c
But with the new spanconfigs it looks like it only prevents splits in the system config span: cockroach/pkg/spanconfig/spanconfigstore/store.go Lines 79 to 89 in 610aa04
This does not cover the liveness ranges (starts at 0x88, liveness is at 0x04): Line 41 in 89f8aba
And so it goes on to check the asynchronous span configs, which I believe are racy. @irfansharif I haven't looked into the spanconfig stuff in any depth, but can you confirm that this race is no longer possible? This bug reproduced a couple of months back, but no longer -- a link to the PR that addressed it would be great. |
Nice, good catch! |
What exactly is the race we're talking about here? Is the question why did we merge ranges Unrelated, the mechanism to prevent range splits isn't exactly in the snippet linked -- that bit is just an exception for the system config span (which doesn't encompass the liveness range either). It's a bit more hidden than our hard-coded lists unfortunately, but split keys are determined by the contents of cockroach/pkg/spanconfig/spanconfig.go Lines 98 to 112 in 90054c6
That's not very helpful though, so usually when I want to figure out what split points are supposed to exist, I look at these test files: cockroach/pkg/ccl/spanconfigccl/spanconfigcomparedccl/testdata/basic Lines 68 to 121 in d7e7865
Closing. |
Yep, the r1+r2 merge shouldn't have happened. Thanks for confirming the fix! |
Fixes cockroachdb#98200. This test was written pre-spanconfig days, and when enabling spanconfigs by default, opted out of using it. This commit makes it use spanconfigs after giving up on reproducing/diagnosing the original flake (this test is notoriously slow -- taking 30+s given it waits for actual upreplication and replica movement, so not --stress friendly). Using spanconfigs here surfaced a rare, latent bug, one this author incorrectly thought was fixed back in cockroachdb#75939. In very rare cases, right during cluster bootstrap before the span config reconciler has ever had a chance to run (i.e. system.span_configurations is empty), it's possible that the subscriber has subscribed to an empty span config state (we've only seen this happen in unit tests with 50ms scan intervals). So it's not been meaningfully "updated" in any sense of the word, but we still previously set a non-empty last-updated timestamp, something various components in KV rely on as proof that we have span configs as of some timestamp. As a result, we saw KV incorrectly merge away the liveness range into adjacent ranges, and then later split it off. We don't think we've ever seen this happen outside of tests as it instantly triggers the following fatal in the raftScheduler, which wants to prioritize the liveness range above all else: panic: priority range ID already set: old=2, new=61, first set at: Release note: None
Fixes cockroachdb#98200. This test was written pre-spanconfig days, and when enabling spanconfigs by default over a year ago, opted out of using it. It's a real chore to bring this old test back up to spec (cockroachdb#100210 is an earlier attempt). It has been skipped for a while after flaking (for test-only reasons that are understood, see cockroachdb#100210) and is notoriously slow taking 30+s given it waits for actual upreplication and replica movement, making it not --stress friendly. In our earlier attempt to upgrade this to use spanconfigs, we learnt two new things: - There was a latent bug, previously thought to have been fixed in cockroachdb#75939. In very rare cases, right during cluster bootstrap before the span config reconciler has ever had a chance to run (i.e. system.span_configurations is empty), it was possible that the subscriber had subscribed to an empty span config state (we've only seen this happen in unit tests with 50ms scan intervals). So it was not been meaningfully "updated" in any sense of the word, but we still previously set a non-empty last-updated timestamp, something various components in KV rely on as proof that we have span configs as of some timestamp. As a result, we saw KV incorrectly merge away the liveness range into adjacent ranges, and then later split it off. We don't think we've ever seen this happen outside of tests as it instantly triggers the following fatal in the raftScheduler, which wants to prioritize the liveness range above all else: panic: priority range ID already set: old=2, new=61, first set at: This bug continues to exist. We've filed cockroachdb#104195 to track fixing it. - Fixing the bug above (by erroring out until a span config snapshot is available) made it so that tests now needed to actively wait for a span config snapshot before relocating ranges manually or using certain kv queues. Adding that synchronization made lots of tests a whole lot slower (by 3+s each) despite reducing the closed timestamp interval, etc. These tests weren't really being harmed by the bug (== empty span config snapshot). So it's not clear that the bug fix is worth fixing. But that can be litigated in cockroachdb#104195. We don't really need this test in this current form (end-to-end spanconfig tests exist elsewhere and are more comprehensive without suffering the issues above). Release note: None
Fixes cockroachdb#98200. This test was written pre-spanconfig days, and when enabling spanconfigs by default over a year ago, opted out of using it. It's a real chore to bring this old test back up to spec (cockroachdb#100210 is an earlier attempt). It has been skipped for a while after flaking (for test-only reasons that are understood, see cockroachdb#100210) and is notoriously slow taking 30+s given it waits for actual upreplication and replica movement, making it not --stress friendly. In our earlier attempt to upgrade this to use spanconfigs, we learnt two new things: - There was a latent bug, previously thought to have been fixed in cockroachdb#75939. In very rare cases, right during cluster bootstrap before the span config reconciler has ever had a chance to run (i.e. system.span_configurations is empty), it was possible that the subscriber had subscribed to an empty span config state (we've only seen this happen in unit tests with 50ms scan intervals). So it was not been meaningfully "updated" in any sense of the word, but we still previously set a non-empty last-updated timestamp, something various components in KV rely on as proof that we have span configs as of some timestamp. As a result, we saw KV incorrectly merge away the liveness range into adjacent ranges, and then later split it off. We don't think we've ever seen this happen outside of tests as it instantly triggers the following fatal in the raftScheduler, which wants to prioritize the liveness range above all else: panic: priority range ID already set: old=2, new=61, first set at: This bug continues to exist. We've filed cockroachdb#104195 to track fixing it. - Fixing the bug above (by erroring out until a span config snapshot is available) made it so that tests now needed to actively wait for a span config snapshot before relocating ranges manually or using certain kv queues. Adding that synchronization made lots of tests a whole lot slower (by 3+s each) despite reducing the closed timestamp interval, etc. These tests weren't really being harmed by the bug (== empty span config snapshot). So it's not clear that the bug fix is worth fixing. But that can be litigated in cockroachdb#104195. We don't really need this test in this current form (end-to-end spanconfig tests exist elsewhere and are more comprehensive without suffering the issues above). Release note: None
104198: kvserver: kill TestSystemZoneConfigs r=irfansharif a=irfansharif Fixes #98200. This test was written pre-spanconfig days, and when enabling spanconfigs by default over a year ago, opted out of using it. It's a real chore to bring this old test back up to spec (#100210 is an earlier attempt). It has been skipped for a while after flaking (for test-only reasons that are understood, see #100210) and is notoriously slow taking 30+s given it waits for actual upreplication and replica movement, making it not --stress friendly. In our earlier attempt to upgrade this to use spanconfigs, we learnt two new things: - There was a latent bug, previously thought to have been fixed in #75939. In very rare cases, right during cluster bootstrap before the span config reconciler has ever had a chance to run (i.e. system.span_configurations is empty), it was possible that the subscriber had subscribed to an empty span config state (we've only seen this happen in unit tests with 50ms scan intervals). So it was not been meaningfully "updated" in any sense of the word, but we still previously set a non-empty last-updated timestamp, something various components in KV rely on as proof that we have span configs as of some timestamp. As a result, we saw KV incorrectly merge away the liveness range into adjacent ranges, and then later split it off. We don't think we've ever seen this happen outside of tests as it instantly triggers the following fatal in the raftScheduler, which wants to prioritize the liveness range above all else: panic: priority range ID already set: old=2, new=61, first set at: This bug continues to exist. We've filed #104195 to track fixing it. - Fixing the bug above (by erroring out until a span config snapshot is available) made it so that tests now needed to actively wait for a span config snapshot before relocating ranges manually or using certain kv queues. Adding that synchronization made lots of tests a whole lot slower (by 3+s each) despite reducing the closed timestamp interval, etc. These tests weren't really being harmed by the bug (== empty span config snapshot). So it's not clear that the bug fix is worth fixing. But that can be litigated in #104195. We don't really need this test in this current form (end-to-end spanconfig tests exist elsewhere and are more comprehensive without suffering the issues above). Release note: None Co-authored-by: irfan sharif <[email protected]>
server.TestAdminAPITableStats failed with artifacts on master @ 5c0d5132ade6d2e30a9d414163851272750d8afa:
Fatal error:
Stack:
Log preceding fatal error
Help
See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:
This test on roachdash | Improve this report!
Jira issue: CRDB-12886
The text was updated successfully, but these errors were encountered: