Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: kill TestSystemZoneConfigs #104198

Merged

Conversation

irfansharif
Copy link
Contributor

Fixes #98200. This test was written pre-spanconfig days, and when enabling spanconfigs by default over a year ago, opted out of using it. It's a real chore to bring this old test back up to spec (#100210 is an earlier attempt). It has been skipped for a while after flaking (for test-only reasons that are understood, see #100210) and is notoriously slow taking 30+s given it waits for actual upreplication and replica movement, making it not --stress friendly.

In our earlier attempt to upgrade this to use spanconfigs, we learnt two new things:

  • There was a latent bug, previously thought to have been fixed in server: TestAdminAPITableStats failed [priority range ID already set] #75939. In very rare cases, right during cluster bootstrap before the span config reconciler has ever had a chance to run (i.e. system.span_configurations is empty), it was possible that the subscriber had subscribed to an empty span config state (we've only seen this happen in unit tests with 50ms scan intervals). So it was not been meaningfully "updated" in any sense of the word, but we still previously set a non-empty last-updated timestamp, something various components in KV rely on as proof that we have span configs as of some timestamp. As a result, we saw KV incorrectly merge away the liveness range into adjacent ranges, and then later split it off. We don't think we've ever seen this happen outside of tests as it instantly triggers the following fatal in the raftScheduler, which wants to prioritize the liveness range above all else: panic: priority range ID already set: old=2, new=61, first set at: This bug continues to exist. We've filed kvserver: possible to merge away liveness range during bootstrap #104195 to track fixing it.

  • Fixing the bug above (by erroring out until a span config snapshot is available) made it so that tests now needed to actively wait for a span config snapshot before relocating ranges manually or using certain kv queues. Adding that synchronization made lots of tests a whole lot slower (by 3+s each) despite reducing the closed timestamp interval, etc. These tests weren't really being harmed by the bug (== empty span config snapshot). So it's not clear that the bug fix is worth fixing. But that can be litigated in kvserver: possible to merge away liveness range during bootstrap #104195.

We don't really need this test in this current form (end-to-end spanconfig tests exist elsewhere and are more comprehensive without suffering the issues above).

Release note: None

@irfansharif irfansharif requested a review from arulajmani June 1, 2023 14:02
@irfansharif irfansharif requested a review from a team as a code owner June 1, 2023 14:02
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)


-- commits line 35 at r1:
Noting that a possible alternative here would be to have tighter synchronization, and just for this test. So maybe a testing knob that can be used to inspect a specific span config update has made its way down into the subscriber.


-- commits line 42 at r1:
By this, do you mean the reconciler/subscriber tests or something else? If its the former, I think the "more comprehensive" verbiage is debatable. Not saying that should preclude us from merging this PR, but just a nudge to keep you honest in the commit message before you bors :)


pkg/kv/kvserver/client_replica_test.go line 2715 at r1 (raw file):

			// infrastructure.
			DisableSpanConfigs: true,
			// Scan like a bat out of hell to ensure replication and replica GC

You feel good about killing such comments bro?

Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)


-- commits line 35 at r1:

Previously, arulajmani (Arul Ajmani) wrote…

Noting that a possible alternative here would be to have tighter synchronization, and just for this test. So maybe a testing knob that can be used to inspect a specific span config update has made its way down into the subscriber.

Ack, but that's a hack papering over the real bug #10419 captures.


-- commits line 42 at r1:

Previously, arulajmani (Arul Ajmani) wrote…

By this, do you mean the reconciler/subscriber tests or something else? If its the former, I think the "more comprehensive" verbiage is debatable. Not saying that should preclude us from merging this PR, but just a nudge to keep you honest in the commit message before you bors :)

Also TestSpanConfigUpdateAppliedToReplica. I think this description is honest -- every other roachtest making use of zone configs to place schemas in one zone or another is exercising this machinery and is asserting on placement. pkg/cmd/roachtest/tests/replicagc.go for example.


pkg/kv/kvserver/client_replica_test.go line 2715 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

You feel good about killing such comments bro?

Fortunately it's been cargoculted elsewhere:

// Scan like a bat out of hell to ensure replication and replica GC
// happen in a timely manner.
ScanInterval: 50 * time.Millisecond,
.

Fixes cockroachdb#98200. This test was written pre-spanconfig days, and when
enabling spanconfigs by default over a year ago, opted out of using it.
It's a real chore to bring this old test back up to spec (cockroachdb#100210 is an
earlier attempt). It has been skipped for a while after flaking (for
test-only reasons that are understood, see cockroachdb#100210) and is notoriously
slow taking 30+s given it waits for actual upreplication and replica
movement, making it not --stress friendly.

In our earlier attempt to upgrade this to use spanconfigs, we learnt two
new things:

- There was a latent bug, previously thought to have been fixed in
  cockroachdb#75939. In very rare cases, right during cluster bootstrap before the
  span config reconciler has ever had a chance to run (i.e.
  system.span_configurations is empty), it was possible that the
  subscriber had subscribed to an empty span config state (we've only
  seen this happen in unit tests with 50ms scan intervals). So it was
  not been meaningfully "updated" in any sense of the word, but we still
  previously set a non-empty last-updated timestamp, something various
  components in KV rely on as proof that we have span configs as of some
  timestamp. As a result, we saw KV incorrectly merge away the liveness
  range into adjacent ranges, and then later split it off. We don't
  think we've ever seen this happen outside of tests as it instantly
  triggers the following fatal in the raftScheduler, which wants to
  prioritize the liveness range above all else:
    panic: priority range ID already set: old=2, new=61, first set at:
  This bug continues to exist. We've filed cockroachdb#104195 to track fixing it.

- Fixing the bug above (by erroring out until a span config snapshot is
  available) made it so that tests now needed to actively wait for a
  span config snapshot before relocating ranges manually or using
  certain kv queues. Adding that synchronization made lots of tests a
  whole lot slower (by 3+s each) despite reducing the closed timestamp
  interval, etc. These tests weren't really being harmed by the bug (==
  empty span config snapshot). So it's not clear that the bug fix is
  worth fixing. But that can be litigated in cockroachdb#104195.

We don't really need this test in this current form (end-to-end
spanconfig tests exist elsewhere and are more comprehensive without
suffering the issues above).

Release note: None
@irfansharif irfansharif force-pushed the 230601.rm-TestSystemZoneConfigs branch from c028cf9 to 8395a21 Compare June 27, 2023 16:55
@craig
Copy link
Contributor

craig bot commented Jun 27, 2023

Canceled.

@irfansharif
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 27, 2023

Build succeeded:

@craig craig bot merged commit 6ba0fd5 into cockroachdb:master Jun 27, 2023
@irfansharif irfansharif deleted the 230601.rm-TestSystemZoneConfigs branch June 27, 2023 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv/kvserver: TestSystemZoneConfigs failed
3 participants