Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: add backup-restore/small-ranges #112356

Merged

Conversation

msbutler
Copy link
Collaborator

@msbutler msbutler commented Oct 13, 2023

This patch adds a new backup-restore roachtest variant that reduces default
range size for user databases in the backup-restore/round-trip roachtest to
simulate a larger cluster, in terms of range count, at smaller data sizes. In
addition, the roachtest scales down a few cluster settings such that the ratio
of rangeSize/clusterSetting remains constant.

This patch should allow us to recreate a roachtest workload that can simulate
the conditions that lead to #109483 but at smaller data sizes.

Informs #109483

Release note: None

@msbutler msbutler self-assigned this Oct 13, 2023
@msbutler msbutler requested a review from a team as a code owner October 13, 2023 22:40
@msbutler msbutler requested review from DarrylWong and removed request for a team October 13, 2023 22:40
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@msbutler msbutler force-pushed the butler-real-splits-roundtrip-test branch from 6a5cf63 to 8ae84b6 Compare October 14, 2023 00:02
@msbutler
Copy link
Collaborator Author

@renatolabs @adityamaru I plan to run a few test runs with the patch, if y'all think this general approach looks reasonable.

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for doing this! Should we also use this new function in the mixed-version test as well?

It would also be interesting to see if we are able to hit #109483 by running this test with a crdb build prior to the fix.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @DarrylWong, and @msbutler)


pkg/cmd/roachtest/tests/backup_restore_roundtrip.go line 36 at r1 (raw file):

	// maxRangeSizeBytes defines the possible non default (default is 512 MB) maximum range
	// sizes that may get set for all user databases.
	maxRangeSizeBytes = []int{1 << 18 /* 256 KB */, 4 << 20 /* 4 MB*/, 32 << 20 /* 32 MB */}

Nit: (here and in other places) these are technically KiB, not KB.


pkg/cmd/roachtest/tests/backup_restore_roundtrip.go line 40 at r1 (raw file):

	// SystemSettingsValuesBoundOnRangeSize defines the set of possible cluster
	// setting values, set after the default max range size. The value chosen will
	// never be larger than the max range size.

Could this be more clearly expressed as a delta? So these values would be the possible differences between the max range size and the actual cluster setting.

Also: would it make sense to test a zero delta here?


pkg/cmd/roachtest/tests/mixed_version_backup.go line 1294 at r1 (raw file):

// the `systemSettingsValuesBoundOnRangeSize[setting]` set below the max range
// size, or to some random value in the set below the max range size.
func (u *CommonTestUtils) setMaxRangeSizeAndDependentSettings(

Some thoughts: Not sure I understand the purpose of defaultAltSettingProbability as I mention below. Also, this function is currently maybe changing the max range size but always changing cluster settings.

IMO, we should make sure there is a decent percentage of runs (I'd say at least 0.2) where we don't touch anything and let the system use its defaults.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 1303 at r1 (raw file):

	if rng.Float64() < defaultRangeSizeProbability {
		rangeSize = maxRangeSizeBytes[rng.Intn(len(maxRangeSizeBytes))]
		l.Printf("Set max range rangeSize to %d MB", rangeSize>>20)

Nit: this will confusingly render as 0 MB if the range size is less than 1MiB.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 1305 at r1 (raw file):

		l.Printf("Set max range rangeSize to %d MB", rangeSize>>20)
		for _, dbName := range dbs {
			query := fmt.Sprintf("ALTER DATABASE %s CONFIGURE ZONE USING range_max_bytes=%d, range_min_bytes=1024",

Wouldn't hurt to make the min range size a constant as well.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 1324 at r1 (raw file):

		}
		chosenValue := boundedValues[len(boundedValues)-1]
		if rng.Float64() > defaultAltSettingProbability {

Curious about this choice: why do we assign higher probability to the value closest to the max range size? Makes reasoning about the distribution a little harder so I'd only recommend this type of approach if there's a clear reason why this would uncover more bugs/interesting scenarios.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 1327 at r1 (raw file):

			chosenValue = boundedValues[rng.Intn(len(boundedValues))]
		}
		l.Printf("setting cluster setting %q to %q", setting, chosenValue)

chosenValue is an int, no?

@adityamaru
Copy link
Contributor

adityamaru commented Oct 16, 2023

I plan to run a few test runs with the patch, if y'all think this general approach looks reasonable.

Nice! I'm still to read the patch carefully but this was my first reaction too. We should run and observe a few of these runs to make sure we don't need to tweak some more knobs to get the "simulated" cluster state we're after. Things like do we need to wait till the ranges are rebalanced when we tweak the setting, do we need to alter the target files sizes of export requests and of our sst file sink etc. I think this as is would increase coverage but I want to have some confidence in our understanding of what coverage it offers so that we're not relying on it doing something its not.

Copy link
Collaborator Author

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to first deploy it on this test, and after week or two of baking, I'll add it to the mixed-version test.

And yes, the hope is to hit that bug!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @DarrylWong, and @renatolabs)


pkg/cmd/roachtest/tests/backup_restore_roundtrip.go line 40 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Could this be more clearly expressed as a delta? So these values would be the possible differences between the max range size and the actual cluster setting.

Also: would it make sense to test a zero delta here?

I think it makes sense to specify the absolute value of the desired cluster setting, since i think it's easier to reason about the value of a cluster setting on it's own and not relative to max range size. As an example: setting backup_restore_span.target_size has a semantic meaning that's quite different than the a nonzero value for this setting. And expressing these values as a difference wrt range size would muddle that.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 1294 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Some thoughts: Not sure I understand the purpose of defaultAltSettingProbability as I mention below. Also, this function is currently maybe changing the max range size but always changing cluster settings.

IMO, we should make sure there is a decent percentage of runs (I'd say at least 0.2) where we don't touch anything and let the system use its defaults.

See my answer below.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 1303 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Nit: this will confusingly render as 0 MB if the range size is less than 1MiB.

ah good point


pkg/cmd/roachtest/tests/mixed_version_backup.go line 1305 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Wouldn't hurt to make the min range size a constant as well.

Done.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 1324 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Curious about this choice: why do we assign higher probability to the value closest to the max range size? Makes reasoning about the distribution a little harder so I'd only recommend this type of approach if there's a clear reason why this would uncover more bugs/interesting scenarios.

This is a good question. My approach here may be over indexing on the strategy for tuning backup.restore_span.target_size. For optimal performance, in theory, we generally want to set this setting to be about 75% of the max range size. It's also totally fine to set this setting even lower, and doing so would help increase test coverage. We would never want to test a value for this setting above the range size -- that would really degrade performance.

So, conditional on a set range size, my approach chooses a "default" backup_restore_span.target_size of 75% of the range size half of the time. For the 50% of test runs, we choose a "non-default" value which is lower than 75% of the range size.

i think my approach works well in the context of this specific cluster setting, but may not work so well for other dependent settings we may want to add in the future. I think some experiments may shed some light on better alternative approaches.

It may also be worth adding a totally new variant of the test backup-restore/round-trip/small-ranges which explores small range test coverage, since the small range size fundamentally changes the behavior of backup/restore.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 1327 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

chosenValue is an int, no?

Aha. yes.

@msbutler msbutler force-pushed the butler-real-splits-roundtrip-test branch 3 times, most recently from 089c4af to 83daf1c Compare October 17, 2023 13:01
Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @DarrylWong, and @msbutler)


pkg/cmd/roachtest/tests/mixed_version_backup.go line 1294 at r1 (raw file):

Previously, msbutler (Michael Butler) wrote…

See my answer below.

Thanks! I still see the value of running some runs without touching any settings, but I'll leave that up to you.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 1324 at r1 (raw file):

For the 50% of test runs, we choose a "non-default" value which is lower than 75% of the range size.

Would it make sense to test values in the [0.75-1.0] range? I understand it's not ideal and it's not what customers should be doing, just wondering if it's something that "should work" or not.

It may also be worth adding a totally new variant of the test backup-restore/round-trip/small-ranges

Maybe, but a couple of thoughts: do you think testing small ranges probabilistically in this test is not enough? Even if something runs with probability 0.3 (as an example), chances are it will run at least once daily if that is running in all release branches.

Another thought is that roachtest nightly is pretty close to its limit, typically finishing in 23h+, so if we really want it, it would probably make more sense to make it weekly, or revisit existing tests so that our net runtime stays the same.

@msbutler msbutler force-pushed the butler-real-splits-roundtrip-test branch from 83daf1c to 1155107 Compare October 19, 2023 14:29
@msbutler msbutler changed the title roachtest: maybe reduce range size in backup-restore/round-trip roachtest: add backup-restore/small-ranges Oct 19, 2023
@msbutler
Copy link
Collaborator Author

first commit in PR in review over at #112567

@msbutler msbutler force-pushed the butler-real-splits-roundtrip-test branch 2 times, most recently from 0d28996 to e0c911b Compare October 19, 2023 20:33
@msbutler
Copy link
Collaborator Author

unrelated flake

@msbutler msbutler force-pushed the butler-real-splits-roundtrip-test branch from e0c911b to 497bc0a Compare October 24, 2023 01:57
Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Nits, looks great!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @DarrylWong, and @msbutler)


pkg/cmd/roachtest/tests/mixed_version_backup.go line 1312 at r7 (raw file):

	scale := func(current int64) int64 {
		// Dividing an int by a larger int returns 0, not a fraction. Thus, the
		// scalar function depends on whether we're scaling up or down.

I personally find that converting things to float64 would make this easier to read, as it would always be current * ratio, but maybe that's just me.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 1337 at r7 (raw file):

		}
		newValue := scale(currentValue)
		t.L().Printf("setting cluster setting %s from %s to %s ", setting, humanizedCurrentValue, humanizeutil.IBytes(newValue))

Nit: "changing" cluster setting from X to Y?
Nit2: trailing white space.

This patch adds a new backup-restore roachtest variant that reduces default
range size for user databases in the backup-restore/round-trip roachtest to
simulate a larger cluster, in terms of range count, at smaller data sizes. In
addition, the roachtest scales down a few cluster settings such that the ratio
of rangeSize/clusterSetting remains constant.

This patch should allow us to recreate a roachtest workload that can simulate
the conditions that lead to cockroachdb#109483 but at smaller data sizes.

Informs cockroachdb#109483

Release note: None
@msbutler msbutler force-pushed the butler-real-splits-roundtrip-test branch from 497bc0a to e8bd197 Compare October 25, 2023 19:14
Copy link
Collaborator Author

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @adityamaru, @DarrylWong, and @renatolabs)


pkg/cmd/roachtest/tests/mixed_version_backup.go line 1312 at r7 (raw file):

Previously, renatolabs (Renato Costa) wrote…

I personally find that converting things to float64 would make this easier to read, as it would always be current * ratio, but maybe that's just me.

good point. changed it.

@msbutler
Copy link
Collaborator Author

TFTR!

bors r=renatolabs

@craig
Copy link
Contributor

craig bot commented Oct 26, 2023

Build succeeded:

@craig craig bot merged commit b13be53 into cockroachdb:master Oct 26, 2023
@msbutler
Copy link
Collaborator Author

blathers backport 23.2

@msbutler
Copy link
Collaborator Author

msbutler commented Nov 2, 2023

blathers backport 23.1

Copy link

blathers-crl bot commented Nov 2, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from e8bd197 to blathers/backport-release-23.1-112356: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants