-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: add backup-restore/small-ranges #112356
roachtest: add backup-restore/small-ranges #112356
Conversation
6a5cf63
to
8ae84b6
Compare
@renatolabs @adityamaru I plan to run a few test runs with the patch, if y'all think this general approach looks reasonable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for doing this! Should we also use this new function in the mixed-version test as well?
It would also be interesting to see if we are able to hit #109483 by running this test with a crdb build prior to the fix.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @DarrylWong, and @msbutler)
pkg/cmd/roachtest/tests/backup_restore_roundtrip.go
line 36 at r1 (raw file):
// maxRangeSizeBytes defines the possible non default (default is 512 MB) maximum range // sizes that may get set for all user databases. maxRangeSizeBytes = []int{1 << 18 /* 256 KB */, 4 << 20 /* 4 MB*/, 32 << 20 /* 32 MB */}
Nit: (here and in other places) these are technically KiB
, not KB
.
pkg/cmd/roachtest/tests/backup_restore_roundtrip.go
line 40 at r1 (raw file):
// SystemSettingsValuesBoundOnRangeSize defines the set of possible cluster // setting values, set after the default max range size. The value chosen will // never be larger than the max range size.
Could this be more clearly expressed as a delta? So these values would be the possible differences between the max range size and the actual cluster setting.
Also: would it make sense to test a zero delta here?
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 1294 at r1 (raw file):
// the `systemSettingsValuesBoundOnRangeSize[setting]` set below the max range // size, or to some random value in the set below the max range size. func (u *CommonTestUtils) setMaxRangeSizeAndDependentSettings(
Some thoughts: Not sure I understand the purpose of defaultAltSettingProbability
as I mention below. Also, this function is currently maybe changing the max range size but always changing cluster settings.
IMO, we should make sure there is a decent percentage of runs (I'd say at least 0.2) where we don't touch anything and let the system use its defaults.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 1303 at r1 (raw file):
if rng.Float64() < defaultRangeSizeProbability { rangeSize = maxRangeSizeBytes[rng.Intn(len(maxRangeSizeBytes))] l.Printf("Set max range rangeSize to %d MB", rangeSize>>20)
Nit: this will confusingly render as 0 MB
if the range size is less than 1MiB.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 1305 at r1 (raw file):
l.Printf("Set max range rangeSize to %d MB", rangeSize>>20) for _, dbName := range dbs { query := fmt.Sprintf("ALTER DATABASE %s CONFIGURE ZONE USING range_max_bytes=%d, range_min_bytes=1024",
Wouldn't hurt to make the min range size a constant as well.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 1324 at r1 (raw file):
} chosenValue := boundedValues[len(boundedValues)-1] if rng.Float64() > defaultAltSettingProbability {
Curious about this choice: why do we assign higher probability to the value closest to the max range size? Makes reasoning about the distribution a little harder so I'd only recommend this type of approach if there's a clear reason why this would uncover more bugs/interesting scenarios.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 1327 at r1 (raw file):
chosenValue = boundedValues[rng.Intn(len(boundedValues))] } l.Printf("setting cluster setting %q to %q", setting, chosenValue)
chosenValue
is an int, no?
Nice! I'm still to read the patch carefully but this was my first reaction too. We should run and observe a few of these runs to make sure we don't need to tweak some more knobs to get the "simulated" cluster state we're after. Things like do we need to wait till the ranges are rebalanced when we tweak the setting, do we need to alter the target files sizes of export requests and of our sst file sink etc. I think this as is would increase coverage but I want to have some confidence in our understanding of what coverage it offers so that we're not relying on it doing something its not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to first deploy it on this test, and after week or two of baking, I'll add it to the mixed-version test.
And yes, the hope is to hit that bug!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @DarrylWong, and @renatolabs)
pkg/cmd/roachtest/tests/backup_restore_roundtrip.go
line 40 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Could this be more clearly expressed as a delta? So these values would be the possible differences between the max range size and the actual cluster setting.
Also: would it make sense to test a zero delta here?
I think it makes sense to specify the absolute value of the desired cluster setting, since i think it's easier to reason about the value of a cluster setting on it's own and not relative to max range size. As an example: setting backup_restore_span.target_size
has a semantic meaning that's quite different than the a nonzero value for this setting. And expressing these values as a difference wrt range size would muddle that.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 1294 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Some thoughts: Not sure I understand the purpose of
defaultAltSettingProbability
as I mention below. Also, this function is currently maybe changing the max range size but always changing cluster settings.IMO, we should make sure there is a decent percentage of runs (I'd say at least 0.2) where we don't touch anything and let the system use its defaults.
See my answer below.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 1303 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit: this will confusingly render as
0 MB
if the range size is less than 1MiB.
ah good point
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 1305 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Wouldn't hurt to make the min range size a constant as well.
Done.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 1324 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Curious about this choice: why do we assign higher probability to the value closest to the max range size? Makes reasoning about the distribution a little harder so I'd only recommend this type of approach if there's a clear reason why this would uncover more bugs/interesting scenarios.
This is a good question. My approach here may be over indexing on the strategy for tuning backup.restore_span.target_size
. For optimal performance, in theory, we generally want to set this setting to be about 75% of the max range size. It's also totally fine to set this setting even lower, and doing so would help increase test coverage. We would never want to test a value for this setting above the range size -- that would really degrade performance.
So, conditional on a set range size, my approach chooses a "default" backup_restore_span.target_size
of 75% of the range size half of the time. For the 50% of test runs, we choose a "non-default" value which is lower than 75% of the range size.
i think my approach works well in the context of this specific cluster setting, but may not work so well for other dependent settings we may want to add in the future. I think some experiments may shed some light on better alternative approaches.
It may also be worth adding a totally new variant of the test backup-restore/round-trip/small-ranges
which explores small range test coverage, since the small range size fundamentally changes the behavior of backup/restore.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 1327 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
chosenValue
is an int, no?
Aha. yes.
089c4af
to
83daf1c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @DarrylWong, and @msbutler)
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 1294 at r1 (raw file):
Previously, msbutler (Michael Butler) wrote…
See my answer below.
Thanks! I still see the value of running some runs without touching any settings, but I'll leave that up to you.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 1324 at r1 (raw file):
For the 50% of test runs, we choose a "non-default" value which is lower than 75% of the range size.
Would it make sense to test values in the [0.75-1.0] range? I understand it's not ideal and it's not what customers should be doing, just wondering if it's something that "should work" or not.
It may also be worth adding a totally new variant of the test
backup-restore/round-trip/small-ranges
Maybe, but a couple of thoughts: do you think testing small ranges probabilistically in this test is not enough? Even if something runs with probability 0.3
(as an example), chances are it will run at least once daily if that is running in all release branches.
Another thought is that roachtest nightly is pretty close to its limit, typically finishing in 23h+, so if we really want it, it would probably make more sense to make it weekly, or revisit existing tests so that our net runtime stays the same.
83daf1c
to
1155107
Compare
first commit in PR in review over at #112567 |
0d28996
to
e0c911b
Compare
unrelated flake |
e0c911b
to
497bc0a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nits, looks great!
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @DarrylWong, and @msbutler)
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 1312 at r7 (raw file):
scale := func(current int64) int64 { // Dividing an int by a larger int returns 0, not a fraction. Thus, the // scalar function depends on whether we're scaling up or down.
I personally find that converting things to float64
would make this easier to read, as it would always be current * ratio
, but maybe that's just me.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 1337 at r7 (raw file):
} newValue := scale(currentValue) t.L().Printf("setting cluster setting %s from %s to %s ", setting, humanizedCurrentValue, humanizeutil.IBytes(newValue))
Nit: "changing" cluster setting from X to Y?
Nit2: trailing white space.
This patch adds a new backup-restore roachtest variant that reduces default range size for user databases in the backup-restore/round-trip roachtest to simulate a larger cluster, in terms of range count, at smaller data sizes. In addition, the roachtest scales down a few cluster settings such that the ratio of rangeSize/clusterSetting remains constant. This patch should allow us to recreate a roachtest workload that can simulate the conditions that lead to cockroachdb#109483 but at smaller data sizes. Informs cockroachdb#109483 Release note: None
497bc0a
to
e8bd197
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @adityamaru, @DarrylWong, and @renatolabs)
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 1312 at r7 (raw file):
Previously, renatolabs (Renato Costa) wrote…
I personally find that converting things to
float64
would make this easier to read, as it would always becurrent * ratio
, but maybe that's just me.
good point. changed it.
TFTR! bors r=renatolabs |
Build succeeded: |
blathers backport 23.2 |
blathers backport 23.1 |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from e8bd197 to blathers/backport-release-23.1-112356: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.1 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This patch adds a new backup-restore roachtest variant that reduces default
range size for user databases in the backup-restore/round-trip roachtest to
simulate a larger cluster, in terms of range count, at smaller data sizes. In
addition, the roachtest scales down a few cluster settings such that the ratio
of rangeSize/clusterSetting remains constant.
This patch should allow us to recreate a roachtest workload that can simulate
the conditions that lead to #109483 but at smaller data sizes.
Informs #109483
Release note: None