Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: increase the number of ranges in rebalance/by-load tests #102801

Closed
kvoli opened this issue May 5, 2023 · 0 comments · Fixed by #102824
Closed

roachtest: increase the number of ranges in rebalance/by-load tests #102801

kvoli opened this issue May 5, 2023 · 0 comments · Fixed by #102824
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@kvoli
Copy link
Collaborator

kvoli commented May 5, 2023

In the rebalance/by-load/... roachtest, a KV table is initialized and split into ranges so that the number of ranges is equal to the number of stores.

splits := numStores - 1 // n-1 splits => n ranges => 1 lease per store

These tests periodically fail due to #88829. Since this is a known limitation, we should update the roachtests to use a greater number of ranges, so that we can still assert on load being rebalanced but without the noise from this limitation.

Jira issue: CRDB-27661

@kvoli kvoli added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-distribution Relating to rebalancing and leasing. labels May 5, 2023
@kvoli kvoli self-assigned this May 5, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label May 5, 2023
kvoli added a commit to kvoli/cockroach that referenced this issue May 5, 2023
The `rebalance/by-load/replicas` roachtests periodically flake due to a
known limitation, where a cluster with a small number of ranges may not
be properly balanced due to heterogeneous localities (including
multi-store) cockroachdb#88829.

This commit updates the total number of ranges from 1 per-store, to 5
per-store for `rebalance/by-load/*` roachtests.

The roachtest asserted on the lease count, as a proxy for load, assuming
that the load evenly hits each lease for the created ranges. However,
the principal indicator of load in a read heavy workload is CPU.

This commit updates the test assertion to require that every store's CPU
is within 10% of the cluster mean. The test assertion previously
required that the max-min lease count delta was 0, when no outside
splits occurred; or 1 when the number of ranges was greater than the
number of stores.

The logging format is updated for easier debugging:

```
cpu outside bounds mean=23.9 tolerance=10.0% (±2.4) bounds=[21.5, 26.3]
   below  = []
   within = [s1: 22 (-6.7%), s2: 22 (-6.3%)]
   above  = [s3: 26 (+13.0%)]
...
cpu within bounds mean=25.7 tolerance=10.0% (±2.6) bounds=[23.2, 28.3]
   stores=[s1: 24 (-3.0%), s2: 24 (-2.9%), s3: 27 (+5.9%)]
```

resolves: cockroachdb#102801
resolves: cockroachdb#102823

Release note: None
craig bot pushed a commit that referenced this issue May 9, 2023
102824: roachtest: assert on CPU in rebalance/by-load  r=aliher1911 a=kvoli

The `rebalance/by-load/replicas` roachtests periodically flake due to a
known limitation, where a cluster with a small number of ranges may not
be properly balanced due to heterogeneous localities (including
multi-store) #88829.

This PR updates the total number of ranges from 1 per-store, to 5
per-store for `rebalance/by-load/*` roachtests.

The roachtest asserted on the lease count, as a proxy for load, assuming
that the load evenly hits each lease for the created ranges. However,
the principal indicator of load in a read heavy workload is CPU.

This PR updates the test assertion to require that every store's CPU
is within 10% of the cluster mean. The test assertion previously
required that the max-min lease count delta was 0, when no outside
splits occurred; or 1 when the number of ranges was greater than the
number of stores.

The logging format is updated for easier debugging:

```
cpu outside bounds mean=23.9 tolerance=10.0% (±2.4) bounds=[21.5, 26.3]
   below  = []
   within = [s1: 22 (-6.7%), s2: 22 (-6.3%)]
   above  = [s3: 26 (+13.0%)]
...
cpu within bounds mean=25.7 tolerance=10.0% (±2.6) bounds=[23.2, 28.3]
   stores=[s1: 24 (-3.0%), s2: 24 (-2.9%), s3: 27 (+5.9%)]
```

resolves: #102801
resolves: #102823

Release note: None

Co-authored-by: Austen McClernon <[email protected]>
@craig craig bot closed this as completed in d9b5463 May 9, 2023
blathers-crl bot pushed a commit that referenced this issue May 9, 2023
The `rebalance/by-load/replicas` roachtests periodically flake due to a
known limitation, where a cluster with a small number of ranges may not
be properly balanced due to heterogeneous localities (including
multi-store) #88829.

This commit updates the total number of ranges from 1 per-store, to 5
per-store for `rebalance/by-load/*` roachtests.

The roachtest asserted on the lease count, as a proxy for load, assuming
that the load evenly hits each lease for the created ranges. However,
the principal indicator of load in a read heavy workload is CPU.

This commit updates the test assertion to require that every store's CPU
is within 10% of the cluster mean. The test assertion previously
required that the max-min lease count delta was 0, when no outside
splits occurred; or 1 when the number of ranges was greater than the
number of stores.

The logging format is updated for easier debugging:

```
cpu outside bounds mean=23.9 tolerance=10.0% (±2.4) bounds=[21.5, 26.3]
   below  = []
   within = [s1: 22 (-6.7%), s2: 22 (-6.3%)]
   above  = [s3: 26 (+13.0%)]
...
cpu within bounds mean=25.7 tolerance=10.0% (±2.6) bounds=[23.2, 28.3]
   stores=[s1: 24 (-3.0%), s2: 24 (-2.9%), s3: 27 (+5.9%)]
```

resolves: #102801
resolves: #102823

Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Jun 27, 2023
The `rebalance/by-load/replicas` roachtests periodically flake due to a
known limitation, where a cluster with a small number of ranges may not
be properly balanced due to heterogeneous localities (including
multi-store) cockroachdb#88829.

This commit updates the total number of ranges from 1 per-store, to 5
per-store for `rebalance/by-load/*` roachtests.

The roachtest asserted on the lease count, as a proxy for load, assuming
that the load evenly hits each lease for the created ranges. However,
the principal indicator of load in a read heavy workload is CPU.

This commit updates the test assertion to require that every store's CPU
is within 10% of the cluster mean. The test assertion previously
required that the max-min lease count delta was 0, when no outside
splits occurred; or 1 when the number of ranges was greater than the
number of stores.

The logging format is updated for easier debugging:

```
cpu outside bounds mean=23.9 tolerance=10.0% (±2.4) bounds=[21.5, 26.3]
   below  = []
   within = [s1: 22 (-6.7%), s2: 22 (-6.3%)]
   above  = [s3: 26 (+13.0%)]
...
cpu within bounds mean=25.7 tolerance=10.0% (±2.6) bounds=[23.2, 28.3]
   stores=[s1: 24 (-3.0%), s2: 24 (-2.9%), s3: 27 (+5.9%)]
```

resolves: cockroachdb#102801
resolves: cockroachdb#102823

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant