Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.2: assert on CPU in rebalance/by-load and reduce noise #105616

Merged

Conversation

kvoli
Copy link
Collaborator

@kvoli kvoli commented Jun 27, 2023

Backport 2/3 commits from #102824 on behalf of @kvoli.
Backport 1/1 commits from #104885 on behalf of @kvoli.

/cc @cockroachdb/release


The rebalance/by-load/replicas roachtests periodically flake due to a
known limitation, where a cluster with a small number of ranges may not
be properly balanced due to heterogeneous localities (including
multi-store) #88829.

This PR updates the total number of ranges from 1 per-store, to 5
per-store for rebalance/by-load/* roachtests.

The roachtest asserted on the lease count, as a proxy for load, assuming
that the load evenly hits each lease for the created ranges. However,
the principal indicator of load in a read heavy workload is CPU.

This PR updates the test assertion to require that every store's CPU
is within 10% of the cluster mean. The test assertion previously
required that the max-min lease count delta was 0, when no outside
splits occurred; or 1 when the number of ranges was greater than the
number of stores.

The logging format is updated for easier debugging:

cpu outside bounds mean=23.9 tolerance=10.0% (±2.4) bounds=[21.5, 26.3]
   below  = []
   within = [s1: 22 (-6.7%), s2: 22 (-6.3%)]
   above  = [s3: 26 (+13.0%)]
...
cpu within bounds mean=25.7 tolerance=10.0% (±2.6) bounds=[23.2, 28.3]
   stores=[s1: 24 (-3.0%), s2: 24 (-2.9%), s3: 27 (+5.9%)]

Release note: None


The rebalance-by-load tests assert that the normalized [0,1] CPU utilization of each node is within some threshold of the mean. The threshold was previously 10%, however it is not unexpected that despite replica load being within this threshold, that total node load is not.

The current balancing implementation only concerns itself with replica load.

Bump the tolerance from 10% to 15% to reduce noise.

Additionally, the test did not wait for 3x replication prior to beginning the workload. This is bound to introduce flakes eventually. Wait for 3x replication before beginning.

Release note: None


Resolves: #105258
Resolves: #100816
Release justification: Test only change for deflake.

kvoli added 3 commits June 27, 2023 14:02
Previously, the utility functions for retrieving stats from the
timeseries database in roachtest wouldn't allow specifying a sample
period nor sources. This commit enhances ts_util.go to enable specifying
a sample period via `getMetricsWithSamplePeriod` and updates the
`tsQuery` to have a source field - which is applied on queries.

Release note: None
The `rebalance/by-load/replicas` roachtests periodically flake due to a
known limitation, where a cluster with a small number of ranges may not
be properly balanced due to heterogeneous localities (including
multi-store) cockroachdb#88829.

This commit updates the total number of ranges from 1 per-store, to 5
per-store for `rebalance/by-load/*` roachtests.

The roachtest asserted on the lease count, as a proxy for load, assuming
that the load evenly hits each lease for the created ranges. However,
the principal indicator of load in a read heavy workload is CPU.

This commit updates the test assertion to require that every store's CPU
is within 10% of the cluster mean. The test assertion previously
required that the max-min lease count delta was 0, when no outside
splits occurred; or 1 when the number of ranges was greater than the
number of stores.

The logging format is updated for easier debugging:

```
cpu outside bounds mean=23.9 tolerance=10.0% (±2.4) bounds=[21.5, 26.3]
   below  = []
   within = [s1: 22 (-6.7%), s2: 22 (-6.3%)]
   above  = [s3: 26 (+13.0%)]
...
cpu within bounds mean=25.7 tolerance=10.0% (±2.6) bounds=[23.2, 28.3]
   stores=[s1: 24 (-3.0%), s2: 24 (-2.9%), s3: 27 (+5.9%)]
```

resolves: cockroachdb#102801
resolves: cockroachdb#102823

Release note: None
The rebalance-by-load tests assert that the normalized [0,1] CPU
utilization of each node is within some threshold of the mean. The
threshold was previously 10%, however it is not unexpected that despite
replica load being within this threshold, that total node load is not.

The current balancing implementation only concerns itself with replica
load.

Bump the tolerance from 10% to 15% to reduce noise.

Additionally, the test did not wait for 3x replication prior to
beginning the workload. This is bound to introduce flakes eventually.
Wait for 3x replication before beginning.

Resolves: cockroachdb#104854
Resolves: cockroachdb#104386

Release note: None
@kvoli kvoli self-assigned this Jun 27, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kvoli kvoli changed the title roachtest: add sample period and source to ts_util release-22.2: assert on CPU in rebalance/by-load and reduce noise Jun 27, 2023
@kvoli kvoli marked this pull request as ready for review June 27, 2023 14:52
Copy link
Contributor

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@kvoli
Copy link
Collaborator Author

kvoli commented Jun 27, 2023

TYFTR

@kvoli kvoli merged commit 660ecd0 into cockroachdb:release-22.2 Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants