Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: assert on CPU in rebalance/by-load #102824

Merged
merged 3 commits into from
May 9, 2023

Conversation

kvoli
Copy link
Collaborator

@kvoli kvoli commented May 5, 2023

The rebalance/by-load/replicas roachtests periodically flake due to a
known limitation, where a cluster with a small number of ranges may not
be properly balanced due to heterogeneous localities (including
multi-store) #88829.

This PR updates the total number of ranges from 1 per-store, to 5
per-store for rebalance/by-load/* roachtests.

The roachtest asserted on the lease count, as a proxy for load, assuming
that the load evenly hits each lease for the created ranges. However,
the principal indicator of load in a read heavy workload is CPU.

This PR updates the test assertion to require that every store's CPU
is within 10% of the cluster mean. The test assertion previously
required that the max-min lease count delta was 0, when no outside
splits occurred; or 1 when the number of ranges was greater than the
number of stores.

The logging format is updated for easier debugging:

cpu outside bounds mean=23.9 tolerance=10.0% (±2.4) bounds=[21.5, 26.3]
   below  = []
   within = [s1: 22 (-6.7%), s2: 22 (-6.3%)]
   above  = [s3: 26 (+13.0%)]
...
cpu within bounds mean=25.7 tolerance=10.0% (±2.6) bounds=[23.2, 28.3]
   stores=[s1: 24 (-3.0%), s2: 24 (-2.9%), s3: 27 (+5.9%)]

resolves: #102801
resolves: #102823

Release note: None

@kvoli kvoli added the backport-23.1.x Flags PRs that need to be backported to 23.1 label May 5, 2023
@kvoli kvoli self-assigned this May 5, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kvoli kvoli changed the title storepool: limit storelist string to 2 decimals roachtest: assert on CPU in rebalance/by-load May 5, 2023
@kvoli kvoli force-pushed the 230505.roachtest-rebalance-load branch from 7ac3143 to aa9f4f0 Compare May 5, 2023 22:28
@kvoli kvoli marked this pull request as ready for review May 5, 2023 23:05
@kvoli kvoli requested a review from a team as a code owner May 5, 2023 23:05
@kvoli kvoli requested review from srosenberg, smg260 and aliher1911 and removed request for a team May 5, 2023 23:05
Copy link
Contributor

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Wouldn't it be still susceptible to some random variations in the cluster like backup running?

I put some minor nits.

@@ -38,9 +38,13 @@ const (
rate
)

// defaultSampleNanos is the default sampling period for getMetrics.
const defaultSampleNanos = int64(time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more readable to have it as

const defaultSamplePeriod = time.Minute

and then use defaultSamplePeriod.Nanoseconds() only when using it in TimeSeriesQueryRequest.

// isLoadEvenlyDistributed checks whether the load for the stores given are
// within tolerance of the mean. If the store loads are, true is returned as
// well as reason, otherwise false. The function expects the loads to be
// indexed to store IDs, see getLeaseCounts for example format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see getLeaseCounts for example format

I don't see getLeaseCounts anywhere in the code. Is it stale?

}

return fmt.Errorf("timed out before leases were evenly spread")
return fmt.Errorf("timed out before CPU was evenly balanced")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nicer for the reader to have error that would include reason. E.g. CPU is not evenly balanced after timeout: . Then it should be part of error message when test fails and GH issue is created.

kvoli added 3 commits May 9, 2023 13:36
Previously, the utility functions for retrieving stats from the
timeseries database in roachtest wouldn't allow specifying a sample
period nor sources. This commit enhances ts_util.go to enable specifying
a sample period via `getMetricsWithSamplePeriod` and updates the
`tsQuery` to have a source field - which is applied on queries.

Release note: None
The `rebalance/by-load/replicas` roachtests periodically flake due to a
known limitation, where a cluster with a small number of ranges may not
be properly balanced due to heterogeneous localities (including
multi-store) cockroachdb#88829.

This commit updates the total number of ranges from 1 per-store, to 5
per-store for `rebalance/by-load/*` roachtests.

The roachtest asserted on the lease count, as a proxy for load, assuming
that the load evenly hits each lease for the created ranges. However,
the principal indicator of load in a read heavy workload is CPU.

This commit updates the test assertion to require that every store's CPU
is within 10% of the cluster mean. The test assertion previously
required that the max-min lease count delta was 0, when no outside
splits occurred; or 1 when the number of ranges was greater than the
number of stores.

The logging format is updated for easier debugging:

```
cpu outside bounds mean=23.9 tolerance=10.0% (±2.4) bounds=[21.5, 26.3]
   below  = []
   within = [s1: 22 (-6.7%), s2: 22 (-6.3%)]
   above  = [s3: 26 (+13.0%)]
...
cpu within bounds mean=25.7 tolerance=10.0% (±2.6) bounds=[23.2, 28.3]
   stores=[s1: 24 (-3.0%), s2: 24 (-2.9%), s3: 27 (+5.9%)]
```

resolves: cockroachdb#102801
resolves: cockroachdb#102823

Release note: None
The `rebalance/by-load` roachtests assert on the distribution of CPU
utilization between nodes in the cluster, having backups also running
could lead to unrelated flakes.

This commit updates cluster creation in `rebalance/by-load/` roachtests
to opt out of a backup schedule.

Epic: none

Release note: None
@kvoli kvoli force-pushed the 230505.roachtest-rebalance-load branch from aa9f4f0 to 4d21913 Compare May 9, 2023 13:41
@kvoli kvoli requested a review from aliher1911 May 9, 2023 13:47
Copy link
Collaborator Author

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could still be susceptible. I pushed another commit to disable scheduled backups - I'd rather avoid any flakes on this test, since it has been high maintenance recently.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @smg260, and @srosenberg)


pkg/cmd/roachtest/tests/rebalance_load.go line 176 at r2 (raw file):

Previously, aliher1911 (Oleg) wrote…

I think it would be nicer for the reader to have error that would include reason. E.g. CPU is not evenly balanced after timeout: . Then it should be part of error message when test fails and GH issue is created.

Done.


pkg/cmd/roachtest/tests/rebalance_load.go line 322 at r2 (raw file):

Previously, aliher1911 (Oleg) wrote…

see getLeaseCounts for example format

I don't see getLeaseCounts anywhere in the code. Is it stale?

Stale - updated to getStoreCPUFn.


pkg/cmd/roachtest/tests/ts_util.go line 42 at r2 (raw file):

Previously, aliher1911 (Oleg) wrote…

I think it would be more readable to have it as

const defaultSamplePeriod = time.Minute

and then use defaultSamplePeriod.Nanoseconds() only when using it in TimeSeriesQueryRequest.

Good idea, updated to this.

@kvoli
Copy link
Collaborator Author

kvoli commented May 9, 2023

TYFTR

bors r=aliher1911

@kvoli
Copy link
Collaborator Author

kvoli commented May 9, 2023

bors r=aliher1911

@craig
Copy link
Contributor

craig bot commented May 9, 2023

Build succeeded:

@craig craig bot merged commit 03e4c91 into cockroachdb:master May 9, 2023
@blathers-crl
Copy link

blathers-crl bot commented May 9, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from d9b5463 to blathers/backport-release-22.2-102824: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
3 participants