Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: introduce cpu load based splits #96128

Merged
merged 2 commits into from
Feb 10, 2023

Conversation

kvoli
Copy link
Collaborator

@kvoli kvoli commented Jan 27, 2023

This PR adds the ability to perform load based splitting with replica
cpu usage rather than queries per second. Load based splitting now will
use either cpu or qps for deciding split points, depending on the
cluster setting kv.allocator.load_based_rebalancing.objective.

When set to qps, qps is used in deciding split points and when
splitting should occur; similarly, cpu means that request cpu against
the leaseholder replica is to decide split points.

The split threshold when using cpu is the cluster setting
kv.range_split.load_cpu_threshold which defaults to 250ms of cpu
time per second, i.e. a replica using 1/4 processor of a machine
consistently.

The merge queue uses the load based splitter to make decisions on
whether to merge two adjacent ranges due to low load. This commit also
updates the merge queue to be consistent with the load based splitter
signal. When switching between qps and cpu, the load based splitter
for every replica is reset to avoid spurious results.

image

resolves: #95377
resolves: #83519
depends on: #96127

Release note (ops change): Load based splitter now supports using request
cpu usage to split ranges. This is introduced with the previous cluster
setting kv.allocator.load_based_rebalancing.objective, which when set
to cpu, will use request cpu usage. The threshold above which
CPU usage of a range is considered for splitting is defined in the
cluster setting kv.range_split.load_cpu_threshold, which has a default
value of 250ms.

@kvoli kvoli changed the title kvserver: pass load dimension to test load ranges kvserver: introduce cpu load based splits Jan 27, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kvoli kvoli force-pushed the 230120.cpu-load-splits branch from 05cbe42 to f720659 Compare January 27, 2023 22:24
@kvoli kvoli marked this pull request as ready for review January 27, 2023 22:28
@kvoli kvoli requested a review from a team as a code owner January 27, 2023 22:28
@kvoli kvoli self-assigned this Jan 27, 2023
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r4, 21 of 21 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @kvoli)


-- commits line 84 at r5:
How was the 250ms number evaluated? Should we run an experiment like Andrew did in #39687 to tune it to find the right balance between batching and concurrency?


pkg/kv/kvserver/batcheval/cmd_range_stats.go line 54 at r4 (raw file):

		reply.MaxQueriesPerSecond = -1
	}
	reply.MaxQueriesPerSecondSet = true

Do we need to keep setting this to true for another release?


pkg/kv/kvserver/batcheval/cmd_range_stats.go line 54 at r5 (raw file):

	reply.MaxCPUPerSecond, reply.MaxQueriesPerSecond = -1, -1
	if qpsOK && cpuOK {
		return result.Result{}, errors.AssertionFailedf("unexpected both QPS and CPU range statistics set")

Is this possible if the cluster switched from QPS-based splitting to CPU-based splitting between the calls to GetMaxSplitQPS and GetMaxSplitCPU above?


pkg/kv/kvserver/batcheval/eval_context.go line 92 at r5 (raw file):

	GetMaxSplitQPS(context.Context) (float64, bool)

	// GetMaxSplitCPU returns the Replicas maximum request cpu/s rate over a

"Replica's"

Same thing above as well, I guess.


pkg/kv/kvserver/split/decider.go line 59 at r5 (raw file):

	// NewLoadBasedSplitter returns a new LoadBasedSplitter that may be used to
	// find the midpoint based on recorded load.
	NewLoadBasedSplitter(time.Time) LoadBasedSplitter

nit: if this can construct LoadBasedSplitters it feels more like a "factory" than a "config".


pkg/roachpb/api.proto line 2119 at r5 (raw file):

  // replica serving the RangeStats request has not been the leaseholder long
  // enough to have recorded CPU rates for at least a full measurement period.
  // In such cases, the reciipient should not consider the CPU value reliable

s/reciipient/recipient/

@kvoli kvoli force-pushed the 230120.cpu-load-splits branch from f720659 to 337da71 Compare February 8, 2023 18:49
@kvoli kvoli requested a review from a team as a code owner February 8, 2023 18:49
@kvoli kvoli requested a review from a team February 8, 2023 18:49
@kvoli kvoli requested review from a team as code owners February 8, 2023 18:49
@kvoli kvoli requested a review from a team February 8, 2023 18:49
@kvoli kvoli requested review from a team as code owners February 8, 2023 18:49
@kvoli kvoli requested review from a team February 8, 2023 18:49
@kvoli kvoli requested a review from a team as a code owner February 8, 2023 18:49
@kvoli kvoli requested a review from a team February 8, 2023 18:49
@kvoli kvoli requested a review from a team as a code owner February 8, 2023 18:49
@kvoli kvoli requested review from herkolategan, renatolabs and RaduBerinde and removed request for a team February 8, 2023 18:49
@kvoli kvoli force-pushed the 230120.cpu-load-splits branch from 06abb86 to d1157da Compare February 8, 2023 20:25
Copy link
Collaborator Author

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @nvanbenschoten)


-- commits line 84 at r5:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How was the 250ms number evaluated? Should we run an experiment like Andrew did in #39687 to tune it to find the right balance between batching and concurrency?

Mostly eyeballed this one whilst running allocbench. I tried a few different values and looked at the this metric: https://github.com/kvoli/cockroach/blob/f7206595e95e40b107755180a9438dd9b8bf0fd9/pkg/kv/kvserver/store_rebalancer.go#L48-L48 which indicates that there weren't any available actions but the store was overfull. I verified the number was "reasonable" by using it for tpcc and tpce - where the

Note the 250ms number is only for request cpu (not raft). Here's a comparison of splits running TPCE between master/this branch with 250ms:

image.png

The same for allocbench (5 runs of each type, order is r=0/access=skew, r=0/ops=skew, r=50/ops=skew, r=95/access=skew, r=95/ops=skew.
image copy 1.png

I added a todo for now with the intention to address this in the next 2 weeks.


pkg/kv/kvserver/batcheval/cmd_range_stats.go line 54 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we need to keep setting this to true for another release?

Probably, If a <23.1 node sees this flag not being set (default=false?), I'm guessing that range merging will occur until the version is finalized.

Updated this to be set. Updated the TODO to stop setting the field in 23.2.


pkg/kv/kvserver/batcheval/cmd_range_stats.go line 54 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this possible if the cluster switched from QPS-based splitting to CPU-based splitting between the calls to GetMaxSplitQPS and GetMaxSplitCPU above?

It is possible. However in that case, there is a "warmup" period for the decider stats so they would both be !OK.

When the objective changes, every decider is reset. It will then be 5 minutes before the max value returned from a decider is ok. So I think this should be okay.

https://github.com/kvoli/cockroach/blob/f7206595e95e40b107755180a9438dd9b8bf0fd9/pkg/kv/kvserver/kvserverbase/base.go#L207-L209


pkg/kv/kvserver/batcheval/eval_context.go line 92 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"Replica's"

Same thing above as well, I guess.

Updated


pkg/kv/kvserver/split/decider.go line 59 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: if this can construct LoadBasedSplitters it feels more like a "factory" than a "config".

That's fair - it seemed convenient to bundle these methods into a config. I'm not sure if factory suits either here due to the config methods also being present.


pkg/roachpb/api.proto line 2119 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/reciipient/recipient/

Updated

@kvoli kvoli force-pushed the 230120.cpu-load-splits branch from d1157da to d8c7434 Compare February 9, 2023 00:11
@kvoli kvoli requested a review from nvanbenschoten February 9, 2023 00:16
@kvoli kvoli force-pushed the 230120.cpu-load-splits branch from d8c7434 to 3c26d1a Compare February 9, 2023 00:32
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: once the prereq PR is merged.

Reviewed 1 of 3 files at r7, 18 of 22 files at r8, 21 of 21 files at r10, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist and @kvoli)


-- commits line 84 at r5:

Previously, kvoli (Austen) wrote…

Mostly eyeballed this one whilst running allocbench. I tried a few different values and looked at the this metric: https://github.com/kvoli/cockroach/blob/f7206595e95e40b107755180a9438dd9b8bf0fd9/pkg/kv/kvserver/store_rebalancer.go#L48-L48 which indicates that there weren't any available actions but the store was overfull. I verified the number was "reasonable" by using it for tpcc and tpce - where the

Note the 250ms number is only for request cpu (not raft). Here's a comparison of splits running TPCE between master/this branch with 250ms:

image.png

The same for allocbench (5 runs of each type, order is r=0/access=skew, r=0/ops=skew, r=50/ops=skew, r=95/access=skew, r=95/ops=skew.
image copy 1.png

I added a todo for now with the intention to address this in the next 2 weeks.

Do you mind opening a github issue that follows from the TODO, just so that we can tag something as a release blocker?


pkg/kv/kvserver/batcheval/cmd_range_stats.go line 54 at r5 (raw file):

Previously, kvoli (Austen) wrote…

It is possible. However in that case, there is a "warmup" period for the decider stats so they would both be !OK.

When the objective changes, every decider is reset. It will then be 5 minutes before the max value returned from a decider is ok. So I think this should be okay.

https://github.com/kvoli/cockroach/blob/f7206595e95e40b107755180a9438dd9b8bf0fd9/pkg/kv/kvserver/kvserverbase/base.go#L207-L209

I guess it still seems strange that we try to assert this. Would there be a downside to letting both be set in the rare case of an (absurdly long) race? Is it important that we promise to consumers of this API that only one value will be set? Or is it only important that we promise that at least one will be set (unless both are warming up)?


pkg/kv/kvserver/split/decider.go line 59 at r5 (raw file):

Previously, kvoli (Austen) wrote…

That's fair - it seemed convenient to bundle these methods into a config. I'm not sure if factory suits either here due to the config methods also being present.

Ack. Config seems fine then. Unless LoadSplitThing strikes your fancy.

Copy link
Collaborator Author

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @nvanbenschoten)


-- commits line 84 at r5:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you mind opening a github issue that follows from the TODO, just so that we can tag something as a release blocker?

Opened issue #96869 and linked in the TODO.


pkg/kv/kvserver/batcheval/cmd_range_stats.go line 54 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I guess it still seems strange that we try to assert this. Would there be a downside to letting both be set in the rare case of an (absurdly long) race? Is it important that we promise to consumers of this API that only one value will be set? Or is it only important that we promise that at least one will be set (unless both are warming up)?

I think the promise that at most one will be set is important to the merge queue, I don't think it is that important to other consumers of the API. Perhaps it is fine to just have the assertion on the other side. There is a sort of liveness property that eventually a value must be set for merges to make progress.

@kvoli kvoli force-pushed the 230120.cpu-load-splits branch from e6ffdc0 to 1b97efb Compare February 9, 2023 15:11
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @kvoli)


pkg/kv/kvserver/batcheval/cmd_range_stats.go line 54 at r5 (raw file):

Perhaps it is fine to just have the assertion on the other side.

Yeah, that's kind of what I was getting at. But even then, why assert? I would expect the logic to be something like the following, but I'm probably missing some complexity:

switch rebalance_objective:
case qps:
    if !lhsQPS.ok || !rhcQPS.ok {
        return // don't merge
    }
    ...
case cpu:
    if !lhsCPU.ok || !rhcCPU.ok {
        return // don't merge
    }
    ...
}

@kvoli kvoli force-pushed the 230120.cpu-load-splits branch from 1b97efb to e84aa99 Compare February 9, 2023 19:41
Copy link
Collaborator Author

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @nvanbenschoten)


pkg/kv/kvserver/batcheval/cmd_range_stats.go line 54 at r5 (raw file):

Yeah, that's kind of what I was getting at. But even then, why assert? I would expect the logic to be something like the following, but I'm probably missing some complexity:

I think the scenario I'm trying to avoid is the upgrade case, where some node doesn't populate CPU because it is running an earlier version but it then passes the >= 0 check by being the default value (0). This problem seems solved by first checking the split objective. It was also a bit cumbersome to have two stats for each side moving around in the process loop - for formatting and resetting after merge.

I've updated to remove the assertions and just check the objective manager to ensure things work.

@kvoli kvoli force-pushed the 230120.cpu-load-splits branch 2 times, most recently from e37cc28 to 7a9ddd5 Compare February 9, 2023 21:39
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 21 of 21 files at r21, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist)

@kvoli kvoli force-pushed the 230120.cpu-load-splits branch 2 times, most recently from 973036e to e887bf0 Compare February 10, 2023 14:33
@kvoli
Copy link
Collaborator Author

kvoli commented Feb 10, 2023

rebased with #96921 to fix CI.

This patch removes the deprecated 'lastSplitQPS' value throughout the
split/merge code. This field was deprecated in 22.1 in favor or
`maxSplitQPS` and stopped being consulted in 22.2.

A remaining field `max_queries_per_second_set` will remain until 23.2
where it can be removed as no node should consult it.

Now only `maxSplitQPS` is consulted and set in `RangeStatsResponse`.

Release note: None
This commit adds the ability to perform load based splitting with replica
cpu usage rather than queries per second. Load based splitting now will
use either cpu or qps for deciding split points, depending on the
cluster setting `kv.allocator.load_based_rebalancing.objective`.

When set to `qps`, qps is used in deciding split points and when
splitting should occur; similarly, `cpu` means that request cpu against
the leaseholder replica is to decide split points.

The split threshold when using `cpu` is the cluster setting
`kv.range_split.load_cpu_threshold` which defaults to `250ms` of cpu
time per second, i.e. a replica using 1/4 processor of a machine
consistently.

The merge queue uses the load based splitter to make decisions on
whether to merge two adjacent ranges due to low load. This commit also
updates the merge queue to be consistent with the load based splitter
signal. When switching between `qps` and `cpu`, the load based splitter
for every replica is reset to avoid spurious results.

resolves: cockroachdb#95377

Release note (ops change): Load based splitter now supports using request
cpu usage to split ranges. This is introduced with the previous cluster
setting `kv.allocator.load_based_rebalancing.objective`, which when set
to `cpu`, will use request cpu usage. The threshold above which
CPU usage of a range is considered for splitting is defined in the
cluster setting `kv.range_split.load_cpu_threshold`, which has a default
value of `250ms`.
@kvoli kvoli force-pushed the 230120.cpu-load-splits branch from e887bf0 to 408e227 Compare February 10, 2023 17:24
@kvoli
Copy link
Collaborator Author

kvoli commented Feb 10, 2023

rebased with #96946 to fix extended CI

@kvoli
Copy link
Collaborator Author

kvoli commented Feb 10, 2023

bors r=nvanbenschoten

@craig
Copy link
Contributor

craig bot commented Feb 10, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: cpu based load splitting kvserver: load based splitting should be on a more granular metric
3 participants