changefeedccl: Uniformly distribute work during export. #88672

miretskiy · 2022-09-25T01:21:07Z

By default, changefeed distributes the work to nodes based on which nodes
are the lease holder for the ranges. This makes sense since running rangefeed
against local node is more efficient.

In a cluster where ranges are almost uniformly assigned to each node,
running changefeed export is efficient: all nodes are busy, until they are done.

KV server is responsible for making sure that the ranges are more or less
uniformly distributed across the cluster; however, this determination is based
on the set of all ranges in the cluster, and not based on a particular table.

As a result, it is possible to have a table that does not have uniform distribution
of its ranges across all the nodes. When this happens, the changefeed export
would take long time due to the long tail: as each node completes
its set of assigned ranges, it idles until changefeed completes.

This PR introduces a a change (controlled via
changefeed.balance_range_distribution.enable setting which the changefeed
to try to produce a more balanced assignment, where each node is responsible
for roughly 1/Nth of the work for the cluster of N nodes.

Release note (enterprise change): Changefeed exports are up to 25% faster due to uniform work assignment.

cockroach-teamcity · 2022-09-25T01:21:14Z

This change is

HonoreDB

This is great analysis! I feel like this logic should ideally live in the DistSQL planner itself (in distsql_physical_planner or replicaoracle). It already contains some primitives that could be useful. In particular, telling it to use a RandomOracle instead of a BinPackingOracle might be a simpler and more scalable solution than readjusting the spans afterwards; in large clusters a RandomOracle will get you a uniform distribution, and the default BinPackingOracle does multiple things that can create a non-uniform distribution and aren't appropriate for changefeeds. We could also look into making maxPreferredRangesPerLeaseHolder configurable/smarter (currently it's just a hardcoded 10 scientifically chosen at random).

Reviewed 1 of 5 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

miretskiy · 2022-09-26T16:58:59Z

RandomOracle
I'm not sure I want to use random oracle -- there is probably some value in keeping as much locality as possible.
I was thinking of making changes in distsql land -- but, given how widely those methods are used, I would not feel
comfortable backporting such a change.

ajwerner · 2022-09-26T17:07:42Z

Would we be better off just re-planning once some of the change aggregators finish and having that replanning using a different oracle?

miretskiy · 2022-09-26T17:12:01Z

Would we be better off just re-planning once some of the change aggregators finish and having that replanning using a different oracle?

It's a possibility -- but probably a bit more complex? Like: aggregators do not communicate with each other, so they won't know if one exited; then there is a need to send up to date frontier - preferably w/out truncation (due to proto buffer size).

ajwerner · 2022-09-26T18:37:26Z

It's a possibility -- but probably a bit more complex? Like: aggregators do not communicate with each other, so they won't know if one exited; then there is a need to send up to date frontier - preferably w/out truncation (due to proto buffer size).

Aggregators communicate with the frontier which can know when an aggregator has finished. The root of the query made the plan and has a aggregator->spans mapping which could be provided to the frontier. It just seems to me like better planning up front is never going to get rid of this tail if there's rebalancing, but one graceful shutdown (if we could figure out how to make it happen) could do a great job rebalancing and make it such that the last part of the export happens quite fast.

miretskiy · 2022-09-26T18:44:11Z

Aggregators communicate with the frontier which can know when an aggregator has finished. The root of the query made the plan and has a aggregator->spans mapping which could be provided to the frontier. It just seems to me like better planning up front is never going to get rid of this tail if there's rebalancing, but one graceful shutdown (if we could figure out how to make it happen) could do a great job rebalancing and make it such that the last part of the export happens quite fast.

I agree that better planning will not help w/ rebalance.
we do have ability to restart the flow if the distribution changed (dramatically) -- alas, I'm not sure
it would be very helpful in case of backfills due to the possibility of replan causing too much progress
loss;

The question is: is this PR good enough for now? Seems to be reasonably light weight. Or,
do we want to block what appears to be a decent improvement and try to come up with something
else..

HonoreDB · 2022-09-26T19:35:39Z

The question is: is this PR good enough for now? Seems to be reasonably light weight. Or,
do we want to block what appears to be a decent improvement and try to come up with something
else..

Fair enough, I'll do a line review now.

HonoreDB

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @miretskiy)

pkg/ccl/changefeedccl/changefeed_dist.go line 523 at r2 (raw file):

	targetRanges := int((1 + sensitivity) * float64(numRanges) / float64(len(p)))

	for i, j := 0, len(p)-1; i < j && len(p[i].Spans) > targetRanges && len(p[j].Spans) < targetRanges; {

Why not have j start at len(p)-1 so that we're moving ranges from the largest entry to the smallest?

miretskiy · 2022-09-26T21:43:48Z

pkg/ccl/changefeedccl/changefeed_dist.go line 523 at r2 (raw file):
	targetRanges := int((1 + sensitivity) * float64(numRanges) / float64(len(p)))

	for i, j := 0, len(p)-1; i < j && len(p[i].Spans) > targetRanges && len(p[j].Spans) < targetRanges; {
Why not have j start at len(p)-1 so that we're moving ranges from the largest entry to the smallest?

Isn't it what I'm doing for i, j := 0, len(p)-1...?

By default, changefeed distributes the work to nodes based on which nodes are the lease holder for the ranges. This makes sense since running rangefeed against local node is more efficient. In a cluster where ranges are almost uniformly assigned to each node, running changefeed export is efficient: all nodes are busy, until they are done. KV server is responsible for making sure that the ranges are more or less uniformly distributed across the cluster; however, this determination is based on the set of all ranges in the cluster, and not based on a particular table. As a result, it is possible to have a table that does not uniform distribution of its ranges across all the nodes. When this happens, the changefeed export would take long time due to the long tail: as each node completes its set of assigned ranges, it idles until changefeed completes. This PR introduces a change (controlled via `changefeed.balance_range_distribution.enable` setting) where the changefeed try to produce a more balanced assignment, where each node is responsible for roughly 1/Nth of the work for the cluster of N nodes. Release note (enterprise change): Changefeed exports are up to 25% faster due to uniform work assignment.

miretskiy · 2022-09-27T20:54:13Z

bors r+

craig · 2022-09-27T23:09:32Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-09-28T01:05:11Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-09-28T03:46:16Z

Build succeeded:

Bazel Essential CI (Cockroach)

miretskiy requested a review from a team as a code owner September 25, 2022 01:21

miretskiy requested review from ajwerner and removed request for a team September 25, 2022 01:21

shermanCRL mentioned this pull request Sep 25, 2022

changefeedccl: macro benchmarks over several performance improvements #87921

Closed

miretskiy requested a review from HonoreDB September 26, 2022 14:28

miretskiy force-pushed the balance branch from 4545427 to 1b3ceec Compare September 26, 2022 14:58

HonoreDB reviewed Sep 26, 2022

View reviewed changes

miretskiy force-pushed the balance branch 2 times, most recently from d3bbd36 to 0f59e8b Compare September 26, 2022 18:34

miretskiy force-pushed the balance branch from 0f59e8b to cdeb8e5 Compare September 26, 2022 19:38

HonoreDB approved these changes Sep 26, 2022

View reviewed changes

miretskiy force-pushed the balance branch 2 times, most recently from 43f16c9 to 3d5fae6 Compare September 27, 2022 00:46

miretskiy force-pushed the balance branch from 3d5fae6 to 89b19f4 Compare September 27, 2022 18:19

craig bot merged commit 4982322 into cockroachdb:master Sep 28, 2022

cockroach-teamcity mentioned this pull request Sep 28, 2022

PR #88672 - changefeedccl: Uniformly distribute work during export. cockroachdb/docs#15227

Closed

miretskiy mentioned this pull request Sep 28, 2022

release-22.2: Changefeed backfill performance fixes #88915

Closed

HonoreDB mentioned this pull request Oct 3, 2022

cdc: more load balancing things to try #89198

Closed

jayshrivastava added the backport-22.2.x label Nov 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeedccl: Uniformly distribute work during export. #88672

changefeedccl: Uniformly distribute work during export. #88672

miretskiy commented Sep 25, 2022

cockroach-teamcity commented Sep 25, 2022

HonoreDB left a comment

miretskiy commented Sep 26, 2022

ajwerner commented Sep 26, 2022

miretskiy commented Sep 26, 2022

ajwerner commented Sep 26, 2022

miretskiy commented Sep 26, 2022

HonoreDB commented Sep 26, 2022

HonoreDB left a comment

miretskiy commented Sep 26, 2022

miretskiy commented Sep 27, 2022

craig bot commented Sep 27, 2022

craig bot commented Sep 28, 2022

craig bot commented Sep 28, 2022

changefeedccl: Uniformly distribute work during export. #88672

changefeedccl: Uniformly distribute work during export. #88672

Conversation

miretskiy commented Sep 25, 2022

cockroach-teamcity commented Sep 25, 2022

HonoreDB left a comment

Choose a reason for hiding this comment

miretskiy commented Sep 26, 2022

ajwerner commented Sep 26, 2022

miretskiy commented Sep 26, 2022

ajwerner commented Sep 26, 2022

miretskiy commented Sep 26, 2022

HonoreDB commented Sep 26, 2022

HonoreDB left a comment

Choose a reason for hiding this comment

miretskiy commented Sep 26, 2022

miretskiy commented Sep 27, 2022

craig bot commented Sep 27, 2022

craig bot commented Sep 28, 2022

craig bot commented Sep 28, 2022