changefeedccl: improve planning heuristics #113898

jayshrivastava · 2023-11-06T20:43:33Z

See https://github.com/cockroachlabs/support/issues/2679.

The problem:

In this scenario, there's a changefeed running with execution_locality = foo on a table which is configured to with a leaseholder preference in region bar. The table foo has 150k ranges. Region bar has ~20 nodes and region foo has ~10.

We observed the following problems:

One node in region foo was assigned ~60k ranges to watch with the next highest being ~3k ranges. This imbalance makes the changefeed run slower. When the leaseholders are non-local, we don't know what heuristics distsql will use to assign work to change aggregators. We should investigate and consider making our own planning logic. We should consider using changefeed.balance_range_distribution.enable always, even after initial scans.
Even if we balance the ranges, we might end up with too many ranges per node. Instead of distributing 150k ranges over 30 nodes, we have to balance them across 10 nodes due to the execution locality. As we've seen before, assigning too many ranges to one node can cause OOMs when not using mux rangefeed (we alloc a 2MB buffer per range). There could also be too much of a fan-in factor, preventing catchup scans from completing (with or without mux).

The solution:

We want to add "planning modes" where we can choose how to distribute work to nodes when we plan a changefeed. Namely, we want 3 modes:

let distsql plan everything (ie. let distsql plan the changefeed)
let distsql create the initial plan and rebalance it
create our own plan which uniformly distributes work among live nodes

Jira issue: CRDB-33248

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-11-06T20:43:34Z

cc @cockroachdb/cdc

miretskiy · 2023-11-20T20:31:17Z

@jayshrivastava FYI: with #114710 the work on this issue might become
obsolete, and we may just get rid of balanced range distribution altogether.

jayshrivastava · 2023-11-27T16:17:55Z

Linking this related issue: https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1701101549923459

This commit introduces several changes. It makes `replicaoracle.BinPackingChoice` the explicit default for changefeeds. Previously, we would use `physicalplan.DefaultReplicaChooser`, which would point to the bin packing oracle. This change is better since it gives the changefeed package more control over its dependencies. A new cluster setting `changefeed.default_range_distribution_strategy` is added to specify how changefeeds should distribute work. In the default case, `none`, we defer to distsql to select nodes and distribute work among them. This is the same behavior as running a changefeed today. In the other case, `balanced_simple`, we still let distsql choose nodes for us, but we attempt to evently distribute ranges over them. This case is exactly the same as setting `changefeed.balance_range_distribution.enable = true`, but it is allowed to be used outside of initial scans (`changefeed.balance_range_distribution.enable` was only allowed with `initial_scan=only`. This change also deprecates `changefeed.balance_range_distribution.enable` in favor of `changefeed.default_range_distribution_strategy` because the latter can be expanded in the future to add more functionality. This change also moves changefeed distribution tests to a dedicated file: `changefeed_dist_test.go`. This file contains a tester which can be used to start a cluster with a given topology and range distribution. Changefeed tests can use the tester to perform various load balancing strategies. Right now, the new tests assert how ranges are distributed after planning changefeeds, which implicitly tests code outside of the changefeedccl package (ie. `dsp.PartitionSpans()`). However, this is important since changefeeds rely on the behavior of this black box to perform well. In the future, cdc will implement more distribution strategies and likely implement its own partitioning logic. Tests for these strategies can be added in this file. Over time, cdc can move away from distsql planning. Informs: cockroachdb#113898 Epic: None Release note (enterprise change): `changefeed.balance_range_distribution.enable` is now deprecated. Users should use a new cluster setting `changefeed.default_range_distribution_strategy` instead. `changefeed.default_range_distribution_strategy='balanced_simple'` has the same effect as setting `changefeed.balance_range_distribution.enable=true`. It does not require `initial_scan='only'`, which was required by the old setting.

…ans) Previously, the `rebalanceSpanPartitions` would use O(ranges) memory. This change rewrites it to use range iterators, reducing the memory usage to O(spans). This change also adds a randomized test to assert that all spans are accounted for after rebalancing. It also adds one more unit test. Informs: cockroachdb#113898 Epic: None

This change adds support for `changefeed.default_range_distribution_strategy='balanced_full'`. `balanced_full` distributes ranges evenly across all healthy nodes. The implementation first gets a plan from sql and rebalances it. Rebalancing is better than balancing / assigning ranges directly because the final plan will leverage some optimizations from distsql. For example, distsql tends to assign local ranges to nodes. If a node has many local ranges and is assigned more ranges compared to other nodes, then we reassign some of those ranges during rebalancing. The node still gets to keep as many local ranges as possible. This change also adds the setting `changefeed.range_distribution_threshold` which controls when distribution strategies will take effect. It defines the minimum number of ranges which a changefeed must watch to trigger distribution strategies. This setting avoids rebalancing a small changefeed across many nodes. The default value is 1024. This change also retires the `changefeed.balance_range_distribution.sensitivity` setting. This setting existed to reduce the amount of work to perform during rebalancing (ie. only rebalance a partition to within 5% of the ideal number of ranges per partition). This optimization is not necessary at this moment because rebalancing is relatively cheap and only happens at the start of a changefeed. Also, mechanism would sometimes cause bad distributions. Consider a scenario where we 25 ranges are rebalanced over 5 nodes. With some sensitivity, 6 ranges would be allowed per node. This can result in the imbalanced distribution: 1, 6, 6, 6, 6. Removing sensitivity results in the ideal distribution: 5, 5, 5, 5, 5 (or close to the ideal distribution when ranges cannot be divided evenly). Informs: cockroachdb#113898 Epic: None Release note (enterprise change): This change adds a new value `balanced_full` for the setting `changefeed.default_range_distribution_strategy`. Setting this value will make changefeeds attempt to balance work evenly over at many nodes as possible (while obeying locality restrictions). This is different than `balanced_simple` which does not always leverage all possible nodes. This change also adds the setting `changefeed.range_distribution_threshold` which controls when distribution strategies (ie. changing `changefeed.default_range_distribution_strategy`) will take effect. It defines the minimum number of ranges which a changefeed must watch to trigger distribution strategies.

…ans) Previously, the `rebalanceSpanPartitions` would use O(ranges) memory. This change rewrites it to use range iterators, reducing the memory usage to O(spans). This change also adds a randomized test to assert that all spans are accounted for after rebalancing. It also adds one more unit test. Informs: cockroachdb#113898 Epic: None

This change adds support for `changefeed.default_range_distribution_strategy='balanced_full'`. `balanced_full` distributes ranges evenly across all healthy nodes. The implementation first gets a plan from sql and rebalances it. Rebalancing is better than balancing / assigning ranges directly because the final plan will leverage some optimizations from distsql. For example, distsql tends to assign local ranges to nodes. If a node has many local ranges and is assigned more ranges compared to other nodes, then we reassign some of those ranges during rebalancing. The node still gets to keep as many local ranges as possible. This change also adds the setting `changefeed.range_distribution_threshold` which controls when distribution strategies will take effect. It defines the minimum number of ranges which a changefeed must watch to trigger distribution strategies. This setting avoids rebalancing a small changefeed across many nodes. The default value is 1024. This change also retires the `changefeed.balance_range_distribution.sensitivity` setting. This setting existed to reduce the amount of work to perform during rebalancing (ie. only rebalance a partition to within 5% of the ideal number of ranges per partition). This optimization is not necessary at this moment because rebalancing is relatively cheap and only happens at the start of a changefeed. Also, mechanism would sometimes cause bad distributions. Consider a scenario where we 25 ranges are rebalanced over 5 nodes. With some sensitivity, 6 ranges would be allowed per node. This can result in the imbalanced distribution: 1, 6, 6, 6, 6. Removing sensitivity results in the ideal distribution: 5, 5, 5, 5, 5 (or close to the ideal distribution when ranges cannot be divided evenly). Informs: cockroachdb#113898 Epic: None Release note (enterprise change): This change adds a new value `balanced_full` for the setting `changefeed.default_range_distribution_strategy`. Setting this value will make changefeeds attempt to balance work evenly over at many nodes as possible (while obeying locality restrictions). This is different than `balanced_simple` which does not always leverage all possible nodes. This change also adds the setting `changefeed.range_distribution_threshold` which controls when distribution strategies (ie. changing `changefeed.default_range_distribution_strategy`) will take effect. It defines the minimum number of ranges which a changefeed must watch to trigger distribution strategies.

…ans) Previously, the `rebalanceSpanPartitions` would use O(ranges) memory. This change rewrites it to use range iterators, reducing the memory usage to O(spans). This change also adds a randomized test to assert that all spans are accounted for after rebalancing. It also adds one more unit test. Informs: cockroachdb#113898 Epic: None

rharding6373 · 2024-02-28T17:22:47Z

Reducing priority level to P-3. We already have a rebalancing strategy (balanced simple distribution) that addresses the overload concerns in the support issue, so we do not have an urgent need for additional strategies.

…ans) Previously, the `rebalanceSpanPartitions` would use O(ranges) memory. This change rewrites it to use range iterators, reducing the memory usage to O(spans). This change also adds a randomized test to assert that all spans are accounted for after rebalancing. It also adds one more unit test. Informs: cockroachdb#113898 Epic: None

…ans) Previously, the `rebalanceSpanPartitions` would use O(ranges) memory. This change rewrites it to use range iterators, reducing the memory usage to O(spans). This change also adds a randomized test to assert that all spans are accounted for after rebalancing. It also adds one more unit test. Informs: #113898 Epic: None

115375: changefeedccl: reduce rebalancing memory usage from O(ranges) to O(spans) r=jayshrivastava a=jayshrivastava ### sql: count ranges per partition in PartitionSpans This change updates span partitioning to count ranges while making partitions. This allows callers to rebalance partitions based on range counts without having to iterate over the spans to count ranges. Release note: None Epic: None ### changefeedccl: reduce rebalancing memory usage from O(ranges) to O(spans) #115375 Previously, the `rebalanceSpanPartitions` would use O(ranges) memory. This change rewrites it to use range iterators, reducing the memory usage to O(spans). This change also adds a randomized test to assert that all spans are accounted for after rebalancing. It also adds one more unit test. Informs: #113898 Epic: None ### changefeedccl: add rebalancing checks This change adds extra test coverage for partition rebalancing in changefeeds. It adds checks which are performed after rebalancing to assert that the output list of spans covers exactly the same keys as the input list of spans. These checks are expensive so they only run if the environment variable `COCKROACH_CHANGEFEED_TESTING_REBALANCING_CHECKS` is true. This variable is true in cdc roachtests and unit tests. Release note: None Epic: None 119885: storage: support per-store IO metrics with fine granularity r=jbowens,abarganier a=CheranMahalingam Currently, timeseries metrics are collected on a 10s interval which hides momentary spikes in IO. This commit introduces a central disk monitoring system that polls for disk stats at a 100ms interval. Additionally, the current system accumulates disk metrics across all block devices which includes noise from unrelated processes. This commit also adds support for exporting per-store IO metrics (i.e. IO stats for block devices that map to stores used by Cockroach). These changes will be followed up by a PR to remove the need for customers to specify disk names when setting the provisioned bandwidth for each store as described in #109350. Fixes: #104114, #112898. Informs: #89786. Epic: None. Release note: None. 120649: changefeedccl: avoid undefined behavior in distribution test r=wenyihu6 a=jayshrivastava The `rangeDistributionTester` would sometimes calculate log(0) when determining the node to move a range too. Most of the time, this would be some garbage value which gets ignored. Sometimes, this may return a valid node id, causing the range distribution to be wrong and failing the test failures. This change updates the tester to handle this edge case. Closes: #120470 Release note: None Co-authored-by: Jayant Shrivastava <[email protected]> Co-authored-by: Cheran Mahalingam <[email protected]>

jayshrivastava added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-cdc Change Data Capture T-cdc labels Nov 6, 2023

mattcrdb added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Nov 7, 2023

miretskiy added this to the 23.2 milestone Nov 8, 2023

miretskiy assigned jayshrivastava Nov 8, 2023

jayshrivastava mentioned this issue Nov 29, 2023

changefeedccl: add changefeed range distribution strategies #115166

Merged

jayshrivastava mentioned this issue Nov 30, 2023

changefeedccl: reduce rebalancing memory usage from O(ranges) to O(spans) #115375

Merged

jayshrivastava mentioned this issue Dec 1, 2023

changefeedccl: add balanced_full distribution strategy #115457

Open

jayshrivastava added the P-2 Issues/test failures with a fix SLA of 3 months label Jan 10, 2024

rharding6373 added P-3 Issues/test failures with no fix SLA and removed P-2 Issues/test failures with a fix SLA of 3 months labels Feb 28, 2024

rharding6373 mentioned this issue Feb 29, 2024

changefeedccl: use new bulk oracle for planning #119777

Closed

rharding6373 assigned rharding6373 and unassigned jayshrivastava Mar 21, 2024

exalate-issue-sync bot unassigned rharding6373 Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeedccl: improve planning heuristics #113898

changefeedccl: improve planning heuristics #113898

jayshrivastava commented Nov 6, 2023 •

edited

Loading

blathers-crl bot commented Nov 6, 2023

miretskiy commented Nov 20, 2023

jayshrivastava commented Nov 27, 2023

rharding6373 commented Feb 28, 2024

changefeedccl: improve planning heuristics #113898

changefeedccl: improve planning heuristics #113898

Comments

jayshrivastava commented Nov 6, 2023 • edited Loading

blathers-crl bot commented Nov 6, 2023

miretskiy commented Nov 20, 2023

jayshrivastava commented Nov 27, 2023

rharding6373 commented Feb 28, 2024

jayshrivastava commented Nov 6, 2023 •

edited

Loading