Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: improve planning heuristics #113898

Open
jayshrivastava opened this issue Nov 6, 2023 · 4 comments
Open

changefeedccl: improve planning heuristics #113898

jayshrivastava opened this issue Nov 6, 2023 · 4 comments
Labels
A-cdc Change Data Capture C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-cdc
Milestone

Comments

@jayshrivastava
Copy link
Contributor

jayshrivastava commented Nov 6, 2023

See https://github.com/cockroachlabs/support/issues/2679.

The problem:

In this scenario, there's a changefeed running with execution_locality = foo on a table which is configured to with a leaseholder preference in region bar. The table foo has 150k ranges. Region bar has ~20 nodes and region foo has ~10.

We observed the following problems:

  • One node in region foo was assigned ~60k ranges to watch with the next highest being ~3k ranges. This imbalance makes the changefeed run slower. When the leaseholders are non-local, we don't know what heuristics distsql will use to assign work to change aggregators. We should investigate and consider making our own planning logic. We should consider using changefeed.balance_range_distribution.enable always, even after initial scans.
  • Even if we balance the ranges, we might end up with too many ranges per node. Instead of distributing 150k ranges over 30 nodes, we have to balance them across 10 nodes due to the execution locality. As we've seen before, assigning too many ranges to one node can cause OOMs when not using mux rangefeed (we alloc a 2MB buffer per range). There could also be too much of a fan-in factor, preventing catchup scans from completing (with or without mux).

The solution:

We want to add "planning modes" where we can choose how to distribute work to nodes when we plan a changefeed. Namely, we want 3 modes:

  • let distsql plan everything (ie. let distsql plan the changefeed)
  • let distsql create the initial plan and rebalance it
  • create our own plan which uniformly distributes work among live nodes

Jira issue: CRDB-33248

@jayshrivastava jayshrivastava added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-cdc Change Data Capture T-cdc labels Nov 6, 2023
Copy link

blathers-crl bot commented Nov 6, 2023

cc @cockroachdb/cdc

@mattcrdb mattcrdb added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Nov 7, 2023
@miretskiy miretskiy added this to the 23.2 milestone Nov 8, 2023
@miretskiy
Copy link
Contributor

@jayshrivastava FYI: with #114710 the work on this issue might become
obsolete, and we may just get rid of balanced range distribution altogether.

@jayshrivastava
Copy link
Contributor Author

jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Nov 29, 2023
This commit introduces several changes.

It makes `replicaoracle.BinPackingChoice` the explicit default for changefeeds.
Previously, we would use `physicalplan.DefaultReplicaChooser`, which would point
to the bin packing oracle. This change is better since it gives the changefeed package more control over its dependencies.

A new cluster setting `changefeed.default_range_distribution_strategy` is added to
 specify how changefeeds should distribute work. In the default case,
`none`, we defer to distsql to select nodes and distribute work among them. This
is the same behavior as running a changefeed today.

In the other case, `balanced_simple`, we still let distsql choose nodes for us, but we attempt
to evently distribute ranges over them. This case is exactly the same as setting
`changefeed.balance_range_distribution.enable = true`, but it is allowed to be used
outside of initial scans (`changefeed.balance_range_distribution.enable` was only allowed
with `initial_scan=only`.

This change also deprecates `changefeed.balance_range_distribution.enable` in favor of
`changefeed.default_range_distribution_strategy` because the latter can be expanded
in the future to add more functionality.

This change also moves changefeed distribution tests to a dedicated file:
`changefeed_dist_test.go`. This file contains a tester which can be used to
start a cluster with a given topology and range distribution. Changefeed tests
can use the tester to perform various load balancing strategies. Right now, the new
tests assert how ranges are distributed after planning changefeeds, which implicitly tests
code outside of the changefeedccl package (ie. `dsp.PartitionSpans()`). However,
this is important since changefeeds rely on the behavior of this black box to perform well.
In the future, cdc will implement more distribution strategies and likely implement its
own partitioning logic. Tests for these strategies can be added in this file. Over time,
cdc can move away from distsql planning.

Informs: cockroachdb#113898
Epic: None

Release note (enterprise change): `changefeed.balance_range_distribution.enable` is now deprecated.
Users should use a new cluster setting `changefeed.default_range_distribution_strategy`
instead. `changefeed.default_range_distribution_strategy='balanced_simple'` has the same
effect as setting `changefeed.balance_range_distribution.enable=true`. It does not require
`initial_scan='only'`, which was required by the old setting.
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Dec 1, 2023
This commit introduces several changes.

It makes `replicaoracle.BinPackingChoice` the explicit default for changefeeds.
Previously, we would use `physicalplan.DefaultReplicaChooser`, which would point
to the bin packing oracle. This change is better since it gives the changefeed package more control over its dependencies.

A new cluster setting `changefeed.default_range_distribution_strategy` is added to
 specify how changefeeds should distribute work. In the default case,
`none`, we defer to distsql to select nodes and distribute work among them. This
is the same behavior as running a changefeed today.

In the other case, `balanced_simple`, we still let distsql choose nodes for us, but we attempt
to evently distribute ranges over them. This case is exactly the same as setting
`changefeed.balance_range_distribution.enable = true`, but it is allowed to be used
outside of initial scans (`changefeed.balance_range_distribution.enable` was only allowed
with `initial_scan=only`.

This change also deprecates `changefeed.balance_range_distribution.enable` in favor of
`changefeed.default_range_distribution_strategy` because the latter can be expanded
in the future to add more functionality.

This change also moves changefeed distribution tests to a dedicated file:
`changefeed_dist_test.go`. This file contains a tester which can be used to
start a cluster with a given topology and range distribution. Changefeed tests
can use the tester to perform various load balancing strategies. Right now, the new
tests assert how ranges are distributed after planning changefeeds, which implicitly tests
code outside of the changefeedccl package (ie. `dsp.PartitionSpans()`). However,
this is important since changefeeds rely on the behavior of this black box to perform well.
In the future, cdc will implement more distribution strategies and likely implement its
own partitioning logic. Tests for these strategies can be added in this file. Over time,
cdc can move away from distsql planning.

Informs: cockroachdb#113898
Epic: None

Release note (enterprise change): `changefeed.balance_range_distribution.enable` is now deprecated.
Users should use a new cluster setting `changefeed.default_range_distribution_strategy`
instead. `changefeed.default_range_distribution_strategy='balanced_simple'` has the same
effect as setting `changefeed.balance_range_distribution.enable=true`. It does not require
`initial_scan='only'`, which was required by the old setting.
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Dec 1, 2023
…ans)

Previously, the `rebalanceSpanPartitions` would use O(ranges) memory. This change
rewrites it to use range iterators, reducing the memory usage to O(spans).

This change also adds a randomized test to assert that all spans are accounted for after
rebalancing. It also adds one more unit test.

Informs: cockroachdb#113898
Epic: None
@jayshrivastava jayshrivastava added the P-2 Issues/test failures with a fix SLA of 3 months label Jan 10, 2024
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jan 11, 2024
…ans)

Previously, the `rebalanceSpanPartitions` would use O(ranges) memory. This change
rewrites it to use range iterators, reducing the memory usage to O(spans).

This change also adds a randomized test to assert that all spans are accounted for after
rebalancing. It also adds one more unit test.

Informs: cockroachdb#113898
Epic: None
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jan 11, 2024
This change adds support for `changefeed.default_range_distribution_strategy='balanced_full'`.
`balanced_full` distributes ranges evenly across all healthy nodes. The implementation first
gets a plan from sql and rebalances it. Rebalancing is better than balancing / assigning ranges
directly because the final plan will leverage some optimizations from distsql. For example,
distsql tends to assign local ranges to nodes. If a node has many local ranges and is assigned
more ranges compared to other nodes, then we reassign some of those ranges during rebalancing.
The node still gets to keep as many local ranges as possible.

This change also adds the setting `changefeed.range_distribution_threshold` which controls
when distribution strategies will take effect. It defines the minimum number of ranges
which a changefeed must watch to trigger distribution strategies. This setting avoids
rebalancing a small changefeed across many nodes. The default value is 1024.

This change also retires the `changefeed.balance_range_distribution.sensitivity` setting. This
setting existed to reduce the amount of work to perform during rebalancing (ie. only rebalance
a partition to within 5% of the ideal number of ranges per partition). This optimization is not
necessary at this moment because rebalancing is relatively cheap and only happens at the start of
a changefeed. Also, mechanism would sometimes cause bad distributions. Consider a scenario where
we 25 ranges are rebalanced over 5 nodes. With some sensitivity, 6 ranges would be allowed per
node. This can result in the imbalanced distribution: 1, 6, 6, 6, 6. Removing sensitivity results
in the ideal distribution: 5, 5, 5, 5, 5 (or close to the ideal distribution when ranges
cannot be divided evenly).

Informs: cockroachdb#113898
Epic: None
Release note (enterprise change): This change adds a new value `balanced_full` for the setting
`changefeed.default_range_distribution_strategy`. Setting this value will make changefeeds attempt
to balance work evenly over at many nodes as possible (while obeying locality restrictions). This
is different than `balanced_simple` which does not always leverage all possible nodes.

This change also adds the setting `changefeed.range_distribution_threshold` which controls
when distribution strategies (ie. changing `changefeed.default_range_distribution_strategy`)
will take effect. It defines the minimum number of ranges which a changefeed must watch to
trigger distribution strategies.
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jan 16, 2024
This change adds support for `changefeed.default_range_distribution_strategy='balanced_full'`.
`balanced_full` distributes ranges evenly across all healthy nodes. The implementation first
gets a plan from sql and rebalances it. Rebalancing is better than balancing / assigning ranges
directly because the final plan will leverage some optimizations from distsql. For example,
distsql tends to assign local ranges to nodes. If a node has many local ranges and is assigned
more ranges compared to other nodes, then we reassign some of those ranges during rebalancing.
The node still gets to keep as many local ranges as possible.

This change also adds the setting `changefeed.range_distribution_threshold` which controls
when distribution strategies will take effect. It defines the minimum number of ranges
which a changefeed must watch to trigger distribution strategies. This setting avoids
rebalancing a small changefeed across many nodes. The default value is 1024.

This change also retires the `changefeed.balance_range_distribution.sensitivity` setting. This
setting existed to reduce the amount of work to perform during rebalancing (ie. only rebalance
a partition to within 5% of the ideal number of ranges per partition). This optimization is not
necessary at this moment because rebalancing is relatively cheap and only happens at the start of
a changefeed. Also, mechanism would sometimes cause bad distributions. Consider a scenario where
we 25 ranges are rebalanced over 5 nodes. With some sensitivity, 6 ranges would be allowed per
node. This can result in the imbalanced distribution: 1, 6, 6, 6, 6. Removing sensitivity results
in the ideal distribution: 5, 5, 5, 5, 5 (or close to the ideal distribution when ranges
cannot be divided evenly).

Informs: cockroachdb#113898
Epic: None
Release note (enterprise change): This change adds a new value `balanced_full` for the setting
`changefeed.default_range_distribution_strategy`. Setting this value will make changefeeds attempt
to balance work evenly over at many nodes as possible (while obeying locality restrictions). This
is different than `balanced_simple` which does not always leverage all possible nodes.

This change also adds the setting `changefeed.range_distribution_threshold` which controls
when distribution strategies (ie. changing `changefeed.default_range_distribution_strategy`)
will take effect. It defines the minimum number of ranges which a changefeed must watch to
trigger distribution strategies.
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jan 16, 2024
…ans)

Previously, the `rebalanceSpanPartitions` would use O(ranges) memory. This change
rewrites it to use range iterators, reducing the memory usage to O(spans).

This change also adds a randomized test to assert that all spans are accounted for after
rebalancing. It also adds one more unit test.

Informs: cockroachdb#113898
Epic: None
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jan 16, 2024
This change adds support for `changefeed.default_range_distribution_strategy='balanced_full'`.
`balanced_full` distributes ranges evenly across all healthy nodes. The implementation first
gets a plan from sql and rebalances it. Rebalancing is better than balancing / assigning ranges
directly because the final plan will leverage some optimizations from distsql. For example,
distsql tends to assign local ranges to nodes. If a node has many local ranges and is assigned
more ranges compared to other nodes, then we reassign some of those ranges during rebalancing.
The node still gets to keep as many local ranges as possible.

This change also adds the setting `changefeed.range_distribution_threshold` which controls
when distribution strategies will take effect. It defines the minimum number of ranges
which a changefeed must watch to trigger distribution strategies. This setting avoids
rebalancing a small changefeed across many nodes. The default value is 1024.

This change also retires the `changefeed.balance_range_distribution.sensitivity` setting. This
setting existed to reduce the amount of work to perform during rebalancing (ie. only rebalance
a partition to within 5% of the ideal number of ranges per partition). This optimization is not
necessary at this moment because rebalancing is relatively cheap and only happens at the start of
a changefeed. Also, mechanism would sometimes cause bad distributions. Consider a scenario where
we 25 ranges are rebalanced over 5 nodes. With some sensitivity, 6 ranges would be allowed per
node. This can result in the imbalanced distribution: 1, 6, 6, 6, 6. Removing sensitivity results
in the ideal distribution: 5, 5, 5, 5, 5 (or close to the ideal distribution when ranges
cannot be divided evenly).

Informs: cockroachdb#113898
Epic: None
Release note (enterprise change): This change adds a new value `balanced_full` for the setting
`changefeed.default_range_distribution_strategy`. Setting this value will make changefeeds attempt
to balance work evenly over at many nodes as possible (while obeying locality restrictions). This
is different than `balanced_simple` which does not always leverage all possible nodes.

This change also adds the setting `changefeed.range_distribution_threshold` which controls
when distribution strategies (ie. changing `changefeed.default_range_distribution_strategy`)
will take effect. It defines the minimum number of ranges which a changefeed must watch to
trigger distribution strategies.
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jan 16, 2024
This change adds support for `changefeed.default_range_distribution_strategy='balanced_full'`.
`balanced_full` distributes ranges evenly across all healthy nodes. The implementation first
gets a plan from sql and rebalances it. Rebalancing is better than balancing / assigning ranges
directly because the final plan will leverage some optimizations from distsql. For example,
distsql tends to assign local ranges to nodes. If a node has many local ranges and is assigned
more ranges compared to other nodes, then we reassign some of those ranges during rebalancing.
The node still gets to keep as many local ranges as possible.

This change also adds the setting `changefeed.range_distribution_threshold` which controls
when distribution strategies will take effect. It defines the minimum number of ranges
which a changefeed must watch to trigger distribution strategies. This setting avoids
rebalancing a small changefeed across many nodes. The default value is 1024.

This change also retires the `changefeed.balance_range_distribution.sensitivity` setting. This
setting existed to reduce the amount of work to perform during rebalancing (ie. only rebalance
a partition to within 5% of the ideal number of ranges per partition). This optimization is not
necessary at this moment because rebalancing is relatively cheap and only happens at the start of
a changefeed. Also, mechanism would sometimes cause bad distributions. Consider a scenario where
we 25 ranges are rebalanced over 5 nodes. With some sensitivity, 6 ranges would be allowed per
node. This can result in the imbalanced distribution: 1, 6, 6, 6, 6. Removing sensitivity results
in the ideal distribution: 5, 5, 5, 5, 5 (or close to the ideal distribution when ranges
cannot be divided evenly).

Informs: cockroachdb#113898
Epic: None
Release note (enterprise change): This change adds a new value `balanced_full` for the setting
`changefeed.default_range_distribution_strategy`. Setting this value will make changefeeds attempt
to balance work evenly over at many nodes as possible (while obeying locality restrictions). This
is different than `balanced_simple` which does not always leverage all possible nodes.

This change also adds the setting `changefeed.range_distribution_threshold` which controls
when distribution strategies (ie. changing `changefeed.default_range_distribution_strategy`)
will take effect. It defines the minimum number of ranges which a changefeed must watch to
trigger distribution strategies.
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jan 17, 2024
This change adds support for `changefeed.default_range_distribution_strategy='balanced_full'`.
`balanced_full` distributes ranges evenly across all healthy nodes. The implementation first
gets a plan from sql and rebalances it. Rebalancing is better than balancing / assigning ranges
directly because the final plan will leverage some optimizations from distsql. For example,
distsql tends to assign local ranges to nodes. If a node has many local ranges and is assigned
more ranges compared to other nodes, then we reassign some of those ranges during rebalancing.
The node still gets to keep as many local ranges as possible.

This change also adds the setting `changefeed.range_distribution_threshold` which controls
when distribution strategies will take effect. It defines the minimum number of ranges
which a changefeed must watch to trigger distribution strategies. This setting avoids
rebalancing a small changefeed across many nodes. The default value is 1024.

This change also retires the `changefeed.balance_range_distribution.sensitivity` setting. This
setting existed to reduce the amount of work to perform during rebalancing (ie. only rebalance
a partition to within 5% of the ideal number of ranges per partition). This optimization is not
necessary at this moment because rebalancing is relatively cheap and only happens at the start of
a changefeed. Also, mechanism would sometimes cause bad distributions. Consider a scenario where
we 25 ranges are rebalanced over 5 nodes. With some sensitivity, 6 ranges would be allowed per
node. This can result in the imbalanced distribution: 1, 6, 6, 6, 6. Removing sensitivity results
in the ideal distribution: 5, 5, 5, 5, 5 (or close to the ideal distribution when ranges
cannot be divided evenly).

Informs: cockroachdb#113898
Epic: None
Release note (enterprise change): This change adds a new value `balanced_full` for the setting
`changefeed.default_range_distribution_strategy`. Setting this value will make changefeeds attempt
to balance work evenly over at many nodes as possible (while obeying locality restrictions). This
is different than `balanced_simple` which does not always leverage all possible nodes.

This change also adds the setting `changefeed.range_distribution_threshold` which controls
when distribution strategies (ie. changing `changefeed.default_range_distribution_strategy`)
will take effect. It defines the minimum number of ranges which a changefeed must watch to
trigger distribution strategies.
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Feb 21, 2024
…ans)

Previously, the `rebalanceSpanPartitions` would use O(ranges) memory. This change
rewrites it to use range iterators, reducing the memory usage to O(spans).

This change also adds a randomized test to assert that all spans are accounted for after
rebalancing. It also adds one more unit test.

Informs: cockroachdb#113898
Epic: None
@rharding6373 rharding6373 added P-3 Issues/test failures with no fix SLA and removed P-2 Issues/test failures with a fix SLA of 3 months labels Feb 28, 2024
@rharding6373
Copy link
Collaborator

Reducing priority level to P-3. We already have a rebalancing strategy (balanced simple distribution) that addresses the overload concerns in the support issue, so we do not have an urgent need for additional strategies.

jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Mar 13, 2024
…ans)

Previously, the `rebalanceSpanPartitions` would use O(ranges) memory. This change
rewrites it to use range iterators, reducing the memory usage to O(spans).

This change also adds a randomized test to assert that all spans are accounted for after
rebalancing. It also adds one more unit test.

Informs: cockroachdb#113898
Epic: None
cockroach-dev-inf pushed a commit that referenced this issue Mar 18, 2024
…ans)

Previously, the `rebalanceSpanPartitions` would use O(ranges) memory. This change
rewrites it to use range iterators, reducing the memory usage to O(spans).

This change also adds a randomized test to assert that all spans are accounted for after
rebalancing. It also adds one more unit test.

Informs: #113898
Epic: None
craig bot pushed a commit that referenced this issue Mar 18, 2024
115375: changefeedccl: reduce rebalancing memory usage from O(ranges) to O(spans) r=jayshrivastava a=jayshrivastava

### sql: count ranges per partition in PartitionSpans

This change updates span partitioning to count ranges while making
partitions. This allows callers to rebalance partitions based on
range counts without having to iterate over the spans to count
ranges.

Release note: None
Epic: None

### changefeedccl: reduce rebalancing memory usage from O(ranges) to O(spans) #115375

Previously, the `rebalanceSpanPartitions` would use O(ranges) memory. This change
rewrites it to use range iterators, reducing the memory usage to O(spans).

This change also adds a randomized test to assert that all spans are accounted for after
rebalancing. It also adds one more unit test.

Informs: #113898
Epic: None

### changefeedccl: add rebalancing checks

This change adds extra test coverage for partition rebalancing in
changefeeds. It adds checks which are performed after rebalancing
to assert that the output list of spans covers exactly the same keys
as the input list of spans. These checks are expensive so they only
run if the environment variable `COCKROACH_CHANGEFEED_TESTING_REBALANCING_CHECKS`
is true. This variable is true in cdc roachtests and unit tests.

Release note: None
Epic: None

119885: storage: support per-store IO metrics with fine granularity r=jbowens,abarganier a=CheranMahalingam

Currently, timeseries metrics are collected on a 10s interval which hides momentary spikes in IO. This commit introduces a central disk monitoring system that polls for disk stats at a 100ms interval. Additionally, the current system accumulates disk metrics across all block devices which includes noise from unrelated processes. This commit also adds support for exporting per-store IO metrics (i.e. IO stats for block devices that map to stores used by Cockroach).

These changes will be followed up by a PR to remove the need for customers to specify disk names when setting the provisioned bandwidth for each store as described in #109350.

Fixes: #104114, #112898.
Informs: #89786.

Epic: None.
Release note: None.

120649: changefeedccl: avoid undefined behavior in distribution test r=wenyihu6 a=jayshrivastava

The `rangeDistributionTester` would sometimes calculate log(0) when determining the node to move a range too. Most of the time, this would be some garbage value which gets ignored. Sometimes, this may return a valid node id, causing the range distribution to be wrong and failing the test failures. This change updates the tester to handle this edge case.

Closes: #120470
Release note: None

Co-authored-by: Jayant Shrivastava <[email protected]>
Co-authored-by: Cheran Mahalingam <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-cdc
Projects
None yet
Development

No branches or pull requests

4 participants