cdc,roachtest: add test with changefeeds over a large number of ranges #95236
Labels
A-admission-control
A-cdc
Change Data Capture
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
T-cdc
Is your feature request related to a problem? Please describe.
In internal incidents, we've seen sharp spikes in runnable g's per p (and CPU AC kicking in as a result) when those specific nodes were the changefeed coordinators. The way we publish closed ts updates, are we waking up many rangefeeds all at once? And as a result, causing large spikes in runnable goroutines? We observed an effect on SQL tail latency when this happened, and suspected the impact of elevated Go scheduling latency (which we now have metrics for: #87883). We also observed that pausing the changefeed helping reduce the latency impact. Baseline CPU utilization throughout was low (<25%).
This issue tracks reproducing a similar setup ourselves. Perhaps by introducing 200k+ splits on a single table, disabling the merge queue, and running a changefeed over it. It'll help with driving improvements. The motivating incident is https://github.com/cockroachlabs/support/issues/1997, and discussed internally here. Also https://github.com/cockroachlabs/support/issues/2036.
Jira issue: CRDB-23415
Epic CRDB-23738
The text was updated successfully, but these errors were encountered: