kv,storage: re-consider compaction concurrency for multi-store nodes #74697

irfansharif · 2022-01-11T20:04:29Z

Describe the problem

We use a default of 3 cores per store to run compactions (see COCKROACH_ROCKSDB_CONCURRENCY). For multi-store setups, with insufficient cores, that may be far too many. It may also be that we want to update our guidance with respect to "# of cores recommended for a given # of stores". In a recent escalation we observed that a high store count + compaction debt + low core count led to a large % of the total CPU available on nodes being used entirely for compactions. The CPU being pegged in this manner was disruptive to foreground traffic.

Currently the compaction concurrency for a store defaults to min(3, numCPUs). This isn't multi-store-aware at all, as we could have a lot of CPUs but not enough to give every store 3 of them for concurrent compactions.

Expected behavior

Automatic configuration of compaction concurrency to min(3, numCPUs/numStores) at the very least. Guidance for what an appropriate number of cores are for a given number of stores. Or compaction concurrency that's reflective of the total number of cores available for the total number of stores (presumably after experimentation of our own).

Jira issue: CRDB-12216

Epic CRDB-41111

The text was updated successfully, but these errors were encountered:

jbowens · 2022-01-11T20:35:45Z

Linking this to cockroachdb/pebble#1329, the broader issue of adjusting resource utilization of background Pebble tasks.

Each store has independent disk-bandwidth and IOPs constraints, but shared CPU. I think we'll need something adaptive like discussed in cockroachdb/pebble#1329 to avoid saturating CPU while also sufficiently utilizing disk bandwidth.

sumeerbhola · 2023-01-18T20:26:29Z

For a non-adaptive solution, we could simply have a shared limit across stores. The difficulty is how to roll this out to existing CockroachDB users that have clusters with multiple stores. Presumably they have already fiddled with the individual store setting (or are fine with the default) -- we don't want them to suddenly have reduced concurrency. We could have something that only applies to new clusters, but that seems error prone.

itsbilal · 2024-07-30T21:37:29Z

More context on the O-testcluster label: we've hit the issue of high CPU usage with compactions on multi-store DRT clusters and had to dial down compaction concurrency manually. Ideally this would be automated, so at least every store's max compaction concurrency setting gets set to min(3, numCPUs/numStores) as opposed to the current min(3, numCPUs).

nameisbhaskar · 2024-07-31T14:39:29Z

Archive.zip
Uploading the CPU profiles of drt-large node 1. More details in the thread - https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1722423058416819

itsbilal · 2024-08-01T21:23:48Z

I did a quick analysis of large1.cpuprof.2024-07-29T23_58_53.227.80.pprof in the above comment, coming off of the drt-large cluster's n1. Looking at the Pebble logs from the node itself, I see that an avg of 4 concurrent compactions were live on the node in the 10 clock-seconds (= 160 cpu-seconds) the profile spans.

That would mean 40 cpu-seconds would go towards compactions in the profile if all a compaction did was CPU work. Instead we see 36 profiled cpu-seconds go towards runCompaction, and of those 36, ~2s are in fread and ~2s are in fwrite, so we're left with 32 cpu-seconds in non-IO CPU work, or around 80% of the 40s. From this we can estimate that 80% of a compaction is CPU time, assuming sufficiently fast disks which seems to be the case on drt-large because we have a lot more nvme local SSDs than we can drive quickly with our (limited) CPUs.

80% CPU utilization in a compaction does seem fairly high, but when looking at where the CPU time is being spent, it does seem to make more sense. Most of it is in decoding blocks, snappy-decompressing it, then encoding the write-side blocks, and snappy-compressing it. I don't think the 80% estimate is significantly far off of the true amount of cpu time spent in compactions, although on other clusters/machines where we're driving IO/disk utilization higher than we are with drt-large, the ratio of CPU time is likely lower.

This estimate could be useful in trying to determine how to divvy-up CPUs for concurrent compactions on nodes that have a lot of stores.

compaction concurrency in a multi-store configuration. Each Pebble store (i.e. an instance of *DB) still maintains its own per-store compaction concurrency which is controlled by `opts.MaxConcurrentCompactions`. However, in a multi-store configuration, disk I/O is a per-store resource while CPU is shared across stores. A significant portion of compaction is CPU-intensive, and so this ensures that excessive compactions don't interrupt foreground CPU tasks even if the disks are capable of handling the additional throughput from those compactions. The shared compaction concurrency only applies to automatic compactions This means that delete-only compactions are excluded because they are expected to be cheap, as are flushes because they should never be blocked. Fixes: cockroachdb#3813 Informs: cockroachdb/cockroach#74697

This change adds a new compaction pool which enforces a global max compaction concurrency in a multi-store configuration. Each Pebble store (i.e. an instance of *DB) still maintains its own per-store compaction concurrency which is controlled by `opts.MaxConcurrentCompactions`. However, in a multi-store configuration, disk I/O is a per-store resource while CPU is shared across stores. A significant portion of compaction is CPU-intensive, and so this ensures that excessive compactions don't interrupt foreground CPU tasks even if the disks are capable of handling the additional throughput from those compactions. The shared compaction concurrency only applies to automatic compactions This means that delete-only compactions are excluded because they are expected to be cheap, as are flushes because they should never be blocked. Fixes: cockroachdb#3813 Informs: cockroachdb/cockroach#74697

irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. labels Jan 11, 2022

blathers-crl bot added the T-storage Storage Team label Jan 11, 2022

jbowens added this to [Deprecated] Storage Jun 4, 2024

jbowens moved this to 24.2 candidates in [Deprecated] Storage Jun 4, 2024

BabuSrithar added the O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster label Jul 26, 2024

itsbilal mentioned this issue Aug 6, 2024

db: shared compaction concurrency limit across multiple Pebble instances cockroachdb/pebble#3813

Open

anish-shanbhag mentioned this issue Aug 23, 2024

compact: add shared compaction pool for multiple stores cockroachdb/pebble#3880

Open

andrewbaptist mentioned this issue Nov 20, 2024

roachtest: perturbation/metamorphic/decommission failed #135241

Open

itsbilal mentioned this issue Dec 3, 2024

admission, storage: prototype AC limiter for concurrent compactions #136615

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv,storage: re-consider compaction concurrency for multi-store nodes #74697

kv,storage: re-consider compaction concurrency for multi-store nodes #74697

irfansharif commented Jan 11, 2022 •

edited by exalate-issue-sync bot

Loading

jbowens commented Jan 11, 2022

sumeerbhola commented Jan 18, 2023

itsbilal commented Jul 30, 2024

nameisbhaskar commented Jul 31, 2024

itsbilal commented Aug 1, 2024

kv,storage: re-consider compaction concurrency for multi-store nodes #74697

kv,storage: re-consider compaction concurrency for multi-store nodes #74697

Comments

irfansharif commented Jan 11, 2022 • edited by exalate-issue-sync bot Loading

jbowens commented Jan 11, 2022

sumeerbhola commented Jan 18, 2023

itsbilal commented Jul 30, 2024

nameisbhaskar commented Jul 31, 2024

itsbilal commented Aug 1, 2024

irfansharif commented Jan 11, 2022 •

edited by exalate-issue-sync bot

Loading