-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: investigate slow performance degradation #14108
Comments
And the "Disk Space Used" graph matches up: space usage on |
Compactions also need to read ~everything, whether the application's read distribution is skewed or not. So the reads here could be coming from compactions instead of the application (at the start of the process sstables can stay in the cache for their entire life until they're compacted away). |
Good point about the reads possibly coming from compactions. The slow down might be expected. An additional experiment to run would be to change the size of the data |
@spencerkimball To add simple RocksDB latency metric to see if that is the source of the slow down. |
Kicking this back into your court. Not sure those latency stats will help. Let's get a new build pushed to a cluster which restarts nightly so we can see how that latency metric evolves over time. |
I'll get a build running on |
This is still being investigated, but it is highly unlikely anything will be addressed in the 1.0 time frame. Punting to 1.1. |
Performance started at ~14k ops/sec and has declined to ~12.5k ops/sec. The Raft commit latencies (apologies for the different graph scale) show a minor increase over that time. More interesting than the minor increase is the relative stability of the different commit times across nodes. And the blips in commit times on |
Getting back to this and reproduced this on 3383ce3 using Another interesting bit is that we appear to be leaking Go memory: That could just be the growth in the various caches, but it seems suspicious. Unfortunately, I have heap profiling disabled on the cluster. RocksDB commit latencies are essentially flat. Still poking around. |
Re-running with heap profiling enabled shows a lot of memory being allocated by
We're doing ~2.5k {inserts,proposals}/second, so this seems high. Cc @irfansharif. |
this was a silly oversight, fixed in #16283. |
#16283 fixed the egregious performance degradation, but the slow performance degradation is still present. The graphs below are from running My initial suspicion upon seeing the first run was that performance started to degrade when the initial set of 5k ranges began to split. The second run disproves that hypothesis as we start with 10k ranges and they never split and yet performance still degrades. My next suspicion is that performance is degrading when we hit the next threshold of RocksDB read amplification. Note that we configure RocksDB for levels L0-L6 and that we omit the bloom filters for L6 (otherwise we blow out memory usage). The graph doesn't match up precisely. I can imagine various reasons why not (the read amplification metric is a snapshot in time). I'm going to run an experiment where I re-enable bloom filters for L6. |
Enabling bloom filters for L6 didn't help. Not surprising given that I was misreading the read amplification graph and we're only up to L5. We see an increase in disk reads as the performance declines. Memory usage shows "cached" memory steadily rising until it plateaus. Shortly after that we start to see reads. This is getting back to @bdarnell's earlier observation. Perhaps the working set has exceeded the memory capacity and what we're seeing is the increasing reliance on disk. I'm still surprised by this slowdown for this workload as we're doing 95% reads and the reads are skewed. I would expect nearly all of the reads to hit the block cache. Similarly, the writes are all to virgin keys and shouldn't be resulting in reads. So the reads are probably being caused by compactions. Is it just the increased disk utilization that is causing a problem? RocksDB allows us to throttle the bandwidth used for compactions. I'm going to experiment with that. |
Is it possible to add instrumentation to the block cache so we can see whether the percentage of hits decreases over time? Same question about compactions. Seems like these metrics would be really helpful for understanding what's going on. |
We do have metrics for the block cache hit rate and compactions. We should definitely add a graph for the hit rate. The compactions might not tell us much as we're probably doing compactions constantly, but the compactions themselves do not necessarily do disk reads as the reads might be hitting the OS buffer cache. |
Two new theories which I'm exploring:
The default Cassandra config limits compactions to 16 MB/sec. |
Currently, we allow num-cpu threads for compactions and unlimited bandwidth. Reducing the number of threads to either 2 or 3 prevented RocksDB from keeping up with compactions eventually leading to too many L0 sstables which blocked writes. Limiting write bandwidth for flushes/compactions to 16 MB/sec resulted in a similar situation. It seems like RocksDB does not provide a tunable we desire. RocksDB reserves one of the compaction threads for flushes (i.e. for writing a memtable to L0). The other threads are all used for compactions. But really we want at least 1 thread reserved for performing L0 compactions because read performance is impacted by more L0 files and too many actually causes writes to stop (we can configure this, but I've set the value to 48 and still seen writes stopped). Somewhat curious is the amount of disk writes we're performing. The cluster is doing 2.5K inserts/sec. YCSB writes a row containing 10 100-byte fields. So figure 1KB per row. That translates into 2.5 MB/sec of raw data being written. Add the 3x replication and the Raft log writes and the RocksDB WAL and we're up to 30 MB/sec across the cluster. But each disk is doing 45 MB/sec of writes for an aggregate of 270 MB/sec. That discrepancy seems to high to attribute to RocksDB write amplification. Is my math off somewhere? |
The Raft log WriteBatch for a YCSB insert is 1250 bytes and the applied command WriteBatch is 1500 bytes. So the math above should really translate into 40 MB/sec of write bandwidth. We're seeing 270 MB/sec which is over 6x write amplification. I suppose that's possible. Seems high, though. |
I have noticed rocksdb doing a lot of compactions of the high level SSTables, leading to more compaction I/O than one would expect. Maybe we're tuning some of the compaction parameters poorly? We should try comparing to the two "preset" modes |
Very possible. I'm diving again into the RocksDB parameter space, but first I'm going to verify the degradation is on writes. I'm adding a |
Here's a 1h run where writes stop after 30m: I'll do a longer run over night where we perform writes for 12h before shutting them off. That should indicate whether the slow down is purely a slow down in write throughput. In the meantime I'm going to be experimenting with RocksDB settings as it appears our write amplification is too high. |
Here are graphs from a longer run where write ops were configured to stop after 12h: Seems pretty clear from this that the performance degradation we're seeing is isolated to writes. This run included some minor tweaks to our RocksDB configuration. Specifically, I disabled subcompactions, bumped the write buffer size from 8 MB to 16 MB and bumped the size of L1 from 16 to 64 MB. This seems to have made a modest improvement to the performance degradation. I also enabled periodic output of RocksDB compaction stats:
See the Compaction Stats section of the RocksDB tuning guide for an explanation of the fields. I'm confused by the W-Amp column. The table indicates we have a write amplification of 15.8, but the cumulative compaction stats indicate compactions wrote 1750 GB of data even though we only have 53 GB remaining (33x amplification). Regardless, I think the write amplification is what is hurting our write performance. Per the SPARROW theorem, the only way to decrease write amplification is to increase read amplification. Unfortunately, my attempts so far at doing so have always resulted in worse performance. Next experiment is to use a 32 MB write buffer as that will result in L0 and L1 being slightly closer in size which is a recommendation the RocksDB folks make somewhere. |
The experimentation continued this weekend, but so far nothing has made an improvement. Increasing the write buffer size to 32 MB seemed to show a marginal improvement in performance, but nothing significant. I also tried setting As expected, performance recovered once we were only performing reads. But notice the absolute performance here is less than in the run above where writes were disabled after 12h. The difference is 55k ops/sec vs 48k ops/sec. The disk metrics during the read-only portion are surprising: We're performing only read operations and the vast majority of those operations are to a small set of keys, yet we're reading a significant amount from disk. This indicates that my earlier assumption that the reads were being caused by compactions might have been incorrect. Note that the disk metric graphs are for a single machine. So we're reading ~300 MB/sec across the 6 nodes in the cluster. I instrumented |
My simulated Based on this, I'd expect a performance boost to decreasing the block size, though doing so requires more memory. I'm going to experiment with that in order to verify that I'm starting to understand what is going on. |
Looking at this from another direction, Reducing the block size to 4KB will allow caching 10m blocks which will capture 99.9% of reads. We'll know in 24h if this theory is correct. |
Using a 4KB block size neither improved performance, nor the block cache hit rate. The lack of performance improvement makes sense in light of the block cache hit rate. I'm scratching my head right now. Disk read bandwidth was reduced to ~1/4 of what it was when using a 32KB block size: Disk read ops look similar to using a 32KB block size: |
I'm suspicious that a significant part of the performance degradation is due to the increasing data size and the workload. With RocksDB, write amplification is a factor of various knobs and how much data is present. As more data gets added, more levels of RocksDB get used where each additional level is 10x the size of the previous level. How does the workload play into this? For both |
I still plan to verify my suspicion above that the performance degradation is due to the increasing data size and random writes, but this isn't a blocker for 1.1. |
Closing this as old. One significant factor in the degradation was the growth of the timestamp cache which has been alleviated by its reimplementation. |
Running
ycsb --concurrency 600 --splits 5000
againstdenim
(a 6-node cluster) shows the following throughput:Similarly, latencies slowly climb:
Pre-splitting the
ycsb
table into 5000 ranges means that the number of ranges is constant over the lifetime of the test. The most interesting metrics that shows an increase and could account for this performance decline are the disk metrics:Each node is configured with the default 1/4 physical memory of cache, which in this case is 7GB. Each
ycsb
write is ~1KB in size. Writing at 2K/sec should generate ~7GB/hour and the graphs show we generated ~31GB when disk reads started. Are reads starting to miss in the cache? That's somewhat surprising given the skewed distribution for reads. Perhaps the system just reached a point where it is doing a significant number of background compactions continuously and those compactions are impacting foreground work.The text was updated successfully, but these errors were encountered: