perf: investigate slow performance degradation #14108

petermattis · 2017-03-13T13:55:54Z

Running ycsb --concurrency 600 --splits 5000 against denim (a 6-node cluster) shows the following throughput:

Similarly, latencies slowly climb:

Pre-splitting the ycsb table into 5000 ranges means that the number of ranges is constant over the lifetime of the test. The most interesting metrics that shows an increase and could account for this performance decline are the disk metrics:

Each node is configured with the default 1/4 physical memory of cache, which in this case is 7GB. Each ycsb write is ~1KB in size. Writing at 2K/sec should generate ~7GB/hour and the graphs show we generated ~31GB when disk reads started. Are reads starting to miss in the cache? That's somewhat surprising given the skewed distribution for reads. Perhaps the system just reached a point where it is doing a significant number of background compactions continuously and those compactions are impacting foreground work.

The text was updated successfully, but these errors were encountered:

petermattis · 2017-03-13T14:07:23Z

Compactions did jump right when disk reads jumped:

Doesn't quite seem like it would account for the performance degradation, though perhaps I'm underestimating the cost.

bdarnell · 2017-03-13T19:06:36Z

The start of the latency and disk i/o increase corresponds to the blue line in the memory graph reaching its peak. This is the node_exporter graph, so "cached" here would be the kernel's cache, not the rocksdb block cache.

bdarnell · 2017-03-13T19:16:00Z

And the "Disk Space Used" graph matches up: space usage on /mnt is also growing linearly and its values track the "Cached" line exactly until it hits the limit at ~16GB. So this is just a case of performance slowing down once the data gets too big for memory (and the OS cache gives us a much larger cache than the rocksdb block cache alone). The read distribution is skewed, but this still means that we're going from a 0% miss rate to a non-zero miss rate when the cache fills up.

bdarnell · 2017-03-13T19:17:23Z

Compactions also need to read ~everything, whether the application's read distribution is skewed or not. So the reads here could be coming from compactions instead of the application (at the start of the process sstables can stay in the cache for their entire life until they're compacted away).

petermattis · 2017-03-14T01:00:25Z

Good point about the reads possibly coming from compactions. The slow down might be expected. An additional experiment to run would be to change the size of the data ycsb writes.

petermattis · 2017-04-20T16:03:46Z

@spencerkimball To add simple RocksDB latency metric to see if that is the source of the slow down.

Addresses #14108

spencerkimball · 2017-04-28T15:25:34Z

Kicking this back into your court. Not sure those latency stats will help. Let's get a new build pushed to a cluster which restarts nightly so we can see how that latency metric evolves over time.

petermattis · 2017-04-28T15:27:37Z

I'll get a build running on denim which isn't being used at the moment.

petermattis · 2017-04-28T17:59:39Z

This is still being investigated, but it is highly unlikely anything will be addressed in the 1.0 time frame. Punting to 1.1.

petermattis · 2017-05-01T14:10:42Z

denim has been running a 100% write workload for the past 3 days:

Performance started at ~14k ops/sec and has declined to ~12.5k ops/sec.

The Raft commit latencies (apologies for the different graph scale) show a minor increase over that time. More interesting than the minor increase is the relative stability of the different commit times across nodes. And the blips in commit times on denim 2 correspond with dips in the ops/sec numbers. I'm not sure what is causing the blips in commit times on denim 2. They seem to correspond to less disk I/Os.

petermattis · 2017-06-01T19:31:17Z

Getting back to this and reproduced this on 3383ce3 using ycsb --splits 5000 --concurrency 1000. Throughput is slowly declining and latencies are rising fairly fast:

Another interesting bit is that we appear to be leaking Go memory:

That could just be the growth in the various caches, but it seems suspicious. Unfortunately, I have heap profiling disabled on the cluster. RocksDB commit latencies are essentially flat. Still poking around.

petermattis · 2017-06-02T00:54:26Z

Re-running with heap profiling enabled shows a lot of memory being allocated by Replica.propose:

         .          .   2784:	// Add size of proposal to commandSizes map.
         .          .   2785:	if r.mu.commandSizes != nil {
         .   350.64MB   2786:		r.mu.commandSizes[proposal.idKey] = proposal.command.Size()
         .          .   2787:	}

inuse_objects shows:

         .          .   2785:	if r.mu.commandSizes != nil {
         .     109370   2786:		r.mu.commandSizes[proposal.idKey] = proposal.command.Size()
         .          .   2787:	}

We're doing ~2.5k {inserts,proposals}/second, so this seems high.

Cc @irfansharif.

petermattis · 2017-06-02T01:07:35Z

Doing the math above indicates that r.mu.commandSizes is holding on to 3.3KB per proposal. That's fairly insane. And this is coming from runtime.hashGrow.

irfansharif · 2017-06-02T18:37:58Z

Re-running with heap profiling enabled shows a lot of memory being allocated by Replica.propose

this was a silly oversight, fixed in #16283.

petermattis · 2017-06-05T14:01:14Z

#16283 fixed the egregious performance degradation, but the slow performance degradation is still present. The graphs below are from running ycsb --concurrency 1000. The first period on the graph is for --splits 5000 and the second with --splits 10000.

My initial suspicion upon seeing the first run was that performance started to degrade when the initial set of 5k ranges began to split. The second run disproves that hypothesis as we start with 10k ranges and they never split and yet performance still degrades.

My next suspicion is that performance is degrading when we hit the next threshold of RocksDB read amplification. Note that we configure RocksDB for levels L0-L6 and that we omit the bloom filters for L6 (otherwise we blow out memory usage).

The graph doesn't match up precisely. I can imagine various reasons why not (the read amplification metric is a snapshot in time). I'm going to run an experiment where I re-enable bloom filters for L6.

petermattis · 2017-06-07T13:48:10Z

Enabling bloom filters for L6 didn't help. Not surprising given that I was misreading the read amplification graph and we're only up to L5.

We see an increase in disk reads as the performance declines.

Memory usage shows "cached" memory steadily rising until it plateaus. Shortly after that we start to see reads.

This is getting back to @bdarnell's earlier observation. Perhaps the working set has exceeded the memory capacity and what we're seeing is the increasing reliance on disk. I'm still surprised by this slowdown for this workload as we're doing 95% reads and the reads are skewed. I would expect nearly all of the reads to hit the block cache. Similarly, the writes are all to virgin keys and shouldn't be resulting in reads. So the reads are probably being caused by compactions. Is it just the increased disk utilization that is causing a problem?

RocksDB allows us to throttle the bandwidth used for compactions. I'm going to experiment with that.

jordanlewis · 2017-06-07T14:20:48Z

Is it possible to add instrumentation to the block cache so we can see whether the percentage of hits decreases over time?

Same question about compactions. Seems like these metrics would be really helpful for understanding what's going on.

petermattis · 2017-06-07T14:59:41Z

We do have metrics for the block cache hit rate and compactions. We should definitely add a graph for the hit rate. The compactions might not tell us much as we're probably doing compactions constantly, but the compactions themselves do not necessarily do disk reads as the reads might be hitting the OS buffer cache.

petermattis · 2017-06-07T15:47:06Z

We see a decline in the block cache hit rate from ~95% to ~90%. This might be due to compactions as the block cache is only filled on the first read of a block. I need to think about this more.

petermattis · 2017-06-07T18:59:58Z

I performed a rolling restart of the nodes on denim in an attempt to determine if the slowdown was due to something accumulating in the cockroach nodes. The answer appears to be no. Performance is actually slightly lower after the rolling upgrade.

petermattis · 2017-06-07T20:32:08Z

Two new theories which I'm exploring:

Background compactions are consuming too much disk bandwidth. We can specify a rate limit in RocksDB for how much bandwidth compactions can consume.
Background compactions are consuming too much CPU. We configure RocksDB to use 1 thread for memtable flushes and N-1 threads for compactions where N is the number of CPUs.

The default Cassandra config limits compactions to 16 MB/sec.

petermattis · 2017-06-08T17:30:25Z

Currently, we allow num-cpu threads for compactions and unlimited bandwidth. Reducing the number of threads to either 2 or 3 prevented RocksDB from keeping up with compactions eventually leading to too many L0 sstables which blocked writes. Limiting write bandwidth for flushes/compactions to 16 MB/sec resulted in a similar situation.

It seems like RocksDB does not provide a tunable we desire. RocksDB reserves one of the compaction threads for flushes (i.e. for writing a memtable to L0). The other threads are all used for compactions. But really we want at least 1 thread reserved for performing L0 compactions because read performance is impacted by more L0 files and too many actually causes writes to stop (we can configure this, but I've set the value to 48 and still seen writes stopped).

Somewhat curious is the amount of disk writes we're performing. The cluster is doing 2.5K inserts/sec. YCSB writes a row containing 10 100-byte fields. So figure 1KB per row. That translates into 2.5 MB/sec of raw data being written. Add the 3x replication and the Raft log writes and the RocksDB WAL and we're up to 30 MB/sec across the cluster. But each disk is doing 45 MB/sec of writes for an aggregate of 270 MB/sec. That discrepancy seems to high to attribute to RocksDB write amplification. Is my math off somewhere?

petermattis · 2017-06-08T17:53:58Z

The Raft log WriteBatch for a YCSB insert is 1250 bytes and the applied command WriteBatch is 1500 bytes. So the math above should really translate into 40 MB/sec of write bandwidth. We're seeing 270 MB/sec which is over 6x write amplification. I suppose that's possible. Seems high, though.

bdarnell · 2017-06-08T17:56:15Z

That discrepancy seems to high to attribute to RocksDB write amplification. Is my math off somewhere?

I have noticed rocksdb doing a lot of compactions of the high level SSTables, leading to more compaction I/O than one would expect. Maybe we're tuning some of the compaction parameters poorly? We should try comparing to the two "preset" modes OptimizeLevelStyleCompaction and OptimizeUniversalStyleCompaction. The "universal" style is documented as reducing write amplification but increasing space usage.

petermattis · 2017-06-08T18:20:14Z

Maybe we're tuning some of the compaction parameters poorly?

Very possible. I'm diving again into the RocksDB parameter space, but first I'm going to verify the degradation is on writes. I'm adding a --write-duration flag to ycsb that will switch to read-only mode once that duration has elapsed.

petermattis · 2017-06-08T19:34:54Z

Here's a 1h run where writes stop after 30m:

I'll do a longer run over night where we perform writes for 12h before shutting them off. That should indicate whether the slow down is purely a slow down in write throughput. In the meantime I'm going to be experimenting with RocksDB settings as it appears our write amplification is too high.

petermattis · 2017-06-09T15:08:42Z

Here are graphs from a longer run where write ops were configured to stop after 12h:

Seems pretty clear from this that the performance degradation we're seeing is isolated to writes. This run included some minor tweaks to our RocksDB configuration. Specifically, I disabled subcompactions, bumped the write buffer size from 8 MB to 16 MB and bumped the size of L1 from 16 to 64 MB. This seems to have made a modest improvement to the performance degradation.

I also enabled periodic output of RocksDB compaction stats:

** Compaction Stats [default] **
Level    Files   Size(MB} Score Read(GB}  Rn(GB} Rnp1(GB} Write(GB} Wnew(GB} Moved(GB} W-Amp Rd(MB/s} Wr(MB/s} Comp(sec} Comp(cnt} Avg(sec} KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      1/0       0.10   0.5      0.0     0.0      0.0     111.0    111.0       0.0   0.0      0.0    152.7       745    10245    0.073       0      0
  L3      4/0      36.19   0.8    185.5    96.5     89.0     181.2     92.3       0.0   1.9    172.2    168.3      1103     4448    0.248    326M    41M
  L4     51/0     478.83   1.0    513.2   102.7    410.4     506.9     96.5       2.7   4.9    188.7    186.4       785     6691    0.416    676M    32M
  L5    417/0    4780.13   1.0    536.0    99.9    436.1     528.6     92.5       0.0   5.3    191.2    188.6      2870     5185    0.554    660M    19M
  L6    562/0   47961.49   0.0    463.5    87.9    375.6     422.4     46.8       0.0   4.8    206.5    188.2      2298     3280    0.701    616M   110M
 Sum   1035/0   53256.75   0.0   1698.1   387.1   1311.1    1750.1    439.1       2.7  15.8    177.4    182.9      9800    29849    0.328   2279M   204M
...
Cumulative compaction: 1750.15 GB write, 37.80 MB/s write, 1698.14 GB read, 36.68 MB/s read, 9800.2 seconds

See the Compaction Stats section of the RocksDB tuning guide for an explanation of the fields. I'm confused by the W-Amp column. The table indicates we have a write amplification of 15.8, but the cumulative compaction stats indicate compactions wrote 1750 GB of data even though we only have 53 GB remaining (33x amplification). Regardless, I think the write amplification is what is hurting our write performance. Per the SPARROW theorem, the only way to decrease write amplification is to increase read amplification. Unfortunately, my attempts so far at doing so have always resulted in worse performance.

Next experiment is to use a 32 MB write buffer as that will result in L0 and L1 being slightly closer in size which is a recommendation the RocksDB folks make somewhere.

petermattis · 2017-06-12T14:12:51Z

The experimentation continued this weekend, but so far nothing has made an improvement. Increasing the write buffer size to 32 MB seemed to show a marginal improvement in performance, but nothing significant. I also tried setting bytes_per_sync to smooth out disk writes. No benefit. One interesting bit came from a run this weekend where writes were stopped after 24h:

As expected, performance recovered once we were only performing reads. But notice the absolute performance here is less than in the run above where writes were disabled after 12h. The difference is 55k ops/sec vs 48k ops/sec.

The disk metrics during the read-only portion are surprising:

We're performing only read operations and the vast majority of those operations are to a small set of keys, yet we're reading a significant amount from disk. This indicates that my earlier assumption that the reads were being caused by compactions might have been incorrect. Note that the disk metric graphs are for a single machine. So we're reading ~300 MB/sec across the 6 nodes in the cluster.

I instrumented ycsb to verify my assumption about the skewed access to keys. I simulated writing 216m keys (which is approximately how many were written over 24h) and then read 100m keys. The top 10000 keys read represented 99.9% of the reads. So something isn't making sense here. 10000 keys and their values can easily fit in the block cache. Even assuming each key is in a separate block (which is likely), we're only talking about 312MB. So why isn't the block cache hit rate 100%? And very interesting that this 312MB number so closely matches that 300 MB/sec number from the disk metrics. Could something be causing a block to be read and not cached on every operation?

petermattis · 2017-06-12T18:57:34Z

My simulated ycsb mentioned in the previous message was flawed. Fixing it reveals that the top 10k keys only represent 57% of the reads, the top 100k keys represent 71% of reads and the top 1m keys represent 85% of reads. (These numbers are all for a simulated database where ycsb has inserted 216m keys).

Based on this, I'd expect a performance boost to decreasing the block size, though doing so requires more memory. I'm going to experiment with that in order to verify that I'm starting to understand what is going on.

petermattis · 2017-06-12T19:18:08Z

Looking at this from another direction, denim is configured with 6.9GB of block cache per node. This translates to ~1.35m 32KB blocks across the 6 machines. Looking at my simulated output, the top 1.35m keys represent 88% of reads which looks to be in the same ball park as the observed block cache hit rate.

Reducing the block size to 4KB will allow caching 10m blocks which will capture 99.9% of reads. We'll know in 24h if this theory is correct.

petermattis · 2017-06-15T00:34:45Z

Using a 4KB block size neither improved performance, nor the block cache hit rate. The lack of performance improvement makes sense in light of the block cache hit rate. I'm scratching my head right now.

Disk read bandwidth was reduced to ~1/4 of what it was when using a 32KB block size:

Disk read ops look similar to using a 32KB block size:

petermattis · 2017-07-10T16:24:48Z

I'm suspicious that a significant part of the performance degradation is due to the increasing data size and the workload. With RocksDB, write amplification is a factor of various knobs and how much data is present. As more data gets added, more levels of RocksDB get used where each additional level is 10x the size of the previous level.

How does the workload play into this? For both ycsb and kv, writes are uniformly distributed. So every byte written will need to get propagated down through the levels. The more data already present, the more data will get rewritten.

petermattis · 2017-08-30T17:10:28Z

I still plan to verify my suspicion above that the performance degradation is due to the increasing data size and random writes, but this isn't a blocker for 1.1.

petermattis · 2018-01-25T17:32:37Z

Closing this as old. One significant factor in the degradation was the growth of the timestamp cache which has been alleviated by its reimplementation.

petermattis added this to the 1.0 milestone Mar 13, 2017

petermattis self-assigned this Mar 13, 2017

petermattis mentioned this issue Apr 20, 2017

restore: slows down over time #15205

Closed

petermattis assigned spencerkimball and unassigned petermattis Apr 20, 2017

spencerkimball added a commit that referenced this issue Apr 28, 2017

storage: add latency statistics for commits to raft log / raft commands

06dd275

Addresses #14108

spencerkimball mentioned this issue Apr 28, 2017

storage: add latency statistics for commits to raft log / raft commands #15438

Merged

spencerkimball assigned petermattis and unassigned spencerkimball Apr 28, 2017

petermattis modified the milestones: 1.1, 1.0 Apr 28, 2017

irfansharif mentioned this issue Jun 2, 2017

storage: fix Replica.mu.commandSizes blow up #16283

Merged

petermattis mentioned this issue Jun 9, 2017

RFC: dedicated storage engine for Raft #16361

Merged

petermattis modified the milestones: 1.2, 1.1 Aug 30, 2017

petermattis added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Sep 27, 2017

petermattis closed this as completed Jan 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: investigate slow performance degradation #14108

perf: investigate slow performance degradation #14108

petermattis commented Mar 13, 2017

petermattis commented Mar 13, 2017

bdarnell commented Mar 13, 2017

bdarnell commented Mar 13, 2017

bdarnell commented Mar 13, 2017

petermattis commented Mar 14, 2017

petermattis commented Apr 20, 2017

spencerkimball commented Apr 28, 2017

petermattis commented Apr 28, 2017

petermattis commented Apr 28, 2017

petermattis commented May 1, 2017

petermattis commented Jun 1, 2017

petermattis commented Jun 2, 2017

petermattis commented Jun 2, 2017

irfansharif commented Jun 2, 2017

petermattis commented Jun 5, 2017

petermattis commented Jun 7, 2017

jordanlewis commented Jun 7, 2017

petermattis commented Jun 7, 2017

petermattis commented Jun 7, 2017

petermattis commented Jun 7, 2017

petermattis commented Jun 7, 2017

petermattis commented Jun 8, 2017

petermattis commented Jun 8, 2017

bdarnell commented Jun 8, 2017

petermattis commented Jun 8, 2017

petermattis commented Jun 8, 2017

petermattis commented Jun 9, 2017

petermattis commented Jun 12, 2017

petermattis commented Jun 12, 2017

petermattis commented Jun 12, 2017

petermattis commented Jun 15, 2017

petermattis commented Jul 10, 2017

petermattis commented Aug 30, 2017

petermattis commented Jan 25, 2018

perf: investigate slow performance degradation #14108

perf: investigate slow performance degradation #14108

Comments

petermattis commented Mar 13, 2017

petermattis commented Mar 13, 2017

bdarnell commented Mar 13, 2017

bdarnell commented Mar 13, 2017

bdarnell commented Mar 13, 2017

petermattis commented Mar 14, 2017

petermattis commented Apr 20, 2017

spencerkimball commented Apr 28, 2017

petermattis commented Apr 28, 2017

petermattis commented Apr 28, 2017

petermattis commented May 1, 2017

petermattis commented Jun 1, 2017

petermattis commented Jun 2, 2017

petermattis commented Jun 2, 2017

irfansharif commented Jun 2, 2017

petermattis commented Jun 5, 2017

petermattis commented Jun 7, 2017

jordanlewis commented Jun 7, 2017

petermattis commented Jun 7, 2017

petermattis commented Jun 7, 2017

petermattis commented Jun 7, 2017

petermattis commented Jun 7, 2017

petermattis commented Jun 8, 2017

petermattis commented Jun 8, 2017

bdarnell commented Jun 8, 2017

petermattis commented Jun 8, 2017

petermattis commented Jun 8, 2017

petermattis commented Jun 9, 2017

petermattis commented Jun 12, 2017

petermattis commented Jun 12, 2017

petermattis commented Jun 12, 2017

petermattis commented Jun 15, 2017

petermattis commented Jul 10, 2017

petermattis commented Aug 30, 2017

petermattis commented Jan 25, 2018