-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
There are too many write stalls because the write slowdown stage is frequently skipped #9423
Comments
Siying, Jay, and I discussed this a bit. I am probably the one slowing down the response to this issue due to being caught up in the philosophy of slowing down writes. Giving a quick update on where I am on that right now. Take, for example, an arrival rate curve that would come close to violating limits on space, memory, or read-amp, but not quite reach those limits. The gradual slowdown phase is harmful to such workloads because it causes them to sleep while not protecting any limit. Worse, when arrival rate exceeds processing rate in the slowdown phase (as is likely to happen, otherwise we wouldn't be slowing down), the requests queue up and each one experiences progressively higher latencies, not just the latency of one sleep. The graphs look much more stable with the proposed change. But does it mean RocksDB is more stable? I am not sure because each curve in the graphs shows a different arrival rate. I think it is important to fix the arrival rate (note: db_bench cannot do this) and look at the tail latencies. For example take this graph: Could the RocksDB that produced the red curve handle an arrival rate resembling the blue curve with decent tail latencies? Or vice versa? I am not sure. All I know is the blue RocksDB could handle the blue curve smoothly, and the red one could handle the red curve smoothly. One thing Siying mentioned today that stuck with me is maybe there are some systems that use RocksDB like a closed-loop benchmark (I suppose they have some control over arrival rate, e.g., by shedding excess requests to somewhere else). For those it indeed would be nice if we held a stable processing rate in face of pressure on the system. I need to think about this case further, and how it'd be done to not regress systems that have no control over arrival rate. Some brief notes on past experiences:
|
WRT to gradual slowdowns being harmful, users can set the soft and hard limits to be close or far apart, or we could make RocksDB more clever about when to apply the soft limit. The average throughput from the red curve was slightly higher than the blue curve. So I am confused by this.
The red and blue curves will be largely unchanged for open-loop workloads, if made from the perspective of RocksDB. These were plotted from the client's perspective (client == db_bench) because the client makes it easier to get that info. One achievable goal for a DBMS is to provide steady throughput from the DBMS perspective during overload, whether from open-loop or closed-loop workloads. RocksDB can't do that today and I think it should.
But RocksDB does stall, badly. Whether or not the solution requires slowing writes in addition to stopping them, we need a solution that doesn't stop writes for 5 minutes during overloads. |
Sorry let me try to clarify. I believe a "workload" is essentially an arrival rate curve. In a closed-loop benchmark, I believe a curve in these graphs represents both the system's processing rate and its arrival rate (since a request is generated as soon as the previous request completes). Since the curves for "x4.io.old" and "x4.io.new" are not identical, there are two different workloads under consideration here. The blue curve represents a spiky workload, and the red curve represents a smooth workload. Consider if we send the spiky workload to both systems. We know "x4.io.old" can handle it without tail latencies impacted by sleeps since it processed that exact workload before. But we do not know that for "x4.io.new". At points where the blue curve exceeds the red curve like the long spike at the beginning, it seems unlikely "x4.io.new" can process that fast. And if it falls behind, requests pile up, effectively causing requests to experience longer and longer delays even though writes aren't stopped. The average throughput increasing due to adding sleeps is interesting. But the question I have about those cases is, is it a fundamental improvement, or is it preventing the LSM from entering a bloated state that we inefficiently handle? I have thought about these cases previously and predict/hope it is the latter, mostly because it feels with more compaction debt we have more choices of what to compact, lower fanout, etc., that should allow us to process writes more efficiently.
Agreed. My point has only been that slowdowns can cause pileups of requests that experience similar or worse latency to the stop condition (in open-loop systems only), so we need to carefully consider changes that might cause more slowdowns than before. Or make it clear when they're useful / how to opt out. Right now there are some that can't be disabled like slowing down when N-1 memtables are full (and N > 3 I think). |
I agree that we should have workloads with 1) open-loop and 2) varying arrival rates and I am happy to help on the effort to create them. I suspect you are being generous in stating that RocksDB can handle the blue curve. A write request that arrives during any of those valleys (stall periods) will wait several minutes. So the workload it can accept has to be the same or a subset of the blue curve. Maybe we are assigning a different value to avoiding multi-minute stalls. RocksDB might be doing the right thing when increasing the compaction debt it allows, but it isn't doing the right thing when reducing it. WRT average throughput increasing by adding sleeps, the issue is that RocksDB can't make up its mind about what the per-level target sizes should be. It increases the targets slowly as the L0 grows, but then suddenly decreases them when the L0 shrinks after a large L0->L1 compaction. Persistent structures can't change their shape as fast as that code asks them to change. |
Filed #9478 for the benchmark client that does open-loop and varying arrival rates |
One scenario for a benchmark client that sustains a fixed arrival rate with a multi-minute write stall.
Then a stall arrives, and 10 threads are created each 100 milliseconds, or 100 threads/second and 6000 threads/minute. So after a 3-minute stall there will be ~20,000 threads assuming my server has enough memory to spawn them. It will be interesting to see what happens when that dogpile is slowly un-piled. |
in sysbench we implemented a "queue" for events arriving with the target rate. we do not create extra threads, but just place arrived into "queue"... when the stall happens, the queue will grow, and then we need to decide at what moment the queue becomes too long to return the response in the acceptable response time window |
That is interesting but it keeps the server from seeing too many concurrent query requests which used to be an exciting event back in the day. |
WRT to the start of the bug, perhaps I can restate my wishes if per-level target sizes can be dynamically increased when the L0 gets too large because the write rate is high:
|
I repeated a benchmark using RocksDB version 6.28.2 both as-is and with a hack. The hack does two things: disables intra-L0 compaction, disables dynamic adjust of the per-level target sizes. The test is IO-bound (database larger than RAM) and was repeated on two server types -- one with 18 CPU cores, one with 40 CPU cores. I used 16 client threads for the 18-core server and 32 client threads for the 40-core server. While I did the full set of benchmarks with tools/benchmark.sh I only share the results for readwhilewriting and overwrite. I modified the scripts to use --benchmarks=$whatever,waitforcompaction to make sure that the work to reduce compaction debt would be assigned to each benchmark The summary is that for overwrite:
These are compaction IO stats from the end of db_bench --benchmarks=overwrite,waitforcompaction,stats
|
So from the results above, can we get the efficiency benefits of the current code without the worst-case write stall? |
Contributing to @mdcallag 's findings here, I can confirm that we are also observing with many versions of RocksDB (including 7.2) that for some types of constants input loads RocksDB can suddenly go into minute-long write stalls. IMO such erratic behavior is bad from the user's perspective: requests to RocksDB work fine most of the time, but suddenly the processing time can go through the roof because of unpredictable write stalls. The tail latency times are way too high then. From the user perspective, I would also favor a more even throughput rate, even though it is not maximum. But it is much better to get constant ingestion rates rather than ingestion requests failing suddenly because of too high tail latencies. We have put our own write-throttling mechanism in front of RocksDB to avoid the worst stalls, but ideally RocksDB should gracefully handle inputs and try to achieve a smooth input rate using the configured slowdown trigger values. |
@jsteemann - currently under discussion at #10057 |
@mdcallag : thanks, this is very useful! And thanks for putting a lot of time into benchmarking the current (and adjusted) behaviors! |
- RocksDB community has confirmed the bad performance when compacting in 6.x and fixed in 7.5.3 * facebook/rocksdb#9423 * apache/kvrocks#1056 - new features(e.g., wal compression) Signed-off-by: Yang Honggang <[email protected]>
- RocksDB community has confirmed the bad performance when compacting in 6.x and fixed in 7.5.3 * facebook/rocksdb#9423 * apache/kvrocks#1056 - new features(e.g., wal compression) Signed-off-by: Yang Honggang <[email protected]>
- RocksDB community has confirmed the bad performance when compacting in 6.x and fixed in 7.5.3 * facebook/rocksdb#9423 * apache/kvrocks#1056 - new features(e.g., wal compression) Signed-off-by: Yang Honggang <[email protected]>
- RocksDB community has confirmed the bad performance when compacting in 6.x and fixed in 7.5.3 * facebook/rocksdb#9423 * apache/kvrocks#1056 - new features(e.g., wal compression) Signed-off-by: Yang Honggang <[email protected]>
- RocksDB community has confirmed the bad performance when compacting in 6.x and fixed in 7.5.3 * facebook/rocksdb#9423 * apache/kvrocks#1056 - new features(e.g., wal compression) Signed-off-by: Yang Honggang <[email protected]>
I hope to use the db_bench tool to verify the effect of this repair. What workloads should I use to verify this fact. This is my current test.
|
Expected behavior
Write slowdowns should be applied long before write stalls
Actual behavior
The bytes pending compaction value computed in EstimateCompactionBytesNeeded changes too rapidly. With write-heavy workloads (db_bench --benchmarks=overwrite) RocksDB can change within 1 or 2 seconds from no stalls (or some slowdown) to long periods of write stalls. Even worse are write stalls that last for 100+ seconds. With a small change (see PoC diff below) there are no long stalls and variance is greatly reduced.
The root cause is that CalculateBaseBytes computes sizes bottom up for all levels except Lmax. It uses max( sizeof(L0), options.max_bytes_for_level_base) for the L1 target size, and then from L1 to Lmax-1, the target size is set to be fanout times larger.
This leads to large targets when L0 is large because compaction gets behind. On its own that isn't horrible and might even be considered to be adaptive. However, when sizeof(L0) suddenly shrinks because L0->L1 compaction finishes then the targets can suddenly become much smaller because sizeof(L1) will be max_bytes_for_level_base rather than sizeof(L0). At that point the bytes pending compaction value can become huge because it is the sum of the per-level bytes pending compaction values (PLBPC) where PLBPC = (sizeof(level) - target_size(level)) X fanout.
We also need to improve visibility for this. AFAIK, it is hard or impossible to see the global bytes pending compaction, the contribution from each level and the target size for each level. I ended up adding a lot of code to dump more info to LOG.
Here is one example using things written to LOG, including extra monitoring I added. The first block of logging is at 20:23.49 and there are write slowdowns because bytes pending compaction ~= 25B while the slowdown trigger was ~=23B. The second line lists the target size per level they are 1956545174 or ~2B for L1, 5354149941 or ~5.3B for L2, 14651806650 or ~14.6B for L3, 40095148713 or ~40B for L4, 109721687484 or ~110B for L5. In this case the value for bytes pending compaction jumps from ~25B to ~85B in 2 seconds. In other tests I have seen it increase by 100B or 200B in a few seconds because L0->L1 completes.
But 2 seconds later after L0->L1 finishes the per-level target sizes are recomputed and they are much smaller: 141300841 or ~140M for L1, 654055230 or ~650M for L2, 3027499631 or ~3B for L3, 14013730948 or ~14B for L4, 64866946001 or ~65B for L5. The result is that the per-level contributions to bytes pending compaction are much larger for L2, L3 and L4 which triggers a long write stall. The ECBN scores line has the value for estimated_compaction_bytes_needed_ as total and totalB and then the per-level contributions. The ECBN actual line has the actual per-level sizes. These show that the increase comes from the sudden reduction in the per-level target sizes.
Steps to reproduce the behavior
A proof-of-concept diff to fix this is here. It computes level target from largest to smallest while the current code does it from smallest to largest.
Reproduction:
The text was updated successfully, but these errors were encountered: