-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change The Way Level Target And Compaction Score Are Calculated #10057
Conversation
#9423 is one symptom for the problem. |
My understanding is this bundles two changes, which is fine assuming they're both desirable, but would be helpful if it were explicitly stated (or corrected): (1) Switching target level size adjustment to score adjustment in order to stabilize the calculation of pending compaction bytes. In terms of desirability, (1) seems clearly desirable. It does make us stall earlier compared to before in certain scenarios, but we have been reasonably successful in having customers increase/disable the stalling limits as needed, and could probably increase defaults too, so this is fine with me. The new heuristic (2) is more difficult. I need to study it more closely tomorrow but it certainly appears to have the advantage that it can be "always-on", unlike the level multiplier smoothing we had before. |
All levels would qualify for compaction before adjusting level size, so some L2->L3 and L3->L4 compactions would happen while L0->L1 is happening. However, with adjusted level sizing, the target would look like this:
and only L0->L1 compaction will be going on and all other levels' compactions will be on hold. With this change, L3->L4 will also happen if there are free compaction slots. |
This seems like a good thing to try. I would guess it helps space-amp during write burst and doesn't hurt write-amp (?). Some experimental data would be helpful. Some notes:
|
@siying how will this impact when stalls occur? Does it mean the stall conditions won't be adjusted? My other question is that if we have feature targeted towards handling write bursts, then is it worth the additional complexity of trying to distinguish between bursts of writes vs steady state of high write rates? Because #9423 is caused by a steady state of high write rates. |
|
One more suggestion from Manos that I think is interesting... Have we considered remove write stalls, keeping write slowdowns, but making the slowdown time a function of the badness of the write overload. The goal is to dynamically adjust the write slowdown to figure out what it needs to be to make ingest match outgest (outgest == how fast RocksDB can reduce compaction debt). |
With a b-tree the behavior is close to "pay as you go" for writes. When the buffer pool is full of dirty pages a new RMW must do some writeback if before it reads the to-be-modified block into the buffer pool because it must evict a dirty page before doing the read. This limits the worst case write stall, ignoring other perf problems with checkpoint. But an LSM decouples the write (debt creation) from compaction (debt repayment). Write slowdowns are a way to couple them but from memory the current write stall uses a fixed wait (maybe 1 millisecond). We can estimate the cost of debt repayment as: X = compaction-seconds / ingest-bytes and then make the slowdown ~= X * bytes-to-by-written. The debt repayment estimates assumes that compaction is fully sequential which is a worst-case assumption as some of the repayment is concurrent. From recent benchmarks I have done the value for X approximately 0.1 microseconds per byte of ingest. One example is:
This was measured via db_bench --benchmarks=overwrite,waitforcompaction I know there is a limit on how short a wait we can implement if a thread is to sleep, although I don't know what that is. Short waits could be implemented by spinning on a CPU but that has bad side effects. |
Does write slowdown before reaching a hard limit ever help write latencies in an open system? See Section 3.2 of https://arxiv.org/abs/1906.09667 for explanation on the limitations of using closed-loop benchmarks. I can see scenarios where write slowdowns hurt write latencies in an open system (cases where the workload could be handled without breaching the limits, but gets slowed down - thus building up a backlog - because the workload brought the DB near its limits) but have yet to see a scenario where it helps. |
"making the slowdown time a function of the badness of the write overload" is already partially down. The more L0 files there are, the lower the write rate we set to. It is not expanded to estimated compaction debt and might not work well enough with L0->L0 compaction. |
I tested 3 binaries using a96a4a2 as the base. Tests were repeated for an IO-bound (database larger than RAM) and cached (database cached by RocksDB) workloads. The test is benchmark.sh run the way I run it. The binaries are:
First I will show throughput over time during overwrite which runs at the end of the benchmark. The nonadaptive binary has not much variance. The pre and post binaries have a lot. This is for cached. This is for IO-bound From the benchmark summary for cached:
From the benchmark summary for IO-bound:
Write stall counters are here.
|
Each curve in these graphs is using a different workload. That's the problem with closed-loop benchmarks that I alluded to earlier: "See Section 3.2 of https://arxiv.org/abs/1906.09667 for explanation on the limitations of using closed-loop benchmarks". The graphs give no indication of whether the "pre" or "post" binaries could handle the workload that was sent to the "nonadaptive" binary with acceptable write latencies. |
To clarify, nonadaptive not only removes adaptive level sizing but also L0->L0 compaction, right? |
@ajkr All binaries get the same workload. The workload is send writes to RocksDB faster than compaction can handle. The goal is to see how well or how poorly RocksDB handles it. |
Reading Section 3.2 of the paper you referred to, I think I got your point that the benchmark that writes as soon as it can isn't a good indication of what sustainable write throughput without stalling. I think @mdcallag probably doesn't say his benchmark is measuring the write throughput without stalling. The question is, do you think it is valuable to measure the stalling when users write as far as they can. The fact that most users probably won't write DB in this style doesn't necessarily mean it isn't a valid use case to measure. |
Yes, I just want to be clear about the limitations and relevance to production so we don't overfit the system to this kind of benchmark. One example is we force slowdown when N-1 memtables are full and memtable limit is >= 3, although that should reduce peak sustainable throughput. Other ideas I've heard recently like replacing stops with slowdowns also sound harmful to peak sustainable throughput since they will necessarily slowdown writes before any limit has been breached.
For me, same workload means same requests are sent at the same time. That can't be the case here because the inserts/second graph shows "pre" and "post" sometimes get higher QPS than "nonadaptive", and at other times get lower QPS. I believe that's because the workload is dictated by the binary (i.e., a RocksDB slowdown slows down the workload). So different binaries will produce different workloads. |
There are several metrics:
I believe @mdcallag tried to measure 2 and 3 and claimed that non-adaptive is the best for these two metrics. Your point is that 1 is not measured. It is indeed the question that when 1, 2 and 3 are contradicting, how we should do trade-offs, but it's still not clear to me that they are contradicting with current implementation. I suspect whether it is the case. Indeed, 1 is very hard to measure, and now we are in deadlock and won't be able to make progress. (My question about whether fillseq is a good benchmark to measure this PR is totally orthogonal to this). |
I don't know what idea we're talking about being blocked. For this PR, it is fine with me, I don't see a problem if it helps write-amp or some other metric. For other ideas mentioned like disabling intra-L0, or replacing stop with slowdowns, I suspect it'll make things worse for 1., so don't see those as progress right now. |
I have no doubt that users encounter this. I assume that in most cases it isn't intentional. The goal is a DBMS that behaves better when overloaded. I encountered this with InnoDB and WiredTiger (usually via the insert benchmark). Worst-case write stalls with WiredTiger used to exceed 10 minutes, in recent versions that is reduced to less than 10 seconds. For both engines it took a while to fix as the problem is complicated. I didn't encounter this with Postgres, but mostly because they worked on the problem for many years before I started to use it. My point is that behaves well when overloaded is a feature and something worth having in RocksDB. WRT to benchmarks that find the peak throughput for a DBMS while respecting an SLA -- that would be great to add to RocksDB and even on my TODO list, just not a high-pri given other things I work on. YCSB supports that, db_bench does not (today). |
This wraps up my work on perf tests for this PR. I repeated the overwrite benchmark using 1, 2, 4, 8, 16 and 32 client threads where writes were rate limited to 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95 and 100 MB/second. The server has 40 CPUs and 80 HW threads (HT was enabled). I used three binaries labeled "pre", "post" and "nonadaptive" where 'pre" is upstream RocksDB, "post" is RocksDB with this PR and "nonadaptive" is RocksDB with intra-L0 and dynamic level resizing disabled. Graphs are provided for:
Summary (by QoS I mean variance):
Graphs to follow in separate posts. |
@mdcallag thanks for helping with benchmarking. It looks like that for the area this PR is targeting (max stalling), the PR is only slightly better. There might be something wrong with the previous assumption and let me investigate. |
9fef2eb
to
569aab0
Compare
Revisiting what I posted ~2 weeks ago and some of those graphs are bogus. I am not sure why. The graphs from today look good when I compare them with the data in text files. |
Just to clarify, is it also using delayed_write_rate = 8MB? If it is the case, then it's not surprising that throughput quickly drops to about 1/8 of the sustained rate and sometimes dip further. 8MB is about 1/7.5 of the 60MB fix rate, so any time a slowing down condition triggers, it is down to that level, and might got further down. If we zoom into the 8MB/s base range, the graph looks that "post2" is doing significantly better than "post", as "post2" rarely go more than one order of magnitude lower than 8MB/s, while "post" often goes much lower. |
The benchmark scripts don't set delayed_write_rate so the default, 8MB, is used. Confirmed by looking at LOG. |
|
Summary: facebook#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stucks in L1 without being moved down. We fix calculating a score of L0 in the same way of L1 so that L0->L1 is favored if L0 size is larger than L1. Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away.
Summary: #10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large.. Pull Request resolved: #10518 Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away. Reviewed By: ajkr Differential Revision: D38603145 fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225
Summary: #10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large.. Pull Request resolved: #10518 Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away. Reviewed By: ajkr Differential Revision: D38603145 fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225
…11525) Summary: after #11321 and #11340 (both included in RocksDB v8.2), migration from `level_compaction_dynamic_level_bytes=false` to `level_compaction_dynamic_level_bytes=true` is automatic by RocksDB and requires no manual compaction from user. Making the option true by default as it has several advantages: 1. better space amplification guarantee (a more stable LSM shape). 2. compaction is more adaptive to write traffic. 3. automatic draining of unneeded levels. Wiki is updated with more detail: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#option-level_compaction_dynamic_level_bytes-and-levels-target-size. The PR mostly contains fixes for unit tests as they assumed `level_compaction_dynamic_level_bytes=false`. Most notable change is commit f742be3 and b1928e4 which override the default option in DBTestBase to still set `level_compaction_dynamic_level_bytes=false` by default. This helps to reduce the change needed for unit tests. I think this default option override in unit tests is okay since the behavior of `level_compaction_dynamic_level_bytes=true` is tested by explicitly setting this option. Also, `level_compaction_dynamic_level_bytes=false` may be more desired in unit tests as it makes it easier to create a desired LSM shape. Comment for option `level_compaction_dynamic_level_bytes` is updated to reflect this change and change made in #10057. Pull Request resolved: #11525 Test Plan: `make -j32 J=32 check` several times to try to catch flaky tests due to this option change. Reviewed By: ajkr Differential Revision: D46654256 Pulled By: cbi42 fbshipit-source-id: 6b5827dae124f6f1fdc8cca2ac6f6fcd878830e1
…book#10057) Summary: The current level targets for dynamical leveling has a problem: the target level size will dramatically change after a L0->L1 compaction. When there are many L0 bytes, lower level compactions are delayed, but they will be resumed after the L0->L1 compaction finishes, so the expected write amplification benefits might not be realized. The proposal here is to revert the level targetting size, but instead relying on adjusting score for each level to prioritize levels that need to compact most. Basic idea: (1) target level size isn't adjusted, but score is adjusted. The reasoning is that with parallel compactions, holding compactions from happening might not be desirable, but we would like the compactions are scheduled from the level we feel most needed. For example, if we have a extra-large L2, we would like all compactions are scheduled for L2->L3 compactions, rather than L4->L5. This gets complicated when a large L0->L1 compaction is going on. Should we compact L2->L3 or L4->L5. So the proposal for that is: (2) the score is calculated by actual level size / (target size + estimated upper bytes coming down). The reasoning is that if we have a large amount of pending L0/L1 bytes coming down, compacting L2->L3 might be more expensive, as when the L0 bytes are compacted down to L2, the actual L2->L3 fanout would change dramatically. On the other hand, when the amount of bytes coming down to L5, the impacts to L5->L6 fanout are much less. So when calculating target score, we can adjust it by adding estimated downward bytes to the target level size. Pull Request resolved: facebook#10057 Test Plan: Repurpose tests VersionStorageInfoTest.MaxBytesForLevelDynamicWithLargeL0_* tests to cover this scenario. Reviewed By: ajkr Differential Revision: D37539742 fbshipit-source-id: 9c154cbfe92023f918cf5d80875d8776ad4831a4 Signed-off-by: tabokie <[email protected]>
Summary: facebook#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large.. Pull Request resolved: facebook#10518 Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away. Reviewed By: ajkr Differential Revision: D38603145 fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225 Signed-off-by: tabokie <[email protected]>
…book#10057) Summary: The current level targets for dynamical leveling has a problem: the target level size will dramatically change after a L0->L1 compaction. When there are many L0 bytes, lower level compactions are delayed, but they will be resumed after the L0->L1 compaction finishes, so the expected write amplification benefits might not be realized. The proposal here is to revert the level targetting size, but instead relying on adjusting score for each level to prioritize levels that need to compact most. Basic idea: (1) target level size isn't adjusted, but score is adjusted. The reasoning is that with parallel compactions, holding compactions from happening might not be desirable, but we would like the compactions are scheduled from the level we feel most needed. For example, if we have a extra-large L2, we would like all compactions are scheduled for L2->L3 compactions, rather than L4->L5. This gets complicated when a large L0->L1 compaction is going on. Should we compact L2->L3 or L4->L5. So the proposal for that is: (2) the score is calculated by actual level size / (target size + estimated upper bytes coming down). The reasoning is that if we have a large amount of pending L0/L1 bytes coming down, compacting L2->L3 might be more expensive, as when the L0 bytes are compacted down to L2, the actual L2->L3 fanout would change dramatically. On the other hand, when the amount of bytes coming down to L5, the impacts to L5->L6 fanout are much less. So when calculating target score, we can adjust it by adding estimated downward bytes to the target level size. Pull Request resolved: facebook#10057 Test Plan: Repurpose tests VersionStorageInfoTest.MaxBytesForLevelDynamicWithLargeL0_* tests to cover this scenario. Reviewed By: ajkr Differential Revision: D37539742 fbshipit-source-id: 9c154cbfe92023f918cf5d80875d8776ad4831a4 Signed-off-by: tabokie <[email protected]>
Summary: facebook#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large.. Pull Request resolved: facebook#10518 Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away. Reviewed By: ajkr Differential Revision: D38603145 fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225 Signed-off-by: tabokie <[email protected]>
Summary:
The current level targets for dynamical leveling has a problem: the target level size will dramatically change after a L0->L1 compaction. When there are many L0 bytes, lower level compactions are delayed, but they will be resumed after the L0->L1 compaction finishes, so the expected write amplification benefits might not be realized. The proposal here is to revert the level targetting size, but instead relying on adjusting score for each level to prioritize levels that need to compact most.
Basic idea:
(1) target level size isn't adjusted, but score is adjusted. The reasoning is that with parallel compactions, holding compactions from happening might not be desirable, but we would like the compactions are scheduled from the level we feel most needed. For example, if we have a extra-large L2, we would like all compactions are scheduled for L2->L3 compactions, rather than L4->L5. This gets complicated when a large L0->L1 compaction is going on. Should we compact L2->L3 or L4->L5. So the proposal for that is:
(2) the score is calculated by actual level size / (target size + estimated upper bytes coming down). The reasoning is that if we have a large amount of pending L0/L1 bytes coming down, compacting L2->L3 might be more expensive, as when the L0 bytes are compacted down to L2, the actual L2->L3 fanout would change dramatically. On the other hand, when the amount of bytes coming down to L5, the impacts to L5->L6 fanout are much less. So when calculating target score, we can adjust it by adding estimated downward bytes to the target level size.
Test Plan:
Repurpose tests VersionStorageInfoTest.MaxBytesForLevelDynamicWithLargeL0_* tests to cover this scenario.