-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: pacing user writes, flushes and compactions #7
Comments
Not that a goal here should be removal of |
I ran I think I'm seeing the bad behavior that results from not not pacing user writes gracefully pretty clearly. L0->L1 Compactions are starting to take a long time (>2min) and user writes completely stall for extended periods of time (tens of seconds). L1 seems to be growing out of proportion because there's never any time to move something into L2. Is this a good experiment against which to measure any steps we take? diff --git a/cmd/pebble/sync.go b/cmd/pebble/sync.go
index be6e992..e4d1642 100644
--- a/cmd/pebble/sync.go
+++ b/cmd/pebble/sync.go
@@ -60,7 +60,7 @@ func runSync(cmd *cobra.Command, args []string) {
log.Fatal(err)
}
}
- if err := b.Commit(db.Sync); err != nil {
+ if err := b.Commit(db.NoSync); err != nil {
log.Fatal(err)
}
latency.Record(time.Since(start))
diff --git a/compaction.go b/compaction.go
index 8083835..aa41427 100644
--- a/compaction.go
+++ b/compaction.go
@@ -8,9 +8,11 @@ import (
"bytes"
"errors"
"fmt"
+ "log"
"os"
"path/filepath"
"sort"
+ "time"
"unsafe"
"github.com/petermattis/pebble/db"
@@ -299,7 +301,7 @@ func (c *compaction) String() string {
for i := range c.inputs {
fmt.Fprintf(&buf, "%d:", i+c.level)
for _, f := range c.inputs[i] {
- fmt.Fprintf(&buf, " %d:%s-%s", f.fileNum, f.smallest, f.largest)
+ fmt.Fprintf(&buf, " %d:%q-%q", f.fileNum, f.smallest, f.largest)
}
fmt.Fprintf(&buf, "\n")
}
@@ -643,6 +645,11 @@ func (d *DB) compact1() (err error) {
if c == nil {
return nil
}
+ tBegin := time.Now()
+ defer func() {
+ elapsed := time.Since(tBegin)
+ log.Printf("compaction %s took %.2fs", c, elapsed.Seconds())
+ }()
jobID := d.mu.nextJobID
d.mu.nextJobID++
|
Yes, it is a great place to start, though it isn't sufficient to fully test a throttling mechanism. It is also interesting to test a workload that has some sort of periodic behavior with respect to writes vs reads, and a workload that experiences a phase change (e.g. 100% writes -> 50% writes -> 0% writes). |
Regarding pacing of user writes and flushing: I had some time to think about this while waiting at the doctor's office this morning and wanted to capture my thoughts. First some background:
The current Pebble code flushes a memtable as fast as possible. This isn't entirely true, because We can view the collection of memtables as a bucket that is being filled and drained at the same time. User writes are filling the bucket, while flushing drains the bucket. Draining needs to at least keep up with filling or else we'll overfill our bucket (i.e. run out of memory), but we don't want draining to go significantly faster than filling or the system will experience hiccups in fill performance. Ideal steady state is for the fill rate to equal the drain rate. One wrinkle in this analogy is that we can only drain when there is a full memtable. We want to measure our flush progress below the level of a memtable. This can be accomplished by tracking how much of the memtable has been flushed. There are several ways to estimate this, but one is to keep track of how much of the memtable has been iterated over by tracking the number of bytes in the skiplist nodes, keys and values. The total of all of those bytes seen during full iteration should equal the allocated bytes in the memtable's arena. (It looks like an Back to the bucket draining analogy. We're limiting our draining rate to the fill rate. But if the drain rate is already at maximum and is not keeping up, we need to limit the fill rate. What control mechanism do we use here? Using some sort of The dirty bytes threshold would be set to 110% of the memtable size. When the first memtable is filled it is handed off to flushing and writes can proceed at full speed until the remaining 10% of the dirty bytes quota is exhausted. At that point, user writes are paced to the speed of flushing. As soon as dirty bytes are cleaned by the flusher, blocked user writes would be signaled to proceed. What about the draining speed? Recall that we want to limit the drain rate to the fill rate. Note that the drain rate is not constrained by the fill rate. Without any limiting flushing should usually be able to "drain" a memtable much faster than user writes can fill a new one. Draining should target a dirty bytes value that is ~100% of a memtable's size. That is, the steady state for user writes and flushing should be one memtable's worth of dirty data. If the flusher falls behind this target it will speed up until it is going full throttle and eventually user writes will be stopped, though note how this happens on a granular basis. If the flusher is going too fast (e.g. dirty bytes falls below 90% of a memtable's size) it will slow itself to some minimum flush rate (e.g. 2 MB / sec). There are some additional complexities, such as the handling of To summarize:
@ajkr Would appreciate your thoughts on this. I haven't fully thought through how this would interact with pacing of compactions, but the hazy outline I have in mind is that compaction pacing can be tied to the flush speed. So we'd have compactions pacing flushes which in turn pace user writes. |
I wrote a little simulator for the above and the interactions seem to work well. Here is a 30s snippet:
There is one fill and one drain goroutine. Every 5-10s the fill goroutine changes the rate at which it is filling (in the range 1-20 MB/sec). The drain goroutine targets have 64 MB of dirty data, slowing itself down to a minimum drain rate of 4 MB/sec if it is going too fast. The code for the above is here. |
For comparison, here is what the simulation does if you don't pace draining to filling:
|
This code was not actively being used and the mechanisms were fragile. See #7 for the proposed direction for pacing user writes vs flushes.
I had some rough thoughts when thinking through this, though no concrete suggestions. Overall I like the idea of backpressure making its way from the background to the foreground. It reminds me of the gear-and-spring scheduler you mentioned a while ago where backpressure went bottom-up in the LSM eventually making its way to the memtable writes.
|
@ajkr All good points. I've been thinking about how to pace memtable draining to compactions, but still don't have a clear picture of how that will work.
I was imagining that the flush thread increment the drained memtable bytes as each entry is added to the memtable. Yes, there will be a small blip whenever a sync threshold is passed (though the use of
Yeah, the 110% was just something pulled from the air. This would likely be a tunable. |
The code for the spring&gear scheduler described in the bLSM paper is on github. It is a bit of a mess (IMO). I think the
In the bLSM terminology, |
Simulation of filling and draining a bucket in chunks with bi-directional flow control. If draining is proceeding faster than filling, then draining is slowed down (but kept at a minimum rate so it doesn't stop completely). If filling is proceeding faster than draining, it is slowed down. The simulation is meant to represent filling and draining of the memtable. Each memtable is a chunk and draining (flushing) of a memtable cannot proceed until the memtable is full. The simulation periodically adjusts the rate of filling and demonstrates the nimble reaction of the flow control mechanism. See #7
Hello, I'm a PhD student from UC Irvine working on LSM-trees. I found this very interesting project from Mark's small datum blog. In particular, this issue is very related to what I'm currently working on. Thus, I would like to share some thoughts, which I hope can be mutually helpful. I have been working on minimizing write stalls of LSM-trees (with various designs) under a given disk bandwidth budget for I/O operations (flushes and merges). It seems that this issue tries to address the other side of my problem: how the disk bandwidth budget should be set for a given user workload? Despite the difference, I believe they have many in common as well and my thoughts on this topic are as follows:
I hope these thoughts can be helpful, and please kindly correct me if my thoughts are wrong, especially in an industrial system setting such as CockroachDB. |
Hi @luochen01. Thanks for the thoughts. Glad to hear more attention is being given to avoiding write stalls in LSMs. I think this is an interesting problem. Do you have anything additional you can share about the directions you are exploring?
Yes, all subsequent writes will be delayed. The queueing is actually fairly explicit when committing a write to the LSM (via the write to the WAL), but note that writes to the WAL are batched, while writes to the memtable are concurrent. I see two effects of concurrent flushes and compactions on user write performance:
Do you have more specifics on how to structure this? The available disk bandwidth can change unexpectedly. On cloud VMs this can be due to hitting write quota limits. Or it could be due to another process on the machine suddenly slurping up disk bandwidth.
The goal isn't to reduce average write latencies, only tail write latencies. I did an experiment that found that rate limiting user writes significantly reduced tail latencies without affecting average latencies. The problem is how to configure that rate limit.
I'm not aware of papers which attempt to estimate maximum write throughput of LSMs. Can you point me towards them? I am aware of papers that attempt to estimate write-amp, but that is a different metric.
This is what RocksDB does with its l0_slowdown_writes_threshold and l0_stop_writes_threshold options. Those options are exactly what I'm looking to improve upon. Or am I misunderstanding you?
It might not have been clear from my earlier messages on this issue: a full solution here would involve pacing compactions as well. https://github.com/petermattis/pebble/issues/7#issuecomment-480106971 only talks about pacing flushes, which I agree would be insufficient on its own. |
Hi @petermattis , thanks for your replies. I'm currently preparing a paper to summarize my results of the write stall problem. I could share it when it's ready.
This seems to be a wrong configuration. From my experiments (and as a standard industrial practice), WAL should be configured using a dedicated disk. Its periodic disk forces are very harmful to disk throughput (or you may have turned off disk forces at a risk of data loss)
The feedback loop monitors the write rate and estimates the disk bandwidth budget for background I/Os. It can use L0 SSTables as an estimator of the write pressure. For example, it can increases the disk bandwidth budget if L0 SSTables start pile up and decrease it vice versa. This budget should be applied in a relatively long-term. There could be some variance of the instantaneous disk bandwidth, but that could be absorbed over time. Also, one good property with control theory (e.g., integral control low) is that it can reject the disturbances from the environment. I'm not an expert in this area, but it seems that this could be a reasonable approach.
I doubt this finding. It would be good if you can share your experimental results, including the workload. I'm not sure how you did it, but it is important to decouple the data arrival rate from the processing rate (e.g., clients put data into queue while LSM processes it; or YCSB uses timestamps to implement this feature without explicit queuing, i.e., the intended update time).
I'm actually talking about estimating write-amp, and sorry for the confusion! (Towards Accurate and Fast Evaluation of Multi-Stage Log-structured Designs). Actually, given write-amp and some statistics of records, it should be straightforward to estimate the maximum write throughput (by dividing write throughput with write-amp)
When I read through this issue, I saw many descriptions about how to control flush speed based on dirty bytes, and I was misled a bit on that. Actually, I'm not sure controlling the flush speed is needed. It might be just enough to control the overall disk bandwidth of all I/O operations, which in turn controls the maximum write throughput. Thanks again for your clarification! |
By disk forces I assume you mean disk syncs. No, we do not disable them. Storing the WAL on a separate disk is a good performance practice, yet it isn't always possible. We have to design for situations in which there is only a single disk available. Even if a separate disk is available, we still have to design for when bandwidth to that disk varies unexpectedly.
The problem I see with using L0 sstable count as a signal is that the L0 sstable jumps around quickly. You could try to smooth this signal, but why not use a signal that varies smoothly in the first place. Using a feedback loop that controls bandwidth feels like trying to perform surgery with mittens on. Perhaps that is an extreme analogy, but my point is that we can have much more granular control.
It is good to doubt something which doesn't match up with your internal understanding. I do that all the time. Unfortunately, I don't have the results from that experiment around any longer. Yes, the workload I used coordinated arrival and processing times. Clearly rate limiting in such an environment can reduce the tail latencies for the processing times, but it might do so at the expense of throughput. My contention is that the rate limiting actually allowed more overall throughput to be achieved in steady state (though peak throughput was limited). I should note that I was also rate limiting the flush and compaction rate.
If you're imagining that flush speed is controlled by some overall disk bandwidth limit, then I think you are controlling flush speed, just by an indirect mechanism. |
To be more accurate, L0 sstables reflect how fast data arrive and how fast data are merged down to the bottom level. I'm not sure why the L0 sstable jumps around quickly. In a steady workload, the arrival of new L0 sstables should be relatively low, as most of the background I/Os are spent on merges. It is true that the feedback loop approach treats the LSM-tree as a black box and thus better control may be performed by examining the internal structures and data flows of an LSM-tree. However, the advantage of the feedback loop approach is that it is simple to implement and it usually works very well (this approach may be first introduced into the DBMS community by DB2 http://www.vldb.org/conf/2006/p1081-storm.pdf). I have also pursued the direction of estimating the write rate limit using analytical/simulation approaches for a given disk bandwidth budget to avoid write stalls, but eventually I gave up the efforts due to its high complexity. I'm very interested in how this problem can be addressed eventually in CockroachDB. |
This makes sense to me. Thanks a lot for the explanation. I guess we should separate the query client logic out of the threads that execute storage engine operations (or use the YCSB trick). This is different from the tools I've worked with before, mainly RocksDB's db_bench.
Even if it doesn't help write latencies, it may still be worthwhile to delay writes for limiting read-amp and space-amp. Let's say we want some limit on number of sorted runs (number of memtables plus number of L0 files plus number of L1+ levels) to bound read-amp. Would it be just as good to have a hard limit (i.e., stop writes) when we hit the sorted run limit, rather than gradually slowing down as we approach it? I am still thinking about it, but am starting to believe the hard limit alone is a fine choice.
Maybe I misunderstood the terminology here, but I found this surprising. My understanding is data arrival rate measures how fast we are inserting, and maximum write throughput measures total writes to WAL, L0, L1, etc. Write-amp is typically 10+ so I am not sure how an LSM can be sustained by writing at only 1.1-1.15x the insertion rate. BTW, I enjoyed your LSM survey paper, particularly the part asking authors to evaluate against a well-tuned LSM :). Looking forward to what's next. |
OK, I think I see what you mean now about maximum write throughput 10-15% higher than data arrival rate. The ideal disk bandwidth budget would then be |
My proposed memtable "dirty bytes" metric is also an indication of how fast data arrives and how quickly it is getting flushed, but it is a much smoother metric as it gets updated on every user write and incrementally as a memtable gets flushed. The L0 sstables metric is chunky in comparison. It grows and shrinks in units of sstables, which are much larger than user writes. With the default RocksDB compaction options, the L0 sstable count grows from 0->4 sstables, then a merge occurs and while that is occurring additional flushes happen so that the number of L0 sstables tends to randomly jump around between 0-8.
This statement makes me wonder if we're on the same page. The arrival of new sstables occurs at a rate proportional to incoming write traffic. For a write heavy workload, that can happen fairly frequently (on the order of a few seconds). I do agree that a majority of background I/Os are spent on compactions/merges. |
I spent a little time understanding the RocksDB user-write pacing mechanism and thought it would be useful to jot down my notes. When a batch is committed to RocksDB, part of the "preprocessing" that is performed is to delay the write. This is done in
This is checking to see if the number of L0 files is within 2 of the stop-writes threshold. There are a lot of conditions checked in this method: too many memtables, L0 stop-writes threshold, L0 slowdown-writes threshold, pending compaction bytes threshold, etc.
The "compaction bytes needed" metric is computed whenever a new To summarize all of the above:
|
After playing around with the simulator, thinking about the RocksDB code, and talking to Andrew, I have a few notes and ideas that I think are worth sharing. First off, I tried to modify the simulator to model RocksDB. In particular, I tried to implement the same logic as
then the system would think that compaction debt is increasing even though it is in a steady state. This is because the write rates are adjusted simply on the existence of a compaction debt delta. This does not take into account the actual size of the delta. I spoke to Andrew about this and he confirmed that this issue has been brought up a few times before. Thus, it doesn't seem like we can use this approach in Pebble. This means that we need a new mechanism for detecting changes in compaction debt, which also takes the actual size of the delta into account. One idea which came up was assigning a "budget" for memtable flushing based on number of bytes and not the rate (bytes/second). If we can estimate the write-amp of the system, then we can allow @petermattis Do you have any thoughts on this? |
I'm assuming you mean that the RocksDB approach is problematic and we should do better. We could implement this approach and see exactly the same behavior as RocksDB, right?
Relying on either historic or worst case write-amp seems feasible, though I haven't thought about it thoroughly. (I do my best thinking about these heuristics in the car). Rather than historic or worst case estimates of write-amp, we might be able to make a direct estimate based on the current sizes of the levels. Maybe. If I squint that seems possible, though I don't have a concrete proposal for how to do that. Something also to document here is what I've mentioned to both @Ryanfsdf and @ajkr in person: we should be able to have compaction debt adjust smoothly rather than the chunky adjustment that is done now via updates whenever a flush or compaction finishes. What I'm imagining is that we track how far we have iterated through the input compaction tables and subtract that size from the input levels, and add the size of the new output tables to the output level. If compaction debt is adjusted at a fine granularity like that I'm imagining we could tie flush rate to it so that the flush rate adjustments are all smooth. |
Yes, we should expect to see the same behavior. However, I'm skeptical about the correctness of this approach. There are cases when overall compaction debt is increasing but the system would think it's decreasing, and vice versa. That means user writes may be throttled when compaction debt is decreasing and sped up when compaction debt is increasing, contrary to what we want.
With the smooth compaction debt adjustment, we can set the "target" compaction debt to be between a low and high watermark (ie, size_of_memtable/2 <= compaction_debt < size_of_memtable). Any time we cross the high watermark, we would throttle flushing. And any time we cross the low watermark, we would throttle compactions. This is essentially the same as how we'll handle user writes with memtable flushing. This seems like the ideal approach. Is this similar to what you've had in mind before? |
Yes, though I hadn't seen it fully spelled out until now. PS Be sure to keep around the adjustments you've made to the simulator to model RocksDB's heuristics. We'll want to use that as evidence that this new pacing mechanism is an improvement. |
Add a mechanism to pace (rate limit) user writes, flushes and compactions. Some level of rate limiting of user writes is necessary to prevent user write from blowing up memory (if flushes can't keep up) or creating too many L0 tables (if compaction can't keep up). The existing control mechanisms mirror what are present in RocksDB:
Options.MemTableStopWritesThreshold
,Options.L0CompactionThreshold
andOptions.L0SlowdownWritesThreshold
. These control mechanisms are blunt resulting in undesirable hiccups in write performance. Thecontroller
object was an initial attempt at providing smoother write throughput and it achieved some success, but is too fragile.The problem here is akin to the problem with pacing a concurrent garbage collector. The Go GC pacer design should be inspiration. We want to balance the rate of dirty data arriving in memtables with the rate of flushes and compactions. Flushes and compactions should also be throttled to run at the rate necessary to keep up with incoming user writes and no faster so as to leave CPU available for user reads and other operations. A challenge here is to adjust quickly to changing user load.
The text was updated successfully, but these errors were encountered: