-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZIL Pipelining #6133
ZIL Pipelining #6133
Conversation
@ryao, thanks for your PR! By analyzing the history of the files in this pull request, we identified @behlendorf, @nedbass and @ahrens to be potential reviewers. |
The buildbot made it blatently obvious that I missed a race between this and txg_sync. I am working on a fix now. |
@prakashsurya and I are also working on significant changes to improve ZIL performance. @ryao could you summarize the architectural changes you're proposing here? We can also share our design doc and hopefully soon code as well. We can figure out if our changes are going after the same problems with ZIL performance, or if they are complementary. |
I would be happy to share my code, with the caveat that it still isn't finished. What I have passes the normal regression testing (zfstest and zloop), so it's probably in a good enough state for others to check out the overall design. |
@ahrens Right now,
Phase 1 is concurrent while phases 2 through 5 are effectively a single threaded critical section (ignoring ZIO). The idea here is to reduce the size of the critical section to phases 2 and 3. This reduces latencies because the next ZIL commit no longer has to wait for the previous one to clear phase 5 before it can enter phase 2. The previous one just needs to clear phase 3, which is much faster. This also improves IOPS from allowing multiple threads to operate in phases 4 and 5. Also, eliminating the batching reduces incidence of thundering herds. At a lower level, what I am really doing is refactoring @prakashsurya I'd appreciate it if you could share that. People at work want me to get a group of improvements together for a demo of future performance that takes place in about 36 hours. If our changes turn out to be complementary, it could be rather helpful to combine them. |
quickly scanning your diff on my phone, i think we're trying to solve the same problem, but in slightly different ways. i can't look in detail right now, though. |
@prakashsurya I had a hunch that we were. There isn't too much opportunity to improve ZIL aside from reducing the size of the critical section without resorting to a disk format change. I have an idea for one where particularly active datasets could be given multiple ZIL pipelines, which would also improve replay, but that is an idea to explore after getting the basics right. Edit: I suppose that allowing early exit in situations where the itx list is empty when no other |
This is still a work in progress, and isn't meant to be landed in its current state. I'm only pushing this code publicly now, rather than when it's "done", so the design of the changes implemented here can be compared with the code in the ZFS on Linux PR here: openzfs/zfs#6133 There's a "high level" comment about the new design of `zil_commit()` in a code comment above that function; please see that for more details.
c3483c3
to
4604c8c
Compare
It turns out that I had two races. I have my fingers crossed that this is the last of them. :) @prakashsurya Thanks for sharing. I like the idea of using lwb to simplify the arguments to *_get_data callbacks. I think I will "borrow" it for my version. :) Your code appears to have a per lwb vdev tree. That should cause us to flush excessively in situations where a userland application does many IOs and then calls I suggest we benchmark our approaches against each other once they are mature, enhance the stronger one with any good ideas from the weaker one and get that version merged. |
right, that's something we are concerned about, but we didn't want to overly complicate matters until we had some performance results to lean on. |
@prakashsurya I had the same thought about implementing a mechanism to allow short circuit on a unique foid > 0 when there is no foid = 0 in the pipeline. It looks like your code handles that while mine does not. By the way, it looks like there is a 3rd race that I missed in the testing at work. This time between |
26c114a
to
05d33c6
Compare
A bunch of ztest later, the following gem cropped up:
with
being the apparent culprit, and
indicating its from this PR. |
@sempervictus Thanks for testing. The culprit turns out to be 2ba96b5, which I did after seeing how @prakashsurya did things in his patch. What happens is that the lwb_t was made in the previous pipeline writer. It caches a pointer to that pipeline writer's zilw_t, which is then freed. When the lwb_t is used by the next writer, this callback operates on the old zilw_t that had been freed and boom. That was a really subtle mistake that was hard to find, since my first thought it was an existing bug that I somehow managed to make more common with my latest fixes. Anyway, I will be refreshing this shortly with the fix after confirming this theory locally. |
@sempervictus those traces were from the automated kmemleak buildbot test results. After one run of the ZFS Test Suite we verify nothing was leaked for any of the test cases. |
@sempervictus @behlendorf My apologies for the delays. I am using separate machines for pushing, developing and testing, so I have been trying to batch process these things. I have been stuck trying to diagnose dedup at work, which might not have worked out too well in conjunction with the manual batch processing. I will get it updated, rebased and pushed in the morning though. :) |
Hello all, I ported this patch back to 0.6.5.8, and set up an all ssd envrionment, five intel 3510 ssd created a raidz1, no independent zil disk, the test tool was fio, 4k randwrite, i could get the IOPS from 7k to 20k+, it's not stable, but a big improvement, the 4k randwrite performance value was 5k+ without the patch. 4k random read was stable, always output 33k+ performance value. thank you for the great job. Is somebody meet the 4k randwrite unstable performance issue? |
After sevrial 4k randwrite test, the low value became to 3k+, the iostatus -xm 1 view, sometimes zvol was 100%, most of time the disks were not 100% busy, seems that the zvol still has other botternecks |
I had intended to push this in the morning, but better late then never. Anyway, there was a small cstyle issue from some debug code I had added. Rather than force push, I just added a small commit that we can squash later if the buildbot is happy with it and reviewers are happy with it. |
@sempervictus The latest version has passed at least 100 hours of strenuous testing at work. However, there is one issue that the buildbot had caught that I have not fully digested. If I understood it correctly, it is a pre-existing race that dates back to OpenSolaris, but it is complex enough that I am not 100% certain. I haven't had much time to analyze it since spotting it in the buildbot. If the rebase @behlendorf suggested makes it go away, I would be happy saying that the current patch is probably okay. At the very least, I am increasingly confident that this is no worse than the existing code. |
@wangdbang There are other issues in the zvol code. #6207 is intended to fix some of them. I just pushed it. You might want to try it too. |
@ryao ,thank you for your clue, i will try it. |
@ryao I'd like to try and get some performance numbers of this change, and compare it with the work that I'm doing. Any chance you could port this to OpenZFS and open a PR against that project? |
@prakashsurya Yes. I'll do that today. |
The previous ZIL code implemented a batching operation. This blocked all writers while an existing ZIL commit was being done and was bad for latencies. This patch changes the ZIL code to allow ZIL commits and ZIL zio operations to occur simultaneously through a pipeline. This simultaneously reduces latencies and increases IOPS in synchronous IO heavy workloads. Signed-off-by: Richard Yao <[email protected]>
@prakashsurya Here is a first stab at a port. I will lean on the buildbot to know whether there are any syntatic errors or not: Also, as a bonus, I ported the other patch that just landed into ZoL: |
Notes ===== This change addresses the same deficiencies that Richard Yao is addressing in the following pull requests: - openzfs/zfs#6133 - openzfs#404 Problem ======= The current implementation of zil_commit() can introduce significant latency, beyond what is inherent due to the latency of the underlying storage. The additional latency comes from two main problems: 1. When there's outstanding ZIL blocks being written (i.e. there's already a "writer thread" in progress), then any new calls to zil_commit() will block waiting for the currently oustanding ZIL blocks to complete. The blocks written for each "writer thread" is coined a "batch", and there can only ever be a single "batch" being written at a time. When a batch is being written, any new ZIL transactions will have to wait for the next batch to be written, which won't occur until the current batch finishes. As a result, the underlying storage may not be used as efficiently as possible. While "new" threads enter zil_commit() and are blocked waiting for the next batch, it's possible that the underlying storage isn't fully utilized by the current batch of ZIL blocks. In that case, it'd be better to allow these new threads to generate (and issue) a new ZIL block, such that it could be serviced by the underlying storage concurrently with the other ZIL blocks that are being serviced. 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch" to complete, prior to zil_commit() returning. The size of any given batch is proportional to the number of ZIL transaction in the queue at the time that the batch starts processing the queue; which doesn't occur until the previous batch completes. Thus, if there's a lot of transactions in the queue, the batch could be composed of many ZIL blocks, and each call to zil_commit() will have to wait for all of these writes to complete (even if the thread calling zil_commit() only cared about one of the transactions in the batch). To further complicate the situation, these two issues result in the following side effect: 3. If a given batch takes longer to complete than normal, this results in larger batch sizes, which then take longer to complete and further drive up the latency of zil_commit(). This can occur for a number of reasons, including (but not limited to): transient changes in the workload, and storage latency irregularites. Solution ======== The solution attempted by this change has the following goals: 1. no on-disk changes; maintain current on-disk format. 2. modify the "batch size" to be equal to the "ZIL block size". 3. allow new batches to be generated and issued to disk, while there's already batches being serviced by the disk. 4. allow zil_commit() to wait for as few ZIL blocks as possible. 5. use as few ZIL blocks as possible, for the same amount of ZIL transactions, without introducing significant latency to any individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks. In theory, with these goals met, the new allgorithm will allow the following improvements: 1. new ZIL blocks can be generated and issued, while there's already oustanding ZIL blocks being serviced by the storage. 2. the latency of zil_commit() should be proportional to the underlying storage latency, rather than the incoming synchronous workload. Where to Start w.r.t. Reviews ============================= For those attempting to review this change, I suggest the following order: 1. to familiarize yourself with the high level design this change is shooting for, read the comment directly above zil_commit(). 2. if anything in that comment is unclear, not true, and/or it's missing critical information, please open review a comment so that can be addressed. 3. If you're already familiar with the data structures of the ZIL, now would be a good time to review the modifications I made, as a precursor to the algorithm changes. If you're not already familiar, it's probably best to do this after scanning the algorithm, at which point you'll hopefully have more context to fall back on. 4. briefly read through zil_commit(), zil_commit_writer(), and zil_commit_waiter() to get familiar with the "flow" of the code w.r.t. the thread(s) calling zil_commit(). 5. briefly read through zil_lwb_write_done(), zil_lwb_flush_vdevs(), and zil_lwb_flush_vdevs_done() to get familiar with how the ZIL blocks complete, and how the "commit waiters" are signalled. At this point, I would hope that you'll have a decent grasp of the changes at a high level, such that you'll be able to navigate the code on your own, and rely on the code comments for further inspection and understanding. If that's not the case, I should probably address it with more code comments, so please open review issues so I can do that. Performance Testing =================== I used two fio workloads to drive a synchronous write workload. In both cases, the performance of this change was pitted against the baseline "master" branch. The first workload had each fio thread trying to perform synchronous writes as fast as it could; issuing a single write at at time. The results from this test can be seen here: https://goo.gl/jBUiH5 For the "tl;dr", use these two links that jump directly to the section that describe the percentage difference in fio IOPs and fio write latency when running on both HDDs and SSDs: - HDD drives, IOPs: https://goo.gl/iChbUh - SSD drives, IOPs: https://goo.gl/97u2LA - HDD drives, latency: https://goo.gl/RCGxr4 - SSD drives, latency: https://goo.gl/n7DQ1D The second workload also had fio issuing synchronous writes, but instead of attempting to perform as many operations as fast as each thread could, each fio thread attempted to achieve about 64 write operations per second. The results of this test can be seen here: https://goo.gl/xEPGm4 Additionally, the following links will jump directly to the percentage difference information between the baseline and the project branch: - HDD drives, IOPs: https://goo.gl/rgk7m4 - SSD drives, IOPs: https://goo.gl/jCSVnH - HDD drives, latency: https://goo.gl/qK9E5q - SSD drives, latency: https://goo.gl/dGAXk1 Upstream bugs: DLPX-52375
So I think I found a bug here, unfortunately on my root pool of all things. Seems a condition can be created where the pool starts to cause consistent crashes in lwb_create. What's odd is that it only happens when the OS is booted from the affected pool, running from external media and chrooting into the root dataset on the affected rpool does not cause it to happen - same kernel and ZFS builds on the removable media and system (from our own internal repo). |
Definitely this stack causing the crash, though I'm wondering if its related to #6477 |
So having pulled this out of the stack, I'm seeing human-noticeable performance drops. Whatever happened to the zil in my rpool is concerning, but with nothing but some time and nerves lost, I'm seriously considering getting it back into prod stack for the clear benefits on our cinder volume nodes. |
@sempervictus We'd like to see @prakashsurya's version of this go in to illumos and Linux, which we believe has a superset of the improvements of this PR. Prakash, could you update the OpenZFS PR (openzfs/openzfs#421) with the final version that we integrated to Delphix? |
@ahrens: will throw the updated version into our test queue, thank you. Last time I tried that PR I ran into some dtrace_probe codepath which had undefined macros or something and it never passed the initial builders. Be glad to put it through its paces. |
openzfs/openzfs#421 is up to date. I've also ported that work to ZOL here: #6337 .. there appear to be some more work needed on the port, though, as it's not passing buildbot due to (what appears to be) an mmap issue. I haven't tried to root cause the linux test failure yet. |
We don't have an illumos test stack right now, so I'll hold off adding that until its not going to merc the runtime testers. Meantime, if its anything like this PR, looking forward to it.
|
Closing, see #6566. |
Description
This modifies ZIL to allow the lwb_t objects of each ZIL commit to be calculated while the previous ZIL commit's lwb_t objects are being written.
This is the first version that both passes all of the tests I throw at it and does not appear to risk data integrity. I have a few ideas on how to slightly improve locking, but I feel good enough about this version that I am putting it up for additional scrutiny.
Motivation and Context
This improves synchronous IO performance. The phoronix benchmarks will likely benefit from this because many of them are synchronous IO heavy.
How Has This Been Tested?
At work, we have a couple of dual-socket Xeon E5 v4 systems with 20 cores each in total, a dozen SSDs and NVRAM for SLOG at work. There is a semi-proprietary benchmark that measures IOPS on 4 zvols on each system simultaneously that we are using to quantify performance. Furthermore, sync=always is set on the zvols. This code passes that while boosting performance by at least 50%. Without nvram, performance goes from 19000 IOPS to at least 29,000 IOPS. This is in a configuration with volblocksize=4k and all 12 SSDs in RAIDZ-2 with ashift=12 (and yes, I know this is not space efficient). Also, this passes ztest. Earlier versions did not.
Types of changes
Checklist:
Signed-off-by
.