-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only commit the ZIL once in zpl_writepages() (msync() case) #1853
Conversation
Currently, using msync() results in the following code path: sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage In such a code path, zil_commit() is called as part of zpl_putpage(). This means that for each page, the write is handed to the DMU, the ZIL is committed, and only then do we move on to the next page. As one might imagine, this results in atrocious performance where there is a large number of pages to write: instead of committing a batch of N writes, we do N commits containing one page each. In some extreme cases this can result in msync() being ~700 times slower than it should be, as well as very inefficient use of ZIL resources. This patch fixes this issue by making sure that the requested writes are batched and then committed only once. Unfortunately, the implementation is somewhat non-trivial because there is no way to run write_cache_pages in SYNC mode (so that we get all pages) without making it wait on the writeback tag for each page. The solution implemented here is composed of two parts: - I added a new callback system to the ZIL, which allows the caller to be notified when its ITX gets written to stable storage. One nice thing is that the callback is called not only in zil_commit() but in zil_sync() as well, which means that the caller doesn't have to care whether the write ended up in the ZIL or the DMU: it will get notified as soon as it's safe, period. This is an improvement over dmu_tx_callback_register() that was used previously, which only supports DMU writes. The rationale for this change is to allow zpl_putpage() to be notified when a ZIL commit is completed without having to block on zil_commit() itself. - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage from issuing ZIL commits. zpl_writepages() will issue the commit itself instead of relying on zpl_putpage() to do it, thus nicely batching the writes. Note, however, that we still have to call write_cache_pages() again in SYNC mode because there is an edge case documented in the implementation of write_cache_pages() whereas it will not give us all dirty pages when running in non-SYNC mode. Thus we need to run it at least once in SYNC mode to make sure we honor persistency guarantees. This only happens when the pages are modified at the same time msync() is running, which should be rare. In most cases there won't be any additional pages and this second call will do nothing. This change fixes openzfs#1849. Note that this change also fixes a bug related to openzfs#907 whereas calling msync() on pages that were already handed over to the DMU in a previous writepages() call would make msync() block until the next TXG sync instead of returning as soon as the ZIL commit is complete. The new callback system fixes that problem.
@dechamps How did you test this? I have also written a patch that implements this idea, but I have not tested it yet. My patch does away with write_cache_pages() by implementing the Illumos pvn_vplist_dirty() function in the SPL using write_cache_pages() and find_get_pages() as guides. That would allow us to traverse the address space's radix tree once, rather than ceil(N/14) times, where N is the number of dirty pages. That also would give us the opportunity to avoid write_cache_pages()'s invocation of cond_sched() on each iteration. Unfortunately, it is not pull request ready at the moment for one, lack of testing and two, dependence on very recent kernel features. The latter is easy to fix, but I need to address the former to know whether the potential efficiency gain is worth implementing pvn_vplist_dirty(). |
In terms of testing, I tested that it behaves correctly when If you want a performance test, then you should look at the steps to reproduce in #1849. Regarding That being said your solution looks technically superior; if you want to merge yours instead, then by all means, go ahead. However I would prefer not to let this unfixed for a long time because this |
@dechamps I just wrote a small program for performance testing:
Do you have any objections to this method of testing? |
If you actually looked at the steps to reproduce in #1849, you would have realized you just re-implemented the exact same program :) Actually, in real testing I did a little more than that: take a look at 9b5f1ce. You will need to do something similar if you're testing with fake vdevs (I was testing with in-ram file vdevs) to simulate real-world ZIL performance. |
@dechamps I seem to be reimplmenting plenty of stuff lately. Anyway, my patch seems to be performing poorly on small amounts of dirty data (although it performs very well on large amounts). I am investigating it. I should have something to report by tomorrow. |
Well, the performance of this piece of code is fairly straightforward: as long as you're calling |
@dechamps It just dawned on me that it is annoying to talk about code without actually seeing it. Here is what I have applied to my system right now: http://bpaste.net/show/148747/ I am on my way to dinner, so I have not prepared a git branch for it yet. I plan to do performance comparisons between this, your patch and the unpatched code after dinner. The code is a little rough right now. Specifically, there are stale comments and the SPL code needs to be modified to support older kernels. However, it is in a state where I can do performance tests. |
I see. Well, in its current form your code isn't addressing #1849 at all; you're still calling There's still one problem, however: with the current ZoL code, there is another bug when In its current state, your patch is completely orthogonal to mine and seems to be solving an entirely different set of issues. I see two possible outcomes to this situation:
|
On second thought, there's a third "mixed" outcome: if you simply implement my suggestion to remove |
@dechamps I think you misread the patch. I restructured things so that the code for handling 1 page goes into Your changes to make the ZIL handle the callback probably is the reason for the discrepancy in numbers between our systems, but I still need to do testing. Something offline came up, so this will probably be delayed by a few hours. With that said, Illumos appears to do something similar to |
Yes, indeed. The names are misleading. You should rename So you're indeed calling In fact, this callback problem is one of the reasons why I had to come up with my solution in the first place: if you call Note that being able to "write back 1 gigabyte of dirty data in 11 to 14 seconds" does not prove anything: if your code happens to resurrect #907, then it would indeed complete in a few seconds, but then a 1-byte commit would take seconds (waiting for the next DMU sync instead of the ZIL commit) as well. You should definitely use my program for your testing as it will test small commits as well as large ones. If small commits take 1-5 seconds (i.e. they are waiting for the next DMU sync), then you have a problem. To clarify, the correct solution needs to exhibit the following three characteristics at the same time:
Here's where we are right now:
|
Also, note that my test program tests for (1) and (2), but not (3). I'm not sure (3) can be tested with a purely userspace test because userspace has no control over when |
Oh, well, looking at the
|
@dechamps I am going to withdraw my approach in favor of yours. My approach is unsafe because of how RCU is done:
|
@dechamps @ryao I haven't had a chance to review this patch yet, but a quick comment about testing. Xfstests 074 is an absolutely torture test for mmap() and suffers from this exact msync(2) performance issue. It mixes mmap'ed IO and sys_read/sys_write IO to overlaping regions of a file and then guarantees the result is always consistent. My suggestion would be to run it for a few hours and see if it catches anything. You can find a modified verson of xfstests which is zfs friendly in my https://github.com/behlendorf/xfstests github repository. See the top level README.markdown for a basic quick start in the zfs-upstream patch. This work still needs to be pushed upstream to the xfstests project but I wanted to get everything passing cleanly on zfs first. There are still some false positives. |
@behlendorf: I did what you asked (had to remove the
Considering the stack trace it seems unlikely this has anything to do with my patch. Aside from that, everything seemed fine, nothing scary in the kernel output. Honestly, I think you are better equipped than me to test this change. I have to make do with lowly home server resources here :/ |
@dechamps Thanks for kicking off the testing, I agree that's probably unrelated. I had a chance to launch some testing as well with this change applied to master. To my surprise with this change applied I'm now able to pretty easily hit #1536. I haven't had a chance to look in to why yet. Apparently the FreeBSD guys have effectively disabled this assertion in their code but I've refrained since the assertion looks right to me. Or at least no one has yet been able to successfully explain to me why it's OK for i_count to be zero. Leaving that test failure aside I really like this patch. This has been something of a known problem for a while now but it never got any attention because it wasn't clear to me how to cleanly restructure the code to batch the ZIL commits. I think adding a callback for the ITX records was a nice generic way to handle this, and it was also clever to tweak writepages() to make the individual page updates async. I need to still spend some quality time with this patch to make sure all the cases are covered but on initial inspection I like the approach. Thanks! |
Good (or bad) news depending on your point of view. The recent fix for the HAVE_BDI autoconf regression, fd23663, seems to be preventing me from hitting #1536 which was encountered during my first round of testing. That may make it harder to resolve #1536 and #1417 but fixing those shouldn't block merging this patch. @dweeezil @ryao I'd really like to get some additional eyes on this change before merging it, if you have a little time could you please review it. I'm going to go over it again carefully myself. |
@dechamps OK merged to master, thanks for making this improvement. |
@behlendorf I had been looking at this patch and it looks good to me. Apparently it must have layered on top of a current master rather cleanly, too, since I see it has been committed now. My only questions relate to the new itx callback facility: first, at least in the context of ZFS proper (not sure about lustre and other potential consumers), it seems the old dmu_tx_callback_register() facility is now only used by the ztest program. Does it need to be kept around? Second, and related, should ztest be extended to test the new itx callback facility? Those are my only comments. @dechamps, Thanks for this work. This fixes a sore spot that I'm amazed didn't got a lot more complaints. |
@dweeezil Lustre does still use the existing callback interface so we need to keep it around. However, we're definitely going to evaluate if the new callback makes more sense for Lustre. We should add a test case for this in ztest but I didn't want to block this work being merged on that. Thanks for the additional review! |
This is a fix for issue #1849, as well as a cleaner fix for issue #907. See the individual commit description for details about the implementation.
Before:
After:
And now
aptitude
is much faster. Yay!