Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only commit the ZIL once in zpl_writepages() (msync() case) #1853

Closed
wants to merge 1 commit into from

Conversation

dechamps
Copy link
Contributor

This is a fix for issue #1849, as well as a cleaner fix for issue #907. See the individual commit description for details about the implementation.

Before:

msync(         1 bytes)      0.035 seconds
msync(         2 bytes)      0.027 seconds
msync(         4 bytes)      0.009 seconds
msync(         8 bytes)      0.011 seconds
msync(        16 bytes)      0.014 seconds
msync(        32 bytes)      0.009 seconds
msync(        64 bytes)      0.011 seconds
msync(       128 bytes)      0.014 seconds
msync(       256 bytes)      0.009 seconds
msync(       512 bytes)      0.011 seconds
msync(      1024 bytes)      0.014 seconds
msync(      2048 bytes)      0.009 seconds
msync(      4096 bytes)      0.011 seconds
msync(      8192 bytes)      0.023 seconds
msync(     16384 bytes)      0.045 seconds
msync(     32768 bytes)      0.092 seconds
msync(     65536 bytes)      0.183 seconds
msync(    131072 bytes)      0.378 seconds
msync(    262144 bytes)      0.583 seconds
msync(    524288 bytes)      1.413 seconds
msync(   1048576 bytes)      3.105 seconds
msync(   2097152 bytes)      6.242 seconds
msync(   4194304 bytes)     12.117 seconds
msync(   8388608 bytes)     10.331 seconds
msync(  16777216 bytes)     34.966 seconds
msync(  33554432 bytes)     34.901 seconds

After:

msync(         1 bytes)      0.003 seconds
msync(         2 bytes)      0.003 seconds
msync(         4 bytes)      0.004 seconds
msync(         8 bytes)      0.003 seconds
msync(        16 bytes)      0.003 seconds
msync(        32 bytes)      0.004 seconds
msync(        64 bytes)      0.003 seconds
msync(       128 bytes)      0.003 seconds
msync(       256 bytes)      0.003 seconds
msync(       512 bytes)      0.003 seconds
msync(      1024 bytes)      0.003 seconds
msync(      2048 bytes)      0.002 seconds
msync(      4096 bytes)      0.002 seconds
msync(      8192 bytes)      0.002 seconds
msync(     16384 bytes)      0.003 seconds
msync(     32768 bytes)      0.003 seconds
msync(     65536 bytes)      0.003 seconds
msync(    131072 bytes)      0.003 seconds
msync(    262144 bytes)      0.006 seconds
msync(    524288 bytes)      0.007 seconds
msync(   1048576 bytes)      0.009 seconds
msync(   2097152 bytes)      0.017 seconds
msync(   4194304 bytes)      0.035 seconds
msync(   8388608 bytes)      0.064 seconds
msync(  16777216 bytes)      0.641 seconds
msync(  33554432 bytes)      0.250 seconds

And now aptitude is much faster. Yay!

Currently, using msync() results in the following code path:

    sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage

In such a code path, zil_commit() is called as part of zpl_putpage().
This means that for each page, the write is handed to the DMU, the ZIL
is committed, and only then do we move on to the next page. As one might
imagine, this results in atrocious performance where there is a large
number of pages to write: instead of committing a batch of N writes,
we do N commits containing one page each. In some extreme cases this
can result in msync() being ~700 times slower than it should be, as well
as very inefficient use of ZIL resources.

This patch fixes this issue by making sure that the requested writes
are batched and then committed only once. Unfortunately, the
implementation is somewhat non-trivial because there is no way to run
write_cache_pages in SYNC mode (so that we get all pages) without
making it wait on the writeback tag for each page.

The solution implemented here is composed of two parts:

 - I added a new callback system to the ZIL, which allows the caller to
   be notified when its ITX gets written to stable storage. One nice
   thing is that the callback is called not only in zil_commit() but
   in zil_sync() as well, which means that the caller doesn't have to
   care whether the write ended up in the ZIL or the DMU: it will get
   notified as soon as it's safe, period. This is an improvement over
   dmu_tx_callback_register() that was used previously, which only
   supports DMU writes. The rationale for this change is to allow
   zpl_putpage() to be notified when a ZIL commit is completed without
   having to block on zil_commit() itself.

 - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which
   will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage
   from issuing ZIL commits. zpl_writepages() will issue the commit
   itself instead of relying on zpl_putpage() to do it, thus nicely
   batching the writes. Note, however, that we still have to call
   write_cache_pages() again in SYNC mode because there is an edge case
   documented in the implementation of write_cache_pages() whereas it
   will not give us all dirty pages when running in non-SYNC mode. Thus
   we need to run it at least once in SYNC mode to make sure we honor
   persistency guarantees. This only happens when the pages are
   modified at the same time msync() is running, which should be rare.
   In most cases there won't be any additional pages and this second
   call will do nothing.

This change fixes openzfs#1849.

Note that this change also fixes a bug related to openzfs#907 whereas calling
msync() on pages that were already handed over to the DMU in a previous
writepages() call would make msync() block until the next TXG sync
instead of returning as soon as the ZIL commit is complete. The new
callback system fixes that problem.
@ryao
Copy link
Contributor

ryao commented Nov 11, 2013

@dechamps How did you test this? I have also written a patch that implements this idea, but I have not tested it yet. My patch does away with write_cache_pages() by implementing the Illumos pvn_vplist_dirty() function in the SPL using write_cache_pages() and find_get_pages() as guides. That would allow us to traverse the address space's radix tree once, rather than ceil(N/14) times, where N is the number of dirty pages. That also would give us the opportunity to avoid write_cache_pages()'s invocation of cond_sched() on each iteration. Unfortunately, it is not pull request ready at the moment for one, lack of testing and two, dependence on very recent kernel features. The latter is easy to fix, but I need to address the former to know whether the potential efficiency gain is worth implementing pvn_vplist_dirty().

@dechamps
Copy link
Contributor Author

In terms of testing, I tested that it behaves correctly when msync() is called and also when writepages() is called in non-SYNC mode. That's about it. This patch has been running on my home server for a day now without any noticeable problems, though of course that doesn't mean much. Unfortunately, I don't have the resources that I had back in the day to do heavy automated testing.

If you want a performance test, then you should look at the steps to reproduce in #1849.

Regarding write_cache_pages(): I thought about re-implementing it as well, but decided otherwise for two reasons: (1) all things being equal I guess it's better to use kernel-provided functions and (2) I'm not very comfortable dealing with memory management code that I'm not familiar with. It's very easy to get wrong.

That being said your solution looks technically superior; if you want to merge yours instead, then by all means, go ahead. However I would prefer not to let this unfixed for a long time because this msync() issue is very annoying when using mmap-heavy software such as aptitude.

@ryao
Copy link
Contributor

ryao commented Nov 11, 2013

@dechamps I just wrote a small program for performance testing:

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>

int main (int argc, char *argv[])
{
        unsigned int size;
        char *data;
        int fd = -1;

        if (argc != 3) {
                printf("Usage: %s file size\n", *argv);
                exit(1);
        }

        sscanf(argv[2], "%u", &size);

        if ((fd = open(argv[1], O_RDWR, 0)) == -1) {
                printf("Cannot open %s\n", argv[1]);
                exit(1);
        }

        ftruncate(fd, size);

        data = (char*)mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_FILE|MAP_SHARED, fd, 0);

        memset(data, 0, size);

        msync(data, size, MS_SYNC);

        return EXIT_SUCCESS;

}

Do you have any objections to this method of testing?

@dechamps
Copy link
Contributor Author

If you actually looked at the steps to reproduce in #1849, you would have realized you just re-implemented the exact same program :)

Actually, in real testing I did a little more than that: take a look at 9b5f1ce. You will need to do something similar if you're testing with fake vdevs (I was testing with in-ram file vdevs) to simulate real-world ZIL performance.

@ryao
Copy link
Contributor

ryao commented Nov 11, 2013

@dechamps I seem to be reimplmenting plenty of stuff lately. Anyway, my patch seems to be performing poorly on small amounts of dirty data (although it performs very well on large amounts). I am investigating it. I should have something to report by tomorrow.

@dechamps
Copy link
Contributor Author

Well, the performance of this piece of code is fairly straightforward: as long as you're calling zil_commit() only one time per writepages() call, you're good. I don't think your radix tree traversal optimization will make any measurable difference in this regard: it's all dwarfed by the time spent in the ZIL commit blocking I/O, probably by several orders of magnitude. That doesn't mean it's not nice to have anyway.

@ryao
Copy link
Contributor

ryao commented Nov 11, 2013

@dechamps It just dawned on me that it is annoying to talk about code without actually seeing it. Here is what I have applied to my system right now:

http://bpaste.net/show/148747/
http://bpaste.net/show/148748/

I am on my way to dinner, so I have not prepared a git branch for it yet. I plan to do performance comparisons between this, your patch and the unpatched code after dinner. The code is a little rough right now. Specifically, there are stale comments and the SPL code needs to be modified to support older kernels. However, it is in a state where I can do performance tests.

@dechamps
Copy link
Contributor Author

I see. Well, in its current form your code isn't addressing #1849 at all; you're still calling zil_commit() for each page, so msync() performance is still atrocious. Furthermore, with regard to #1849 your pvn_vplist_dirty() function has the exact same problem as write_cache_pages() as it blocks on each page (wait_on_page_writeback()) inside the loop. However, since you're now in control of this function, the fix is easy: just remove the wait_on_page_writeback() call, and move your zil_commit() call to writepages().

There's still one problem, however: with the current ZoL code, there is another bug when writepages() is called in non-SYNC mode and then shortly after in SYNC mode - in this case writepages() will block until the next TXG sync instead of simply blocking on the ZIL commit. That's because when putpage() decides between registering a DMU callback or calling zil_commit(), it can't change its mind afterwards, so when a ZIL commit does happen, it ends up waiting for the DMU even though the ZIL already took care of it. Your solution still has that problem. My solution doesn't, as it makes the ZIL handle the callback instead of the DMU.

In its current state, your patch is completely orthogonal to mine and seems to be solving an entirely different set of issues. I see two possible outcomes to this situation:

  • You end up finding a way to leverage your newly gained control in the form of pvn_vplist_dirty() to fix msync() is ~700 times slower than it should be on large writes #1849 and the "DMU-then-ZIL" bug in a cleaner way than I did. In this case I'm fine with throwing away my solution in favor of yours;
  • Or you don't, in which case my patch and yours are complementary and can be merged independently from each other, which puts an end to this discussion.

@dechamps
Copy link
Contributor Author

On second thought, there's a third "mixed" outcome: if you simply implement my suggestion to remove wait_on_page_writeback() from pvn_vplist_dirty(), then my patch can be slightly improved by not calling write_cache_pages() (or should I say pvn_vplist_dirty()) twice. Not that big of a deal, though.

@ryao
Copy link
Contributor

ryao commented Nov 11, 2013

@dechamps I think you misread the patch. I restructured things so that the code for handling 1 page goes into zfs_putapage_impl(). Then I have zfs_putapage(), which is called by zpl_writepage() and zfs_putpage(), which is called by zpl_writepages(). Only 1 zil_commit() is called for each call. If zil_commit() were called for each page, there is no way that I would be able to write back 1 gigabyte of dirty data in 11 to 14 seconds.

Your changes to make the ZIL handle the callback probably is the reason for the discrepancy in numbers between our systems, but I still need to do testing. Something offline came up, so this will probably be delayed by a few hours.

With that said, Illumos appears to do something similar to wait_on_page_writeback() in pvn_vplist_dirty(). As far as our test case is concerned, this is not a problem because the pages are not marked as being in writeback until zfs_putapage() is called. The only concern that I have with this is whether it is possible for one thread to block on writeback while holding a lock needed for writeback to happen. However, that does not seem to be a problem for Illumos though, so I need to think a bit more about why that is.

@dechamps
Copy link
Contributor Author

I think you misread the patch.

Yes, indeed. The names are misleading. You should rename zfs_putpage() to zfs_putpages(), IMHO.

So you're indeed calling zfs_commit() only once. There's one thing I still don't understand, however. In zfs_putapage_impl() in sync mode, you're not calling zfs_commit() anymore (which is good), but that also means you can't call zfs_putpage_commit_cb(), and indeed you don't. You don't register it as a callback either. Which begs the question: when does zfs_putpage_commit_cb() get called? If I'm reading the code correctly, if zfs_putpage() is called without B_ASYNC set, then zfs_putpage_commit_cb() is never called nor registered as a callback. As a result, the writeback tag is never cleared, the pages stay in memory forever, and wait_on_page_writeback() deadlocks. There must be something I'm missing, because otherwise I'm not sure how your tests are even able to finish.

In fact, this callback problem is one of the reasons why I had to come up with my solution in the first place: if you call zfs_putpage_commit_cb() in zfs_putapage_impl() you're basically lying to the system because you're clearing the writeback tag before you do the ZIL commit; if you call it in writepages() you have to iterate over the pages twice (which would be okay, but…); and in both cases you end up with a "waiting for DMU instead of ZIL" issue if writepages() is called first in async mode and then in sync mode shortly after. The only solution I could come up with that solves everything is having a ZIL callback that gets called as soon as the data is written as part of a ZIL commit or as part of a DMU sync, whichever comes first.

Note that being able to "write back 1 gigabyte of dirty data in 11 to 14 seconds" does not prove anything: if your code happens to resurrect #907, then it would indeed complete in a few seconds, but then a 1-byte commit would take seconds (waiting for the next DMU sync instead of the ZIL commit) as well. You should definitely use my program for your testing as it will test small commits as well as large ones. If small commits take 1-5 seconds (i.e. they are waiting for the next DMU sync), then you have a problem.

To clarify, the correct solution needs to exhibit the following three characteristics at the same time:

  1. writepages(SYNC) should wait for the ZIL commit to complete, not the next DMU sync (msync() on MAP_SHARED mappings is very slow #907, small commit performance);
  2. writepages(SYNC) should call zil_commit() once, not once per page (msync() is ~700 times slower than it should be on large writes #1849, large commit performance);
  3. If writepages(ASYNC) is called and then writepages(SYNC) is called shortly thereafter, then (1) should still hold for the second call ("DMU-then-ZIL" issue).

Here's where we are right now:

  • The current ZoL code honors (1), but not (2) nor (3).
  • My solution honors (1), (2) and (3).
  • Your solution honors (2). I am not sure about (1) and (3), because I don't understand how your code could possibly work without calling zfs_putpage_commit_cb() in the zfs_putpage(SYNC) case, ever. This may be related to my poor understanding on how memory management, writeback algorithms, page tagging, etc. work.

@dechamps
Copy link
Contributor Author

Also, note that my test program tests for (1) and (2), but not (3). I'm not sure (3) can be tested with a purely userspace test because userspace has no control over when writepages(ASYNC) is called; however it is easily deduced just by looking at the code, I think.

@dechamps
Copy link
Contributor Author

Oh, well, looking at the msync() manpage, maybe writepages(ASYNC) can be tested just by calling msync(MS_ASYNC). In that case testing (3) would just be a matter of doing this:

msync(data, size, MS_ASYNC);
msync(data, size, MS_SYNC);

@ryao
Copy link
Contributor

ryao commented Nov 12, 2013

@dechamps I am going to withdraw my approach in favor of yours. My approach is unsafe because of how RCU is done:

[ 2032.832053] BUG: sleeping function called from invalid context at /usr/src/linux-3.12.0/include/linux/pagemap.h:350
[ 2032.832056] in_atomic(): 0, irqs_disabled(): 0, pid: 5984, name: msync_test
[ 2032.832059] CPU: 5 PID: 5984 Comm: msync_test Tainted: P           O 3.12.0 #2
[ 2032.832061] Hardware name: Supermicro X9SRA/X9SRA-3/X9SRA/X9SRA-3, BIOS 3.0a 08/08/2013
[ 2032.832062]  0000000000000400 ffff880e9c3cfd78 ffffffff81451670 ffff880ee7dd3318
[ 2032.832066]  ffff880e9c3cfd88 ffffffff810704cf ffff880e9c3cfe08 ffffffffa017ed97
[ 2032.832069]  ffff880ee7dd3320 ffff880edfc11f00 0000040081454559 ffffffffa03a71c0
[ 2032.832072] Call Trace:
[ 2032.832079]  [<ffffffff81451670>] dump_stack+0x4e/0x82
[ 2032.832083]  [<ffffffff810704cf>] __might_sleep+0xdf/0x100
[ 2032.832095]  [<ffffffffa017ed97>] pvn_vplist_dirty+0x127/0x270 [spl]
[ 2032.832108]  [<ffffffffa03a71c0>] ? zfs_link+0x480/0x7f0 [zfs]
[ 2032.832120]  [<ffffffffa03aad84>] zfs_putpage+0x94/0x1a0 [zfs]
[ 2032.832130]  [<ffffffffa03bc25c>] zpl_snapdir_rename+0xe1c/0x1470 [zfs]
[ 2032.832134]  [<ffffffff81100cdc>] do_writepages+0x1c/0x40
[ 2032.832137]  [<ffffffff810f76b1>] __filemap_fdatawrite_range+0x51/0x60
[ 2032.832140]  [<ffffffff810f7798>] filemap_write_and_wait_range+0x28/0x70
[ 2032.832148]  [<ffffffffa03bc30f>] zpl_snapdir_rename+0xecf/0x1470 [zfs]
[ 2032.832151]  [<ffffffff8116d126>] vfs_fsync+0x26/0x40
[ 2032.832154]  [<ffffffff81124f7e>] SyS_msync+0x12e/0x1c0
[ 2032.832157]  [<ffffffff8145d8db>] tracesys+0xdd/0xe2

@behlendorf
Copy link
Contributor

@dechamps @ryao I haven't had a chance to review this patch yet, but a quick comment about testing. Xfstests 074 is an absolutely torture test for mmap() and suffers from this exact msync(2) performance issue. It mixes mmap'ed IO and sys_read/sys_write IO to overlaping regions of a file and then guarantees the result is always consistent. My suggestion would be to run it for a few hours and see if it catches anything.

You can find a modified verson of xfstests which is zfs friendly in my https://github.com/behlendorf/xfstests github repository. See the top level README.markdown for a basic quick start in the zfs-upstream patch. This work still needs to be pushed upstream to the xfstests project but I wanted to get everything passing cleanly on zfs first. There are still some false positives.

@dechamps
Copy link
Contributor Author

@behlendorf: I did what you asked (had to remove the acltype=posixacl option because it seems it doesn't exist in 0.6.2). The results are at https://gist.github.com/dechamps/7471538. Unfortunately it did not run to completion because it stayed blocked on generic/317 for more than 8 hours, so I killed it. Here are some kernel messages while generic/317 started running:

INFO: task fsstress:23766 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
fsstress        D ffff880361d05420     0
 ffff880361d05100 0000000000000082 0000000000014300 ffff880110fbdfd8
 0000000000014300 ffff880110fbdfd8 ffff880120160f68 ffff880110fbdda8
 ffff880120160f6c ffff880361d05100 00000000ffffffff ffff880120160f70
Call Trace:
 [<ffffffff81479130>] ? schedule_preempt_disabled+0x20/0x60
 [<ffffffff8147747b>] ? __mutex_lock_slowpath+0x13b/0x1b0
 [<ffffffff81476956>] ? mutex_lock+0x16/0x25
 [<ffffffff81173b87>] ? lock_rename+0x27/0xe0
 [<ffffffff81177591>] ? SYSC_renameat+0x221/0x470
 [<ffffffffa013b71e>] ? tsd_hash_search.isra.1+0x8e/0x1b0 [spl]
 [<ffffffffa013b71e>] ? tsd_hash_search.isra.1+0x8e/0x1b0 [spl]
 [<ffffffffa028779f>] ? zfs_close+0x6f/0xd0 [zfs]
 [<ffffffff8117eec4>] ? dput.part.15+0x24/0x2b0
 [<ffffffff8107510c>] ? task_work_run+0x9c/0xd0
 [<ffffffff81012939>] ? do_notify_resume+0x59/0x90
 [<ffffffff81481c29>] ? system_call_fastpath+0x16/0x1b
device-mapper: table: 254:0: flakey: Device lookup failed
device-mapper: ioctl: error adding target to table

Considering the stack trace it seems unlikely this has anything to do with my patch. Aside from that, everything seemed fine, nothing scary in the kernel output.

Honestly, I think you are better equipped than me to test this change. I have to make do with lowly home server resources here :/

@behlendorf
Copy link
Contributor

@dechamps Thanks for kicking off the testing, I agree that's probably unrelated. I had a chance to launch some testing as well with this change applied to master. To my surprise with this change applied I'm now able to pretty easily hit #1536. I haven't had a chance to look in to why yet. Apparently the FreeBSD guys have effectively disabled this assertion in their code but I've refrained since the assertion looks right to me. Or at least no one has yet been able to successfully explain to me why it's OK for i_count to be zero.

Leaving that test failure aside I really like this patch. This has been something of a known problem for a while now but it never got any attention because it wasn't clear to me how to cleanly restructure the code to batch the ZIL commits. I think adding a callback for the ITX records was a nice generic way to handle this, and it was also clever to tweak writepages() to make the individual page updates async. I need to still spend some quality time with this patch to make sure all the cases are covered but on initial inspection I like the approach. Thanks!

@behlendorf
Copy link
Contributor

Good (or bad) news depending on your point of view. The recent fix for the HAVE_BDI autoconf regression, fd23663, seems to be preventing me from hitting #1536 which was encountered during my first round of testing. That may make it harder to resolve #1536 and #1417 but fixing those shouldn't block merging this patch.

@dweeezil @ryao I'd really like to get some additional eyes on this change before merging it, if you have a little time could you please review it. I'm going to go over it again carefully myself.

@behlendorf
Copy link
Contributor

@dechamps OK merged to master, thanks for making this improvement.

@behlendorf behlendorf closed this Nov 24, 2013
@dweeezil
Copy link
Contributor

@behlendorf I had been looking at this patch and it looks good to me. Apparently it must have layered on top of a current master rather cleanly, too, since I see it has been committed now. My only questions relate to the new itx callback facility: first, at least in the context of ZFS proper (not sure about lustre and other potential consumers), it seems the old dmu_tx_callback_register() facility is now only used by the ztest program. Does it need to be kept around? Second, and related, should ztest be extended to test the new itx callback facility?

Those are my only comments. @dechamps, Thanks for this work. This fixes a sore spot that I'm amazed didn't got a lot more complaints.

@behlendorf
Copy link
Contributor

@dweeezil Lustre does still use the existing callback interface so we need to keep it around. However, we're definitely going to evaluate if the new callback makes more sense for Lustre. We should add a test case for this in ztest but I didn't want to block this work being merged on that. Thanks for the additional review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants