msync() on MAP_SHARED mappings is very slow #907

dechamps · 2012-08-28T15:43:01Z

I wrote a very simple program (msync.c) that simply opens a 512-byte file, mmap() it with the MAP_SHARED flag, write the contents, then msync() it. In theory this should simply trigger a 512-byte ZIL commit, which should be quite fast (especially considering I have a SSD slog). In practice this takes seconds, which suggests that ZFS might be waiting for the next TXG sync for no good reason.

How to reproduce (tested on rc10 with Linux 3.2):

$ wget https://raw.github.com/gist/3499096/7a15994c7d31237f7c5505f65b8e08d9b7810a68/msync.c
$ gcc -std=c99 -Wall -Wextra msync.c -o msync
$ dd if=/dev/zero of=/homez/test bs=512 count=1
$ time ./msync /homez/test

real 0m4.155s
user 0m0.004s
sys  0m0.000s

Note that the issue does not occur when the MAP_PRIVATE flag is used instead.

This is an annoying issue for those using Debian on ZFS root, because aptitude uses mmap(MAP_SHARED) then msync(), so any aptitude run has to wait for more or less five seconds before completing. The frustration made me investigate and isolate this bug.

The text was updated successfully, but these errors were encountered:

maxximino · 2012-08-28T15:48:04Z

There is another aptitude-only bug when running on ZoL: #541. Maybe there is a connection between them?

behlendorf · 2012-08-28T17:07:10Z

@dechamps Your right, this should simply result in a zil_commit() which I'd expect to be fairly fast. When you use the VM_SHARED flag msync() will basically just call fsync() for the entire rage of pages in the file. It would be interesting to profile the msync call with ftrace or systemtap and see where the time is going for this call path.

sys_msync()
  vfs_fsync()
    vfs_fsync_range()
      zpl_fsync()
        zfs_fsync()
          zil_commit()

ryao · 2012-08-28T21:47:44Z

I haven't looked at the code to be certain, but this reminds me of the logbias setting on the dataset. The default setting is throughput, which would imply that ZFS attempts to do some kind of coalescing.

dechamps · 2012-08-28T22:24:42Z

Well, here's what logbias does: when it is set to latency (the default), the slog is used (if available) and synchronous writes are done either in indirect mode (if the write is aligned, larger than 32K and there is no slog), or in immediate mode. When logbias is set to throughput, the slog is never used and all aligned writes are done in indirect mode, while unaligned writes are done in immediate mode.

I'm not sure how this relates to the issue at hand, however. In addition, logbias=throughput (which forces indirect mode) is the exact opposite of my situation, i.e. logbias=latency with a slog (which forces immediate mode).

GregorKopka · 2012-08-29T10:29:02Z

This sounds like it would explain why samba stalls massively when some of its .tdb files are on a zfs dataset.
#700

mrobbetts · 2012-08-29T14:02:57Z

Hmmmm, is there any chance this would affect NFS shares (stored on ZFS) which are accessed synchronously? I'm having some trouble with those at the moment with a diskless system using an NFS share as its rootfs.

I'll try exporting it asynchronously when I'm at home...

dechamps · 2012-08-29T15:58:52Z

Hmmmm, is there any chance this would affect NFS shares (stored on ZFS) which are accessed synchronously? I'm having some trouble with those at the moment with a diskless system using an NFS share as its rootfs.

Maybe. AFAIK, NFS uses the same interfaces as mmap(), so that would make sense.

I just did some investigating. zpl_fsync() is indeed the call that takes several seconds to complete. However, it seems that almost all the time is not spent in zfs_fsync(), but in filemap_write_and_wait_range(), which is called as part of zpl_fsync() on 3.1-3.x kernels. filemap_write_and_wait_range() calls zpl_writepages() which completes quickly, but it also calls filemap_fdatawait_range() which calls wait_on_page_writeback() which I think may be the culprit.

Here's my hypothesis: all the time is spent on wait_on_page_writeback() sitting around waiting for the writeback bit to be set on the page. If ZFS only sets this bit when it syncs the TXG, then that would explain the delay. That's just a wild guess, however.

dechamps · 2012-08-29T16:21:54Z

Just fired up ftrace, which confirms that all the time is indeed spent in wait_on_page_writeback():

 5) ! 1086473 us |    } /* wait_on_page_bit */
 5) ! 1086483 us |  } /* filemap_fdatawait_range */

When writing via ->writepage() the writeback bit was always cleared as part of the txg commit callback. However, when the I/O is being done synchronsously to a slog device we can safely run the callback immediately after zil_commit(). This will significantly reduce the delay for calls such as fsync() and msync() which block until the data is safe on disk. Issue openzfs#700 Issue openzfs#907

behlendorf · 2012-08-29T19:39:32Z

@dechamps: Ahh, that's what I needed to know. I see what's going on and your basically correct.

When a dirty page gets written though zpl_writepages() it marks the page in writeback. This is used to inform the VM that the dirty page is actively being written to the disk but is not yet safe on stable storage. We then register a commit callback with the tx to be called when the data is on disk. This callback is responsible for clearing the writeback bit which is what wait_on_page_writeback() is watching for.

The problem is that the commit callbacks will run several seconds latter as part of the txg sync. This is correct but we can do far better when there's a slog device. The following patch moves the execution of the commit callback immediately after the zil_commit() for a synchronous write. Otherwise we default to the previous behavior.

@dechamps can you please review this change. It significantly helps with your test case in my VM. We could probably further optimize with some additional work by bringing the zil_commit() up in to zpl_writepages() but that would be more disruptive and require larger changes.

$ gcc -std=c99 -Wall -Wextra msync.c -o msync
$ dd if=/dev/zero of=/tank/test/mmap bs=512 count=1
$ time ./msync /tank/tank/mmap 

real    0m0.269s
user    0m0.000s

dechamps · 2012-08-29T19:58:45Z

That's what I thought indeed, thanks for fixing this. I just have one objection:

The problem is that the commit callbacks will run several seconds latter as part of the txg sync. This is correct but we can do far better when there's a slog device.

Actually, we can do far better even if there isn't a slog device. Using the ZIL is much, much faster than waiting for the next TXG even if there's no slog. Commiting the ZIL: typically 10 ms on unloaded spindles. Waiting for the next TXG: up to 5 seconds. That's a 500x slowdown… My point is, this patch also benefits pools that don't have a slog. You should probably fix that in your commit message as well.

We could probably further optimize with some additional work by bringing the zil_commit() up in to zpl_writepages() but that would be more disruptive and require larger changes.

I agree.

As for the patch, I'll test it tomorrow. I'm sure it's fine, though.

behlendorf · 2012-08-29T21:32:47Z

You right your, the commit message isn't very clear. I'll make mention this is the right thing to do even if you don't have a dedicate log device and are just using the primary pool for the zil.

dechamps · 2012-08-30T08:17:44Z

Yep, that fixes it:

$ time /root/msync /homez/test 

real    0m0.003s
user    0m0.000s
sys 0m0.000s

1000x speedup. w00t!

Currently, using msync() results in the following code path: sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage In such a code path, zil_commit() is called as part of zpl_putpage(). This means that for each page, the write is handed to the DMU, the ZIL is committed, and only then do we move on to the next page. As one might imagine, this results in atrocious performance where there is a large number of pages to write: instead of committing a batch of N writes, we do N commits containing one page each. In some extreme cases this can result in msync() being ~700 times slower than it should be, as well as very inefficient use of ZIL resources. This patch fixes this issue by making sure that the requested writes are batched and then committed only once. Unfortunately, the implementation is somewhat non-trivial because there is no way to run write_cache_pages in SYNC mode (so that we get all pages) without making it wait on the writeback tag for each page. The solution implemented here is composed of two parts: - I added a new callback system to the ZIL, which allows the caller to be notified when its ITX gets written to stable storage. One nice thing is that the callback is called not only in zil_commit() but in zil_sync() as well, which means that the caller doesn't have to care whether the write ended up in the ZIL or the DMU: it will get notified as soon as it's safe, period. This is an improvement over dmu_tx_callback_register() that was used previously, which only supports DMU writes. The rationale for this change is to allow zpl_putpage() to be notified when a ZIL commit is completed without having to block on zil_commit() itself. - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage from issuing ZIL commits. zpl_writepages() will issue the commit itself instead of relying on zpl_putpage() to do it, thus nicely batching the writes. Note, however, that we still have to call write_cache_pages() again in SYNC mode because there is an edge case documented in the implementation of write_cache_pages() whereas it will not give us all dirty pages when running in non-SYNC mode. Thus we need to run it at least once in SYNC mode to make sure we honor persistency guarantees. This only happens when the pages are modified at the same time msync() is running, which should be rare. In most cases there won't be any additional pages and this second call will do nothing. This change fixes openzfs#1849. Note that this change also fixes a bug related to openzfs#907 whereas calling msync() on pages that were already handed over to the DMU in a previous writepages() call would make msync() block until the next TXG sync instead of returning as soon as the ZIL commit is complete.

Currently, using msync() results in the following code path: sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage In such a code path, zil_commit() is called as part of zpl_putpage(). This means that for each page, the write is handed to the DMU, the ZIL is committed, and only then do we move on to the next page. As one might imagine, this results in atrocious performance where there is a large number of pages to write: instead of committing a batch of N writes, we do N commits containing one page each. In some extreme cases this can result in msync() being ~700 times slower than it should be, as well as very inefficient use of ZIL resources. This patch fixes this issue by making sure that the requested writes are batched and then committed only once. Unfortunately, the implementation is somewhat non-trivial because there is no way to run write_cache_pages in SYNC mode (so that we get all pages) without making it wait on the writeback tag for each page. The solution implemented here is composed of two parts: - I added a new callback system to the ZIL, which allows the caller to be notified when its ITX gets written to stable storage. One nice thing is that the callback is called not only in zil_commit() but in zil_sync() as well, which means that the caller doesn't have to care whether the write ended up in the ZIL or the DMU: it will get notified as soon as it's safe, period. This is an improvement over dmu_tx_callback_register() that was used previously, which only supports DMU writes. The rationale for this change is to allow zpl_putpage() to be notified when a ZIL commit is completed without having to block on zil_commit() itself. - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage from issuing ZIL commits. zpl_writepages() will issue the commit itself instead of relying on zpl_putpage() to do it, thus nicely batching the writes. Note, however, that we still have to call write_cache_pages() again in SYNC mode because there is an edge case documented in the implementation of write_cache_pages() whereas it will not give us all dirty pages when running in non-SYNC mode. Thus we need to run it at least once in SYNC mode to make sure we honor persistency guarantees. This only happens when the pages are modified at the same time msync() is running, which should be rare. In most cases there won't be any additional pages and this second call will do nothing. This change fixes openzfs#1849. Note that this change also fixes a bug related to openzfs#907 whereas calling msync() on pages that were already handed over to the DMU in a previous writepages() call would make msync() block until the next TXG sync instead of returning as soon as the ZIL commit is complete. The new callback system fixes that problem.

Currently, using msync() results in the following code path: sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage In such a code path, zil_commit() is called as part of zpl_putpage(). This means that for each page, the write is handed to the DMU, the ZIL is committed, and only then do we move on to the next page. As one might imagine, this results in atrocious performance where there is a large number of pages to write: instead of committing a batch of N writes, we do N commits containing one page each. In some extreme cases this can result in msync() being ~700 times slower than it should be, as well as very inefficient use of ZIL resources. This patch fixes this issue by making sure that the requested writes are batched and then committed only once. Unfortunately, the implementation is somewhat non-trivial because there is no way to run write_cache_pages in SYNC mode (so that we get all pages) without making it wait on the writeback tag for each page. The solution implemented here is composed of two parts: - I added a new callback system to the ZIL, which allows the caller to be notified when its ITX gets written to stable storage. One nice thing is that the callback is called not only in zil_commit() but in zil_sync() as well, which means that the caller doesn't have to care whether the write ended up in the ZIL or the DMU: it will get notified as soon as it's safe, period. This is an improvement over dmu_tx_callback_register() that was used previously, which only supports DMU writes. The rationale for this change is to allow zpl_putpage() to be notified when a ZIL commit is completed without having to block on zil_commit() itself. - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage from issuing ZIL commits. zpl_writepages() will issue the commit itself instead of relying on zpl_putpage() to do it, thus nicely batching the writes. Note, however, that we still have to call write_cache_pages() again in SYNC mode because there is an edge case documented in the implementation of write_cache_pages() whereas it will not give us all dirty pages when running in non-SYNC mode. Thus we need to run it at least once in SYNC mode to make sure we honor persistency guarantees. This only happens when the pages are modified at the same time msync() is running, which should be rare. In most cases there won't be any additional pages and this second call will do nothing. This change fixes openzfs#1849. Note that this change also fixes a bug related to openzfs#907 whereas calling msync() on pages that were already handed over to the DMU in a previous writepages() call would make msync() block until the next TXG sync instead of returning as soon as the ZIL commit is complete.

Currently, using msync() results in the following code path: sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage In such a code path, zil_commit() is called as part of zpl_putpage(). This means that for each page, the write is handed to the DMU, the ZIL is committed, and only then do we move on to the next page. As one might imagine, this results in atrocious performance where there is a large number of pages to write: instead of committing a batch of N writes, we do N commits containing one page each. In some extreme cases this can result in msync() being ~700 times slower than it should be, as well as very inefficient use of ZIL resources. This patch fixes this issue by making sure that the requested writes are batched and then committed only once. Unfortunately, the implementation is somewhat non-trivial because there is no way to run write_cache_pages in SYNC mode (so that we get all pages) without making it wait on the writeback tag for each page. The solution implemented here is composed of two parts: - I added a new callback system to the ZIL, which allows the caller to be notified when its ITX gets written to stable storage. One nice thing is that the callback is called not only in zil_commit() but in zil_sync() as well, which means that the caller doesn't have to care whether the write ended up in the ZIL or the DMU: it will get notified as soon as it's safe, period. This is an improvement over dmu_tx_callback_register() that was used previously, which only supports DMU writes. The rationale for this change is to allow zpl_putpage() to be notified when a ZIL commit is completed without having to block on zil_commit() itself. - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage from issuing ZIL commits. zpl_writepages() will issue the commit itself instead of relying on zpl_putpage() to do it, thus nicely batching the writes. Note, however, that we still have to call write_cache_pages() again in SYNC mode because there is an edge case documented in the implementation of write_cache_pages() whereas it will not give us all dirty pages when running in non-SYNC mode. Thus we need to run it at least once in SYNC mode to make sure we honor persistency guarantees. This only happens when the pages are modified at the same time msync() is running, which should be rare. In most cases there won't be any additional pages and this second call will do nothing. This change fixes openzfs#1849. Note that this change also fixes a bug related to openzfs#907 whereas calling msync() on pages that were already handed over to the DMU in a previous writepages() call would make msync() block until the next TXG sync instead of returning as soon as the ZIL commit is complete. The new callback system fixes that problem. Signed-off-by: Richard Yao <[email protected]> Conflicts: module/zfs/zfs_log.c

Currently, using msync() results in the following code path: sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage In such a code path, zil_commit() is called as part of zpl_putpage(). This means that for each page, the write is handed to the DMU, the ZIL is committed, and only then do we move on to the next page. As one might imagine, this results in atrocious performance where there is a large number of pages to write: instead of committing a batch of N writes, we do N commits containing one page each. In some extreme cases this can result in msync() being ~700 times slower than it should be, as well as very inefficient use of ZIL resources. This patch fixes this issue by making sure that the requested writes are batched and then committed only once. Unfortunately, the implementation is somewhat non-trivial because there is no way to run write_cache_pages in SYNC mode (so that we get all pages) without making it wait on the writeback tag for each page. The solution implemented here is composed of two parts: - I added a new callback system to the ZIL, which allows the caller to be notified when its ITX gets written to stable storage. One nice thing is that the callback is called not only in zil_commit() but in zil_sync() as well, which means that the caller doesn't have to care whether the write ended up in the ZIL or the DMU: it will get notified as soon as it's safe, period. This is an improvement over dmu_tx_callback_register() that was used previously, which only supports DMU writes. The rationale for this change is to allow zpl_putpage() to be notified when a ZIL commit is completed without having to block on zil_commit() itself. - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage from issuing ZIL commits. zpl_writepages() will issue the commit itself instead of relying on zpl_putpage() to do it, thus nicely batching the writes. Note, however, that we still have to call write_cache_pages() again in SYNC mode because there is an edge case documented in the implementation of write_cache_pages() whereas it will not give us all dirty pages when running in non-SYNC mode. Thus we need to run it at least once in SYNC mode to make sure we honor persistency guarantees. This only happens when the pages are modified at the same time msync() is running, which should be rare. In most cases there won't be any additional pages and this second call will do nothing. Note that this change also fixes a bug related to #907 whereas calling msync() on pages that were already handed over to the DMU in a previous writepages() call would make msync() block until the next TXG sync instead of returning as soon as the ZIL commit is complete. The new callback system fixes that problem. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1849 Closes #907

Currently, using msync() results in the following code path: sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage In such a code path, zil_commit() is called as part of zpl_putpage(). This means that for each page, the write is handed to the DMU, the ZIL is committed, and only then do we move on to the next page. As one might imagine, this results in atrocious performance where there is a large number of pages to write: instead of committing a batch of N writes, we do N commits containing one page each. In some extreme cases this can result in msync() being ~700 times slower than it should be, as well as very inefficient use of ZIL resources. This patch fixes this issue by making sure that the requested writes are batched and then committed only once. Unfortunately, the implementation is somewhat non-trivial because there is no way to run write_cache_pages in SYNC mode (so that we get all pages) without making it wait on the writeback tag for each page. The solution implemented here is composed of two parts: - I added a new callback system to the ZIL, which allows the caller to be notified when its ITX gets written to stable storage. One nice thing is that the callback is called not only in zil_commit() but in zil_sync() as well, which means that the caller doesn't have to care whether the write ended up in the ZIL or the DMU: it will get notified as soon as it's safe, period. This is an improvement over dmu_tx_callback_register() that was used previously, which only supports DMU writes. The rationale for this change is to allow zpl_putpage() to be notified when a ZIL commit is completed without having to block on zil_commit() itself. - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage from issuing ZIL commits. zpl_writepages() will issue the commit itself instead of relying on zpl_putpage() to do it, thus nicely batching the writes. Note, however, that we still have to call write_cache_pages() again in SYNC mode because there is an edge case documented in the implementation of write_cache_pages() whereas it will not give us all dirty pages when running in non-SYNC mode. Thus we need to run it at least once in SYNC mode to make sure we honor persistency guarantees. This only happens when the pages are modified at the same time msync() is running, which should be rare. In most cases there won't be any additional pages and this second call will do nothing. Note that this change also fixes a bug related to openzfs#907 whereas calling msync() on pages that were already handed over to the DMU in a previous writepages() call would make msync() block until the next TXG sync instead of returning as soon as the ZIL commit is complete. The new callback system fixes that problem. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#1849 Closes openzfs#907

…fs#907) After changing zdb to use --bucket rather than -B, not all of the places where -B was used in the source were caught. This takes care of the missing instances.

behlendorf mentioned this issue Aug 29, 2012

Clear PG_writeback after zil_commit() for sync I/O #912

Closed

behlendorf closed this as completed in 2b28613 Aug 31, 2012

dechamps mentioned this issue Sep 12, 2012

msync stalls with current master (cda4db408...) #958

Closed

dechamps mentioned this issue Feb 24, 2013

Remove the bio_empty_barrier() check #1318

Closed

dechamps mentioned this issue Nov 9, 2013

msync() is ~700 times slower than it should be on large writes #1849

Closed

dechamps mentioned this issue Nov 10, 2013

Only commit the ZIL once in zpl_writepages() (msync() case) #1853

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

msync() on MAP_SHARED mappings is very slow #907

msync() on MAP_SHARED mappings is very slow #907

dechamps commented Aug 28, 2012

maxximino commented Aug 28, 2012

behlendorf commented Aug 28, 2012

ryao commented Aug 28, 2012

dechamps commented Aug 28, 2012

GregorKopka commented Aug 29, 2012

mrobbetts commented Aug 29, 2012

dechamps commented Aug 29, 2012

dechamps commented Aug 29, 2012

behlendorf commented Aug 29, 2012

dechamps commented Aug 29, 2012

behlendorf commented Aug 29, 2012

dechamps commented Aug 30, 2012

msync() on MAP_SHARED mappings is very slow #907

msync() on MAP_SHARED mappings is very slow #907

Comments

dechamps commented Aug 28, 2012

maxximino commented Aug 28, 2012

behlendorf commented Aug 28, 2012

ryao commented Aug 28, 2012

dechamps commented Aug 28, 2012

GregorKopka commented Aug 29, 2012

mrobbetts commented Aug 29, 2012

dechamps commented Aug 29, 2012

dechamps commented Aug 29, 2012

behlendorf commented Aug 29, 2012

dechamps commented Aug 29, 2012

behlendorf commented Aug 29, 2012

dechamps commented Aug 30, 2012