Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msync() on MAP_SHARED mappings is very slow #907

Closed
dechamps opened this issue Aug 28, 2012 · 12 comments
Closed

msync() on MAP_SHARED mappings is very slow #907

dechamps opened this issue Aug 28, 2012 · 12 comments
Milestone

Comments

@dechamps
Copy link
Contributor

I wrote a very simple program (msync.c) that simply opens a 512-byte file, mmap() it with the MAP_SHARED flag, write the contents, then msync() it. In theory this should simply trigger a 512-byte ZIL commit, which should be quite fast (especially considering I have a SSD slog). In practice this takes seconds, which suggests that ZFS might be waiting for the next TXG sync for no good reason.

How to reproduce (tested on rc10 with Linux 3.2):

$ wget https://raw.github.com/gist/3499096/7a15994c7d31237f7c5505f65b8e08d9b7810a68/msync.c
$ gcc -std=c99 -Wall -Wextra msync.c -o msync
$ dd if=/dev/zero of=/homez/test bs=512 count=1
$ time ./msync /homez/test

real 0m4.155s
user 0m0.004s
sys  0m0.000s

Note that the issue does not occur when the MAP_PRIVATE flag is used instead.

This is an annoying issue for those using Debian on ZFS root, because aptitude uses mmap(MAP_SHARED) then msync(), so any aptitude run has to wait for more or less five seconds before completing. The frustration made me investigate and isolate this bug.

@maxximino
Copy link
Contributor

There is another aptitude-only bug when running on ZoL: #541. Maybe there is a connection between them?

@behlendorf
Copy link
Contributor

@dechamps Your right, this should simply result in a zil_commit() which I'd expect to be fairly fast. When you use the VM_SHARED flag msync() will basically just call fsync() for the entire rage of pages in the file. It would be interesting to profile the msync call with ftrace or systemtap and see where the time is going for this call path.

sys_msync()
  vfs_fsync()
    vfs_fsync_range()
      zpl_fsync()
        zfs_fsync()
          zil_commit()

@ryao
Copy link
Contributor

ryao commented Aug 28, 2012

I haven't looked at the code to be certain, but this reminds me of the logbias setting on the dataset. The default setting is throughput, which would imply that ZFS attempts to do some kind of coalescing.

@dechamps
Copy link
Contributor Author

Well, here's what logbias does: when it is set to latency (the default), the slog is used (if available) and synchronous writes are done either in indirect mode (if the write is aligned, larger than 32K and there is no slog), or in immediate mode. When logbias is set to throughput, the slog is never used and all aligned writes are done in indirect mode, while unaligned writes are done in immediate mode.

I'm not sure how this relates to the issue at hand, however. In addition, logbias=throughput (which forces indirect mode) is the exact opposite of my situation, i.e. logbias=latency with a slog (which forces immediate mode).

@GregorKopka
Copy link
Contributor

This sounds like it would explain why samba stalls massively when some of its .tdb files are on a zfs dataset.
#700

@mrobbetts
Copy link

Hmmmm, is there any chance this would affect NFS shares (stored on ZFS) which are accessed synchronously? I'm having some trouble with those at the moment with a diskless system using an NFS share as its rootfs.

I'll try exporting it asynchronously when I'm at home...

@dechamps
Copy link
Contributor Author

Hmmmm, is there any chance this would affect NFS shares (stored on ZFS) which are accessed synchronously? I'm having some trouble with those at the moment with a diskless system using an NFS share as its rootfs.

Maybe. AFAIK, NFS uses the same interfaces as mmap(), so that would make sense.

I just did some investigating. zpl_fsync() is indeed the call that takes several seconds to complete. However, it seems that almost all the time is not spent in zfs_fsync(), but in filemap_write_and_wait_range(), which is called as part of zpl_fsync() on 3.1-3.x kernels. filemap_write_and_wait_range() calls zpl_writepages() which completes quickly, but it also calls filemap_fdatawait_range() which calls wait_on_page_writeback() which I think may be the culprit.

Here's my hypothesis: all the time is spent on wait_on_page_writeback() sitting around waiting for the writeback bit to be set on the page. If ZFS only sets this bit when it syncs the TXG, then that would explain the delay. That's just a wild guess, however.

@dechamps
Copy link
Contributor Author

Just fired up ftrace, which confirms that all the time is indeed spent in wait_on_page_writeback():

 5) ! 1086473 us |    } /* wait_on_page_bit */
 5) ! 1086483 us |  } /* filemap_fdatawait_range */

behlendorf added a commit to behlendorf/zfs that referenced this issue Aug 29, 2012
When writing via ->writepage() the writeback bit was always cleared
as part of the txg commit callback.  However, when the I/O is being
done synchronsously to a slog device we can safely run the callback
immediately after zil_commit().

This will significantly reduce the delay for calls such as fsync()
and msync() which block until the data is safe on disk.

Issue openzfs#700
Issue openzfs#907
@behlendorf
Copy link
Contributor

@dechamps: Ahh, that's what I needed to know. I see what's going on and your basically correct.

When a dirty page gets written though zpl_writepages() it marks the page in writeback. This is used to inform the VM that the dirty page is actively being written to the disk but is not yet safe on stable storage. We then register a commit callback with the tx to be called when the data is on disk. This callback is responsible for clearing the writeback bit which is what wait_on_page_writeback() is watching for.

The problem is that the commit callbacks will run several seconds latter as part of the txg sync. This is correct but we can do far better when there's a slog device. The following patch moves the execution of the commit callback immediately after the zil_commit() for a synchronous write. Otherwise we default to the previous behavior.

@dechamps can you please review this change. It significantly helps with your test case in my VM. We could probably further optimize with some additional work by bringing the zil_commit() up in to zpl_writepages() but that would be more disruptive and require larger changes.

$ gcc -std=c99 -Wall -Wextra msync.c -o msync
$ dd if=/dev/zero of=/tank/test/mmap bs=512 count=1
$ time ./msync /tank/tank/mmap 

real    0m0.269s
user    0m0.000s

@dechamps
Copy link
Contributor Author

That's what I thought indeed, thanks for fixing this. I just have one objection:

The problem is that the commit callbacks will run several seconds latter as part of the txg sync. This is correct but we can do far better when there's a slog device.

Actually, we can do far better even if there isn't a slog device. Using the ZIL is much, much faster than waiting for the next TXG even if there's no slog. Commiting the ZIL: typically 10 ms on unloaded spindles. Waiting for the next TXG: up to 5 seconds. That's a 500x slowdown… My point is, this patch also benefits pools that don't have a slog. You should probably fix that in your commit message as well.

We could probably further optimize with some additional work by bringing the zil_commit() up in to zpl_writepages() but that would be more disruptive and require larger changes.

I agree.

As for the patch, I'll test it tomorrow. I'm sure it's fine, though.

@behlendorf
Copy link
Contributor

You right your, the commit message isn't very clear. I'll make mention this is the right thing to do even if you don't have a dedicate log device and are just using the primary pool for the zil.

@dechamps
Copy link
Contributor Author

Yep, that fixes it:

$ time /root/msync /homez/test 

real    0m0.003s
user    0m0.000s
sys 0m0.000s

1000x speedup. w00t!

dechamps added a commit to dechamps/zfs that referenced this issue Nov 10, 2013
Currently, using msync() results in the following code path:

    sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage

In such a code path, zil_commit() is called as part of zpl_putpage().
This means that for each page, the write is handed to the DMU, the ZIL
is committed, and only then do we move on to the next page. As one might
imagine, this results in atrocious performance where there is a large
number of pages to write: instead of committing a batch of N writes,
we do N commits containing one page each. In some extreme cases this
can result in msync() being ~700 times slower than it should be, as well
as very inefficient use of ZIL resources.

This patch fixes this issue by making sure that the requested writes
are batched and then committed only once. Unfortunately, the
implementation is somewhat non-trivial because there is no way to run
write_cache_pages in SYNC mode (so that we get all pages) without
making it wait on the writeback tag for each page.

The solution implemented here is composed of two parts:

 - I added a new callback system to the ZIL, which allows the caller to
   be notified when its ITX gets written to stable storage. One nice
   thing is that the callback is called not only in zil_commit() but
   in zil_sync() as well, which means that the caller doesn't have to
   care whether the write ended up in the ZIL or the DMU: it will get
   notified as soon as it's safe, period. This is an improvement over
   dmu_tx_callback_register() that was used previously, which only
   supports DMU writes. The rationale for this change is to allow
   zpl_putpage() to be notified when a ZIL commit is completed without
   having to block on zil_commit() itself.

 - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which
   will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage
   from issuing ZIL commits. zpl_writepages() will issue the commit
   itself instead of relying on zpl_putpage() to do it, thus nicely
   batching the writes. Note, however, that we still have to call
   write_cache_pages() again in SYNC mode because there is an edge case
   documented in the implementation of write_cache_pages() whereas it
   will not give us all dirty pages when running in non-SYNC mode. Thus
   we need to run it at least once in SYNC mode to make sure we honor
   persistency guarantees. This only happens when the pages are
   modified at the same time msync() is running, which should be rare.
   In most cases there won't be any additional pages and this second
   call will do nothing.

This change fixes openzfs#1849.

Note that this change also fixes a bug related to openzfs#907 whereas calling
msync() on pages that were already handed over to the DMU in a previous
writepages() call would make msync() block until the next TXG sync
instead of returning as soon as the ZIL commit is complete.
dechamps added a commit to dechamps/zfs that referenced this issue Nov 10, 2013
Currently, using msync() results in the following code path:

    sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage

In such a code path, zil_commit() is called as part of zpl_putpage().
This means that for each page, the write is handed to the DMU, the ZIL
is committed, and only then do we move on to the next page. As one might
imagine, this results in atrocious performance where there is a large
number of pages to write: instead of committing a batch of N writes,
we do N commits containing one page each. In some extreme cases this
can result in msync() being ~700 times slower than it should be, as well
as very inefficient use of ZIL resources.

This patch fixes this issue by making sure that the requested writes
are batched and then committed only once. Unfortunately, the
implementation is somewhat non-trivial because there is no way to run
write_cache_pages in SYNC mode (so that we get all pages) without
making it wait on the writeback tag for each page.

The solution implemented here is composed of two parts:

 - I added a new callback system to the ZIL, which allows the caller to
   be notified when its ITX gets written to stable storage. One nice
   thing is that the callback is called not only in zil_commit() but
   in zil_sync() as well, which means that the caller doesn't have to
   care whether the write ended up in the ZIL or the DMU: it will get
   notified as soon as it's safe, period. This is an improvement over
   dmu_tx_callback_register() that was used previously, which only
   supports DMU writes. The rationale for this change is to allow
   zpl_putpage() to be notified when a ZIL commit is completed without
   having to block on zil_commit() itself.

 - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which
   will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage
   from issuing ZIL commits. zpl_writepages() will issue the commit
   itself instead of relying on zpl_putpage() to do it, thus nicely
   batching the writes. Note, however, that we still have to call
   write_cache_pages() again in SYNC mode because there is an edge case
   documented in the implementation of write_cache_pages() whereas it
   will not give us all dirty pages when running in non-SYNC mode. Thus
   we need to run it at least once in SYNC mode to make sure we honor
   persistency guarantees. This only happens when the pages are
   modified at the same time msync() is running, which should be rare.
   In most cases there won't be any additional pages and this second
   call will do nothing.

This change fixes openzfs#1849.

Note that this change also fixes a bug related to openzfs#907 whereas calling
msync() on pages that were already handed over to the DMU in a previous
writepages() call would make msync() block until the next TXG sync
instead of returning as soon as the ZIL commit is complete. The new
callback system fixes that problem.
dechamps added a commit to dechamps/zfs that referenced this issue Nov 11, 2013
Currently, using msync() results in the following code path:

    sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage

In such a code path, zil_commit() is called as part of zpl_putpage().
This means that for each page, the write is handed to the DMU, the ZIL
is committed, and only then do we move on to the next page. As one might
imagine, this results in atrocious performance where there is a large
number of pages to write: instead of committing a batch of N writes,
we do N commits containing one page each. In some extreme cases this
can result in msync() being ~700 times slower than it should be, as well
as very inefficient use of ZIL resources.

This patch fixes this issue by making sure that the requested writes
are batched and then committed only once. Unfortunately, the
implementation is somewhat non-trivial because there is no way to run
write_cache_pages in SYNC mode (so that we get all pages) without
making it wait on the writeback tag for each page.

The solution implemented here is composed of two parts:

 - I added a new callback system to the ZIL, which allows the caller to
   be notified when its ITX gets written to stable storage. One nice
   thing is that the callback is called not only in zil_commit() but
   in zil_sync() as well, which means that the caller doesn't have to
   care whether the write ended up in the ZIL or the DMU: it will get
   notified as soon as it's safe, period. This is an improvement over
   dmu_tx_callback_register() that was used previously, which only
   supports DMU writes. The rationale for this change is to allow
   zpl_putpage() to be notified when a ZIL commit is completed without
   having to block on zil_commit() itself.

 - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which
   will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage
   from issuing ZIL commits. zpl_writepages() will issue the commit
   itself instead of relying on zpl_putpage() to do it, thus nicely
   batching the writes. Note, however, that we still have to call
   write_cache_pages() again in SYNC mode because there is an edge case
   documented in the implementation of write_cache_pages() whereas it
   will not give us all dirty pages when running in non-SYNC mode. Thus
   we need to run it at least once in SYNC mode to make sure we honor
   persistency guarantees. This only happens when the pages are
   modified at the same time msync() is running, which should be rare.
   In most cases there won't be any additional pages and this second
   call will do nothing.

This change fixes openzfs#1849.

Note that this change also fixes a bug related to openzfs#907 whereas calling
msync() on pages that were already handed over to the DMU in a previous
writepages() call would make msync() block until the next TXG sync
instead of returning as soon as the ZIL commit is complete.
ryao pushed a commit to ryao/zfs that referenced this issue Nov 21, 2013
Currently, using msync() results in the following code path:

    sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage

In such a code path, zil_commit() is called as part of zpl_putpage().
This means that for each page, the write is handed to the DMU, the ZIL
is committed, and only then do we move on to the next page. As one might
imagine, this results in atrocious performance where there is a large
number of pages to write: instead of committing a batch of N writes,
we do N commits containing one page each. In some extreme cases this
can result in msync() being ~700 times slower than it should be, as well
as very inefficient use of ZIL resources.

This patch fixes this issue by making sure that the requested writes
are batched and then committed only once. Unfortunately, the
implementation is somewhat non-trivial because there is no way to run
write_cache_pages in SYNC mode (so that we get all pages) without
making it wait on the writeback tag for each page.

The solution implemented here is composed of two parts:

 - I added a new callback system to the ZIL, which allows the caller to
   be notified when its ITX gets written to stable storage. One nice
   thing is that the callback is called not only in zil_commit() but
   in zil_sync() as well, which means that the caller doesn't have to
   care whether the write ended up in the ZIL or the DMU: it will get
   notified as soon as it's safe, period. This is an improvement over
   dmu_tx_callback_register() that was used previously, which only
   supports DMU writes. The rationale for this change is to allow
   zpl_putpage() to be notified when a ZIL commit is completed without
   having to block on zil_commit() itself.

 - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which
   will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage
   from issuing ZIL commits. zpl_writepages() will issue the commit
   itself instead of relying on zpl_putpage() to do it, thus nicely
   batching the writes. Note, however, that we still have to call
   write_cache_pages() again in SYNC mode because there is an edge case
   documented in the implementation of write_cache_pages() whereas it
   will not give us all dirty pages when running in non-SYNC mode. Thus
   we need to run it at least once in SYNC mode to make sure we honor
   persistency guarantees. This only happens when the pages are
   modified at the same time msync() is running, which should be rare.
   In most cases there won't be any additional pages and this second
   call will do nothing.

This change fixes openzfs#1849.

Note that this change also fixes a bug related to openzfs#907 whereas calling
msync() on pages that were already handed over to the DMU in a previous
writepages() call would make msync() block until the next TXG sync
instead of returning as soon as the ZIL commit is complete. The new
callback system fixes that problem.

Signed-off-by: Richard Yao <[email protected]>

Conflicts:
	module/zfs/zfs_log.c
behlendorf pushed a commit that referenced this issue Nov 24, 2013
Currently, using msync() results in the following code path:

    sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage

In such a code path, zil_commit() is called as part of zpl_putpage().
This means that for each page, the write is handed to the DMU, the ZIL
is committed, and only then do we move on to the next page. As one might
imagine, this results in atrocious performance where there is a large
number of pages to write: instead of committing a batch of N writes,
we do N commits containing one page each. In some extreme cases this
can result in msync() being ~700 times slower than it should be, as well
as very inefficient use of ZIL resources.

This patch fixes this issue by making sure that the requested writes
are batched and then committed only once. Unfortunately, the
implementation is somewhat non-trivial because there is no way to run
write_cache_pages in SYNC mode (so that we get all pages) without
making it wait on the writeback tag for each page.

The solution implemented here is composed of two parts:

 - I added a new callback system to the ZIL, which allows the caller to
   be notified when its ITX gets written to stable storage. One nice
   thing is that the callback is called not only in zil_commit() but
   in zil_sync() as well, which means that the caller doesn't have to
   care whether the write ended up in the ZIL or the DMU: it will get
   notified as soon as it's safe, period. This is an improvement over
   dmu_tx_callback_register() that was used previously, which only
   supports DMU writes. The rationale for this change is to allow
   zpl_putpage() to be notified when a ZIL commit is completed without
   having to block on zil_commit() itself.

 - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which
   will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage
   from issuing ZIL commits. zpl_writepages() will issue the commit
   itself instead of relying on zpl_putpage() to do it, thus nicely
   batching the writes. Note, however, that we still have to call
   write_cache_pages() again in SYNC mode because there is an edge case
   documented in the implementation of write_cache_pages() whereas it
   will not give us all dirty pages when running in non-SYNC mode. Thus
   we need to run it at least once in SYNC mode to make sure we honor
   persistency guarantees. This only happens when the pages are
   modified at the same time msync() is running, which should be rare.
   In most cases there won't be any additional pages and this second
   call will do nothing.

Note that this change also fixes a bug related to #907 whereas calling
msync() on pages that were already handed over to the DMU in a previous
writepages() call would make msync() block until the next TXG sync
instead of returning as soon as the ZIL commit is complete. The new
callback system fixes that problem.

Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1849
Closes #907
unya pushed a commit to unya/zfs that referenced this issue Dec 13, 2013
Currently, using msync() results in the following code path:

    sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage

In such a code path, zil_commit() is called as part of zpl_putpage().
This means that for each page, the write is handed to the DMU, the ZIL
is committed, and only then do we move on to the next page. As one might
imagine, this results in atrocious performance where there is a large
number of pages to write: instead of committing a batch of N writes,
we do N commits containing one page each. In some extreme cases this
can result in msync() being ~700 times slower than it should be, as well
as very inefficient use of ZIL resources.

This patch fixes this issue by making sure that the requested writes
are batched and then committed only once. Unfortunately, the
implementation is somewhat non-trivial because there is no way to run
write_cache_pages in SYNC mode (so that we get all pages) without
making it wait on the writeback tag for each page.

The solution implemented here is composed of two parts:

 - I added a new callback system to the ZIL, which allows the caller to
   be notified when its ITX gets written to stable storage. One nice
   thing is that the callback is called not only in zil_commit() but
   in zil_sync() as well, which means that the caller doesn't have to
   care whether the write ended up in the ZIL or the DMU: it will get
   notified as soon as it's safe, period. This is an improvement over
   dmu_tx_callback_register() that was used previously, which only
   supports DMU writes. The rationale for this change is to allow
   zpl_putpage() to be notified when a ZIL commit is completed without
   having to block on zil_commit() itself.

 - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which
   will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage
   from issuing ZIL commits. zpl_writepages() will issue the commit
   itself instead of relying on zpl_putpage() to do it, thus nicely
   batching the writes. Note, however, that we still have to call
   write_cache_pages() again in SYNC mode because there is an edge case
   documented in the implementation of write_cache_pages() whereas it
   will not give us all dirty pages when running in non-SYNC mode. Thus
   we need to run it at least once in SYNC mode to make sure we honor
   persistency guarantees. This only happens when the pages are
   modified at the same time msync() is running, which should be rare.
   In most cases there won't be any additional pages and this second
   call will do nothing.

Note that this change also fixes a bug related to openzfs#907 whereas calling
msync() on pages that were already handed over to the DMU in a previous
writepages() call would make msync() block until the next TXG sync
instead of returning as soon as the ZIL commit is complete. The new
callback system fixes that problem.

Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#1849
Closes openzfs#907
pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Sep 26, 2023
…fs#907)

After changing zdb to use --bucket rather than -B, not all of the
places where -B was used in the source were caught. This takes care
of the missing instances.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants