-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
msync() on MAP_SHARED mappings is very slow #907
Comments
There is another aptitude-only bug when running on ZoL: #541. Maybe there is a connection between them? |
@dechamps Your right, this should simply result in a zil_commit() which I'd expect to be fairly fast. When you use the VM_SHARED flag msync() will basically just call fsync() for the entire rage of pages in the file. It would be interesting to profile the msync call with ftrace or systemtap and see where the time is going for this call path. sys_msync() vfs_fsync() vfs_fsync_range() zpl_fsync() zfs_fsync() zil_commit() |
I haven't looked at the code to be certain, but this reminds me of the logbias setting on the dataset. The default setting is throughput, which would imply that ZFS attempts to do some kind of coalescing. |
Well, here's what I'm not sure how this relates to the issue at hand, however. In addition, |
This sounds like it would explain why samba stalls massively when some of its .tdb files are on a zfs dataset. |
Hmmmm, is there any chance this would affect NFS shares (stored on ZFS) which are accessed synchronously? I'm having some trouble with those at the moment with a diskless system using an NFS share as its rootfs. I'll try exporting it asynchronously when I'm at home... |
Maybe. AFAIK, NFS uses the same interfaces as I just did some investigating. Here's my hypothesis: all the time is spent on |
Just fired up
|
When writing via ->writepage() the writeback bit was always cleared as part of the txg commit callback. However, when the I/O is being done synchronsously to a slog device we can safely run the callback immediately after zil_commit(). This will significantly reduce the delay for calls such as fsync() and msync() which block until the data is safe on disk. Issue openzfs#700 Issue openzfs#907
@dechamps: Ahh, that's what I needed to know. I see what's going on and your basically correct. When a dirty page gets written though zpl_writepages() it marks the page in writeback. This is used to inform the VM that the dirty page is actively being written to the disk but is not yet safe on stable storage. We then register a commit callback with the tx to be called when the data is on disk. This callback is responsible for clearing the writeback bit which is what The problem is that the commit callbacks will run several seconds latter as part of the txg sync. This is correct but we can do far better when there's a slog device. The following patch moves the execution of the commit callback immediately after the @dechamps can you please review this change. It significantly helps with your test case in my VM. We could probably further optimize with some additional work by bringing the $ gcc -std=c99 -Wall -Wextra msync.c -o msync $ dd if=/dev/zero of=/tank/test/mmap bs=512 count=1 $ time ./msync /tank/tank/mmap real 0m0.269s user 0m0.000s |
That's what I thought indeed, thanks for fixing this. I just have one objection:
Actually, we can do far better even if there isn't a slog device. Using the ZIL is much, much faster than waiting for the next TXG even if there's no slog. Commiting the ZIL: typically 10 ms on unloaded spindles. Waiting for the next TXG: up to 5 seconds. That's a 500x slowdown… My point is, this patch also benefits pools that don't have a slog. You should probably fix that in your commit message as well.
I agree. As for the patch, I'll test it tomorrow. I'm sure it's fine, though. |
You right your, the commit message isn't very clear. I'll make mention this is the right thing to do even if you don't have a dedicate log device and are just using the primary pool for the zil. |
Yep, that fixes it:
1000x speedup. w00t! |
Currently, using msync() results in the following code path: sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage In such a code path, zil_commit() is called as part of zpl_putpage(). This means that for each page, the write is handed to the DMU, the ZIL is committed, and only then do we move on to the next page. As one might imagine, this results in atrocious performance where there is a large number of pages to write: instead of committing a batch of N writes, we do N commits containing one page each. In some extreme cases this can result in msync() being ~700 times slower than it should be, as well as very inefficient use of ZIL resources. This patch fixes this issue by making sure that the requested writes are batched and then committed only once. Unfortunately, the implementation is somewhat non-trivial because there is no way to run write_cache_pages in SYNC mode (so that we get all pages) without making it wait on the writeback tag for each page. The solution implemented here is composed of two parts: - I added a new callback system to the ZIL, which allows the caller to be notified when its ITX gets written to stable storage. One nice thing is that the callback is called not only in zil_commit() but in zil_sync() as well, which means that the caller doesn't have to care whether the write ended up in the ZIL or the DMU: it will get notified as soon as it's safe, period. This is an improvement over dmu_tx_callback_register() that was used previously, which only supports DMU writes. The rationale for this change is to allow zpl_putpage() to be notified when a ZIL commit is completed without having to block on zil_commit() itself. - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage from issuing ZIL commits. zpl_writepages() will issue the commit itself instead of relying on zpl_putpage() to do it, thus nicely batching the writes. Note, however, that we still have to call write_cache_pages() again in SYNC mode because there is an edge case documented in the implementation of write_cache_pages() whereas it will not give us all dirty pages when running in non-SYNC mode. Thus we need to run it at least once in SYNC mode to make sure we honor persistency guarantees. This only happens when the pages are modified at the same time msync() is running, which should be rare. In most cases there won't be any additional pages and this second call will do nothing. This change fixes openzfs#1849. Note that this change also fixes a bug related to openzfs#907 whereas calling msync() on pages that were already handed over to the DMU in a previous writepages() call would make msync() block until the next TXG sync instead of returning as soon as the ZIL commit is complete.
Currently, using msync() results in the following code path: sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage In such a code path, zil_commit() is called as part of zpl_putpage(). This means that for each page, the write is handed to the DMU, the ZIL is committed, and only then do we move on to the next page. As one might imagine, this results in atrocious performance where there is a large number of pages to write: instead of committing a batch of N writes, we do N commits containing one page each. In some extreme cases this can result in msync() being ~700 times slower than it should be, as well as very inefficient use of ZIL resources. This patch fixes this issue by making sure that the requested writes are batched and then committed only once. Unfortunately, the implementation is somewhat non-trivial because there is no way to run write_cache_pages in SYNC mode (so that we get all pages) without making it wait on the writeback tag for each page. The solution implemented here is composed of two parts: - I added a new callback system to the ZIL, which allows the caller to be notified when its ITX gets written to stable storage. One nice thing is that the callback is called not only in zil_commit() but in zil_sync() as well, which means that the caller doesn't have to care whether the write ended up in the ZIL or the DMU: it will get notified as soon as it's safe, period. This is an improvement over dmu_tx_callback_register() that was used previously, which only supports DMU writes. The rationale for this change is to allow zpl_putpage() to be notified when a ZIL commit is completed without having to block on zil_commit() itself. - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage from issuing ZIL commits. zpl_writepages() will issue the commit itself instead of relying on zpl_putpage() to do it, thus nicely batching the writes. Note, however, that we still have to call write_cache_pages() again in SYNC mode because there is an edge case documented in the implementation of write_cache_pages() whereas it will not give us all dirty pages when running in non-SYNC mode. Thus we need to run it at least once in SYNC mode to make sure we honor persistency guarantees. This only happens when the pages are modified at the same time msync() is running, which should be rare. In most cases there won't be any additional pages and this second call will do nothing. This change fixes openzfs#1849. Note that this change also fixes a bug related to openzfs#907 whereas calling msync() on pages that were already handed over to the DMU in a previous writepages() call would make msync() block until the next TXG sync instead of returning as soon as the ZIL commit is complete. The new callback system fixes that problem.
Currently, using msync() results in the following code path: sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage In such a code path, zil_commit() is called as part of zpl_putpage(). This means that for each page, the write is handed to the DMU, the ZIL is committed, and only then do we move on to the next page. As one might imagine, this results in atrocious performance where there is a large number of pages to write: instead of committing a batch of N writes, we do N commits containing one page each. In some extreme cases this can result in msync() being ~700 times slower than it should be, as well as very inefficient use of ZIL resources. This patch fixes this issue by making sure that the requested writes are batched and then committed only once. Unfortunately, the implementation is somewhat non-trivial because there is no way to run write_cache_pages in SYNC mode (so that we get all pages) without making it wait on the writeback tag for each page. The solution implemented here is composed of two parts: - I added a new callback system to the ZIL, which allows the caller to be notified when its ITX gets written to stable storage. One nice thing is that the callback is called not only in zil_commit() but in zil_sync() as well, which means that the caller doesn't have to care whether the write ended up in the ZIL or the DMU: it will get notified as soon as it's safe, period. This is an improvement over dmu_tx_callback_register() that was used previously, which only supports DMU writes. The rationale for this change is to allow zpl_putpage() to be notified when a ZIL commit is completed without having to block on zil_commit() itself. - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage from issuing ZIL commits. zpl_writepages() will issue the commit itself instead of relying on zpl_putpage() to do it, thus nicely batching the writes. Note, however, that we still have to call write_cache_pages() again in SYNC mode because there is an edge case documented in the implementation of write_cache_pages() whereas it will not give us all dirty pages when running in non-SYNC mode. Thus we need to run it at least once in SYNC mode to make sure we honor persistency guarantees. This only happens when the pages are modified at the same time msync() is running, which should be rare. In most cases there won't be any additional pages and this second call will do nothing. This change fixes openzfs#1849. Note that this change also fixes a bug related to openzfs#907 whereas calling msync() on pages that were already handed over to the DMU in a previous writepages() call would make msync() block until the next TXG sync instead of returning as soon as the ZIL commit is complete.
Currently, using msync() results in the following code path: sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage In such a code path, zil_commit() is called as part of zpl_putpage(). This means that for each page, the write is handed to the DMU, the ZIL is committed, and only then do we move on to the next page. As one might imagine, this results in atrocious performance where there is a large number of pages to write: instead of committing a batch of N writes, we do N commits containing one page each. In some extreme cases this can result in msync() being ~700 times slower than it should be, as well as very inefficient use of ZIL resources. This patch fixes this issue by making sure that the requested writes are batched and then committed only once. Unfortunately, the implementation is somewhat non-trivial because there is no way to run write_cache_pages in SYNC mode (so that we get all pages) without making it wait on the writeback tag for each page. The solution implemented here is composed of two parts: - I added a new callback system to the ZIL, which allows the caller to be notified when its ITX gets written to stable storage. One nice thing is that the callback is called not only in zil_commit() but in zil_sync() as well, which means that the caller doesn't have to care whether the write ended up in the ZIL or the DMU: it will get notified as soon as it's safe, period. This is an improvement over dmu_tx_callback_register() that was used previously, which only supports DMU writes. The rationale for this change is to allow zpl_putpage() to be notified when a ZIL commit is completed without having to block on zil_commit() itself. - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage from issuing ZIL commits. zpl_writepages() will issue the commit itself instead of relying on zpl_putpage() to do it, thus nicely batching the writes. Note, however, that we still have to call write_cache_pages() again in SYNC mode because there is an edge case documented in the implementation of write_cache_pages() whereas it will not give us all dirty pages when running in non-SYNC mode. Thus we need to run it at least once in SYNC mode to make sure we honor persistency guarantees. This only happens when the pages are modified at the same time msync() is running, which should be rare. In most cases there won't be any additional pages and this second call will do nothing. This change fixes openzfs#1849. Note that this change also fixes a bug related to openzfs#907 whereas calling msync() on pages that were already handed over to the DMU in a previous writepages() call would make msync() block until the next TXG sync instead of returning as soon as the ZIL commit is complete. The new callback system fixes that problem. Signed-off-by: Richard Yao <[email protected]> Conflicts: module/zfs/zfs_log.c
Currently, using msync() results in the following code path: sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage In such a code path, zil_commit() is called as part of zpl_putpage(). This means that for each page, the write is handed to the DMU, the ZIL is committed, and only then do we move on to the next page. As one might imagine, this results in atrocious performance where there is a large number of pages to write: instead of committing a batch of N writes, we do N commits containing one page each. In some extreme cases this can result in msync() being ~700 times slower than it should be, as well as very inefficient use of ZIL resources. This patch fixes this issue by making sure that the requested writes are batched and then committed only once. Unfortunately, the implementation is somewhat non-trivial because there is no way to run write_cache_pages in SYNC mode (so that we get all pages) without making it wait on the writeback tag for each page. The solution implemented here is composed of two parts: - I added a new callback system to the ZIL, which allows the caller to be notified when its ITX gets written to stable storage. One nice thing is that the callback is called not only in zil_commit() but in zil_sync() as well, which means that the caller doesn't have to care whether the write ended up in the ZIL or the DMU: it will get notified as soon as it's safe, period. This is an improvement over dmu_tx_callback_register() that was used previously, which only supports DMU writes. The rationale for this change is to allow zpl_putpage() to be notified when a ZIL commit is completed without having to block on zil_commit() itself. - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage from issuing ZIL commits. zpl_writepages() will issue the commit itself instead of relying on zpl_putpage() to do it, thus nicely batching the writes. Note, however, that we still have to call write_cache_pages() again in SYNC mode because there is an edge case documented in the implementation of write_cache_pages() whereas it will not give us all dirty pages when running in non-SYNC mode. Thus we need to run it at least once in SYNC mode to make sure we honor persistency guarantees. This only happens when the pages are modified at the same time msync() is running, which should be rare. In most cases there won't be any additional pages and this second call will do nothing. Note that this change also fixes a bug related to #907 whereas calling msync() on pages that were already handed over to the DMU in a previous writepages() call would make msync() block until the next TXG sync instead of returning as soon as the ZIL commit is complete. The new callback system fixes that problem. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1849 Closes #907
Currently, using msync() results in the following code path: sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage In such a code path, zil_commit() is called as part of zpl_putpage(). This means that for each page, the write is handed to the DMU, the ZIL is committed, and only then do we move on to the next page. As one might imagine, this results in atrocious performance where there is a large number of pages to write: instead of committing a batch of N writes, we do N commits containing one page each. In some extreme cases this can result in msync() being ~700 times slower than it should be, as well as very inefficient use of ZIL resources. This patch fixes this issue by making sure that the requested writes are batched and then committed only once. Unfortunately, the implementation is somewhat non-trivial because there is no way to run write_cache_pages in SYNC mode (so that we get all pages) without making it wait on the writeback tag for each page. The solution implemented here is composed of two parts: - I added a new callback system to the ZIL, which allows the caller to be notified when its ITX gets written to stable storage. One nice thing is that the callback is called not only in zil_commit() but in zil_sync() as well, which means that the caller doesn't have to care whether the write ended up in the ZIL or the DMU: it will get notified as soon as it's safe, period. This is an improvement over dmu_tx_callback_register() that was used previously, which only supports DMU writes. The rationale for this change is to allow zpl_putpage() to be notified when a ZIL commit is completed without having to block on zil_commit() itself. - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage from issuing ZIL commits. zpl_writepages() will issue the commit itself instead of relying on zpl_putpage() to do it, thus nicely batching the writes. Note, however, that we still have to call write_cache_pages() again in SYNC mode because there is an edge case documented in the implementation of write_cache_pages() whereas it will not give us all dirty pages when running in non-SYNC mode. Thus we need to run it at least once in SYNC mode to make sure we honor persistency guarantees. This only happens when the pages are modified at the same time msync() is running, which should be rare. In most cases there won't be any additional pages and this second call will do nothing. Note that this change also fixes a bug related to openzfs#907 whereas calling msync() on pages that were already handed over to the DMU in a previous writepages() call would make msync() block until the next TXG sync instead of returning as soon as the ZIL commit is complete. The new callback system fixes that problem. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#1849 Closes openzfs#907
…fs#907) After changing zdb to use --bucket rather than -B, not all of the places where -B was used in the source were caught. This takes care of the missing instances.
I wrote a very simple program (msync.c) that simply opens a 512-byte file,
mmap()
it with theMAP_SHARED
flag, write the contents, thenmsync()
it. In theory this should simply trigger a 512-byte ZIL commit, which should be quite fast (especially considering I have a SSD slog). In practice this takes seconds, which suggests that ZFS might be waiting for the next TXG sync for no good reason.How to reproduce (tested on rc10 with Linux 3.2):
Note that the issue does not occur when the
MAP_PRIVATE
flag is used instead.This is an annoying issue for those using Debian on ZFS root, because aptitude uses
mmap(MAP_SHARED)
thenmsync()
, so any aptitude run has to wait for more or less five seconds before completing. The frustration made me investigate and isolate this bug.The text was updated successfully, but these errors were encountered: