-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix vdev_queue_aggregate() deadlock #4111
Conversation
@behlendorf |
@tuxoko the spl_kmem_cache threads don't have PF_FSTRANS set, they're allowed to perform direct reclaim if needed to satisfy the allocation. When memory is low this may be desirable if asynchronous reclaim isn't keeping up. As long as we have control over where we're blocking in the ZFS code this is OK. |
@behlendorf That's a very interesting deadlock and your fix certainly seems like it ought to fix the particular case you outlined, however, I'm wondering why the Also, it's not clear to me which part of the |
This deadlock may manifest itself in slightly different ways but at the core it is caused by a memory allocation blocking on file- system reclaim in the zio pipeline. This is normally impossible because zio_execute() disables filesystem reclaim by setting PF_FSTRANS on the thread. However, kmem cache allocations may still indirectly block on file system reclaim while holding the critical vq->vq_lock as shown below. To resolve this issue zio_buf_alloc_flags() is introduced which allocation flags to be passed. This can then be used in vdev_queue_aggregate() with KM_NOSLEEP when allocating the aggregate IO buffer. Since aggregating the IO is purely a performance optimization we want this to either succeed or fail quickly. Trying too hard to allocate this memory under the vq->vq_lock can negatively impact performance and result in this deadlock. * z_wr_iss zio_vdev_io_start vdev_queue_io -> Takes vq->vq_lock vdev_queue_io_to_issue vdev_queue_aggregate zio_buf_alloc -> Waiting on spl_kmem_cache process * z_wr_int zio_vdev_io_done vdev_queue_io_done mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss * txg_sync spa_sync dsl_pool_sync zio_wait -> Waiting on zio being handled by z_wr_int * spl_kmem_cache spl_cache_grow_work kv_alloc spl_vmalloc ... evict zpl_evict_inode zfs_inactive dmu_tx_wait txg_wait_open -> Waiting on txg_sync Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#3808 Issue openzfs#3867
4ca3eca
to
a95c2d3
Compare
It's possible there are a few other places lurking in the code where something similar can happen, maybe with the dbug hash lock. I haven't carefully audited every call to I've updated the comment as you suggested to make this complicated deadlock clearer. I've also added a small additional patch with fixes the bound checking for zfs_vdev_aggregation_limit which I noticed was wrong while looking at this issue. |
@behlendorf |
@tuxoko interesting... that's what I get for not actually looking at this function. Thanks for point that out. So in fact it probably has PF_MEMALLOC_NOIO set for a new'ish kernel and PF_MEMALLOC_NOIO is not exactly the same as PF_FSTRANS. It allows entry in to the filesystem code which is no good. |
I've open a pull request to get rid of PF_MEMALLOC_NOIO |
As a follow-on to my note over in the spl, I'll note that this is yet another case where kernel changes have come and gone and created breakage during large swaths of time. I'm trying to imagine ways of dealing with this. Personally, I like to look at the "config.log" files from time to time because sometimes it can uncover that an autotools check is clearly not doing the right thing (any more). For this type of issue, however, it would be very cool to have some sort of auditing infrastructure which could be use to, say, scrape the kernel repo between a set of commits (likely tagged versions) to find possible changes to the various interfaces we use. This would, of course, require someone to actually look at the output and it also seems we'd have to maintain a list or develop a tool to show all the kernel interfaces used in the spl (and zfs as well). I wonder if any other similar out-of-kernel projects may already have some tools to help in this regard. |
Update the bounds checking for zfs_vdev_aggregation_limit so that it has a floor of zero and a maximum value of the supported block size for the pool. Additionally add an early return when zfs_vdev_aggregation_limit equals zero to disable aggregation. For very fast solid state or memory devices it may be more expensive to perform the aggregation than to issue the IO immediately. Signed-off-by: Brian Behlendorf <[email protected]> Requires-spl: refs/pull/515/head
a95c2d3
to
da42296
Compare
@tuxoko good catch, that would definitely explain a few things. I've refreshed the top commit in this stack to depend on openzfs/spl#515 so we can get it run through for testing. We'll want to make all of these changes in master. @dweeezil I'm not aware of any tools for this kind of thing, although if they existed I'd love to use them. One way we could minimize the likelihood of this kind if thing in the future is to whittle down the number of kernel interfaces we depend on. That would at least reduce the surface area for potential issues. A good place to start for this would be to try and restrict spl/zfs to using an enterprise kernel's supported white list of systems. These are guaranteed stable interfaces. That should in theory at least prevent accidental breakage for RHEL/CentOS/SUSE. |
Since everything tested out I'll be merging this along with @tuxoko's fix. |
For earlier versions of the kernel with memalloc_noio_save, it only turns off __GFP_IO but leaves __GFP_FS untouched during direct reclaim. This would cause threads to direct reclaim into ZFS and cause deadlock. Instead, we should stick to using spl_fstrans_mark. Since we would explicitly turn off both __GFP_IO and __GFP_FS before allocation, it will work on every version of the kernel. This impacts kernel versions 3.9-3.17, see upstream kernel commit torvalds/linux@934f307 for reference. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #515 Issue openzfs/zfs#4111
For earlier versions of the kernel with memalloc_noio_save, it only turns off __GFP_IO but leaves __GFP_FS untouched during direct reclaim. This would cause threads to direct reclaim into ZFS and cause deadlock. Instead, we should stick to using spl_fstrans_mark. Since we would explicitly turn off both __GFP_IO and __GFP_FS before allocation, it will work on every version of the kernel. This impacts kernel versions 3.9-3.17, see upstream kernel commit torvalds/linux@934f307 for reference. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #515 Issue openzfs/zfs#4111
This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline. This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread. However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.
To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed. This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer. Since aggregating the IO is purely a
performance optimization we want this to either success or fail
quickly. Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.
zio_vdev_io_start
vdev_queue_io -> Takes vq->vq_lock
vdev_queue_io_to_issue
vdev_queue_aggregate
zio_buf_alloc -> Waiting on spl_kmem_cache process
zio_vdev_io_done
vdev_queue_io_done
mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss
spa_sync
dsl_pool_sync
zio_wait -> Waiting on zio being handled by z_wr_int
spl_vmalloc
...
evict
...
zfs_inactive
dmu_tx_wait
txg_wait_open -> Waiting on txg_sync
Signed-off-by: Brian Behlendorf [email protected]
Issue #3808
Issue #3867