Fix vdev_queue_aggregate() deadlock #4111

behlendorf · 2015-12-16T20:03:54Z

This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline. This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread. However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.

To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed. This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer. Since aggregating the IO is purely a
performance optimization we want this to either success or fail
quickly. Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.

z_wr_iss
zio_vdev_io_start
vdev_queue_io -> Takes vq->vq_lock
vdev_queue_io_to_issue
vdev_queue_aggregate
zio_buf_alloc -> Waiting on spl_kmem_cache process
z_wr_int
zio_vdev_io_done
vdev_queue_io_done
mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss
txg_sync
spa_sync
dsl_pool_sync
zio_wait -> Waiting on zio being handled by z_wr_int
spl_kmem_cache
spl_vmalloc
...
evict
...
zfs_inactive
dmu_tx_wait
txg_wait_open -> Waiting on txg_sync

Signed-off-by: Brian Behlendorf [email protected]
Issue #3808
Issue #3867

behlendorf · 2015-12-16T21:04:58Z

@tuxoko @dweeezil @ryao can I get your thoughts on this change. It may explain some of the deadlocks which have been reported.

tuxoko · 2015-12-16T22:48:50Z

@behlendorf
This patch doesn't look wrong, but why would spl_kmem_cache do direct reclaim in zfs? I thought it has fstrans set before allocating. Am I missing something?

behlendorf · 2015-12-17T02:03:40Z

@tuxoko the spl_kmem_cache threads don't have PF_FSTRANS set, they're allowed to perform direct reclaim if needed to satisfy the allocation. When memory is low this may be desirable if asynchronous reclaim isn't keeping up. As long as we have control over where we're blocking in the ZFS code this is OK.

dweeezil · 2015-12-17T15:19:14Z

@behlendorf That's a very interesting deadlock and your fix certainly seems like it ought to fix the particular case you outlined, however, I'm wondering why the zio_buf_alloc() call in vdev_queue_aggregate() is the only one which could trigger this deadlock.

Also, it's not clear to me which part of the spl_kmem_cache task in involved here: I presume it's spl_cache_grow_work which maybe ought to be listed in the stack in the comment.

This deadlock may manifest itself in slightly different ways but at the core it is caused by a memory allocation blocking on file- system reclaim in the zio pipeline. This is normally impossible because zio_execute() disables filesystem reclaim by setting PF_FSTRANS on the thread. However, kmem cache allocations may still indirectly block on file system reclaim while holding the critical vq->vq_lock as shown below. To resolve this issue zio_buf_alloc_flags() is introduced which allocation flags to be passed. This can then be used in vdev_queue_aggregate() with KM_NOSLEEP when allocating the aggregate IO buffer. Since aggregating the IO is purely a performance optimization we want this to either succeed or fail quickly. Trying too hard to allocate this memory under the vq->vq_lock can negatively impact performance and result in this deadlock. * z_wr_iss zio_vdev_io_start vdev_queue_io -> Takes vq->vq_lock vdev_queue_io_to_issue vdev_queue_aggregate zio_buf_alloc -> Waiting on spl_kmem_cache process * z_wr_int zio_vdev_io_done vdev_queue_io_done mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss * txg_sync spa_sync dsl_pool_sync zio_wait -> Waiting on zio being handled by z_wr_int * spl_kmem_cache spl_cache_grow_work kv_alloc spl_vmalloc ... evict zpl_evict_inode zfs_inactive dmu_tx_wait txg_wait_open -> Waiting on txg_sync Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#3808 Issue openzfs#3867

behlendorf · 2015-12-17T17:45:03Z

I'm wondering why the zio_buf_alloc() call in vdev_queue_aggregate() is the only one which could trigger this deadlock.

It's possible there are a few other places lurking in the code where something similar can happen, maybe with the dbug hash lock. I haven't carefully audited every call to zio_buf_alloc(), but if there are some remaining they're very unlikely. The key issue here is that zio_buf_alloc() can't be safely called under any lock which is also acquired by txg_sync. Otherwise it's possible this same deadlock could occur.

I've updated the comment as you suggested to make this complicated deadlock clearer. I've also added a small additional patch with fixes the bound checking for zfs_vdev_aggregation_limit which I noticed was wrong while looking at this issue.

tuxoko · 2015-12-17T22:37:46Z

@behlendorf
But spl_cache_grow_work does seem to have set PF_FSTRANS:
https://github.com/zfsonlinux/spl/blob/master/module/spl/spl-kmem-cache.c#L1159

behlendorf · 2015-12-18T01:09:27Z

@tuxoko interesting... that's what I get for not actually looking at this function. Thanks for point that out. So in fact it probably has PF_MEMALLOC_NOIO set for a new'ish kernel and PF_MEMALLOC_NOIO is not exactly the same as PF_FSTRANS. It allows entry in to the filesystem code which is no good.

tuxoko · 2015-12-18T01:22:42Z

I think I understand why
http://lxr.free-electrons.com/source/include/linux/sched.h?v=3.17#L1940
http://lxr.free-electrons.com/source/include/linux/sched.h?v=3.18#L1959

tuxoko · 2015-12-18T02:39:47Z

I've open a pull request to get rid of PF_MEMALLOC_NOIO
openzfs/spl#515

dweeezil · 2015-12-18T14:56:40Z

As a follow-on to my note over in the spl, I'll note that this is yet another case where kernel changes have come and gone and created breakage during large swaths of time.

I'm trying to imagine ways of dealing with this. Personally, I like to look at the "config.log" files from time to time because sometimes it can uncover that an autotools check is clearly not doing the right thing (any more).

For this type of issue, however, it would be very cool to have some sort of auditing infrastructure which could be use to, say, scrape the kernel repo between a set of commits (likely tagged versions) to find possible changes to the various interfaces we use. This would, of course, require someone to actually look at the output and it also seems we'd have to maintain a list or develop a tool to show all the kernel interfaces used in the spl (and zfs as well). I wonder if any other similar out-of-kernel projects may already have some tools to help in this regard.

Update the bounds checking for zfs_vdev_aggregation_limit so that it has a floor of zero and a maximum value of the supported block size for the pool. Additionally add an early return when zfs_vdev_aggregation_limit equals zero to disable aggregation. For very fast solid state or memory devices it may be more expensive to perform the aggregation than to issue the IO immediately. Signed-off-by: Brian Behlendorf <[email protected]> Requires-spl: refs/pull/515/head

behlendorf · 2015-12-18T17:44:36Z

@tuxoko good catch, that would definitely explain a few things. I've refreshed the top commit in this stack to depend on openzfs/spl#515 so we can get it run through for testing. We'll want to make all of these changes in master.

@dweeezil I'm not aware of any tools for this kind of thing, although if they existed I'd love to use them. One way we could minimize the likelihood of this kind if thing in the future is to whittle down the number of kernel interfaces we depend on. That would at least reduce the surface area for potential issues. A good place to start for this would be to try and restrict spl/zfs to using an enterprise kernel's supported white list of systems. These are guaranteed stable interfaces. That should in theory at least prevent accidental breakage for RHEL/CentOS/SUSE.

behlendorf · 2015-12-18T21:29:09Z

Since everything tested out I'll be merging this along with @tuxoko's fix.

For earlier versions of the kernel with memalloc_noio_save, it only turns off __GFP_IO but leaves __GFP_FS untouched during direct reclaim. This would cause threads to direct reclaim into ZFS and cause deadlock. Instead, we should stick to using spl_fstrans_mark. Since we would explicitly turn off both __GFP_IO and __GFP_FS before allocation, it will work on every version of the kernel. This impacts kernel versions 3.9-3.17, see upstream kernel commit torvalds/linux@934f307 for reference. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #515 Issue openzfs/zfs#4111

behlendorf mentioned this pull request Dec 16, 2015

0.6.5.2 hangs on large rsync jobs #3867

Closed

behlendorf force-pushed the issue-3867 branch from 4ca3eca to a95c2d3 Compare December 17, 2015 17:34

behlendorf force-pushed the issue-3867 branch from a95c2d3 to da42296 Compare December 18, 2015 17:34

behlendorf closed this Dec 18, 2015

behlendorf deleted the issue-3867 branch April 19, 2021 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix vdev_queue_aggregate() deadlock #4111

Fix vdev_queue_aggregate() deadlock #4111

behlendorf commented Dec 16, 2015

behlendorf commented Dec 16, 2015

tuxoko commented Dec 16, 2015

behlendorf commented Dec 17, 2015

dweeezil commented Dec 17, 2015

behlendorf commented Dec 17, 2015

tuxoko commented Dec 17, 2015

behlendorf commented Dec 18, 2015

tuxoko commented Dec 18, 2015

tuxoko commented Dec 18, 2015

dweeezil commented Dec 18, 2015

behlendorf commented Dec 18, 2015

behlendorf commented Dec 18, 2015

Fix vdev_queue_aggregate() deadlock #4111

Fix vdev_queue_aggregate() deadlock #4111

Conversation

behlendorf commented Dec 16, 2015

behlendorf commented Dec 16, 2015

tuxoko commented Dec 16, 2015

behlendorf commented Dec 17, 2015

dweeezil commented Dec 17, 2015

behlendorf commented Dec 17, 2015

tuxoko commented Dec 17, 2015

behlendorf commented Dec 18, 2015

tuxoko commented Dec 18, 2015

tuxoko commented Dec 18, 2015

dweeezil commented Dec 18, 2015

behlendorf commented Dec 18, 2015

behlendorf commented Dec 18, 2015