Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix vdev_queue_aggregate() deadlock #4111

Closed
wants to merge 2 commits into from

Conversation

behlendorf
Copy link
Contributor

This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline. This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread. However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.

To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed. This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer. Since aggregating the IO is purely a
performance optimization we want this to either success or fail
quickly. Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.

  • z_wr_iss
    zio_vdev_io_start
    vdev_queue_io -> Takes vq->vq_lock
    vdev_queue_io_to_issue
    vdev_queue_aggregate
    zio_buf_alloc -> Waiting on spl_kmem_cache process
  • z_wr_int
    zio_vdev_io_done
    vdev_queue_io_done
    mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss
  • txg_sync
    spa_sync
    dsl_pool_sync
    zio_wait -> Waiting on zio being handled by z_wr_int
  • spl_kmem_cache
    spl_vmalloc
    ...
    evict
    ...
    zfs_inactive
    dmu_tx_wait
    txg_wait_open -> Waiting on txg_sync

Signed-off-by: Brian Behlendorf [email protected]
Issue #3808
Issue #3867

@behlendorf
Copy link
Contributor Author

@tuxoko @dweeezil @ryao can I get your thoughts on this change. It may explain some of the deadlocks which have been reported.

@tuxoko
Copy link
Contributor

tuxoko commented Dec 16, 2015

@behlendorf
This patch doesn't look wrong, but why would spl_kmem_cache do direct reclaim in zfs? I thought it has fstrans set before allocating. Am I missing something?

@behlendorf
Copy link
Contributor Author

@tuxoko the spl_kmem_cache threads don't have PF_FSTRANS set, they're allowed to perform direct reclaim if needed to satisfy the allocation. When memory is low this may be desirable if asynchronous reclaim isn't keeping up. As long as we have control over where we're blocking in the ZFS code this is OK.

@dweeezil
Copy link
Contributor

@behlendorf That's a very interesting deadlock and your fix certainly seems like it ought to fix the particular case you outlined, however, I'm wondering why the zio_buf_alloc() call in vdev_queue_aggregate() is the only one which could trigger this deadlock.

Also, it's not clear to me which part of the spl_kmem_cache task in involved here: I presume it's spl_cache_grow_work which maybe ought to be listed in the stack in the comment.

This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline.  This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread.  However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.

To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed.  This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer.  Since aggregating the IO is purely a
performance optimization we want this to either succeed or fail
quickly.  Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.

* z_wr_iss
zio_vdev_io_start
  vdev_queue_io -> Takes vq->vq_lock
    vdev_queue_io_to_issue
      vdev_queue_aggregate
        zio_buf_alloc -> Waiting on spl_kmem_cache process

* z_wr_int
zio_vdev_io_done
  vdev_queue_io_done
    mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss

* txg_sync
spa_sync
  dsl_pool_sync
    zio_wait -> Waiting on zio being handled by z_wr_int

* spl_kmem_cache
spl_cache_grow_work
  kv_alloc
    spl_vmalloc
      ...
      evict
        zpl_evict_inode
          zfs_inactive
            dmu_tx_wait
              txg_wait_open -> Waiting on txg_sync

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#3808
Issue openzfs#3867
@behlendorf
Copy link
Contributor Author

I'm wondering why the zio_buf_alloc() call in vdev_queue_aggregate() is the only one which could trigger this deadlock.

It's possible there are a few other places lurking in the code where something similar can happen, maybe with the dbug hash lock. I haven't carefully audited every call to zio_buf_alloc(), but if there are some remaining they're very unlikely. The key issue here is that zio_buf_alloc() can't be safely called under any lock which is also acquired by txg_sync. Otherwise it's possible this same deadlock could occur.

I've updated the comment as you suggested to make this complicated deadlock clearer. I've also added a small additional patch with fixes the bound checking for zfs_vdev_aggregation_limit which I noticed was wrong while looking at this issue.

@tuxoko
Copy link
Contributor

tuxoko commented Dec 17, 2015

@behlendorf
But spl_cache_grow_work does seem to have set PF_FSTRANS:
https://github.com/zfsonlinux/spl/blob/master/module/spl/spl-kmem-cache.c#L1159

@behlendorf
Copy link
Contributor Author

@tuxoko interesting... that's what I get for not actually looking at this function. Thanks for point that out. So in fact it probably has PF_MEMALLOC_NOIO set for a new'ish kernel and PF_MEMALLOC_NOIO is not exactly the same as PF_FSTRANS. It allows entry in to the filesystem code which is no good.

@tuxoko
Copy link
Contributor

tuxoko commented Dec 18, 2015

I've open a pull request to get rid of PF_MEMALLOC_NOIO
openzfs/spl#515

@dweeezil
Copy link
Contributor

As a follow-on to my note over in the spl, I'll note that this is yet another case where kernel changes have come and gone and created breakage during large swaths of time.

I'm trying to imagine ways of dealing with this. Personally, I like to look at the "config.log" files from time to time because sometimes it can uncover that an autotools check is clearly not doing the right thing (any more).

For this type of issue, however, it would be very cool to have some sort of auditing infrastructure which could be use to, say, scrape the kernel repo between a set of commits (likely tagged versions) to find possible changes to the various interfaces we use. This would, of course, require someone to actually look at the output and it also seems we'd have to maintain a list or develop a tool to show all the kernel interfaces used in the spl (and zfs as well). I wonder if any other similar out-of-kernel projects may already have some tools to help in this regard.

Update the bounds checking for zfs_vdev_aggregation_limit so that
it has a floor of zero and a maximum value of the supported block
size for the pool.

Additionally add an early return when zfs_vdev_aggregation_limit
equals zero to disable aggregation.  For very fast solid state or
memory devices it may be more expensive to perform the aggregation
than to issue the IO immediately.

Signed-off-by: Brian Behlendorf <[email protected]>
Requires-spl: refs/pull/515/head
@behlendorf
Copy link
Contributor Author

@tuxoko good catch, that would definitely explain a few things. I've refreshed the top commit in this stack to depend on openzfs/spl#515 so we can get it run through for testing. We'll want to make all of these changes in master.

@dweeezil I'm not aware of any tools for this kind of thing, although if they existed I'd love to use them. One way we could minimize the likelihood of this kind if thing in the future is to whittle down the number of kernel interfaces we depend on. That would at least reduce the surface area for potential issues. A good place to start for this would be to try and restrict spl/zfs to using an enterprise kernel's supported white list of systems. These are guaranteed stable interfaces. That should in theory at least prevent accidental breakage for RHEL/CentOS/SUSE.

@behlendorf
Copy link
Contributor Author

Since everything tested out I'll be merging this along with @tuxoko's fix.

behlendorf pushed a commit to openzfs/spl that referenced this pull request Dec 18, 2015
For earlier versions of the kernel with memalloc_noio_save, it only turns
off __GFP_IO but leaves __GFP_FS untouched during direct reclaim. This
would cause threads to direct reclaim into ZFS and cause deadlock.

Instead, we should stick to using spl_fstrans_mark. Since we would
explicitly turn off both __GFP_IO and __GFP_FS before allocation, it
will work on every version of the kernel.

This impacts kernel versions 3.9-3.17, see upstream kernel commit
torvalds/linux@934f307 for reference.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes #515
Issue openzfs/zfs#4111
@behlendorf behlendorf closed this Dec 18, 2015
nedbass pushed a commit to openzfs/spl that referenced this pull request Dec 24, 2015
For earlier versions of the kernel with memalloc_noio_save, it only turns
off __GFP_IO but leaves __GFP_FS untouched during direct reclaim. This
would cause threads to direct reclaim into ZFS and cause deadlock.

Instead, we should stick to using spl_fstrans_mark. Since we would
explicitly turn off both __GFP_IO and __GFP_FS before allocation, it
will work on every version of the kernel.

This impacts kernel versions 3.9-3.17, see upstream kernel commit
torvalds/linux@934f307 for reference.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes #515
Issue openzfs/zfs#4111
@behlendorf behlendorf deleted the issue-3867 branch April 19, 2021 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants