Heavy IO lockup from Linux 3.5->3.6 change #1342

DeHackEd · 2013-03-08T02:41:48Z

ryao says this may be mmap related. For now I'll dump the info I know here and update as I get more. Feel free to update the description if I take too long coming up with something better.

Hung task stack traces: http://pastebin.com/n2E2Mq6d

Slapd (OpenLDAP using BDB as its backend) is doing a replication database receive. It's slowed down by what I assume to be #1297 but that's not the issue.

# zpool status
  pool: lxc-iscsi
 state: ONLINE
status: The pool is formatted using a legacy on-disk format.  The pool can
    still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
    pool will no longer be accessible on software that does not support
    feature flags.
  scan: scrub repaired 0 in 0h46m with 0 errors on Mon Mar  4 16:27:28 2013
config:

    NAME                                                       STATE     READ WRITE CKSUM
    lxc-iscsi                                                  ONLINE       0     0     0
      mirror-0                                                 ONLINE       0     0     0
        ip-10.50.8.10:3260-iscsi-iqn.2012-06.net.blahblahblah  ONLINE       0     0     0
        ip-10.50.8.11:3260-iscsi-iqn.2012-06.net.blahblahblah  ONLINE       0     0     0

errors: No known data errors

Host is a custom build kernel, KVM guest with 8 cores and 10 gigs of RAM. All ZFS options at default. Currently running build 0b4d1b5

Hang usually takes anywhere from 5-30 minutes to manifest itself. It may be a Linux kernel regression as a 3.2 kernel appeared stable but a 3.8 kernel would jam up pretty quickly. Some earlier kernels (~3.4) would run overnight successfully. I'm trying to bisect a possible upstream regression but the irregular nature of the lockup means some guesswork is involved.

Update history: Old title was "Heavy IO lockup"

The text was updated successfully, but these errors were encountered:

DeHackEd · 2013-03-13T14:26:46Z

After several sessions of bisecting I think the fault was introduced in commit torvalds/linux@e62e384

I'm feeling 80% confident this isn't a mistake based on probability of the bug going off. I'll try patching future kernels to see if I can turn them stable.

behlendorf · 2013-03-14T02:20:28Z

Nice job running this down! That commit completely explains the hang and jibes perfectly with the stack traces you've already posted.

Here's what's happening. While in the middle of a write system call we take a page fault and attempt to perform some read-ahead. Unfortunately, in the middle of the read we try to reclaim memory which causes us to block waiting on a zfs page in write back. That page will never get written because the original write system call is holding the txg open. Thus we deadlock. The patch you referenced introduced the wait on page in write back.

Now we need to decide what to do about this...

ryao · 2013-03-17T15:13:23Z

It is possible for people to workaround this issue in recent kernels by reverting the following commits:

torvalds/linux@c3b94f4
torvalds/linux@e62e384

I have asked the Gentoo kernel team to revert these patches in our main kernel source package until we have a proper fix in ZFS. That should protect at least some users from this issue provided that they honor my request.

ryao · 2013-03-17T15:50:59Z

I have opened a Gentoo bug to track the status of this in Gentoo:

https://bugs.gentoo.org/show_bug.cgi?id=462066

ryao · 2013-03-18T10:36:47Z

A comment in dmu_write_uio_dnode() suggests that this is an Illumos/Solaris bug that we inherited:

                /*
                 * XXX uiomove could block forever (eg.nfs-backed                                                                      
                 * pages).  There needs to be a uiolockdown() function                                                                 
                 * to lock the pages in memory, so that uiomove won't
                 * block.                                                                                                              
                 */
                err = uiomove((char *)db->db_data + bufoff, tocpy,                                                                     
                    UIO_WRITE, uio);

I withdraw my previous comments about the mmap() rewrite fixing this. It is not yet clear to me if the rewrite that I started will touch this.

@ahrens You wrote the commit that introduced this code in 2005 at Sun. Would you elaborate on how you thought that uiolockdown() could be implemented when you wrote this comment?

ahrens · 2013-03-18T17:32:58Z

@ryao The problem described in that comment is that uiomove() can block for an arbitrarily long time while we have a transaction open. This will eventually cause the spa_sync() thread to block on this, and then all writes will block.

The idea is that the caller would call uiolockdown() before calling dmu_tx_assign(). This would cause the pages to be locked in memory, so that uiomove() won't block. I'm not familiar enough with the VM subsystem (especially on Linux) to comment on how exactly uiolockdown() would be implemented.

Note that we take several pains to decrease the likelihood of uiomove() having to block. zfs_write() calls uio_prefaultpages() to read the pages into memory (but those pages could be evicted before we get to uiomove()). Also, for full blocks, zfs_write() uses uiocopy() to copy the data before assigning the transaction, so we don't use uiomove() in that case. However, zvol_write() and sbd_zvol_copy_write() don't take advantage of these tricks.

behlendorf · 2013-03-18T21:44:49Z

@ryao One way we might handle this, at least for mmap'ed reads, is to implemented our own read_cache_pages() helper. This way we'd be able to clear the GFP_IO bit in the read_cache_pages()->add_to_page_cache_lru() call and prevent the lower layers from attempting to write or block on pages in writeback. Unfortunately, add_to_page_cache_lru() seems to be GPL-only so that easier said than done. Other suggestions are welcome.

Unrelated to this issue we could be doing something smarter in read_cache_pages(). Right now it can easily result in us pulling a 128k block from disk and then only stashing the request pages in the page cache.

casualfish · 2013-06-29T13:03:31Z

@DeHackEd @ryao @behlendorf As far as I know this problem also occurs in other parts of the linux kernel and is an open question, the key point here is that the page reclaim algorithm chooses the wrong page to evict, causing the victim to shoot itself in the foot, I know FUSE(FileSystem in Userspace) avoids this issue by reserving some memory for emergent use, can we also do in this way?

ryao · 2014-07-30T11:35:14Z

#2411 and openzfs/spl#369 should handle this. In specific, a mix of the PF_FSTRANS changes should tackle it.

behlendorf · 2016-03-25T20:11:02Z

The original reported issue was determined to be due to a kernel bug by @DeHackEd, the original reporter. It's possible subsequent improvements in ZoL have also improved things. Closing.

ryao mentioned this issue Apr 1, 2013

Kernel panic in dmu_write() on i686 #1284

Closed

ryao mentioned this issue Apr 22, 2013

yay more zfs deadlock #1415

Closed

abraunegg mentioned this issue Jul 7, 2013

General protection faults & null pointer dereferences under load #1449

Closed

ryao mentioned this issue Jul 15, 2013

System hangs when accessing a ZVOL #1592

Closed

behlendorf modified the milestones: 0.7.0, 0.6.4 Oct 7, 2014

behlendorf added Bug - Minor Component: Memory Management kernel memory management and removed Bug labels Oct 7, 2014

behlendorf closed this as completed Mar 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heavy IO lockup from Linux 3.5->3.6 change #1342

Heavy IO lockup from Linux 3.5->3.6 change #1342

DeHackEd commented Mar 8, 2013

DeHackEd commented Mar 13, 2013

behlendorf commented Mar 14, 2013

ryao commented Mar 17, 2013

ryao commented Mar 17, 2013

ryao commented Mar 18, 2013

ahrens commented Mar 18, 2013

behlendorf commented Mar 18, 2013

casualfish commented Jun 29, 2013

ryao commented Jul 30, 2014

behlendorf commented Mar 25, 2016

Heavy IO lockup from Linux 3.5->3.6 change #1342

Heavy IO lockup from Linux 3.5->3.6 change #1342

Comments

DeHackEd commented Mar 8, 2013

DeHackEd commented Mar 13, 2013

behlendorf commented Mar 14, 2013

ryao commented Mar 17, 2013

ryao commented Mar 17, 2013

ryao commented Mar 18, 2013

ahrens commented Mar 18, 2013

behlendorf commented Mar 18, 2013

casualfish commented Jun 29, 2013

ryao commented Jul 30, 2014

behlendorf commented Mar 25, 2016