Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heavy IO lockup from Linux 3.5->3.6 change #1342

Closed
DeHackEd opened this issue Mar 8, 2013 · 10 comments
Closed

Heavy IO lockup from Linux 3.5->3.6 change #1342

DeHackEd opened this issue Mar 8, 2013 · 10 comments
Labels
Component: Memory Management kernel memory management
Milestone

Comments

@DeHackEd
Copy link
Contributor

DeHackEd commented Mar 8, 2013

ryao says this may be mmap related. For now I'll dump the info I know here and update as I get more. Feel free to update the description if I take too long coming up with something better.

Hung task stack traces: http://pastebin.com/n2E2Mq6d

Slapd (OpenLDAP using BDB as its backend) is doing a replication database receive. It's slowed down by what I assume to be #1297 but that's not the issue.

# zpool status
  pool: lxc-iscsi
 state: ONLINE
status: The pool is formatted using a legacy on-disk format.  The pool can
    still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
    pool will no longer be accessible on software that does not support
    feature flags.
  scan: scrub repaired 0 in 0h46m with 0 errors on Mon Mar  4 16:27:28 2013
config:

    NAME                                                       STATE     READ WRITE CKSUM
    lxc-iscsi                                                  ONLINE       0     0     0
      mirror-0                                                 ONLINE       0     0     0
        ip-10.50.8.10:3260-iscsi-iqn.2012-06.net.blahblahblah  ONLINE       0     0     0
        ip-10.50.8.11:3260-iscsi-iqn.2012-06.net.blahblahblah  ONLINE       0     0     0

errors: No known data errors

Host is a custom build kernel, KVM guest with 8 cores and 10 gigs of RAM. All ZFS options at default. Currently running build 0b4d1b5

Hang usually takes anywhere from 5-30 minutes to manifest itself. It may be a Linux kernel regression as a 3.2 kernel appeared stable but a 3.8 kernel would jam up pretty quickly. Some earlier kernels (~3.4) would run overnight successfully. I'm trying to bisect a possible upstream regression but the irregular nature of the lockup means some guesswork is involved.

Update history: Old title was "Heavy IO lockup"

@DeHackEd
Copy link
Contributor Author

After several sessions of bisecting I think the fault was introduced in commit torvalds/linux@e62e384

I'm feeling 80% confident this isn't a mistake based on probability of the bug going off. I'll try patching future kernels to see if I can turn them stable.

@behlendorf
Copy link
Contributor

Nice job running this down! That commit completely explains the hang and jibes perfectly with the stack traces you've already posted.

Here's what's happening. While in the middle of a write system call we take a page fault and attempt to perform some read-ahead. Unfortunately, in the middle of the read we try to reclaim memory which causes us to block waiting on a zfs page in write back. That page will never get written because the original write system call is holding the txg open. Thus we deadlock. The patch you referenced introduced the wait on page in write back.

Now we need to decide what to do about this...

@ryao
Copy link
Contributor

ryao commented Mar 17, 2013

It is possible for people to workaround this issue in recent kernels by reverting the following commits:

torvalds/linux@c3b94f4
torvalds/linux@e62e384

I have asked the Gentoo kernel team to revert these patches in our main kernel source package until we have a proper fix in ZFS. That should protect at least some users from this issue provided that they honor my request.

@ryao
Copy link
Contributor

ryao commented Mar 17, 2013

I have opened a Gentoo bug to track the status of this in Gentoo:

https://bugs.gentoo.org/show_bug.cgi?id=462066

@ryao
Copy link
Contributor

ryao commented Mar 18, 2013

A comment in dmu_write_uio_dnode() suggests that this is an Illumos/Solaris bug that we inherited:

                /*
                 * XXX uiomove could block forever (eg.nfs-backed                                                                      
                 * pages).  There needs to be a uiolockdown() function                                                                 
                 * to lock the pages in memory, so that uiomove won't
                 * block.                                                                                                              
                 */
                err = uiomove((char *)db->db_data + bufoff, tocpy,                                                                     
                    UIO_WRITE, uio);

I withdraw my previous comments about the mmap() rewrite fixing this. It is not yet clear to me if the rewrite that I started will touch this.

@ahrens You wrote the commit that introduced this code in 2005 at Sun. Would you elaborate on how you thought that uiolockdown() could be implemented when you wrote this comment?

@ahrens
Copy link
Member

ahrens commented Mar 18, 2013

@ryao The problem described in that comment is that uiomove() can block for an arbitrarily long time while we have a transaction open. This will eventually cause the spa_sync() thread to block on this, and then all writes will block.

The idea is that the caller would call uiolockdown() before calling dmu_tx_assign(). This would cause the pages to be locked in memory, so that uiomove() won't block. I'm not familiar enough with the VM subsystem (especially on Linux) to comment on how exactly uiolockdown() would be implemented.

Note that we take several pains to decrease the likelihood of uiomove() having to block. zfs_write() calls uio_prefaultpages() to read the pages into memory (but those pages could be evicted before we get to uiomove()). Also, for full blocks, zfs_write() uses uiocopy() to copy the data before assigning the transaction, so we don't use uiomove() in that case. However, zvol_write() and sbd_zvol_copy_write() don't take advantage of these tricks.

@behlendorf
Copy link
Contributor

@ryao One way we might handle this, at least for mmap'ed reads, is to implemented our own read_cache_pages() helper. This way we'd be able to clear the GFP_IO bit in the read_cache_pages()->add_to_page_cache_lru() call and prevent the lower layers from attempting to write or block on pages in writeback. Unfortunately, add_to_page_cache_lru() seems to be GPL-only so that easier said than done. Other suggestions are welcome.

Unrelated to this issue we could be doing something smarter in read_cache_pages(). Right now it can easily result in us pulling a 128k block from disk and then only stashing the request pages in the page cache.

@casualfish
Copy link
Contributor

@DeHackEd @ryao @behlendorf As far as I know this problem also occurs in other parts of the linux kernel and is an open question, the key point here is that the page reclaim algorithm chooses the wrong page to evict, causing the victim to shoot itself in the foot, I know FUSE(FileSystem in Userspace) avoids this issue by reserving some memory for emergent use, can we also do in this way?

@ryao
Copy link
Contributor

ryao commented Jul 30, 2014

#2411 and openzfs/spl#369 should handle this. In specific, a mix of the PF_FSTRANS changes should tackle it.

@behlendorf behlendorf modified the milestones: 0.7.0, 0.6.4 Oct 7, 2014
@behlendorf behlendorf added Bug - Minor Component: Memory Management kernel memory management and removed Bug labels Oct 7, 2014
@behlendorf
Copy link
Contributor

The original reported issue was determined to be due to a kernel bug by @DeHackEd, the original reporter. It's possible subsequent improvements in ZoL have also improved things. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management
Projects
None yet
Development

No branches or pull requests

5 participants