-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heavy IO lockup from Linux 3.5->3.6 change #1342
Comments
After several sessions of bisecting I think the fault was introduced in commit torvalds/linux@e62e384 I'm feeling 80% confident this isn't a mistake based on probability of the bug going off. I'll try patching future kernels to see if I can turn them stable. |
Nice job running this down! That commit completely explains the hang and jibes perfectly with the stack traces you've already posted. Here's what's happening. While in the middle of a write system call we take a page fault and attempt to perform some read-ahead. Unfortunately, in the middle of the read we try to reclaim memory which causes us to block waiting on a zfs page in write back. That page will never get written because the original write system call is holding the txg open. Thus we deadlock. The patch you referenced introduced the wait on page in write back. Now we need to decide what to do about this... |
It is possible for people to workaround this issue in recent kernels by reverting the following commits: torvalds/linux@c3b94f4 I have asked the Gentoo kernel team to revert these patches in our main kernel source package until we have a proper fix in ZFS. That should protect at least some users from this issue provided that they honor my request. |
I have opened a Gentoo bug to track the status of this in Gentoo: |
A comment in dmu_write_uio_dnode() suggests that this is an Illumos/Solaris bug that we inherited:
I withdraw my previous comments about the mmap() rewrite fixing this. It is not yet clear to me if the rewrite that I started will touch this. @ahrens You wrote the commit that introduced this code in 2005 at Sun. Would you elaborate on how you thought that uiolockdown() could be implemented when you wrote this comment? |
@ryao The problem described in that comment is that uiomove() can block for an arbitrarily long time while we have a transaction open. This will eventually cause the spa_sync() thread to block on this, and then all writes will block. The idea is that the caller would call uiolockdown() before calling dmu_tx_assign(). This would cause the pages to be locked in memory, so that uiomove() won't block. I'm not familiar enough with the VM subsystem (especially on Linux) to comment on how exactly uiolockdown() would be implemented. Note that we take several pains to decrease the likelihood of uiomove() having to block. zfs_write() calls uio_prefaultpages() to read the pages into memory (but those pages could be evicted before we get to uiomove()). Also, for full blocks, zfs_write() uses uiocopy() to copy the data before assigning the transaction, so we don't use uiomove() in that case. However, zvol_write() and sbd_zvol_copy_write() don't take advantage of these tricks. |
@ryao One way we might handle this, at least for mmap'ed reads, is to implemented our own Unrelated to this issue we could be doing something smarter in |
@DeHackEd @ryao @behlendorf As far as I know this problem also occurs in other parts of the linux kernel and is an open question, the key point here is that the page reclaim algorithm chooses the wrong page to evict, causing the victim to shoot itself in the foot, I know FUSE(FileSystem in Userspace) avoids this issue by reserving some memory for emergent use, can we also do in this way? |
#2411 and openzfs/spl#369 should handle this. In specific, a mix of the PF_FSTRANS changes should tackle it. |
The original reported issue was determined to be due to a kernel bug by @DeHackEd, the original reporter. It's possible subsequent improvements in ZoL have also improved things. Closing. |
ryao says this may be mmap related. For now I'll dump the info I know here and update as I get more. Feel free to update the description if I take too long coming up with something better.
Hung task stack traces: http://pastebin.com/n2E2Mq6d
Slapd (OpenLDAP using BDB as its backend) is doing a replication database receive. It's slowed down by what I assume to be #1297 but that's not the issue.
Host is a custom build kernel, KVM guest with 8 cores and 10 gigs of RAM. All ZFS options at default. Currently running build 0b4d1b5
Hang usually takes anywhere from 5-30 minutes to manifest itself. It may be a Linux kernel regression as a 3.2 kernel appeared stable but a 3.8 kernel would jam up pretty quickly. Some earlier kernels (~3.4) would run overnight successfully. I'm trying to bisect a possible upstream regression but the irregular nature of the lockup means some guesswork is involved.
Update history: Old title was "Heavy IO lockup"
The text was updated successfully, but these errors were encountered: