-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock with hung tasks in kmalloc #446
Comments
Interesting, so all the threads which entered direct reclaim are unfortunately stuck waiting to open the next txg. It would be interesting to see why txg_sync thread isn't able to move things along. I've seen similar issues to this very rarely when the arc is 1/2 full of dirty data. Threads attempting to manipulate the txg keep getting ERESTART back and txg_sync for some reason never makes forward progress. |
I am sure this is no help, but I wanted to comment at least. I have had the same symptoms while running VirtualBox on ZFS. For a while, I couldn't even have a ZFS pool mounted while VirtualBox was running a VM. I ended up keeping the VM on Ext4 and turning off all ZFS pools until I needed them, then switching VM off to enable ZFS. It would just deadlock the server just as etienne-dechamps-o posted. However, now, still running rc5 on Ubuntu, I have all my ZFS mounted, but nothing accessing them, VM running, and no lockups in over a week. I'll probably upgrade to rc6 this weekend and try the VM on there again. If there's anything I can do to help resolve this bug, let me know. For the record... Big THANK YOU to all working on this project!! I love ZFS! I love it even more on Linux! |
Hi, I'm new with ZFS on Linux. First of all, I'd like to say a big thank you for your work Brian ! I use a backup server to test ZFS. It's an Ubuntu 10.04.3 LTS with a 3.1.1 kernel and the last ZFS / SPL code (downloaded yesterday from GIT). I use raidz on 4 hdd of 3To each. Here is the last : Nov 11 23:22:22 nsXX kernel: kswapd0 D 0000000000000000 0 748 2 0x00000000 I can reproduce it when I stress the server in less of 3 hours of work so if you made a patch, I would test it very fast. Hope I can help... Adrien |
Thanks Adrien, hopefully your extra debugging with help us get this issue resolved. |
Adrien, I'd like to try to reproduce this. Can you tell me what your stress test does? Thanks. |
Hi Akorn, Sorry for the delay. The test was 4 rsync between an SSD drive and the ZFS pool and a lot of copy of kernel sources (lot of small files) in parallels. Now, I have this logs with only one rsync from ZFS to an other drive (reiserfs), after 657Go that was copied.
Adrien |
Sorry, meanwhile the box I was using for experiments went live; I don't have a sandbox at the moment. Andras |
It looks like pull request #669 should address the original issue reported here. |
I'm not so sure. Certainly it would prevent the reclaim under the zvol_write() but that should be safe under most circumstances (except for a swap device). The real question is why the txg_sync thread isn't making head way. This issue also feels a bit stale. Has anyone hit this recently or shall we just close it an open a new one if this is observed again. |
Well, I only encountered this bug once. After that I stopped using VirtualBox so I have no idea how to reproduce it or even if the bug is still there. So feel free to close this until someone stumble upon the issue again. |
I had the bug a lot of times on my backup server, every 12 hours, so I stopped using ZFS :-( I can try using it again, but I need some time to buy new hard drives. I'll keep you in touch. |
Alright. Then I'm going to close this issue for now. I'm sure someone with open a new issue if this remains a problem. |
Avoid deadlocks when entering the shrinker from a PF_FSTRANS context. This patch also reverts commit d0d5dd7 which added MUTEX_FSTRANS. Its use has been deprecated within ZFS as it was an ineffective mechanism to eliminate deadlocks. Among other things, it introduced the need for strict ordering of mutex locking and unlocking in order that the PF_FSTRANS flag wouldn't set incorrectly. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#446
DOSE-400 Zfs_rename DOSE-405 Zfs_share DOSE-410 Zfs_unshare
Since the `response_type` is the same as the `request_type`, the server infrastructure can fill it in automatically, so each request handler doesn't need to match it up manually. The "get pools" response contained the pools directly in the nvlist, and asserted that there are no other nvpairs. There was no `response_type` in this response message. This commit changes it so that "get pools" includes a `response_type`, and the pools are under a different `pools` nvpair. Additionally, the code in `PublicConnectionState::get_pools()` is reorganized to be more clear. Additionally, the code in `PublicConnectionState::get_destroying_pools()` is simplified, using the server helper method to convert to an nvlist.
I'm running latest SPL/ZFS from master.
ZFS just deadlocked my server. This happened while it was doing quite a lot of things at the same time (most notably, rtorrent and a ZVOL-backed VirtualBox).
I have no idea what triggered it exactly. I noticed that my processes were getting deadlocked one after the other (hung and SIGKILL-proof). The box, however, stayed up and running (no panic), although most ZFS operations wouldn't complete. Basically I was still able to read the pool but it was impossible to write anything. I observed the phenomenon for a few minutes, then I rebooted the box. Needless to say, this will probably be difficult to reproduce.
What's much more interesting however is the kernel log:
What's interesting is that the
rtorrent
process was doing a direct memory reclaim (see stack trace), while most other tasks were spinning inkmalloc_nofail
. Seems to me that the fix for the infamous #287 bug may be unveiling a new kind of issue.The text was updated successfully, but these errors were encountered: