-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
umount, snapshot ZFS processes stuck in kernel forever causing high load #13327
Comments
I would suggest trying 2.1.4 and seeing if the issue persists, initially - there have been a number of bugs fixed since 2.1.0 released, and while I can't think offhand of any that would have caused this, it's always unfortunate to spend a long time figuring out your problem only to realize someone already resolved it. More generally, if you're not seeing any "task blocked more than 120s" messages in dmesg, unless someone turned them off, that implies that things are making progress, but so slowly that it doesn't really look like it. It'd be interesting to know where your ZFS kernel threads are spending their time - e.g. if you look at /proc/[one of the stuck processes]/stack for the different types of stuck process (zfs commands, zpool commands, ls on a dir, etc), what does it say? What does, say, Something like |
Thanks for your comment. I'm planning on upgrading (reluctantly, as past Fedora release upgrades broke the ZFS installation every time). I cannot provide any more debugging info as the situation got worse, more services became unresponsive, user sessions were not usable anymore. A reboot was required (as expected, those processes prevented the system from shutting down, so a cold power cycle was necessary). I do not believe that any progress was made, why would there still be processes that have been stuck for over a month. I think one of the many zfs processes was stuck at some zpl snapdir function, don't remember the exact name. I wish I could go back and find out what happened. |
A couple of wild shots:
|
I'm thinking about configuring As for Yes, these The only special thing that happened recently was a zfs send test recently (of a very small dataset). Now, a wild theory might be that there were two unrelated problems: One, those ~55 frozen umount processes (which would've eventually exhausted proc and fd limits but did not immediately cause anything to get stuck) and two, the zfs send command caused things to freeze within a matter of days. Are there any known bugs of zfs send, which could cause something like this? At first glance, the issue #4716 sounds a bit similar as it also mentions accessing snapshots and a subsequent freeze, but it's old and probably unrelated. |
I'd still suggest The closest I could think of is using zfs/module/os/linux/zfs/zfs_ctldir.c Lines 1015 to 1025 in 0dd34a1
If you feel like experimenting you could try injecting
Additionally, investigating the content of Please have backups! |
Thanks for your ideas. The snapshot rotation script is very simple and you can find it here: I don't want to lazy unmount snapshots because I assume, it would freeze in the same way, but I wouldn't even see the umount process, so (if it happens again) I wouldn't have any clue that some snapshots might have something to do with whatever is going on. |
You may find #13131 (comment) and my prior reply about how I revised my original patch to help with unmounting snapshots sometimes tripping an assertion failure on debug builds interesting, though as I comment there, these are just patches I'm experimenting with to fix the problem in my own systems, I make no promises they don't burn things down other than that if they do, I'll likely be burning down too. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
activity |
same problem |
On what version? The eventual revisions of that patch got merged into 2.1, so on 2.1.12, if you're still having this issue, that'd be exciting. |
it's zfs-2.1.11 |
Huh, did I never get #14462 cherrypicked into 2.1? Ruh-roh. |
Uh, it's some sort of custom kernel. Probably gonna take a long time to the next update. I'll try to keep this in mind, though, Probably irrelevant, but interesting coincidence: it happened during zfs scrubbing. And the un/mounts were almost certainly not initiated by sanoid snapshot timer, but by some systemd autoaction that i know too little about. |
The unmounts are a periodic timer in ZFS itself, usually, which is what those patches fix - there were cases where it would fail and just give up, essentially. |
workaround to avoid this built-in zfs zfs_expire_snapshot parameter: echo 0 > /sys/module/zfs/parameters/zfs_expire_snapshot It will disable the automatic umount of the snapshots |
I'm observing a situation with ZFS processes stuck, causing the load to grow in the 5 digits. They are stuck in the kernel and therefore not killable. I'm wondering why this could be and if this could be fixed without rebooting the server?
zfs-snapshot is a snapshot rotation script. There are tens of thousands of zfs processes like this but only 55 "umount" processes. Other processes like CROND are also accumulating (10k).
Could this be an issue with ZFS? Assuming some of those ZFS processes are causing the others to get stuck, how can they be terminated?
This is ZFS 2.1.0-1, currently running on Fedora 32, kernel 5.11.2.
At first glance, issue #10100 appears to be similar, but in this case it's not causing soft lockup errors. It seems to be somehow related to cifs and/or nfs exports (there are smbd processes from the same day). Now, running
ls
,lsof
or even bash auto-complete on (some older) snapshots will get stuck as well.The text was updated successfully, but these errors were encountered: