-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pool frozen (bad hdd), running processes stuck, zpool not working anymore #3233
Comments
Host is a Debian 8 with kernel 3.16.0-4-amd64. ZoL installed from the repository. SPL: Loaded module v0.6.3-296_g353be12 |
mea culpa, I, myself posted on the wrong issue |
@kernelOfTruth why wrong issue? |
@basic6 I wanted to post in #3232 , confused your issue with the 1 comment with mine, wanted to write that you answered on the wrong issue, but then realized that it was actually me who got it wrong 😏 your issue could be related to #3215 or #3175 so taking the advise from those issues: if you want a fast answer - take a look at the mailing list or the IRC channels otherwise it could take some hours (or in the worst case days here) |
Oh, I see, no problem. Bug 3215 mentions corrupted data but zpool status does not show the same error in my case. According to the status output, everything is perfectly fine, zero errors. In Bug 3175, the user gets a "pool is busy" error, which I don't get. I have already asked on IRC, DeHackEd did not know what's going (so it can't be easy). |
After rebooting, all zfs/zpool commands get stuck (but there is a lot of hdd activity): [ 600.164080] INFO: task zpool:2516 blocked for more than 120 seconds. Also, the ZFS volumes cannot be mounted, but they do show up as directories in /datapool. However, a df on one of them says that it's on the root disk, not ZFS. I recently updated a few packages on the system, maybe this is related. There are now updates available for the zfs/zfs-dkms/spl packages, so I'll update the system and try again. |
System update did not fix it. After rebooting, iostat shows a lot of activity for all the drives in the pool, except the ones offlined (the bad drive and the drive with the bad sector):
Not sure what ZFS is doing, nothing logged in dmesg, ZFS isn't saying what's going on. Given this hdd activity, it seems like ZFS is doing something... zpool status getting stuck, just like all other zfs/zpool commands. |
Looks like it's working again! After one of several reboots, iostat showed 100% util for sdk (which is a pool member) and 0% for all the other drives. I have physically removed it and rebooted. At first, zpool status took a very long time (iostat showed changing util values, as shown in the last comment). After what felt like 15 minutes, zpool status returned. No read/write/checksum errors. So first of all - the zpool is alive again. I still don't really know what happened, but it's working again. |
My pool is frozen again! So I highly doubt that the WD RED drive I removed (which looks healthy) had anything to do with it. The first bad hdd (smart error) has been replaced already, the second bad drive is still connected, but marked as "FAULTED". I'll remove it (and maybe put the WD RED drive back in). Kernel: 3.16.0-4-amd64
|
|
Stacktraces after "echo w > /proc/sysrq-trigger": DeHackEd mentioned a zfs rename operation which appears to be in progress. The only thing that might run zfs rename (afaik) is the snapshot rotation cronjob. I might disable it temporarily and maybe delete a few snapshots. Also, using the daily build was suggested (with a warning that it might introduce other issues). This is the enabled line in /etc/apt/sources.list.d/zfsonlinux.list: |
iostat shows no hdd activity:
|
Again, ZFS commands (like zfs list, after a reboot probably zpool status as well) are just locking up, not throwing any error messages, leaving me to guess what might or might not be the issue. The pool is not running out of space either, there's more than 5 TB space available. |
On a sidenote: After switching from debian to debian-daily (zfsonlinux repo file), the installation gets stuck at this point: Hanging installation was resolved by killing (SIGKILL) the child process "/bin/systemctl start zfs.target". After installing from debian-daily, this version is now installed:
This is what almost any zfs command looks like, getting stuck:
Running cat on a file on the storage does actually prints its contents, but then gets stuck as well - and the stack looks identical:
To be clear, I would have replaced the other bad drive (marked as "FAULTED") by now if ZFS wasn't frozen. |
Memory usage (although that's probably not it because it would have worked after the reboot last time this happened):
|
As discussed with ryao, here's a paste of all running process stacks: http://pastee.co/pMIM8t |
After rebooting (as suggested, to try the daily build), the system won't even boot anymore because apparently it's now trying to access the pool before I get a shell, but the pool isn't accessible (probably until I remove another healthy drive like last time) due to this deadlock. |
Try booting with Also make sure there's no zfs stuff in your initrd If you're really paranoid, you can either remove the modules (or at least move them out the way so they don't get loaded automatically) |
Correction - I was too impatient. After probably 5 - 10 minutes, the system did boot and is now running. ZFS is working again at the moment (using v0.6.3-40-0f7d2a, according to dmesg). |
Good fortune then! |
Thanks. I'll try a few things, put some load on the system, keep the snapshot rotation cronjob disabled for a few days and report back if something happens. |
Might be time to do a full backup to other hardware… Just in case... |
Good idea, the most important volumes already have an up-to-date backup, but I should run a full backup for the other volumes. The system is running normally at this point, except that I still have the snapshot rotation cronjob disabled, which apparently somehow triggered the deadlock. |
Earlier today, processes using the storage got stuck and zpool commands now get stuck as well, except zpool status, which shows no errors (all 0).
Trigger of this issue might be a bad hdd (smart error) and a second almost-as-bad (first bad sector found) hdd. It's a raidz3 pool, so two failing drives should not be a problem.
However, since ZFS isn't throwing any errors, it is not clear which hdd triggered this (or maybe it was unrelated to the hdds).
Trying to zpool offline the bad drive (with the error) got stuck, but zpool status in another terminal shows that it's been offlined. Trying to replace it with the new drive got stuck, but nothing is happening (status output still showing old drive, now offline).
iostat -dmx 1 did not show a constant value of 100%.
Since no error message can be found - ZFS is not saying what's going on -, "echo w > /proc/sysrq-trigger" was issued. Here is a part of what's been logged (the part with the stack traces): https://gist.github.com/basic6/a1b6c6a27d81a173966e
It looks like ZFS got stuck so bad that a hard reboot will be necessary (losing some work because those running processes would have to be killed).
The text was updated successfully, but these errors were encountered: