-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs crash, followed by lockup on boot. (volume mount successfully in zevo.) #1702
Comments
The stacks indicate that the mount was blocked on I/O while it was replaying the ZIL. That makes me suspect a disk problem. If you were able to mount the FS on your Mac then the ZIL will have been replayed and you won't have any issues importing the pool on Linux. |
I suspect this is happening because I'm hitting my eSATA bandwidth cap on this device (I have multiple drives connected to this eSATA port) and not because of a drive failure. So this is likely a system block i/o issue on the drive and not an i/o timeout intrinsic to zfs? So just tuning in the system via something like this for my device should help alleviate this issue, right?
Or would it make sense to reduce the timeout to something like 5 seconds from 30 seconds and make the device try again? I'm not sure what would be most efficient here. Scrubs on the disk show all the data is fine. Marking the disk as bad due to a failed read when the failed read was due to an i/o limitation seems to be less than ideal. As per this discussion: #471 , I'm not sure about the best way to proceed here. Regardless, if the mount is blocked on I/O, the zfs module shouldn't crash, right? (If this happens in Zevo the pool is forcibly exported, though it occasionally takes the system down there too.) |
Nothing should crash, and in fact nothing crashed under Linux. The console messages (which scary looking) are simply warnings. They indicate that a task is taking far longer to complete than it should, in this case it was due to an IO which just wasn't completing. But nothing failed.
Increasing the device timeout may help if for some reason things are just backed up. However, if you're overloading this eSATA port it may be wiser to decrease the timeout and increase. This will allow you to attempt more retries in the same amount of time. The best thing to do would be to move the disks on to additional controllers to avoid this entirely. |
Thanks again. I've adjusted the timeout to 10 seconds to see if that helps. I was able to reboot the system by SSHing in, but the system then hung at zfs mount at the next boot. I had to reboot into OS X, where zevo mounted the volumes, before rebooting into linux, where I had to export, and then reimport the pools. Is there maybe an issue with mounting pools at next boot after this sort of thing happens? (The failure to mount the zpools at next boot into linux after the issue blocked the startup process, so I couldn't connect in and see what was going on.) If I/O doesn't complete within the timeout, and the mount becomes unresponsive, do the pools not export cleanly during shutdown? Is there a way to force umount/export at shutdown? Since the timeout was 30 seconds, and mount.zfs was complaining after 120 seconds, I imagine that zfs should have complained and marked disk(s) as bad. Would that have caused the lockup at next boot? |
I added this to /etc/udev/rules.d/81-zfs-timeout.rules on my ubuntu system to drop the timeout to 10 seconds on all of my potential devices.
|
@satmandu Where your proposed tuning enough to resolve the issue? |
Sadly no. 50% of boots were hanging. ( I had the timeout eventually set to 40, then I gave up.) Reverting the change allowed me to reliably boot up... which I consider a priority. ;-) |
I have 3 volumes on 6 disks, 5 disks of which are connected via eSATA, and one via firewire.
Volumes were created in zevo, and have mounted successfully in ubuntu on kernel 3.11, with zfsonlinux .62.
While doing a chown on a git directory, zfsonlinux had an error, locking up all zfs volumes.
Rebooting works, until the volumes get mounted. Attempting to mount volumes fails and locks the system.
This is what was in my dmesg prior to having to reboot:
My guess is that there might have been an issue where the esata bus was overwhelmed and throttled, causing zfs on linux to get confused. That's all I have.
The volumes mount successfully when the system is rebooted into OS X, and I'm running a scrub on the volumes now there.
The text was updated successfully, but these errors were encountered: