Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odd MD pseudo-crash related to ZFS memory issue? #1619

Closed
cousins opened this issue Jul 30, 2013 · 4 comments
Closed

Odd MD pseudo-crash related to ZFS memory issue? #1619

cousins opened this issue Jul 30, 2013 · 4 comments
Milestone

Comments

@cousins
Copy link

cousins commented Jul 30, 2013

Just checking to see if anybody else has had anything similar to this happen:

I have a large (60 4TB disks in 6 groups of 10 disk raidz2's) zfs pool that I have been rsyncing data to from a couple of other systems. While on vacation and checking in (I know, always a bad idea) I noticed one of the rsyncs had hung but the others were still going. While investigating I found that certain commands would give me "I/O error"s. Then I found that the mirrored OS volume was degraded but with /dev/sda2 having been thrown out. sda1 and sda3 were still active in their mirrors though. /usr/bin and /usr/sbin showed I/O errors and my vacation wasn't much fun for a while.

I eventually booted from a DVD and poked around. The md devices were fine. The underlying hardware was fine. The file system was fine. I booted into the OS (Centos 6.4) again and everything is fine again. I didn't have to add /dev/sda2 back into the mirror and have it sync. It was just fine.

My guess (along with a tech-support person from the vendor we bought the hardware from) is that ZFS somehow tromped on memory, and put the root volume in a very weird state. Looking at the logs, I don't see any entries since the 19th which.

Has anyone seen anything similar to this?

Thanks,

Steve

@tomposmiko
Copy link

On 07/30/2013 10:48 PM, cousins wrote:

Just checking to see if anybody else has had anything similar to this happen:

I have a large (60 4TB disks in 6 groups of 10 disk raidz2's) zfs pool that I have been rsyncing
data to from a couple of other systems. While on vacation and checking in (I know, always a bad
idea) I noticed one of the rsyncs had hung but the others were still going. While investigating I
found that certain commands would give me "I/O error"s. Then I found that the mirrored OS volume was
degraded but with /dev/sda2 having been thrown out. sda1 and sda3 were still active in their mirrors
though. /usr/bin and /usr/sbin showed I/O errors and my vacation wasn't much fun for a while.

I eventually booted from a DVD and poked around. The md devices were fine. The underlying hardware
was fine. The file system was fine. I booted into the OS (Centos 6.4) again and everything is fine
again. I didn't have to add /dev/sda2 back into the mirror and have it sync. It was just fine.

My guess (along with a tech-support person from the vendor we bought the hardware from) is that ZFS
somehow tromped on memory, and put the root volume in a very weird state. Looking at the logs, I
don't see any entries since the 19th which.

Why do you suspect zfs?

Has anyone seen anything similar to this?

I saw similar issues many times with md raid.

tamas

@cousins
Copy link
Author

cousins commented Jul 30, 2013

Hi Tamas,

We've been having a fair amount of trouble with ZFSonLinux that make me a bit gun-shy: #1179 and openzfs/spl#247.

I've used MD for over 10 years on many systems and I've never seen this behavior before.

I admit that I have no proof that ZFS had anything to do with this. That is why I'm asking if anyone has seen anything like this. Just trying to get more information.

Steve

@tomposmiko
Copy link

I saw similar (not exactly the same) issues, when there was crappy HDD (check smart), or bad SATA/power connector, crappy HBA or its driver.
When I rebooted that machines, everything worked fine for a couple of hours/days, sometimes weeks.

In similar case with HW raid I saw timeouts both in linux and controller logs.

In such a case disabling HDD cache can help (if it's a HW raid array).

@behlendorf
Copy link
Contributor

Even if this was caused by ZFS without additional information to go on there's not much which can be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants