Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.6.5 regression: zpool import -d /dev/disk/by-id hangs #3866

Closed
devsk opened this issue Oct 1, 2015 · 19 comments
Closed

0.6.5 regression: zpool import -d /dev/disk/by-id hangs #3866

devsk opened this issue Oct 1, 2015 · 19 comments
Milestone

Comments

@devsk
Copy link

devsk commented Oct 1, 2015

At boot up, one of the scripts is doing 'zpool import -d /dev/disk/by-id' for listing all the pools available for import and it hangs there forever.

This worked fine in 0.6.4.

It does not matter if whether -d is provided or not. Simple zpool import hangs as well.

I did a quick debug boot of the system (it drops me into a shell in the initrd) and the strace revealed that some ioctls are failing with ENOMEM. Since I can't log this to a file, I took a photo of the strace run. Have a look.

The hang is in ioctl(3, __IOC(0, 0x5a, 0x06, 0x00), which fails with ENOMEM previously in the strace. And there is: ioctl(3, __IOC(0, 0x5a, 0x05, 0x00), 0x7fff4fd7e660) = -1 failed with ENOENT.

@devsk
Copy link
Author

devsk commented Oct 1, 2015

img_20151001_084214363-small

@devsk
Copy link
Author

devsk commented Oct 1, 2015

I don't get any kernel panic or out of memory errors (other than the ENOMEM shown by strace, which I believe are wrong as per code). The process hangs hard (likely in D state) and I can't escape with Ctrl-C on the only console I have...:( I should probably run it in background to see if I can troubleshoot more.

When I press ctrl-alt-del, the system reboots fine, suggesting that it is not a kernel panic or hang.

@devsk
Copy link
Author

devsk commented Oct 1, 2015

why can't I attach a text file here?

@FransUrbo
Copy link
Contributor

I think I can vaguely remember an issue saying something about the import going into an endless loop or something.

Should have been fixed in one of the point releases, so what version exactly are you using? And where do you get ZoL from - package or source?

@devsk
Copy link
Author

devsk commented Oct 1, 2015

Sorry, I forgot to mention this. I upgraded from 0.6.4 to 0.6.5. I am on Gentoo, so its build from source.

This worked perfectly fine in 0.6.4.

@FransUrbo
Copy link
Contributor

Ah. Don't know if @ryao used my init scripts in that (I've rewritten the init scripts from scratch because we had five versions, all different which made it impossible to maintain). He have mentioned that he was going back to "some other" means, but I don't know the exact information about that.

Is there no point release (such as 0.6.5.1 or 0.6.5.2 which was released yesterday) available for Gentoo?

@devsk
Copy link
Author

devsk commented Oct 1, 2015

This is early boot (initrd) trying to poke to see which pools are available for import. So, no init scripts are in the picture at this time. Pretty much:

  1. modprobe zfs
  2. zpool import
  3. loss

Same thing yields profit in 0.6.4...:)

@devsk
Copy link
Author

devsk commented Oct 1, 2015

0.6.5.2 just hit the portage. But do we know if it will fix this or not?

@ryao
Copy link
Contributor

ryao commented Oct 1, 2015

@FransUrbo The scripts in the repository are in Gentoo at the moment, but I intend to merge #3800 by the end of the week.

@FransUrbo
Copy link
Contributor

Since this is in the initrd, it's not the/my init scripts that's at fault. Gentoo is using completely different code in their initrd, so it's not my initrd code either…

Can't say. BUT 0.6.5.1 fixed a problem that COULD, under some circumstances, cause data loss, so I'd update either way and hope for the best.

@ryao
Copy link
Contributor

ryao commented Oct 1, 2015

@FransUrbo That fix was backported. Gentoo skipped from 0.6.5 + that fix to 0.6.5.2.

@ryao
Copy link
Contributor

ryao commented Oct 1, 2015

@devsk Are you using genkernel?

The genkernel zfs branch might work for you:

https://gitweb.gentoo.org/proj/genkernel.git/log/?h=zfs

It is designed to read the cachefile from the pool and import all pools using that. It solves reliability problems involving the cachefile and initramfs archives. It has not been merged yet because it is missing support for generating the scsi, usb and wwn symlinks in /dev/disk/by-id and the other /dev/disk symlinks. That should be resolved later this month.

@devsk
Copy link
Author

devsk commented Oct 1, 2015

No, I am doing manual 'debug' boot where nothing else happens other than 3 things I mentioned above. So, its not related to initrd packaging. The right version of module is loaded for the kernel loaded. And same version user space tools (zpool) are being used to import.

@devsk
Copy link
Author

devsk commented Oct 1, 2015

@ryao: I do use the genkernel but I am not auto importing any pools. Doing very simple modprobe zfs and zpool import (which is supposed to list pools and not import them) from the debug shell that genkernel drops you into with 'debug' kernel cmdline.

@ryao
Copy link
Contributor

ryao commented Oct 1, 2015

I suggest asking for help in #zfsonlinux on freenode. It will be quicker than going back and forth in the issue.

@devsk
Copy link
Author

devsk commented Oct 1, 2015

Chris at the mailing list had this analysis:

Running zpool through the strace, I find that its hanging on ioctl(3,
_IOC(0, 0x5a, 0x06, 0x00).

Any ideas on what that ioctl does?

A previous ioctl(3, _IOC(0, 0x5a, 0x05, 0x00), 0x7fff4fd7e660) = -1
failed with ENOENT.

This might be a clue if someone knows what those ioctls are.

The ioctls are sort of listed in include/sys/fs/zfs.h. If I'm
understanding all of this correctly, what you're seeing is
a ZFS_IOC_POOL_TRYIMPORT operation hanging, with a previous
ZFS_IOC_POOL_STATS failing.

(Again if I'm understanding this correctly what strace represents as eg
'_IOC(0, 0x5a, 0x00, 0x00)' is the first ZFS IOCTL, ZFS_IOC_POOL_CREATE.
The 0x5a is 'Z' << 8, and the second byte is the index in the zfs_ioc
enum (counting from 0). That makes 0x05 the sixth ZFS ioctl and 0x06
the seventh.)

I just looked at other ioctls before this. All the 0x06 ones failed
with ENOMEM. This looks like a bug in ZFS code. I have 12GB of RAM on
this box, there is no way zpool import can eat that up in few seconds.

If I'm reading the code right, I'm not convinced that ENOMEM means
that you're out of (kernel) memory. The ZFS_IOC_POOL_TRYIMPORT operation
returns some information into a memory buffer allocated by the 'zpool'
command, and in theory an ENOMEM return is supposed to indicate that
this buffer is too small for the information the kernel wants to return.
However there appear to be other conditions inside the ZFS module that
can cause this ENOMEM error under some conditions[*].

@devsk
Copy link
Author

devsk commented Oct 1, 2015

I upgraded to 0.6.5.2 and the issue is gone. Does anybody have any idea from the information above what might have happened and fix to which issue actually fixed this issue for me?

We can close this bug as a dup of that issue.

@Bronek
Copy link

Bronek commented Oct 2, 2015

@devsk if the problem is fixed in 0.6.5.2 then perhaps it was duplicate of #3785

@behlendorf
Copy link
Contributor

If I'm reading the code right, I'm not convinced that ENOMEM means
that you're out of (kernel) memory.

Right, in this context is means that user space needs to pass a bigger buffer for the kernel to use.

@devsk my best guess is that you were hitting #3652 / #3785. This regression manifested itself in quite a few different ways but was resolved by 5592404. I'm happy to close this as a duplicate of that issue.

@behlendorf behlendorf added this to the 0.6.5.2 milestone Oct 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants