0.6.5 regression: zpool import -d /dev/disk/by-id hangs #3866

devsk · 2015-10-01T17:10:03Z

At boot up, one of the scripts is doing 'zpool import -d /dev/disk/by-id' for listing all the pools available for import and it hangs there forever.

This worked fine in 0.6.4.

It does not matter if whether -d is provided or not. Simple zpool import hangs as well.

I did a quick debug boot of the system (it drops me into a shell in the initrd) and the strace revealed that some ioctls are failing with ENOMEM. Since I can't log this to a file, I took a photo of the strace run. Have a look.

The hang is in ioctl(3, __IOC(0, 0x5a, 0x06, 0x00), which fails with ENOMEM previously in the strace. And there is: ioctl(3, __IOC(0, 0x5a, 0x05, 0x00), 0x7fff4fd7e660) = -1 failed with ENOENT.

devsk · 2015-10-01T17:14:16Z

devsk · 2015-10-01T17:25:03Z

I don't get any kernel panic or out of memory errors (other than the ENOMEM shown by strace, which I believe are wrong as per code). The process hangs hard (likely in D state) and I can't escape with Ctrl-C on the only console I have...:( I should probably run it in background to see if I can troubleshoot more.

When I press ctrl-alt-del, the system reboots fine, suggesting that it is not a kernel panic or hang.

devsk · 2015-10-01T17:58:03Z

why can't I attach a text file here?

FransUrbo · 2015-10-01T18:45:21Z

I think I can vaguely remember an issue saying something about the import going into an endless loop or something.

Should have been fixed in one of the point releases, so what version exactly are you using? And where do you get ZoL from - package or source?

devsk · 2015-10-01T18:47:35Z

Sorry, I forgot to mention this. I upgraded from 0.6.4 to 0.6.5. I am on Gentoo, so its build from source.

This worked perfectly fine in 0.6.4.

FransUrbo · 2015-10-01T18:50:54Z

Ah. Don't know if @ryao used my init scripts in that (I've rewritten the init scripts from scratch because we had five versions, all different which made it impossible to maintain). He have mentioned that he was going back to "some other" means, but I don't know the exact information about that.

Is there no point release (such as 0.6.5.1 or 0.6.5.2 which was released yesterday) available for Gentoo?

devsk · 2015-10-01T18:59:19Z

This is early boot (initrd) trying to poke to see which pools are available for import. So, no init scripts are in the picture at this time. Pretty much:

modprobe zfs
zpool import
loss

Same thing yields profit in 0.6.4...:)

devsk · 2015-10-01T19:03:24Z

0.6.5.2 just hit the portage. But do we know if it will fix this or not?

ryao · 2015-10-01T19:07:13Z

@FransUrbo The scripts in the repository are in Gentoo at the moment, but I intend to merge #3800 by the end of the week.

FransUrbo · 2015-10-01T19:07:20Z

Since this is in the initrd, it's not the/my init scripts that's at fault. Gentoo is using completely different code in their initrd, so it's not my initrd code either…

Can't say. BUT 0.6.5.1 fixed a problem that COULD, under some circumstances, cause data loss, so I'd update either way and hope for the best.

ryao · 2015-10-01T19:07:54Z

@FransUrbo That fix was backported. Gentoo skipped from 0.6.5 + that fix to 0.6.5.2.

ryao · 2015-10-01T19:10:41Z

@devsk Are you using genkernel?

The genkernel zfs branch might work for you:

https://gitweb.gentoo.org/proj/genkernel.git/log/?h=zfs

It is designed to read the cachefile from the pool and import all pools using that. It solves reliability problems involving the cachefile and initramfs archives. It has not been merged yet because it is missing support for generating the scsi, usb and wwn symlinks in /dev/disk/by-id and the other /dev/disk symlinks. That should be resolved later this month.

devsk · 2015-10-01T19:10:44Z

No, I am doing manual 'debug' boot where nothing else happens other than 3 things I mentioned above. So, its not related to initrd packaging. The right version of module is loaded for the kernel loaded. And same version user space tools (zpool) are being used to import.

devsk · 2015-10-01T19:16:30Z

@ryao: I do use the genkernel but I am not auto importing any pools. Doing very simple modprobe zfs and zpool import (which is supposed to list pools and not import them) from the debug shell that genkernel drops you into with 'debug' kernel cmdline.

ryao · 2015-10-01T19:18:52Z

I suggest asking for help in #zfsonlinux on freenode. It will be quicker than going back and forth in the issue.

devsk · 2015-10-01T19:23:00Z

Chris at the mailing list had this analysis:

Running zpool through the strace, I find that its hanging on ioctl(3,
_IOC(0, 0x5a, 0x06, 0x00).

Any ideas on what that ioctl does?

A previous ioctl(3, _IOC(0, 0x5a, 0x05, 0x00), 0x7fff4fd7e660) = -1
failed with ENOENT.

This might be a clue if someone knows what those ioctls are.

The ioctls are sort of listed in include/sys/fs/zfs.h. If I'm
understanding all of this correctly, what you're seeing is
a ZFS_IOC_POOL_TRYIMPORT operation hanging, with a previous
ZFS_IOC_POOL_STATS failing.

(Again if I'm understanding this correctly what strace represents as eg
'_IOC(0, 0x5a, 0x00, 0x00)' is the first ZFS IOCTL, ZFS_IOC_POOL_CREATE.
The 0x5a is 'Z' << 8, and the second byte is the index in the zfs_ioc
enum (counting from 0). That makes 0x05 the sixth ZFS ioctl and 0x06
the seventh.)

I just looked at other ioctls before this. All the 0x06 ones failed
with ENOMEM. This looks like a bug in ZFS code. I have 12GB of RAM on
this box, there is no way zpool import can eat that up in few seconds.

If I'm reading the code right, I'm not convinced that ENOMEM means
that you're out of (kernel) memory. The ZFS_IOC_POOL_TRYIMPORT operation
returns some information into a memory buffer allocated by the 'zpool'
command, and in theory an ENOMEM return is supposed to indicate that
this buffer is too small for the information the kernel wants to return.
However there appear to be other conditions inside the ZFS module that
can cause this ENOMEM error under some conditions[*].

devsk · 2015-10-01T22:31:14Z

I upgraded to 0.6.5.2 and the issue is gone. Does anybody have any idea from the information above what might have happened and fix to which issue actually fixed this issue for me?

We can close this bug as a dup of that issue.

Bronek · 2015-10-02T08:12:50Z

@devsk if the problem is fixed in 0.6.5.2 then perhaps it was duplicate of #3785

behlendorf · 2015-10-02T22:50:04Z

If I'm reading the code right, I'm not convinced that ENOMEM means
that you're out of (kernel) memory.

Right, in this context is means that user space needs to pass a bigger buffer for the kernel to use.

@devsk my best guess is that you were hitting #3652 / #3785. This regression manifested itself in quite a few different ways but was resolved by 5592404. I'm happy to close this as a duplicate of that issue.

FransUrbo mentioned this issue Oct 1, 2015

"Importing ZFS pool xyz Out of memory" crash at boot. #3863

Closed

behlendorf added this to the 0.6.5.2 milestone Oct 2, 2015

behlendorf added the Bug - Major label Oct 2, 2015

behlendorf closed this as completed Mar 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.6.5 regression: zpool import -d /dev/disk/by-id hangs #3866

0.6.5 regression: zpool import -d /dev/disk/by-id hangs #3866

devsk commented Oct 1, 2015

devsk commented Oct 1, 2015

devsk commented Oct 1, 2015

devsk commented Oct 1, 2015

FransUrbo commented Oct 1, 2015

devsk commented Oct 1, 2015

FransUrbo commented Oct 1, 2015

devsk commented Oct 1, 2015

devsk commented Oct 1, 2015

ryao commented Oct 1, 2015

FransUrbo commented Oct 1, 2015

ryao commented Oct 1, 2015

ryao commented Oct 1, 2015

devsk commented Oct 1, 2015

devsk commented Oct 1, 2015

ryao commented Oct 1, 2015

devsk commented Oct 1, 2015

devsk commented Oct 1, 2015

Bronek commented Oct 2, 2015

behlendorf commented Oct 2, 2015

0.6.5 regression: zpool import -d /dev/disk/by-id hangs #3866

0.6.5 regression: zpool import -d /dev/disk/by-id hangs #3866

Comments

devsk commented Oct 1, 2015

devsk commented Oct 1, 2015

devsk commented Oct 1, 2015

devsk commented Oct 1, 2015

FransUrbo commented Oct 1, 2015

devsk commented Oct 1, 2015

FransUrbo commented Oct 1, 2015

devsk commented Oct 1, 2015

devsk commented Oct 1, 2015

ryao commented Oct 1, 2015

FransUrbo commented Oct 1, 2015

ryao commented Oct 1, 2015

ryao commented Oct 1, 2015

devsk commented Oct 1, 2015

devsk commented Oct 1, 2015

ryao commented Oct 1, 2015

devsk commented Oct 1, 2015

Chris at the mailing list had this analysis:

devsk commented Oct 1, 2015

Bronek commented Oct 2, 2015

behlendorf commented Oct 2, 2015