-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock in zpool, txg_sync and udev/vol_id when importing the pool #1862
Comments
This looks like some sort of deadlock between pool import and zvol device creation. The fact that suppressing the zvol creation with |
@behlendorf Any idea how I should proceed [with finding out what/where the deadlock might be]? Setting zvol_inhibit_dev to 0 and creating a new filesystem works fine. The ZVOL device is where it's supposed to be in /dev. However, accessing it doesn't seems possible. I ran fdisk on it after creating the ZVOL, and that hung. As well as 'txg_sync'. I got that not-an-oops again. And all writes to the/a filesystem froze... |
Doing some more debugging, after cherry-picking some stuff that looks fine, I have:
Doing an strace on these three process I see:
It then hangs on the select. Looking in /udev/.udev/:
Since the process list indicates that it's the 'zpool list' that hangs, running this under strace shows:
and then it hangs at the ioctl... |
After a restart (so I could try again):
and running that vol_id under strace, I see:
So maybe it's LARGEFILE support that's the problem?!
|
Well, I couldn't find anything that support that, but what's (also) weird:
Shouldn't the zd* devices be there? I CAN see that udev IS reading/loading the ZoL rules... |
Might be something wrong with udev:
|
Running zvol_id under strace:
|
A full log of a strace of the original (mother) udev, a child and the zpool list processes can be found at http://bayour.com/misc/strace_zpool_import-full.txt. I'm currently looking at it myself, but it's 18M and I can't see something obvious :( I also put the strace log from a manual zpool list at http://bayour.com/misc/strace_zpool_import.txt just for completeness and there I can see that the ioctl() hangs on /dev/zfs. |
Any access to a ZVOL, even with 'zvol_inhibit_dev=1' hangs txg_sync. Creating one works, but any access to it hangs... |
@behlendorf do you have any input? Any idea how to proceed? |
Loading the module without zvol_inhibit_dev then restart udevd gives:
Then trying to run 'zpool list' hangs zpool, txg_sync and vol_id as usual. |
Disabling the ZoL udev rules completely does not change anything. Still get the hang... |
The syslog after module loaded (includes output from sysrq 't' key) - http://bayour.com/misc/syslog_zpool_import.txt. |
Now I can't even mount all filesystems! It fails after a while.
Output from a sysrq 't' key: http://bayour.com/misc/syslog_zfs_mount_all.txt. Since I see a 'txg_wait_synced' in there, I checked the 'zfs_txg_timeout' value, and it's at the default '5'. I'll reboot and increase that to ... 10? |
Trying to remove the cache device (after a successful import zfs_txg_timeout=10):
|
Have #938 resurfaced again? My txg_sync process is also at 100% after a call trace like #1862 (comment) which hangs 'zfs mount -av' at 12/628. During the import:
101%!? Cool! :) Trying options wildly.
|
A zpool import (no mount) is still running after almost an hour. Increasing the arc_max to 10GB don't seem to change anything (as yet). No call trace or hung process though, that's good. I'm going to let it run and see what happens... Load average at 2.10/2.20/2.11, an strace on the zpool process doesn't give anything (and yet it runs at 100% CPU) and I don't know how to trace a kernel process (txg_sync, also 100%)... |
After running zpool import for 80 minutes, with a huge max arch cache, memory is barely touched
|
A 'zdb share' gives
An strace output of this can be found at http://bayour.com/misc/strace_zdb.txt. |
@FransUrbo Not much information there. Try doing an |
@dweeezil what process do you want me to do strace? I think I've done it on every one I could think of... |
And there's no /tmp/x file... Trying again:
|
@FransUrbo The zdb command forks/clones itself a lot and most of the interesting action that might help figure out your Input/output error is in one of the forked processes. The processes are all zdb. When you use "-o /tmp/x -ff", the output files will be named "/tmp/xx." and there should be a whole lot of them. Did you look in /tmp to see if there were any files there?
I'd suspect that one of those trace files should give a clue as to where you I/O error is occurring. |
Ah! I've never used double F's, single yes, but not double. Much simpler than getting it all in one file :) Uploaded to http://bayour.com/misc/zdb_share/. |
@FransUrbo The strace output doesn't show anything terribly interesting because a lot of the work is being done in the kernel. The first thing I'd do is to rule out a corrupted cache file. I'd also suggest backing off to a simple In order to check your cache file, I'd suggest renaming it with Your strace output shows the following devices being opened:
Are those the only devices in your pool? |
I'm going to try the different zdb command lines, but in the meantime - those disks looks like all of them. 15 disks (3x3TB+12x1.5TB). I also have a 120GB Corsair SSD disk as cache device, but I've removed that from the system, but not the pool (can't, any zpool remove/offline etc gives me a lockup).
Strangely enough (and probably unrelated to the issue) is that all the disks that is prefixed with ata- used to be prefixed with scsi- as well a few days ago. |
Running 'zdb -d share' gave an I/O error:
The strace output can be found at http://bayour.com/misc/strace_zdb_-d_share/. Loading the module, importing the pool, removing the cache file, rebooting (just to be sure :), load the module shows 'no pools availible'. Exporting the pool and then run 'zdb -e -p /dev/disk/by-id' gave http://bayour.com/misc/strace_zdb_-e_-p/zdb_-e_-p.out and the strace output http://bayour.com/misc/strace_zdb_-e_-p |
Installing a non-debug version of spl/zfs GIT master makes it possible to do a So compiling with debug interferes with the code in some way. |
|
Early versions of ZFS coordinated the creation and destruction of device minors from userspace. This was inherently racy and in late 2009 these ioctl()s were removed leaving everything up to the kernel. This significantly simplified the code. However, we never picked up these changes in ZoL since we'd already significantly adjusted this code for Linux. This patch aims to rectify that by finally removing ZFC_IOC_*_MINOR ioctl()s and moving all the functionality down in to the kernel. Since this cleanup will change the kernel/user ABI it's being done in the same tag as the previous libzfs_core ABI changes. This will minimize, but not eliminate, the disruption to end users. Once merged ZoL, Illumos, and FreeBSD will basically be back in sync in regards to handling ZVOLs in the common code. While each platform must have its own custom zvol.c implemenation the interfaces provided are consistent. NOTES: 1) This patch introduces one subtle change in behavior which could not be easily avoided. Prior to this change callers of 'zfs create -V ...' were guaranteed that upon exit the /dev/zvol/ block device link would be created or an error returned. That's no longer the case. The utilities will no longer block waiting for the symlink to be created. Callers are now responsible for blocking as needed and this is why a 'sleep 1' was added to zconfig.sh. 2) The read-only behavior of a ZVOL now solely depends on if the ZVOL_RDONLY bit is set in zv->zv_flags. The redundant policy setting in the gendisk structure was removed. This both simplifies the code and allows us to safely leverage set_disk_ro() to issue a KOBJ_CHANGE uevent. 3) Because __zvol_create_minor() and zvol_alloc() may now be called in a sync task they must now use KM_PUSHPAGE. References: illumos-gate/illumos@681d9761e8516a7dc5ab6589e2dfe717777e1123 Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#1862
@FransUrbo I was wrong in my comment above, the deadlock I thought I saw in the master code doesn't exist. Nor have I been able to reproduce what you're seeing in any of the testing. That coupled with the fact that no one else has reported anything like this makes me strongly suspect the deadlock was accidentally introduced by zfsrogue's zfs-crypt changes. For the reasons you've mentioned above I'd prefer not to look at the code, but perhaps you could. The question which needs to be answered in why doesn't the dsl_pool_config lock get dropped in the following call stack. In the master code it's taken at the start of
Also I don't expect this will fix your issue but I've pushed and updated version of some long over due zvol cleanup for review, #1969. |
I took a line by line look at Looking at the diff against the zfs-crypto code and ZoL/master, there is only one added I have asked 'someone' that knows a lot more about this (both codes) to take a look, and if that person wants to contribute comments, fine. If not, I will forward them here. |
@behlendorf I did find your branch in your own repo just shortly after you posted that you had a possible fix. So my findings in #1862 (comment) is with that part applied. |
1862 incremental zfs receive fails for sparse file > 8PB Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Simon Klinkert <[email protected]> Approved by: Eric Schrock <[email protected]> References: illumos/illumos-gate@31495a1 illumos changeset: 13789:f0c17d471b7a https://www.illumos.org/issues/1862 Ported-by: Brian Behlendorf <[email protected]>
Early versions of ZFS coordinated the creation and destruction of device minors from userspace. This was inherently racy and in late 2009 these ioctl()s were removed leaving everything up to the kernel. This significantly simplified the code. However, we never picked up these changes in ZoL since we'd already significantly adjusted this code for Linux. This patch aims to rectify that by finally removing ZFC_IOC_*_MINOR ioctl()s and moving all the functionality down in to the kernel. Since this cleanup will change the kernel/user ABI it's being done in the same tag as the previous libzfs_core ABI changes. This will minimize, but not eliminate, the disruption to end users. Once merged ZoL, Illumos, and FreeBSD will basically be back in sync in regards to handling ZVOLs in the common code. While each platform must have its own custom zvol.c implemenation the interfaces provided are consistent. NOTES: 1) This patch introduces one subtle change in behavior which could not be easily avoided. Prior to this change callers of 'zfs create -V ...' were guaranteed that upon exit the /dev/zvol/ block device link would be created or an error returned. That's no longer the case. The utilities will no longer block waiting for the symlink to be created. Callers are now responsible for blocking, this is why a 'udev_wait' call was added to the 'label' function in scripts/common.sh. 2) The read-only behavior of a ZVOL now solely depends on if the ZVOL_RDONLY bit is set in zv->zv_flags. The redundant policy setting in the gendisk structure was removed. This both simplifies the code and allows us to safely leverage set_disk_ro() to issue a KOBJ_CHANGE uevent. See the comment in the code for futher details on this. 3) Because __zvol_create_minor() and zvol_alloc() may now be called in a sync task they must use KM_PUSHPAGE. References: illumos-gate/illumos@681d9761e8516a7dc5ab6589e2dfe717777e1123 Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#1862
A debug patch to identify the writing process holding the rrwlock. If after 60 seconds of blocking a write the lock cannot be taken print to the console the process name and pid of the holder. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#1862
Early versions of ZFS coordinated the creation and destruction of device minors from userspace. This was inherently racy and in late 2009 these ioctl()s were removed and everything was left up to the kernel. However, we never picked up these changes in ZoL since we'd already significantly adjusted this code for Linux. This patch aims to rectify that by finally removing ZFC_IOC_*_MINOR and moving all the functionality down in to the kernel. This gets ZoL effectively back in sync with Illumos and FreeBSD. This change also updates the zvol_*_minors() functions based on the FreeBSD implemenations. This was done to avoid a deadlock which was introduced by the lock restructuring in Illumos 3464. References: illumos-gate/illumos@681d9761e8516a7dc5ab6589e2dfe717777e1123 TODO: * zvol_set_snapdev() is subject to the same deadlock and must be reworked in a similar way to zvol_create_minors(). * __zvol_rename_minor() must be updated to prompt udev. * Test, test, test. This code can be subtle are there are quite a few cases to get right. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#1862
Sorry, always late to the party! Maybe I can help, or see if rogue is around. It is worth noting that this area of the sources are new, so there is no "other" sources to compare against any more. I can reconfirm that This is quite a long issue now, @FransUrbo is there a set of instructions that makes the problem trigger? I assume it is no longer importing volumes? |
@lundman Correct. As soon as I don't set Also have a look at #1933. Have no idea if it's related, but since I get an invalid value, my thought was that maybe it's something wrong with the metadata or txg/dmu (whatever those are :) which triggers the dead lock. |
I've push(ed) the exact code I'm using into https://github.com/FransUrbo/zfs/tree/iscsi+crypto+mine_0.6.2+. ---- WARNING --- |
Ah yep, pretty easy to get stuck waiting on rrwlock (dp_config), especially when using zvol, which is odd :) |
Simple printing on lock grab and release gets this output:
Telling
Although, this is the second locker, possibly the one before is the one at fault. Which is this guy:
Which has no fewer than 6 returns without unlock. An early pull request is available here: zfsrogue/zfs-crypto#41 If it is ok, we can ask rogue to merge it. |
It seems to work. Halfway anyway... I can import the pool without problem and all device nodes is created. However, when I try to mount the device, I get:
That's identical to #1933. |
I'm going to rebuild without debugging and see if it works any better. |
That looks familiar, didn't we already decide those ASSERTs need to be updated, or removed? |
!!WARNING -- possible crypto code in link -- WARNING!! Not those, that was some other ASSERTs (related to SPL_VERSION) - FransUrbo/zfs-crypto@b7d66cf. Do note that it's the |
Ok, so compiling without debugging, then it seems to work. I can import, mount filesystems, including the ZVOL I tested without any problems. I can't run |
@lundman It seems that the fix isn't enough... Everything seems to work, but as soon as I started 'loading' the volumes, I got a hang again. And after a reboot, I get a kernel panic when trying to import the volume. |
There's a lot of activity on the disks while it tries to import the pool and the panic produces page after page on a 800x600 screen and I have no idea how to save it to disk. The sysrq keys don't work... |
After a whole lot of reboots and attempted imports with corresponding panics, I seem to have gotten past the panics at least. Now I 'only' got a dead lock:
|
I'm back to kernel panics.... The few times when it DIDN'T panic must have been a fluke... |
Importing it readonly seems to have worked. |
After having it imported readonly and mounting (ro)/unmounting one ZVOL for half a day, I exported the pool and then imported it again writable. And this I didn't get any panic, dead lock OR SPL error! This thing is really, really, REALLY starting to piss me off! :). It's impossible to draw any conclusion from anything that's happening and I can't reproduce any problem for long... Every time the problem have worsened, I think I've been running a scrub (can't be absolutly sure, a million things have happen over the two months I've been having problem) and rebooted half way... After that, shit really hit the fan, and it seems just sheer luck that I get back to a semi-stable system... |
I'm closing this, because it's now quite obvious that it's the zfs-crypto patch that fucked everything up. |
Early versions of ZFS coordinated the creation and destruction of device minors from userspace. This was inherently racy and in late 2009 these ioctl()s were removed leaving everything up to the kernel. This significantly simplified the code. However, we never picked up these changes in ZoL since we'd already significantly adjusted this code for Linux. This patch aims to rectify that by finally removing ZFC_IOC_*_MINOR ioctl()s and moving all the functionality down in to the kernel. Since this cleanup will change the kernel/user ABI it's being done in the same tag as the previous libzfs_core ABI changes. This will minimize, but not eliminate, the disruption to end users. Once merged ZoL, Illumos, and FreeBSD will basically be back in sync in regards to handling ZVOLs in the common code. While each platform must have its own custom zvol.c implemenation the interfaces provided are consistent. NOTES: 1) This patch introduces one subtle change in behavior which could not be easily avoided. Prior to this change callers of 'zfs create -V ...' were guaranteed that upon exit the /dev/zvol/ block device link would be created or an error returned. That's no longer the case. The utilities will no longer block waiting for the symlink to be created. Callers are now responsible for blocking, this is why a 'udev_wait' call was added to the 'label' function in scripts/common.sh. 2) The read-only behavior of a ZVOL now solely depends on if the ZVOL_RDONLY bit is set in zv->zv_flags. The redundant policy setting in the gendisk structure was removed. This both simplifies the code and allows us to safely leverage set_disk_ro() to issue a KOBJ_CHANGE uevent. See the comment in the code for futher details on this. 3) Because __zvol_create_minor() and zvol_alloc() may now be called in a sync task they must use KM_PUSHPAGE. References: illumos-gate/illumos@681d9761e8516a7dc5ab6589e2dfe717777e1123 Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#1862
This is a fork of the #1848 issue.
When importing the pool without the 'zvol_inhibit_dev' parameter, I get (after a few minutes) an oops from first zpool, txg_sync and then vol_id (from udev, NOT the zvol_id) and all vol_id processes are tagged Dead (177 of them, although I only have 64 VDEVs). And the import is hung and I can no longer issue any zfs command...
Yes, this is zfs-crypto, but it is now completely in line with ZoL/master so I'm a little unsure if it have something to with the crypto code or ZoL...
Since I only have one filesystem that is currently encrypted, I'm prepared to 'downgrade' (i.e. remove the crypto feature) to do some testing with ZoL/master if needed.
The text was updated successfully, but these errors were encountered: