-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zpool import spinning on txg_quiesce and txg_sync #10828
Comments
As suggest on zfs-discuss,
That works, and zpool status reports clean.
That spins in the same way,
|
|
As conjectured on zfs-discuss multihost is enabled,
From zdb pool history,
|
This pool has been successfully imported by setting zfs_multihost_fail_intervals=0 once and then subsequent imports work without having to set that. More details available in zfs-discuss at https://zfsonlinux.topicbox.com/groups/zfs-discuss/Ta6b683d15084807b/zpool-import-spinning-on-txgquiesce-and-txgsync The remaining question for this issue is whether there is enough information to fix the bug/feature that prevented a perfectly healthy pool from importing? |
@stuartthebruce I believe the patch from #10873 will fix the import problem you saw. Thanks for all the debug information. |
From zfs-discuss this happened again in a reproducible way, and after a bit more testing I think I found a problem (or at least an opportunity for an enhancement). After the initial pool recovery discussed previously in this thread where an initial import with zfs_multihost_fail_intervals=0, it was not necessary to modify that parameter for subsequent imports. However, that is no longer true, and I think the initial and current import problem are only loosely related to unscheduled shutdowns. After confirming a few times that "import -o readonly=on" works with default settings, and a non-readonly import requires zfs_multihost_fail_intervals=0 for both initial and subsequent scheduled imports, I discovered that another solution is to increase zfs_multihost_fail_intervals. In particular, I am currently reproducibly unable to import this pool with the default setting of zfs_multihost_fail_intervals=20, however, I can reproducibly import it with a value of 40. I think what has changed is that I have pushed more datasets, snapshots, metadata and data into the pool, and the import time has crossed a threshold,
Please recall that this pool has qty 60 7200RPM SAS drives and I now wonder if the default settings are doing what I thought they where, i.e., try 10 times to write a single sector to one of the leaf devices in the pool and if and only if that fails to complete within 1-sec 20 times then suspend the pool. Unless there are concurrent I/O intensive import threads that are heavily loading down the drives during import time, these 7200RPM drives are all healthy and should not take 1-sec to respond (and certainly not 20 times in a row). For example, after a successful import and during a 2 GByte/sec scrub the latencies are reasonable for a 7200RPM HDD,
What am I missing? P.S. For reference, here is dbgmsg from a successful non-readonly import with zfs_multihost_fail_intervals=40,
|
After you imported with If the pool was suspended, then you are almost certainly hitting the bug I fixed in #10873 . The bug is not specific to imports after an unclean shutdown. If you have the ability to build zfs, please apply that patch and see if you're able to import this pool normally.
That's not quite what the setting means. When The problem #10873 fixed is that the timer calculation had an error, so the drop-dead, suspend-the-pool time was passed during the import before MMP writes are even issued. |
Early on, subsequent imports succeeded with the default value after an initial import with
Yes.
It has been quite a while since I had to build the Linux kernel (or kernel modules), but I will give that a try if I can find the time before
Got it, and thanks for the explanation. That certainly sounds consistent with my observation that once |
After upgrading to 0.8.5 this system is able to import with multihost=on and default kernel module settings. However, it takes ~5 min, which seems a bit long for a pool with 60 HDD. |
Thank you for confirming that the import is successful now. I'm surprised the import takes 5 minutes also. If you have time to create a new issue with steps to reproduce (I'm guessing the same) and attach the /proc/spl/kstat/zfs/dbgmsg contents, I'll take a look. |
Initial boot import time reproduced with export/import, so I will open another ticket with additional information. Thanks. |
Slow import ticket is #11034 |
System information
Describe the problem you're observing
zpool import hangs (for at least several hours) with txg_quiesce and txg_sync using CPU cycles to no avail. There are no problems with the storage or syslog errors about hung kernel tasks. After rebooting without auto-importing, a naked "zpool import" discovers all of the devices and gives every indication that an actual import by name should succeed. However, even while that is spinning cpu cycles I am able to run zdb to see all of the devices and pool history.
Is there any other useful diagnostic information beyond that below before I start trying to run "zpool import -T"?
Note, this is also being discussed at https://zfsonlinux.topicbox.com/groups/zfs-discuss/Ta6b683d15084807b/zpool-import-spinning-on-txgquiesce-and-txgsync
Describe how to reproduce the problem
Start 5 concurrent "zfs receive" to a pool with 6 10-drive raidz3 vdev and small mirrored SSD log, and wait for city power sub-station failure to power down server and external SAS storage.
Include any warning/errors/backtraces from the system logs
And after actually trying to import.
The text was updated successfully, but these errors were encountered: