-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pool suspension due to delayed MMP writes needs a better error message #7045
Comments
In addition to an improved message, IMHO the default settings for the MMP thread are probably overly sensitive to delayed writes, since this takes the pool offline permanently. |
It might be nice to extend
For this very reason we picked conservative values based on the results of our local testing. This should only happen with the default settings when not even a single vdev in the pool has been able to service a write in over 5 seconds. Are you commonly seeing delays this long? |
We've been hitting this problem fairly regularly on a wide variety of tests since we started testing with ZFS 0.7.x (see https://jira.hpdd.intel.com/browse/LU-9845 for details), since Lustre enables the Note that this is happening with testing in a VM, not necessarily on real hardware, but it is also reasonable to expect that a high CPU/IO load on the system would not cause the pool to become unusable to the point of requiring an export/import. It may well be that this can be induced solely by high CPU load, since the MMP writes are done in the context of the sync thread, which may be blocked if there are lots of running threads. It seems to me that the MMP timeout is not taking the current system/IO load into account, since the pool suspension timeout is always the fixed " Under load the ldiskfs code will increase the check interval written into the MMP block (as with ZFS), so that the peer node will wait longer to detect activity. What is different is that it uses the intra-write delay rather than the initial specified delay to detect if it hasn't completed a write within the peer's timeout interval. Secondly, since the ZFS MMP block is written in sync context, if there is a lot of IO in a TXG then the MMP write may be delayed significantly, even though other überblocks could have been written in the meantime. If it does exceed the write interval (e.g. suddenly very slow disk or high scheduling latency, suspended node, etc) ldiskfs first re-reads the MMP block from disk and checks for any MMP activity before marking the filesystem in error. It would be a simple first step to compute the |
We wanted these to be independently tunable, and the default values were set conservatively. This also makes it easy to disable suspending the pool due to IO failures if you like, this is described in the man page, although not recommended. Speaking of which, the
This sounds like a pretty reasonable precaution, although it would only be able to catch
The internals are actually a little different. While a full TXG sync does count as an MMP write, it isn't the primary mechanism for writing MMP uberblocks. There's a dedicated There are only a couple things I can think of offhand in ZFS itself which could prevent the MMP writes entirely. The entire zio pipeline somehow hangs due to a bug. Or it might be possible that a 100% synchronous write workload starves them all out. Lustre has so many threads maybe that second case is possible during testing. It would be interesting to see the mmp history.
The thing is There is some follow work in #6212 which uses Patches welcome, and let's make sure we get @ofaaland's thoughts on this too. |
Good point. Note that if a subset of devices is not available to both nodes due to a connectivity issue, some recent uberblocks may not be visible to the other node attempting to import. So if mmp_delay is climbing fast, the importing node may be making its calculation based on an earlier, much lower, mmp_delay value. The greater the mmp_delay the more likely this is. However, we could keep one or more prior values of mmp_delay and use the lesser value when calculating max_fail_ns, to mitigate this.
Great, thank you.
I agree that would be helpful.
Agreed. But it seems as if the MMP writes are failing or being starved out entirely on your test VMs, and we'll have to figure out why. |
One option for debugging the MMP stalls would be to have it dump the MMP history (if enabled) to the console before suspending the pool? That won't affect default behavior since MMP history is off by default, one could safely assume that if it is enabled the admin is trying to debug an MMP problem. That said, it doesn't look like there is an easy way to dump the MMP stats to the console today. |
That sounds like a good idea to me, although as you say it would require creating a little infrastructure. |
I'm encountering the same issue on a healthy pool, on actual HW using SAS drives (12 x 10 disk raidz2) with multihost enabled running on 0.7.9. The way to trigger the issue for me is to just start a scrub, within a few seconds that causes:
This leaves the pool in a mixed state:
It all clears with a Setting /sys/module/zfs/parameters/zfs_multihost_interval to 2000 seems to workaround the issue. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
I believe that this is still an issue with zfs-0.8 and may be an issue with 2.0 as well. Reopening. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
System information
Distribution Name | *
Distribution Version | *
Linux Kernel | *
Architecture | *
ZFS Version | 0.7.5
SPL Version | 0.7.5
This is related to Lustre issue https://jira.hpdd.intel.com/browse/LU-9845.
When an MMP thread suspends a pool because "no MMP write has succeeded in over mmp_interval * mmp_fail_intervals nanoseconds" the only message we see on the console is "WARNING: Pool 'blahblah' has encountered an uncorrectable I/O failure and has been suspended." This is not really informative enough and probably a bit misleading. We encountered these mysteriously suspended pool in our test clusters and were only able to attribute this to MMP by setting the pool failure mode to panic.
I was able to easily reproduce using the Lustre backed zfs setup (VM has hostid set and 2 vCPUs, pool has MMP enabled) using the following:
I think we should probably keep the message from
zio_suspend()
as is but add a suitable message tommp_thread()
before callingzio_suspend()
.The text was updated successfully, but these errors were encountered: