pool suspension due to delayed MMP writes needs a better error message #7045

jhammond-intel · 2018-01-15T16:58:57Z

System information

This is related to Lustre issue https://jira.hpdd.intel.com/browse/LU-9845.

When an MMP thread suspends a pool because "no MMP write has succeeded in over mmp_interval * mmp_fail_intervals nanoseconds" the only message we see on the console is "WARNING: Pool 'blahblah' has encountered an uncorrectable I/O failure and has been suspended." This is not really informative enough and probably a bit misleading. We encountered these mysteriously suspended pool in our test clusters and were only able to attribute this to MMP by setting the pool failure mode to panic.

I was able to easily reproduce using the Lustre backed zfs setup (VM has hostid set and 2 vCPUs, pool has MMP enabled) using the following:

m:~# export FSTYPE=zfs
m:~# bash $LUSTRE/tests/llmount.sh
...
m:~# cat /sys/module/zfs/parameters/zfs_multihost_interval 
1000
m:~# echo 100 > /sys/module/zfs/parameters/zfs_multihost_interval # set mmp interval to 100ms
m:~# chrt -f 20 dd if=/dev/zero of=/dev/null &
m:~# chrt -f 20 dd if=/dev/zero of=/dev/null &

I think we should probably keep the message from zio_suspend() as is but add a suitable message to mmp_thread() before calling zio_suspend().

The text was updated successfully, but these errors were encountered:

adilger · 2018-01-16T09:09:20Z

In addition to an improved message, IMHO the default settings for the MMP thread are probably overly sensitive to delayed writes, since this takes the pool offline permanently.

behlendorf · 2018-01-16T18:58:13Z

It might be nice to extend zio_suspend() to additionally take a reason, say as an enum. Then a less generic error message could be logged. But adding an additional message sounds reasonable to me as a quick improvement, and it's easy to add to the release branch.

the default settings for the MMP thread are probably overly sensitive to delayed writes

For this very reason we picked conservative values based on the results of our local testing. This should only happen with the default settings when not even a single vdev in the pool has been able to service a write in over 5 seconds. Are you commonly seeing delays this long?

adilger · 2018-01-16T22:41:19Z

We've been hitting this problem fairly regularly on a wide variety of tests since we started testing with ZFS 0.7.x (see https://jira.hpdd.intel.com/browse/LU-9845 for details), since Lustre enables the multihost feature by default for newly-formatted datasets.

Note that this is happening with testing in a VM, not necessarily on real hardware, but it is also reasonable to expect that a high CPU/IO load on the system would not cause the pool to become unusable to the point of requiring an export/import. It may well be that this can be induced solely by high CPU load, since the MMP writes are done in the context of the sync thread, which may be blocked if there are lots of running threads.

It seems to me that the MMP timeout is not taking the current system/IO load into account, since the pool suspension timeout is always the fixed "max_fail_ns = mmp_interval * mmp_fail_intervals", rather than "max_fail_ns = MAX(zfs_multihost_interval, mmp_delay * MAX(vdev_count_leaves(spa), 1)) * mmp_fail_intervals" that the peer node would wait before considering the pool to be inactive. The actual interval that a peer would wait is typically even longer, since there is a separate zfs_multihost_import_intervals parameter from the zfs_mmp_fail_intervals that is larger by default (10x vs. 5x). It isn't exactly clear why these are separate parameters, and it doesn't appear there are sanity checks that prevent, for example, zfs_multihost_fail_intervals >= zfs_multihost_import_intervals which could allow a pool to be imported on two nodes without any warnings. I think it should require zfs_multihost_fail_intervals strictly less than zfs_multihost_import_intervals.

Under load the ldiskfs code will increase the check interval written into the MMP block (as with ZFS), so that the peer node will wait longer to detect activity. What is different is that it uses the intra-write delay rather than the initial specified delay to detect if it hasn't completed a write within the peer's timeout interval. Secondly, since the ZFS MMP block is written in sync context, if there is a lot of IO in a TXG then the MMP write may be delayed significantly, even though other überblocks could have been written in the meantime. If it does exceed the write interval (e.g. suddenly very slow disk or high scheduling latency, suspended node, etc) ldiskfs first re-reads the MMP block from disk and checks for any MMP activity before marking the filesystem in error.

It would be a simple first step to compute the max_fail_ns based on mmp_delay so that the pool doesn't suspend if writes are taking longer than expected, but still succeeding (a patch for this is forthcoming). Also, it could call spa_activity_check() before suspending the pool if the writes exceeded the timeout, and if no activity is detected the pool would not be suspended (possibly with a warning printed?).

behlendorf · 2018-01-17T03:11:11Z

It seems to me that the MMP timeout is not taking the current system/IO load into account

max_fail_ns isn't exactly fixed, but I take your point. It makes good sense to use mmp->mmp_delay instead of mmp_interval when determining if the pool needs to be suspended. We do need to be careful to cap the maximum interval to the last delay time known to be written to disk. Otherwise, we could wait longer than is strictly safe before suspending. The good news is that mmp->mmp_delay only gets updated after a successful mmp write so it can be safely used in the mmp thread.

since there is a separate zfs_multihost_import_intervals parameter from the zfs_mmp_fail_intervals that is larger by default (10x vs. 5x). It isn't exactly clear why these are separate parameters

We wanted these to be independently tunable, and the default values were set conservatively. This also makes it easy to disable suspending the pool due to IO failures if you like, this is described in the man page, although not recommended. Speaking of which, the zfs_multihost_fail_intervals section will need to be updated accordingly if we do tweak the algorithm.

zfs_multihost_fail_intervals >= zfs_multihost_import_intervals which could allow a pool to be imported on two nodes without any warnings.

This sounds like a pretty reasonable precaution, although it would only be able to catch
an accidental misconfiguration. Aside from removing these tunables entirely we can't prevent a user from some setting them to different values on different hosts.

MMP writes are done in the context of the sync thread

The internals are actually a little different. While a full TXG sync does count as an MMP write, it isn't the primary mechanism for writing MMP uberblocks. There's a dedicated mmp_thread which submits, on average, one sync write to each vdev every zfs_mmp_interval. Since these are synchronous writes the zio pipeline will issue them before any IO which is part of a normal TXG.

There are only a couple things I can think of offhand in ZFS itself which could prevent the MMP writes entirely. The entire zio pipeline somehow hangs due to a bug. Or it might be possible that a 100% synchronous write workload starves them all out. Lustre has so many threads maybe that second case is possible during testing. It would be interesting to see the mmp history.

it could call spa_activity_check() before suspending the pool if the writes exceeded the timeout

The thing is spa_activity_check() reads from disk which could potentially hang if there is a problem with the zio pipeline. I suppose it could also take very long time if the system is under a heavy sustained load. Since triggering this failure is time sensitive the design avoided calling anything which might block.

There is some follow work in #6212 which uses spa_activity_check() to make it possible for zfs clear to safely resume a suspended multihost pool. The patch has gotten a bit stale, but I'd love to see it finalized. Then you wouldn't need to export/import the pool in this case.

Patches welcome, and let's make sure we get @ofaaland's thoughts on this too.

ofaaland · 2018-01-17T18:38:39Z

It seems to me that the MMP timeout is not taking the current system/IO load into account, since the pool suspension timeout is always the fixed "max_fail_ns = mmp_interval * mmp_fail_intervals", rather than "max_fail_ns = MAX(zfs_multihost_interval, mmp_delay * MAX(vdev_count_leaves(spa), 1)) * mmp_fail_intervals" that the peer node would wait before considering the pool to be inactive.

Good point.

Note that if a subset of devices is not available to both nodes due to a connectivity issue, some recent uberblocks may not be visible to the other node attempting to import. So if mmp_delay is climbing fast, the importing node may be making its calculation based on an earlier, much lower, mmp_delay value. The greater the mmp_delay the more likely this is.

However, we could keep one or more prior values of mmp_delay and use the lesser value when calculating max_fail_ns, to mitigate this.

It would be a simple first step to compute the max_fail_ns based on mmp_delay so that the pool doesn't suspend if writes are taking longer than expected, but still succeeding (a patch for this is forthcoming).

Great, thank you.

I think it should require zfs_multihost_fail_intervals strictly less than zfs_multihost_import_intervals.

I agree that would be helpful.

it is also reasonable to expect that a high CPU/IO load on the system would not cause the pool to become unusable to the point of requiring an export/import.

Agreed. But it seems as if the MMP writes are failing or being starved out entirely on your test VMs, and we'll have to figure out why.

adilger · 2018-01-17T19:04:02Z

One option for debugging the MMP stalls would be to have it dump the MMP history (if enabled) to the console before suspending the pool? That won't affect default behavior since MMP history is off by default, one could safely assume that if it is enabled the admin is trying to debug an MMP problem. That said, it doesn't look like there is an easy way to dump the MMP stats to the console today.

ofaaland · 2018-01-17T20:33:59Z

One option for debugging the MMP stalls would be to have it dump the MMP history (if enabled) to the console before suspending the pool?

That sounds like a good idea to me, although as you say it would require creating a little infrastructure.

gerardba · 2018-08-22T13:41:54Z

I'm encountering the same issue on a healthy pool, on actual HW using SAS drives (12 x 10 disk raidz2) with multihost enabled running on 0.7.9.

The way to trigger the issue for me is to just start a scrub, within a few seconds that causes:

kernel: WARNING: MMP writes to pool 'zpool' have not succeeded in over 5s; suspending pool
kernel: WARNING: Pool 'zpool' has encountered an uncorrectable I/O failure and has been suspended.

This leaves the pool in a mixed state:

  pool: zpool
 state: ONLINE
status: The pool is suspended because multihost writes failed or were delayed;
	another system could import the pool undetected.
action: Make sure the pool's devices are connected, then reboot your system and
	import the pool.
   see: http://zfsonlinux.org/msg/ZFS-8000-K4
  scan: scrub in progress since Wed Aug 22 09:17:47 2018
	3.81G scanned out of 585T at 126M/s, (scan is slow, no estimated time)
	0B repaired, 0.00% done

It all clears with a zpool clear zpool.

Setting /sys/module/zfs/parameters/zfs_multihost_interval to 2000 seems to workaround the issue.

stale · 2020-08-25T04:21:48Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

ofaaland · 2020-11-23T16:13:17Z

I believe that this is still an issue with zfs-0.8 and may be an issue with 2.0 as well. Reopening.

stale · 2021-11-23T20:19:47Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

jhammond-intel mentioned this issue Jan 16, 2018

Emit an error message before MMP suspends pool #7048

Merged

13 tasks

jhammond-intel mentioned this issue Jan 19, 2018

MMP delay logic changes #7068

Closed

13 tasks

gerardba mentioned this issue Aug 22, 2018

Disk Failure causes offline pool with enabled multihost #7709

Closed

stale bot added the Status: Stale No recent activity for issue label Aug 25, 2020

stale bot closed this as completed Nov 23, 2020

ofaaland reopened this Nov 23, 2020

stale bot removed the Status: Stale No recent activity for issue label Nov 23, 2020

stale bot added the Status: Stale No recent activity for issue label Nov 23, 2021

stale bot closed this as completed Feb 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pool suspension due to delayed MMP writes needs a better error message #7045

pool suspension due to delayed MMP writes needs a better error message #7045

jhammond-intel commented Jan 15, 2018

adilger commented Jan 16, 2018

behlendorf commented Jan 16, 2018

adilger commented Jan 16, 2018 •

edited

Loading

behlendorf commented Jan 17, 2018

ofaaland commented Jan 17, 2018 •

edited

Loading

adilger commented Jan 17, 2018

ofaaland commented Jan 17, 2018

gerardba commented Aug 22, 2018

stale bot commented Aug 25, 2020

ofaaland commented Nov 23, 2020

stale bot commented Nov 23, 2021

pool suspension due to delayed MMP writes needs a better error message #7045

pool suspension due to delayed MMP writes needs a better error message #7045

Comments

jhammond-intel commented Jan 15, 2018

System information

adilger commented Jan 16, 2018

behlendorf commented Jan 16, 2018

adilger commented Jan 16, 2018 • edited Loading

behlendorf commented Jan 17, 2018

ofaaland commented Jan 17, 2018 • edited Loading

adilger commented Jan 17, 2018

ofaaland commented Jan 17, 2018

gerardba commented Aug 22, 2018

stale bot commented Aug 25, 2020

ofaaland commented Nov 23, 2020

stale bot commented Nov 23, 2021

adilger commented Jan 16, 2018 •

edited

Loading

ofaaland commented Jan 17, 2018 •

edited

Loading