Another deadlock on unmounting filesystem #2301

lukemarsden · 2014-05-05T22:26:46Z

Complete zpool lockup again. The system was doing some concurrent ZFS operations, possibly on the same filesystem. This time using the latest trusty daily:

hybrid@node002:~/hangs/2014-05-05-hang-after-live-migration$ sudo modinfo zfs |head
filename:       /lib/modules/3.8.0-38-generic/updates/dkms/zfs.ko
version:        0.6.2-2.1~trusty~13.gbp7d6f55
license:        CDDL
author:         Sun Microsystems/Oracle, Lawrence Livermore National Laboratory
description:    ZFS
srcversion:     8EAC8D11923CD437A7B0573

http://hybridcluster.net/2014-05-05-hang-after-live-migration/syslog.txt (see after timestamp May 5 13:08:18).

The text was updated successfully, but these errors were encountered:

behlendorf · 2014-05-06T21:38:36Z

The zfs utility deadlocked with the txg_sync thread while dropping a hold in the onexit handler. Here's what happened:

task txg_sync:1549

txg_sync_thread
spa_sync
dsl_pool_sync
dsl_sync_task_sync
dsl_dataset_user_hold_sync
dsl_onexit_hold_cleanup
zfs_onexit_add_cb
zfs_onexit_minor_to_state
zfsdev_get_state (blocks on zfsdev_state_lock)
mutex_lock

task zfs:4738

zfsdev_release (holds zfsdev_state_lock)
zfs_onexit_destroy
dsl_dataset_user_release_onexit
dsl_dataset_user_release_tmp
dsl_dataset_user_release_impl
dsl_sync_task
txg_wait_synced (blocks on syncing txg)

The zfs process is waiting for a the txg to finish syncing while holding the zfsdev_state_lock. Meanwhile the sync thread can't make progress and finish the sync until the zfs process drops the zfsdev_state_lock. Deadlock.

We probably haven't seen this before because most users haven't made extensive use of zfs holds. That said, these two call paths are basically the same under Illumos so it's not immediately clear to me why no one has reported this issue there. Regardless, we'll need to get it fixed!

dweeezil · 2014-05-07T12:32:00Z

I have a feeling it's more likely this is fallout from the restructured sync task of 13fe019 rather than more frequent use (directly by users) of holds.

@lukemarsden You mentioned this deadlock occurred during a bunch of concurrent ZFS operations: do I gather correctly that a zfs hold was not one of those operations? Have you got an idea of what operations were being performed? Was a zfs diff or zfs send one of them?

I have a feeling the hold operations involved with this are the ones performed internally.

dweeezil · 2014-05-07T12:38:09Z

@lukemarsden One other question. After rebooting to recover from the deadlock, are there any temporary snapshots left around (they have names ending with "-" followed by 16 digits). Also, it might be interesting to see a slice of zfs history -i from the time period leading up to the deadlock.

EDIT: Github markup initially ate part of my description of temporary snapshot names.

lukemarsden · 2014-05-07T13:39:32Z

@dweeezil Thanks! Yes, we don't do any zfs holds directly, but we do plenty of zfs send (no zfs diff at this time). You're correct that they were internal holds.

I don't see any temporary snapshots on the affected system, neither with zfs list or zdb -d <poolname>.

Unfortunately the zpool history -i results have since spooled out of existence, next time I capture a deadlock I'll capture the zpool history -i just beforehand too and include this.

lukemarsden · 2014-05-07T14:07:01Z

One other fact that is slightly unusual about this pool: it's backed onto a file:

hybrid@node002:/var/crash/201405070111$ sudo zpool status
  pool: hpool
 state: ONLINE
  scan: none requested
config:

        NAME                STATE     READ WRITE CKSUM
        hpool               ONLINE       0     0     0
          //hpool-datafile  ONLINE       0     0     0

errors: No known data errors

We're currently working on stress testing on a real partition-backed pool, to see if we can reproduce the same problems there.

lukemarsden · 2014-05-07T20:50:52Z

Just adding a note from an email thread with @ahrens:

The stack traces in the bug report indicate that the problem is with some Linux-specific driver code (zfsdev_get_state):

May  5 13:08:18 precise64 kernel: [ 2994.761348]  [<ffffffff816f3fda>] mutex_lock+0x2a/0x50
May  5 13:08:18 precise64 kernel: [ 2994.761371]  [<ffffffffa01ba856>] zfsdev_get_state+0x26/0x50 [zfs]
May  5 13:08:18 precise64 kernel: [ 2994.761391]  [<ffffffffa01bd018>] zfs_onexit_minor_to_state+0x18/0x30 [zfs]
May  5 13:08:18 precise64 kernel: [ 2994.761410]  [<ffffffffa01bd253>] zfs_onexit_add_cb+0x33/0xf0 [zfs]
May  5 13:08:18 precise64 kernel: [ 2994.761429]  [<ffffffffa01e7e96>] dsl_onexit_hold_cleanup+0xb6/0x100 [zfs]
May  5 13:08:18 precise64 kernel: [ 2994.761447]  [<ffffffffa01e7fc6>] dsl_dataset_user_hold_sync+0xe6/0x1b0 [zfs]
May  5 13:08:18 precise64 kernel: [ 2994.761464]  [<ffffffffa0171e32>] dsl_sync_task_sync+0xf2/0x100 [zfs]
May  5 13:08:18 precise64 kernel: [ 2994.761480]  [<ffffffffa016a393>] dsl_pool_sync+0x2f3/0x420 [zfs]
May  5 13:08:18 precise64 kernel: [ 2994.761499]  [<ffffffffa0180484>] spa_sync+0x414/0xb20 [zfs]

behlendorf · 2014-05-07T22:02:11Z

@lukemarsden Yes, that looks like the critical bit. Under Illumos the zfs_onexit_minor_to_state function uses zfsdev_get_soft_state to do this lookup locklessly. That's why Illumos doesn't see this and how we accidentally inherited it.

This code needed to be rewritten for Linux because Illumos relies on support provided by ddi_get_soft_state. That interface isn't provided by the Linux kernel so we wrote our own implementation which takes this lock. As an aside none of the ddi_* functions are available in the Linux kernel. Longer term it would be a good thing to hide behind a portable interface.

lukemarsden · 2014-05-07T22:04:05Z

Thanks @behlendorf! @ryao mentioned he may be able to help us on Friday, so perhaps he could have a crack at that unless you or someone else get there first :-)

behlendorf · 2014-05-07T22:11:38Z

The issue looks straight forward enough. Likely the best fix is to implement zfsdev_get_soft_state and use that. We just need a developer with some time to get it implemented.

dweeezil · 2014-05-08T13:32:25Z

I wonder if it would be worth implementing a few more of the DDI soft state functions in the SPL? That would allow the code to more closely match the upstream. It looks like the DDI suite stores the minor states in an array to allow for lock-free reading (which seems like a good idea).

lukemarsden · 2014-05-08T13:42:35Z

FWIW, I have reproduced the same issue on dedicated hardware with a zpool mirror on partitions on two physical drives: http://hybridcluster.net/zol-hangs/201405081202/

hybrid@ns410338:/var/crash$ sudo zpool status
  pool: hpool
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        hpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda3    ONLINE       0     0     0
            sdb3    ONLINE       0     0     0

errors: No known data errors

This eliminates the issue being to do with being a file-backed pool. Built from the same (latest) version of the source (2c33b91).

@dweeezil - seems sensible to reimplement these functions for lock-free reading, if in fact this is causing these deadlocks. How much work do you think that would that be?

dweeezil · 2014-05-08T14:17:39Z

Another idea that came to my mind would be to simply not delete items from the (zfsdev_state) list, but instead to mark them as available for re-use (maybe by setting the minor to -1). That should also allow for lock-free reading. The list isn't likely to ever be too long to begin with so it's not like it will waste a lot of memory. Although implementing more of the Solaris DDI interface sill might be a good idea, this might be a quicker fix.

dweeezil · 2014-05-08T14:54:23Z

I hacked up a patch at dweeezil/zfs@cb83c05 which implements the "don't delete" idea. It passes light testing but I'll not have a time to review it in greater detail until later this evening.

lukemarsden · 2014-05-08T17:12:39Z

@dweeezil Thanks! So far this seems to have solved the issue for us. Awesome!

ryao · 2014-05-09T15:13:31Z

To summarize the problem above, this deadlock is caused by lock inversion. In an abstract sense, the deadlock occurs because thread 1 locks A and then thread 2 locks B while another locks B and then A. @dweeezil's patch works by making thread 2 avoid locking B where the data structure being protected is modified in a lock-less fashion, which fixes it. Here, thread 1 is the kernel txg_sync thread, thread 2 is the userland zfs utility, lock A is the transaction sync and lock B is zfsdev_state_lock.

This should be a problem on both FreeBSD and Illumos, but it seems to have gone unnoticed until now. Linux's memory allocation routines attempt to reclaim memory under low memory conditions. That has been a source of deadlocks involving swap on ZFS, but it has the secondary effect of causing operations involving memory allocations to randomly take more time than usual. That can make races such as this more likely to trigger a deadlock, which is what I suspect happened here.

@dweeezil's patch changes writes to occur in a way that lets readers simply read without any fear of repercussions. There are plenty of subtle things that happen to be just right to make that work. We currently only traverse forwards, and modification happens under a lock, which functions as a memory barrier on x86. The way that Linux modifies the list means that any readers will either see a change to the list or won't, but we are implicitly relying on the lock to function as a barrier to do that.

I have a few criticisms of this approach:

It would be rather easy to cause regressions to happen with the code structured this way, because it is not immediately clear that the list must only be traversed forward. I need to think about how this could be done in a way that minimizes the potential for such a thing to happen, but the patch looks safe to me as far as production use right now is concerned. If we settle on this approach, we should modify the list to stop being doubly linked as a safety precaution and include a comment saying that it must never be traversed in reverse because of the lockless nature of this code.
Atomic instructions are not a memory barrier on Alpha, ARMv7, POWER and IA-64 CPUs. I spot checked a few architectures (POWER and Alpha) and it appears that the mutex code will explicitly call memory barriers on those architectures. However, we will see some very hard to diagnose bugs should that ever not be the case. We should design a splat test that will only fail whenever locking does not function as a memory barrier.

dweeezil · 2014-05-09T15:48:20Z

I'll add a couple of notes about my patch: I hadn't dealt with the Solairs-compatible list library provided by the SPL and had wrongly assumed it was only singly-linked. I suppose I should have realized it wasn't after using the "delete from tail" in the destructor. I was also concerned about the atomicity of the read operations.

Since this list is essentially a substitute for a variable-sized array, I'll rework the patch to do a (singly) linked list on its own without using the list library from SPL.

behlendorf · 2014-05-10T19:57:06Z

That sounds like a reasonable approach. It would be great if you could open a pull request when you're ready with the reworked patch.

dweeezil · 2014-05-11T13:42:52Z

@lukemarsden Could you please try the new version in #2323. It implements the same idea but dispenses with the use of the doubly-linked list library and adds some comments to the data structure. I've not developed a reproducer for this problem and have only run basic tests (as was the case with the original patch).

Restructure the zfsdev_state_list to allow for lock-free reading by converting to a simple singly-linked list from which items are never deleted and over which only forward iterations are performed. It depends on, among other things, the atomicity of accessing the zs_minor integer and zs_next pointer. This fixes a lock inversion in which the zfsdev_state_lock is used by both the sync task (txg_sync) and indirectly by any user program which uses /dev/zfs; the zfsdev_release method uses the same lock and then blocks on the sync task. The most typical failure scenerio occurs when the sync task is cleaning up a user hold while various concurrent "zfs" commands are in progress. Neither Illumos nor Solaris are affected by this issue because they use DDI interface which provides lock-free reading of device state via the ddi_get_soft_state() function. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#2301

behlendorf added this to the 0.7.0 milestone May 6, 2014

behlendorf added the Bug label May 6, 2014

behlendorf mentioned this issue May 7, 2014

Deadlock with crashdump #2304

Closed

This was referenced May 7, 2014

Another deadlock with crashdump #2307

Closed

Deadlock with crashdump and zpool history #2309

Closed

behlendorf modified the milestones: 0.6.3, 0.7.0 May 10, 2014

dweeezil mentioned this issue May 11, 2014

Allow for lock-free reading zfsdev_state_list. #2323

Closed

behlendorf closed this as completed in 3937ab2 May 19, 2014

ryao mentioned this issue Sep 7, 2022

Cleanup: Make memory barrier definitions consistent across kernels #13843

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Another deadlock on unmounting filesystem #2301

Another deadlock on unmounting filesystem #2301

lukemarsden commented May 5, 2014

behlendorf commented May 6, 2014

dweeezil commented May 7, 2014

dweeezil commented May 7, 2014

lukemarsden commented May 7, 2014

lukemarsden commented May 7, 2014

lukemarsden commented May 7, 2014

behlendorf commented May 7, 2014

lukemarsden commented May 7, 2014

behlendorf commented May 7, 2014

dweeezil commented May 8, 2014

lukemarsden commented May 8, 2014

dweeezil commented May 8, 2014

dweeezil commented May 8, 2014

lukemarsden commented May 8, 2014

ryao commented May 9, 2014

dweeezil commented May 9, 2014

behlendorf commented May 10, 2014

dweeezil commented May 11, 2014

Another deadlock on unmounting filesystem #2301

Another deadlock on unmounting filesystem #2301

Comments

lukemarsden commented May 5, 2014

behlendorf commented May 6, 2014

dweeezil commented May 7, 2014

dweeezil commented May 7, 2014

lukemarsden commented May 7, 2014

lukemarsden commented May 7, 2014

lukemarsden commented May 7, 2014

behlendorf commented May 7, 2014

lukemarsden commented May 7, 2014

behlendorf commented May 7, 2014

dweeezil commented May 8, 2014

lukemarsden commented May 8, 2014

dweeezil commented May 8, 2014

dweeezil commented May 8, 2014

lukemarsden commented May 8, 2014

ryao commented May 9, 2014

dweeezil commented May 9, 2014

behlendorf commented May 10, 2014

dweeezil commented May 11, 2014