-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Another deadlock on unmounting filesystem #2301
Comments
The task txg_sync:1549
task zfs:4738
The We probably haven't seen this before because most users haven't made extensive use of zfs holds. That said, these two call paths are basically the same under Illumos so it's not immediately clear to me why no one has reported this issue there. Regardless, we'll need to get it fixed! |
I have a feeling it's more likely this is fallout from the restructured sync task of 13fe019 rather than more frequent use (directly by users) of holds. @lukemarsden You mentioned this deadlock occurred during a bunch of concurrent ZFS operations: do I gather correctly that a I have a feeling the hold operations involved with this are the ones performed internally. |
@lukemarsden One other question. After rebooting to recover from the deadlock, are there any temporary snapshots left around (they have names ending with "-" followed by 16 digits). Also, it might be interesting to see a slice of EDIT: Github markup initially ate part of my description of temporary snapshot names. |
@dweeezil Thanks! Yes, we don't do any I don't see any temporary snapshots on the affected system, neither with Unfortunately the |
One other fact that is slightly unusual about this pool: it's backed onto a file:
We're currently working on stress testing on a real partition-backed pool, to see if we can reproduce the same problems there. |
Just adding a note from an email thread with @ahrens:
|
@lukemarsden Yes, that looks like the critical bit. Under Illumos the This code needed to be rewritten for Linux because Illumos relies on support provided by |
Thanks @behlendorf! @ryao mentioned he may be able to help us on Friday, so perhaps he could have a crack at that unless you or someone else get there first :-) |
The issue looks straight forward enough. Likely the best fix is to implement |
I wonder if it would be worth implementing a few more of the DDI soft state functions in the SPL? That would allow the code to more closely match the upstream. It looks like the DDI suite stores the minor states in an array to allow for lock-free reading (which seems like a good idea). |
FWIW, I have reproduced the same issue on dedicated hardware with a zpool mirror on partitions on two physical drives: http://hybridcluster.net/zol-hangs/201405081202/
This eliminates the issue being to do with being a file-backed pool. Built from the same (latest) version of the source (2c33b91). @dweeezil - seems sensible to reimplement these functions for lock-free reading, if in fact this is causing these deadlocks. How much work do you think that would that be? |
Another idea that came to my mind would be to simply not delete items from the (zfsdev_state) list, but instead to mark them as available for re-use (maybe by setting the minor to -1). That should also allow for lock-free reading. The list isn't likely to ever be too long to begin with so it's not like it will waste a lot of memory. Although implementing more of the Solaris DDI interface sill might be a good idea, this might be a quicker fix. |
I hacked up a patch at dweeezil/zfs@cb83c05 which implements the "don't delete" idea. It passes light testing but I'll not have a time to review it in greater detail until later this evening. |
@dweeezil Thanks! So far this seems to have solved the issue for us. Awesome! |
To summarize the problem above, this deadlock is caused by lock inversion. In an abstract sense, the deadlock occurs because thread 1 locks A and then thread 2 locks B while another locks B and then A. @dweeezil's patch works by making thread 2 avoid locking B where the data structure being protected is modified in a lock-less fashion, which fixes it. Here, thread 1 is the kernel txg_sync thread, thread 2 is the userland zfs utility, lock A is the transaction sync and lock B is zfsdev_state_lock. This should be a problem on both FreeBSD and Illumos, but it seems to have gone unnoticed until now. Linux's memory allocation routines attempt to reclaim memory under low memory conditions. That has been a source of deadlocks involving swap on ZFS, but it has the secondary effect of causing operations involving memory allocations to randomly take more time than usual. That can make races such as this more likely to trigger a deadlock, which is what I suspect happened here. @dweeezil's patch changes writes to occur in a way that lets readers simply read without any fear of repercussions. There are plenty of subtle things that happen to be just right to make that work. We currently only traverse forwards, and modification happens under a lock, which functions as a memory barrier on x86. The way that Linux modifies the list means that any readers will either see a change to the list or won't, but we are implicitly relying on the lock to function as a barrier to do that. I have a few criticisms of this approach:
|
I'll add a couple of notes about my patch: I hadn't dealt with the Solairs-compatible list library provided by the SPL and had wrongly assumed it was only singly-linked. I suppose I should have realized it wasn't after using the "delete from tail" in the destructor. I was also concerned about the atomicity of the read operations. Since this list is essentially a substitute for a variable-sized array, I'll rework the patch to do a (singly) linked list on its own without using the list library from SPL. |
That sounds like a reasonable approach. It would be great if you could open a pull request when you're ready with the reworked patch. |
@lukemarsden Could you please try the new version in #2323. It implements the same idea but dispenses with the use of the doubly-linked list library and adds some comments to the data structure. I've not developed a reproducer for this problem and have only run basic tests (as was the case with the original patch). |
Restructure the zfsdev_state_list to allow for lock-free reading by converting to a simple singly-linked list from which items are never deleted and over which only forward iterations are performed. It depends on, among other things, the atomicity of accessing the zs_minor integer and zs_next pointer. This fixes a lock inversion in which the zfsdev_state_lock is used by both the sync task (txg_sync) and indirectly by any user program which uses /dev/zfs; the zfsdev_release method uses the same lock and then blocks on the sync task. The most typical failure scenerio occurs when the sync task is cleaning up a user hold while various concurrent "zfs" commands are in progress. Neither Illumos nor Solaris are affected by this issue because they use DDI interface which provides lock-free reading of device state via the ddi_get_soft_state() function. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#2301
Complete
zpool
lockup again. The system was doing some concurrent ZFS operations, possibly on the same filesystem. This time using the latesttrusty
daily:http://hybridcluster.net/2014-05-05-hang-after-live-migration/syslog.txt (see after timestamp
May 5 13:08:18
).The text was updated successfully, but these errors were encountered: