-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linux 5.13 compat: retry zvol_open() when contended #12759
Conversation
In macOS we had some locking issues with zvol_open, and that requests come in from diskarbitration. We had to do a fairly ugly thing for it: Do we want to think about a shared code solution? |
That would be grand, unfortunately after looking at this code for a while I think that the requirements for each platform are just different enough to make it not really worthwhile. But if you see a way to nice unify the FreeBSD, Linux, and MacOS implementations I'm all ears. |
Glad to see this; will test it out soon. 👍 Also, just double checking to make sure I have this right:
Does that all sound correct? |
Not quite. In the normal zvol_open() case we'll retry up to 100 times at 10ms intervals. Since contention on the lock is relatively unlikely the retries should prevent failures during open. However, it is still technically possible, in which case an error will be returned. When using a zvol as a vdev we'll do additional retries when an open fails with ERESTART. This handles the possible deadlock case mentioned is the comments. |
Okay; thanks for the clarification! |
Since I am affected by this as well ( ZVOLs randomly not showing up under |
As for the design of this change, would it be viable to have a r/w lock instead? This way all read accesses to ZVOL will not block each other, only when shared lock needs to be promoted to exclusive (or if there already is exclusive lock when read access is requested) will any failures/blocking happen. Even better if reads could access "old" data concurrently, even when the data is being written, exploiting the COW nature of ZFS. Admittedly I do not know what I am talking about, since I never worked on ZFS internals, and won't be at all offended if this suggestion is dismissed 🤡 |
I wanted to mention that I've been running a system with this patchset, on top of 8ac58acf56, for about 7 days now. (Arch Linux w/ standard Arch kernel config, version 5.15.2.) Haven't noticed any particularly egregious problems or anything notable. On the other hand, I haven't gone out of my way to do any particular tests or stressing. (Also this is not-my-main-system; so there are only a few zvols, rather than literally dozens-to-hundreds of them. 😝 Might be a good idea to stress test on a system with hundreds or thousands of zvols, to see whether it ever does timeout or not... 🤔) |
This retry mechanism seems like a code path to which someone will one day return and potentially facepalm under 100s of delegated namespaces contending on that lock or some other strange condition producing thrash. Do we know what happened in 5.13 to cause this, and if so, is it feasible to push a fix upstream instead of implementing this workaround instead? |
Due to a possible lock inversion the zvol open call path on Linux needs to be able to retry in the case where the spa_namespace_lock cannot be acquired. For Linux 5.12 an older kernel this was accomplished by returning -ERESTARTSYS from zvol_open() to request that blkdev_get() drop the bdev->bd_mutex lock, reaquire it, then call the open callback again. However, as of the 5.13 kernel this behavior was removed. Therefore, for 5.12 and older kernels we preserved the existing retry logic, but for 5.13 and newer kernels we retry internally in zvol_open(). This should always succeed except in the case where a pool's vdev are layed on zvols, in which case it may fail. To handle this case vdev_disk_open() has been updated to retry when opening a device when -ERESTARTSYS is returned. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#12301
6c4c9a1
to
d25cd3b
Compare
We do. This change to the 5.13 kernel removed the kernel provided mechanism for retrying the open which we were depending on. Since it was explicitly removed I doubt upstream would be receptive to putting it back for us, but you could re-apply the change to a custom kernel. Longer term I agree we're going to want to find a way to restructure the locking in ZFS to remove the need for this entirely. However, that's going to be a more disruptive change so it's something we're going to want to tackle in a different PR. |
Thank you sir, looks like they were doing just about the same thing internally though so not much of a "fix" in there either :-. |
@behlendorf @tonyhutter Oops, I meant to submit my code review comment a few days back; but only just now realized it was sitting in "pending" state, and therefore not actually visible AFAIK. 🤦♂️ Pretty much just poking around at possible edge cases to get your opinion on whether anyone who ends up having both the belt (pre-5.13 |
Due to a possible lock inversion the zvol open call path on Linux needs to be able to retry in the case where the spa_namespace_lock cannot be acquired. For Linux 5.12 an older kernel this was accomplished by returning -ERESTARTSYS from zvol_open() to request that blkdev_get() drop the bdev->bd_mutex lock, reaquire it, then call the open callback again. However, as of the 5.13 kernel this behavior was removed. Therefore, for 5.12 and older kernels we preserved the existing retry logic, but for 5.13 and newer kernels we retry internally in zvol_open(). This should always succeed except in the case where a pool's vdev are layed on zvols, in which case it may fail. To handle this case vdev_disk_open() has been updated to retry when opening a device when -ERESTARTSYS is returned. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#12301 Closes openzfs#12759
Due to a possible lock inversion the zvol open call path on Linux needs to be able to retry in the case where the spa_namespace_lock cannot be acquired. For Linux 5.12 an older kernel this was accomplished by returning -ERESTARTSYS from zvol_open() to request that blkdev_get() drop the bdev->bd_mutex lock, reaquire it, then call the open callback again. However, as of the 5.13 kernel this behavior was removed. Therefore, for 5.12 and older kernels we preserved the existing retry logic, but for 5.13 and newer kernels we retry internally in zvol_open(). This should always succeed except in the case where a pool's vdev are layed on zvols, in which case it may fail. To handle this case vdev_disk_open() has been updated to retry when opening a device when -ERESTARTSYS is returned. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#12301 Closes openzfs#12759
Due to a possible lock inversion the zvol open call path on Linux needs to be able to retry in the case where the spa_namespace_lock cannot be acquired. For Linux 5.12 an older kernel this was accomplished by returning -ERESTARTSYS from zvol_open() to request that blkdev_get() drop the bdev->bd_mutex lock, reaquire it, then call the open callback again. However, as of the 5.13 kernel this behavior was removed. Therefore, for 5.12 and older kernels we preserved the existing retry logic, but for 5.13 and newer kernels we retry internally in zvol_open(). This should always succeed except in the case where a pool's vdev are layed on zvols, in which case it may fail. To handle this case vdev_disk_open() has been updated to retry when opening a device when -ERESTARTSYS is returned. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#12301 Closes openzfs#12759
Due to a possible lock inversion the zvol open call path on Linux needs to be able to retry in the case where the spa_namespace_lock cannot be acquired. For Linux 5.12 an older kernel this was accomplished by returning -ERESTARTSYS from zvol_open() to request that blkdev_get() drop the bdev->bd_mutex lock, reaquire it, then call the open callback again. However, as of the 5.13 kernel this behavior was removed. Therefore, for 5.12 and older kernels we preserved the existing retry logic, but for 5.13 and newer kernels we retry internally in zvol_open(). This should always succeed except in the case where a pool's vdev are layed on zvols, in which case it may fail. To handle this case vdev_disk_open() has been updated to retry when opening a device when -ERESTARTSYS is returned. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#12301 Closes openzfs#12759
Motivation and Context
Issue #12301
Description
Due to a possible lock inversion the zvol open call path, on Linux we
need to be able to retry in the case where the spa_namespace_lock
cannot be acquired.
For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS
fromzvol_open()
to request thatblkdev_get()
dropthe
bdev->bd_mutex lock
, reacquire it, then call the open callbackagain. However, as of the 5.13 kernel this behavior was removed.
Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open(). This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail. To
handle this case
vdev_disk_open()
has been updated to retry whenopening a device when
-ERESTARTSYS
is returned.How Has This Been Tested?
Locally by running the
zfs_copies_003_pos
tests on Fedora within a loop with the 5.14.16-301.fc35.x86_64 kernel. This test would
often failure with the new kernel because the block device could
not be opened and would return an error (
ERESTARTSYS
). Withthis change applied the test now runs reliably.
Types of changes
Checklist:
Signed-off-by
.