-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for lock-free reading zfsdev_state_list. #2323
Conversation
This should fix #2301 and other equivalent issues. |
Wouldn't you need memory barrier for this kind of thing? |
Why can't we just move zfs_onexit_destroy out of the lock and call it a day. I don't object doing this locklessly, but we should be very careful when inventing our own lockless implementation. It would be very hard to debug if there's anything subtle hiding in there. |
@tuxoko I thought about moving All things considered I'm OK with this approach. @lukemarsden did you get a chance to verify @dweeezil's updated patch resolves the issue. If so we could get this merged fairly soon. |
Hi @dweeezil @behlendorf |
I've finally gotten a bit of time to review this patch and there are likely some concurrency problems. For starters, I had intended on making sure a newly-allocated structure was completely filled in before linking it into the list but the revised code I pushed does not do that. Also, I had intended on writing the zs_minor member last when re-using and entry and it currently does not do that, either. I think with some re-ordering to address these issues and some appropriate barriers, it should be safe for concurrent read access. My only remaining concern would be atomicity. |
I've re-ordered things and have added barriers where I think they're necessary. |
@dweeezil Anyway, here's some comment on your updated code:
|
@tuxoko Thanks for the updated comments.
I'll admit to be being a bit fuzzy on the various barriers available to us. I was rather surprised, for example, that I'll wait to hear back before pushing a revised version of this patch, but I think we're getting close. |
@dweeezil 3 A read barrier is almost always paired with a write barrier. I'm a bit curious about the first read barrier though. |
After reflecting a bit. ACCESS_ONCE might not be needed after all. zs->zs_next only goes from NULL to some value and stay there forever. Edit: The sequence of zs->minor and zs_next doesn't really matter. Edit2: So you still nee two barriers, but the first one can be skipped if the second one is taken. |
I've implemented these changes but have thought a bit more about the barriers required in In order to make this more clear, I've restructured the bottom of the function and added an expanded comment. |
@tuxoko Please look at my recent push of dweeezil/zfs@0fa7ccd. Feel free to use code comments this time. I'll not re-push until I look at them (so they don't get lost). |
Restructure the zfsdev_state_list to allow for lock-free reading by converting to a simple singly-linked list from which items are never deleted and over which only forward iterations are performed. It depends on, among other things, the atomicity of accessing the zs_minor integer and zs_next pointer. This fixes a lock inversion in which the zfsdev_state_lock is used by both the sync task (txg_sync) and indirectly by any user program which uses /dev/zfs; the zfsdev_release method uses the same lock and then blocks on the sync task. The most typical failure scenerio occurs when the sync task is cleaning up a user hold while various concurrent "zfs" commands are in progress. Neither Illumos nor Solaris are affected by this issue because they use DDI interface which provides lock-free reading of device state via the ddi_get_soft_state() function.
I think the barriers are correct now (dweeezil/zfs@8531b88). |
@dweeezil |
Me too. Nice work guys, I'll get it merged. |
Merged as: 3937ab2 Allow for lock-free reading zfsdev_state_list. |
Restructure the zfsdev_state_list to allow for lock-free reading by
converting to a simple singly-linked list from which items are never
deleted and over which only forward iterations are performed. It depends
on, among other things, the atomicity of accessing the zs_minor integer
and zs_next pointer.
This fixes a lock inversion in which the zfsdev_state_lock is used by
both the sync task (txg_sync) and indirectly by any user program which
uses /dev/zfs; the zfsdev_release method uses the same lock and then
blocks on the sync task.
The most typical failure scenerio occurs when the sync task is cleaning
up a user hold while various concurrent "zfs" commands are in progress.
Neither Illumos nor Solaris are affected by this issue because they use
DDI interface which provides lock-free reading of device state via the
ddi_get_soft_state() function.