Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for lock-free reading zfsdev_state_list. #2323

Closed
wants to merge 1 commit into from

Conversation

dweeezil
Copy link
Contributor

Restructure the zfsdev_state_list to allow for lock-free reading by
converting to a simple singly-linked list from which items are never
deleted and over which only forward iterations are performed. It depends
on, among other things, the atomicity of accessing the zs_minor integer
and zs_next pointer.

This fixes a lock inversion in which the zfsdev_state_lock is used by
both the sync task (txg_sync) and indirectly by any user program which
uses /dev/zfs; the zfsdev_release method uses the same lock and then
blocks on the sync task.

The most typical failure scenerio occurs when the sync task is cleaning
up a user hold while various concurrent "zfs" commands are in progress.

Neither Illumos nor Solaris are affected by this issue because they use
DDI interface which provides lock-free reading of device state via the
ddi_get_soft_state() function.

@dweeezil
Copy link
Contributor Author

This should fix #2301 and other equivalent issues.

@tuxoko
Copy link
Contributor

tuxoko commented May 12, 2014

Wouldn't you need memory barrier for this kind of thing?

@tuxoko
Copy link
Contributor

tuxoko commented May 12, 2014

Why can't we just move zfs_onexit_destroy out of the lock and call it a day.
As far as I can tell, that's where the problem is right?

I don't object doing this locklessly, but we should be very careful when inventing our own lockless implementation. It would be very hard to debug if there's anything subtle hiding in there.

@behlendorf behlendorf added this to the 0.6.3 milestone May 12, 2014
@behlendorf behlendorf added the Bug label May 12, 2014
@behlendorf
Copy link
Contributor

@tuxoko I thought about moving zfs_onexit_destroy out of the lock as well but that takes us even farther away from the Illumos implementation which assumes a lockless zfsdev_get_state(). I think the safer course here is to does this locklessly, although I completely agree unless we must be careful or there could be subtle issues. Conversely changing the basic locking assumpions might cause us to accidentally introduce an issue when porting patches from Illumos.

All things considered I'm OK with this approach. @lukemarsden did you get a chance to verify @dweeezil's updated patch resolves the issue. If so we could get this merged fairly soon.

@tuxoko
Copy link
Contributor

tuxoko commented May 15, 2014

Hi @dweeezil @behlendorf
I add some comment on how I think would be the proper way to do this locklessly.
However, I'm not an expert on lockless thingy nor I'm familiar with the concurrency on this piece of code, so I might be completely wrong.

@dweeezil
Copy link
Contributor Author

I've finally gotten a bit of time to review this patch and there are likely some concurrency problems. For starters, I had intended on making sure a newly-allocated structure was completely filled in before linking it into the list but the revised code I pushed does not do that. Also, I had intended on writing the zs_minor member last when re-using and entry and it currently does not do that, either.

I think with some re-ordering to address these issues and some appropriate barriers, it should be safe for concurrent read access. My only remaining concern would be atomicity.

@dweeezil
Copy link
Contributor Author

I've re-ordered things and have added barriers where I think they're necessary.

@tuxoko
Copy link
Contributor

tuxoko commented May 15, 2014

@dweeezil
You flushed my inline comment.
I really hate how github would flush inline commen once the pull request is updated.

Anyway, here's some comment on your updated code:

  1. Use smp_wmb(), smp_rmb().
  2. zs = ACCESS_ONCE(zs->zs_next);
  3. There should be another smp_rmb() after if (zs->zs_minor == minor)
  4. put zsprev->zs_next = zs; after zs->zs_minor = minor;

@dweeezil
Copy link
Contributor Author

@tuxoko Thanks for the updated comments.

  1. I'll switch the barriers to the "smp_" variants. Makes sense to me.
  2. I'm not clear as to why accessing the zs_next member requires ACCESS_ONCE, could you please enlighten?
  3. I'm also not clear why we need another barrier after if (zs->zs_minor == minor) given that the items we're reading (zs_onexit and zs_event) are, themselves, followed by a write barrier following the writing of their values.
  4. Of course. I'll re-arrange.

I'll admit to be being a bit fuzzy on the various barriers available to us. I was rather surprised, for example, that ddi_get_soft_state() & friends in Illumos had no barriers. In any case, there's certainly plenty of examples of them within the Linux kernel, itself, so I suppose I should study a bunch of them.

I'll wait to hear back before pushing a revised version of this patch, but I think we're getting close.

@tuxoko
Copy link
Contributor

tuxoko commented May 16, 2014

@dweeezil
2 ACCESS_ONCE(x) make sure compiler will not re-read x from memory, though I'm not sure if we require this since we have barrier.

3 A read barrier is almost always paired with a write barrier.
On the write side, you set zs_minor to indicate zs_onexit and zs_event are ready.
So on the read side, you must read zs_onexit after you read zs_minor. If you don't add barrier in between, compiler and CPU could read zs_onexit in advance. So, you'll get an invalid zs_onexit even though you think zs_minor is set.

I'm a bit curious about the first read barrier though.
Since zs->zs_minor depends on zs = zs->next. I don't think compiler and CPU would have a chance to reorder. But if we were to discard this rmb, then ACCESS_ONCE is a must have.

@tuxoko
Copy link
Contributor

tuxoko commented May 16, 2014

After reflecting a bit. ACCESS_ONCE might not be needed after all.

zs->zs_next only goes from NULL to some value and stay there forever.
So even if compiler were to re-read it from memory, it wouldn't change anything.

Edit:
For write side, I think you can do it with only one wmb.
...
zs_onexit
smp_wmb();
zs->minor;
prev->zs_next = zs;

The sequence of zs->minor and zs_next doesn't really matter.

Edit2:
I'm sorry. The sequence does matter, because zs->minor is uninitialized.
...
zs_onexit
smp_wmb();
zs->minor;
smp_wmb();
prev->zs_next = zs;

So you still nee two barriers, but the first one can be skipped if the second one is taken.

@dweeezil
Copy link
Contributor Author

I've implemented these changes but have thought a bit more about the barriers required in zfsdev_state_init() a bit more. There are two cases which require slightly different implementation: if we allocate a new entry, the final operation should be a write barrier following by an updated of the previous entry's next pointer. If we reuse an entry, the final operation should be a write barrier followed by the update of the minor.

In order to make this more clear, I've restructured the bottom of the function and added an expanded comment.

@dweeezil
Copy link
Contributor Author

@tuxoko Please look at my recent push of dweeezil/zfs@0fa7ccd. Feel free to use code comments this time. I'll not re-push until I look at them (so they don't get lost).

Restructure the zfsdev_state_list to allow for lock-free reading by
converting to a simple singly-linked list from which items are never
deleted and over which only forward iterations are performed.  It depends
on, among other things, the atomicity of accessing the zs_minor integer
and zs_next pointer.

This fixes a lock inversion in which the zfsdev_state_lock is used by
both the sync task (txg_sync) and indirectly by any user program which
uses /dev/zfs; the zfsdev_release method uses the same lock and then
blocks on the sync task.

The most typical failure scenerio occurs when the sync task is cleaning
up a user hold while various concurrent "zfs" commands are in progress.

Neither Illumos nor Solaris are affected by this issue because they use
DDI interface which provides lock-free reading of device state via the
ddi_get_soft_state() function.
@dweeezil
Copy link
Contributor Author

I think the barriers are correct now (dweeezil/zfs@8531b88).

@tuxoko
Copy link
Contributor

tuxoko commented May 19, 2014

@dweeezil
It looks good to me.
Thanks

@behlendorf
Copy link
Contributor

Me too. Nice work guys, I'll get it merged.

@behlendorf
Copy link
Contributor

Merged as:

3937ab2 Allow for lock-free reading zfsdev_state_list.

@behlendorf behlendorf closed this May 19, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants