-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup: Make memory barrier definitions consistent across kernels #13843
Conversation
fdf5c3f
to
1cb1630
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like general cleanup, but I have feeling that FreeBSD part goes wrong way. Comment in sys/amd64/include/atomic.h says:
/*
* To express interprocessor (as opposed to processor and device) memory
* ordering constraints, use the atomic_*() functions with acquire and release
* semantics rather than the *mb() functions. An architecture's memory
* ordering (or memory consistency) model governs the order in which a
* program's accesses to different locations may be performed by an
* implementation of that architecture. In general, for memory regions
* defined as writeback cacheable, the memory ordering implemented by amd64
* processors preserves the program ordering of a load followed by a load, a
* load followed by a store, and a store followed by a store. Only a store
* followed by a load to a different memory location may be reordered.
* Therefore, except for special cases, like non-temporal memory accesses or
* memory regions defined as write combining, the memory ordering effects
* provided by the sfence instruction in the wmb() function and the lfence
* instruction in the rmb() function are redundant. In contrast, the
* atomic_*() functions with acquire and release semantics do not perform
* redundant instructions for ordinary cases of interprocessor memory
* ordering on any architecture.
*/
Since ZFS usually does not work with hardware directly, I think membar_producer() should be left as-is and membar_consumer() should be mapped to atomic_thread_fence_acq(). Those two on x86 do not create a full fence as you tell, but only with __compiler_membar() block memory access reorder by compiler, while hardware implements regular x86 memory model. rmb()/wmd() on the other side generate explicit LFENCE/SFENCE instructions, that may be expensive and as written above redundant.
@amotin On FreeBSD, https://github.com/freebsd/freebsd-src/blob/main/sys/amd64/include/atomic.h#L344
In hindsight, I should have labelled this as a bug fix, rather than a performance enhancement. I mislabelled it as a performance enhancement because I misunderstood As for being expensive, partial hardware fences are less expensive than full hardware fences, which are less expensive than atomic instructions, which are less expensive than locks. These partial hardware fences are in the code because they were the cheapest option for synchronization. Do you still want to map |
Unfortunately I can't say that I am a big expert in this area. But I can say that atomic_thread_fence_acq/rel() on FreeBSD are mapped to __compiler_membar() only on x86 platforms. For other platforms with less strict memory ordering they do explicitly call hardware synchronization. On x86 AFAIK and as written in the quote above inter-CPU synchronization is a duty of hardware. Exceptions are very rare, like devices access, rdtsc (which is not serializing, but often wanted so) and few others. So from one side I don't want FreeBSD to be different, but from the other I'd like the code using these primitives in ZFS to be re-validated from the point whether the strict semantics is really required. I'll call few other FreeBSD developers for help with this: @markjdb , @mjguzik , @kostikbel . |
As another argument to support my point, here are amd64 FreeBSD implementations of atomic_store_rel() and atomic_load_acq():
As you may see, they have only compiler barriers, and those are sufficient for the kernel primitives. All other synchronization on x86 is done by hardware. |
The existing FreeBSD SPL uses
Relying on this whenever superscalar processors cannot be allowed to reorder would introduce a tiny race condition unless another synchronization primitive, such as a lock (that does a full memory barrier) is able to protect it. If a lock provides protection, then it would be unnecessary. Something like this would only be useful to prevent the compiler from introducing bugs like the one described here: https://lwn.net/Articles/508991/ In code like that, you do not need a hardware memory fence, but you do need to keep the compiler from doing loop invariant optimizations. A hardware memory fence would be overkill for that.
We support ARM, which does not enforce total store ordering: For example, let us take a look at The way that it works is fairly simple:
It uses a store fence when either adding a new entry or marking an existing entry as in use. Let us say that we are marking an entry as in use without the store fence on processor A. That means a superscalar architecture is free to reorder If we were excecuting on a scalar processor, then a compiler memory barrier would be fine, since the stores would be done in-order, but modern CPUs, being superscalar, are free to re-order stores, so compiler memory barriers no longer enforce an order. Sometimes, the hardware might not reorder and we would be fine, but then a new processor can be made that exploits the re-ordering opportunity and then we have a very rare and very hard to debug race condition. This is just one example. Every case in the code that uses either Despite what some documentation might say, memory fences are cheap. They are cheaper than any other synchronization primitive when you need synchronization. They are only expensive relative to not having any synchronization at all, but you cannot have concurrent access to data structures without some kind of synchronization. As I said previously, a compiler memory barrier does not enforce ordering on a superscalar processor. The result of not enforcing ordering in code that requires it to perform synchronization is a race condition. Lastly, we might by clever and observe, that in the above example, releasing the |
@ryao I do understand the concept of barriers in general, but x86 in particular is more forgiving than other architectures. You may look on "Memory ordering in some architectures" table at https://en.wikipedia.org/wiki/Memory_ordering. Or if you wish more serious document on "8.2.2 Memory Ordering in P6 and More Recent Processor Families" chapter at Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide, Part 1: In a single-processor system for memory regions defined as write-back cacheable, the memory-ordering model respects the following principles: |
@amotin Honestly, I posted that before I should have and ended up editing it up until you replied. Anyway, I think I understand the problem here. You assume that we only support architectures that do total store ordering. We support both ARM and POWER, which do not implement total store ordering: I understand that FreeBSD also supports ARM and PPC: https://www.freebsd.org/platforms/arm/ That said, I now understand your aversion to using these instructions on x86/amd64. It does look like they are unnecessary, provided that we use compiler memory barriers. I am open to switching to compiler memory barriers on x86/amd64, but we still need the fences on other architectures. Would that be satisfactory? |
As I have told above: "But I can say that atomic_thread_fence_acq/rel() on FreeBSD are mapped to __compiler_membar() only on x86 platforms. For other platforms with less strict memory ordering they do explicitly call hardware synchronization." It would be good to find closest atomic_thread_fence_acq/rel() counterparts on other architectures. Otherwise FreeBSD may indeed appear different. Unless I am completely wrong and somebody correct me. ;) |
@amotin It has been years since I last thought about this and honestly, I never thought about exploiting x86/amd64's TSO to avoid explicit fences. The practice in both OpenSolaris and Linux is to use explicit fences and I never had a reason to question it until now. After thinking about it, I realized that you were right from the start. I just pushed a new patch that uses Thank you for your feedback and for taking the time to explain things to me after I failed to understand them at first. :) |
We inherited membar_consumer() and membar_producer() from OpenSolaris, but we had replaced membar_consumer() with Linux's smp_rmb() in zfs_ioctl.c. The FreeBSD SPL consequently implemented a shim for the Linux-only smp_rmb(). We reinstate membar_consumer() in platform independent code and fix the FreeBSD SPL to implement membar_consumer() in a way analogous to Linux. Signed-off-by: Richard Yao <[email protected]>
After scrutinizing the Linux kernel sources more closely, it turns out that Linux also implements the optimization where it replaces lfence/sfence with a compiler memory barrier in |
As @amotin said these are the correct barriers, so the patch is fine in terms of what it modifies. However, I could not help but note there is no full barrier implemented in the list which I find highly suspicious. On Linux it would be smp_mb, on FreeBSD atomic_thread_fence_seq_cst. These happen to not just compile out even on amd64. Grep reveals membar_enter and membar_exit, which are quite frankly weird -- both provide a full barrier and are not used anywhere. Instead there is a wrong redefinition in module/icp/include/sys/crypto/impl.h:
Should the above be needed, on Linux you would smp_mb__before/after_atomic. Unfortunately FreeBSD does not provide an equivalent right now. Sample usage:
This most likely only needs a release fence, not a full barrier -- as in membar_producer, although it does look weird given the naming. That said, the patch as is provides an improvement and perhaps can go in, but there is more work to do in the area. |
@mjguzik The identical membar_enter and membar_exit weirdness is explained in the Illumos/OpenSolaris source code: In short, it was for DTrace. |
This is definitely wrong. Atomics only form memory barriers on architectures that do not reorder atomics (and a quick look at the Linux kernel source code confirmed the lack of barriers on IA64). Most architectures will reorder atomics: That said, the original code did not use atomics, but instead used a mutex: I will tackle this in a separate patch, since I will not be implementing |
We inherited membar_consumer() and membar_producer() from OpenSolaris, but we had replaced membar_consumer() with Linux's smp_rmb() in zfs_ioctl.c. The FreeBSD SPL consequently implemented a shim for the Linux-only smp_rmb(). We reinstate membar_consumer() in platform independent code and fix the FreeBSD SPL to implement membar_consumer() in a way analogous to Linux. Reviewed-by: Konstantin Belousov <[email protected]> Reviewed-by: Mateusz Guzik <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes openzfs#13843
We inherited membar_consumer() and membar_producer() from OpenSolaris, but we had replaced membar_consumer() with Linux's smp_rmb() in zfs_ioctl.c. The FreeBSD SPL consequently implemented a shim for the Linux-only smp_rmb(). We reinstate membar_consumer() in platform independent code and fix the FreeBSD SPL to implement membar_consumer() in a way analogous to Linux. Reviewed-by: Konstantin Belousov <[email protected]> Reviewed-by: Mateusz Guzik <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes openzfs#13843
We inherited membar_consumer() and membar_producer() from OpenSolaris, but we had replaced membar_consumer() with Linux's smp_rmb() in zfs_ioctl.c. The FreeBSD SPL consequently implemented a shim for the Linux-only smp_rmb(). We reinstate membar_consumer() in platform independent code and fix the FreeBSD SPL to implement membar_consumer() in a way analogous to Linux. Reviewed-by: Konstantin Belousov <[email protected]> Reviewed-by: Mateusz Guzik <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes openzfs#13843
We inherited membar_consumer() and membar_producer() from OpenSolaris, but we had replaced membar_consumer() with Linux's smp_rmb() in zfs_ioctl.c. The FreeBSD SPL consequently implemented a shim for the Linux-only smp_rmb(). We reinstate membar_consumer() in platform independent code and fix the FreeBSD SPL to implement membar_consumer() in a way analogous to Linux. Reviewed-by: Konstantin Belousov <[email protected]> Reviewed-by: Mateusz Guzik <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes openzfs#13843
We inherited membar_consumer() and membar_producer() from OpenSolaris, but we had replaced membar_consumer() with Linux's smp_rmb() in zfs_ioctl.c. The FreeBSD SPL consequently implemented a shim for the Linux-only smp_rmb(). We reinstate membar_consumer() in platform independent code and fix the FreeBSD SPL to implement membar_consumer() in a way analogous to Linux. Reviewed-by: Konstantin Belousov <[email protected]> Reviewed-by: Mateusz Guzik <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes openzfs#13843
We inherited membar_consumer() and membar_producer() from OpenSolaris, but we had replaced membar_consumer() with Linux's smp_rmb() in zfs_ioctl.c. The FreeBSD SPL consequently implemented a shim for the Linux-only smp_rmb(). We reinstate membar_consumer() in platform independent code and fix the FreeBSD SPL to implement membar_consumer() in a way analogous to Linux. Reviewed-by: Konstantin Belousov <[email protected]> Reviewed-by: Mateusz Guzik <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes openzfs#13843
We inherited membar_consumer() and membar_producer() from OpenSolaris, but we had replaced membar_consumer() with Linux's smp_rmb() in zfs_ioctl.c. The FreeBSD SPL consequently implemented a shim for the Linux-only smp_rmb(). We reinstate membar_consumer() in platform independent code and fix the FreeBSD SPL to implement membar_consumer() in a way analogous to Linux. Reviewed-by: Konstantin Belousov <[email protected]> Reviewed-by: Mateusz Guzik <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes openzfs#13843
Motivation and Context
When reading
module/zfs/zfs_ioctl.c
, I noticed a Linux kernel specificsmp_rmb()
in platform independent code that should have usedmembar_consumer()
. This merits cleanup.When cleaning it up, I noticed that FreeBSD had not defined
membar_consumer()
, but had definedsmp_rmb()
ininclude/os/freebsd/linux/compiler.h
whenmembar_producer()
had been ininclude/os/freebsd/sys/atomic.h
. The definition formembar_producer()
had been optimized to a compiler memory barrier on amd64/x86 by exploiting the total store order (TSO) memory model, but thesmp_rmb()
definition lacked that optimization.Description
I replaced
smp_rmb()
withmembar_consumer()
. I also deleted thesmp_rmb()
definition frominclude/os/freebsd/linux/compiler.h
and put the correct definitions formembar_consumer()
intoinclude/os/freebsd/sys/atomic.h
andinclude/os/linux/spl/sys/vmsystm.h
.How Has This Been Tested?
It has not been tested. This kind of change should only require a build test. I intend to rely on the buildbot to verify that I have not broken kernel builds on either FreeBSD or Linux. I am not setup to compile FreeBSD here, so the buildbot's verification is a time saver for me.
Types of changes
Checklist:
Signed-off-by
.