Cleanup: Make memory barrier definitions consistent across kernels #13843

ryao · 2022-09-06T02:34:26Z

Motivation and Context

When reading module/zfs/zfs_ioctl.c, I noticed a Linux kernel specific smp_rmb() in platform independent code that should have used membar_consumer(). This merits cleanup.

When cleaning it up, I noticed that FreeBSD had not defined membar_consumer(), but had defined smp_rmb() in include/os/freebsd/linux/compiler.h when membar_producer() had been in include/os/freebsd/sys/atomic.h. The definition for membar_producer() had been optimized to a compiler memory barrier on amd64/x86 by exploiting the total store order (TSO) memory model, but the smp_rmb() definition lacked that optimization.

Description

I replaced smp_rmb() with membar_consumer(). I also deleted the smp_rmb() definition from include/os/freebsd/linux/compiler.h and put the correct definitions for membar_consumer() into include/os/freebsd/sys/atomic.h and include/os/linux/spl/sys/vmsystm.h.

How Has This Been Tested?

It has not been tested. This kind of change should only require a build test. I intend to rely on the buildbot to verify that I have not broken kernel builds on either FreeBSD or Linux. I am not setup to compile FreeBSD here, so the buildbot's verification is a time saver for me.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin

I like general cleanup, but I have feeling that FreeBSD part goes wrong way. Comment in sys/amd64/include/atomic.h says:

/*
 * To express interprocessor (as opposed to processor and device) memory
 * ordering constraints, use the atomic_*() functions with acquire and release
 * semantics rather than the *mb() functions.  An architecture's memory
 * ordering (or memory consistency) model governs the order in which a
 * program's accesses to different locations may be performed by an
 * implementation of that architecture.  In general, for memory regions
 * defined as writeback cacheable, the memory ordering implemented by amd64
 * processors preserves the program ordering of a load followed by a load, a
 * load followed by a store, and a store followed by a store.  Only a store
 * followed by a load to a different memory location may be reordered.
 * Therefore, except for special cases, like non-temporal memory accesses or
 * memory regions defined as write combining, the memory ordering effects
 * provided by the sfence instruction in the wmb() function and the lfence
 * instruction in the rmb() function are redundant.  In contrast, the
 * atomic_*() functions with acquire and release semantics do not perform
 * redundant instructions for ordinary cases of interprocessor memory
 * ordering on any architecture.
 */

Since ZFS usually does not work with hardware directly, I think membar_producer() should be left as-is and membar_consumer() should be mapped to atomic_thread_fence_acq(). Those two on x86 do not create a full fence as you tell, but only with __compiler_membar() block memory access reorder by compiler, while hardware implements regular x86 memory model. rmb()/wmd() on the other side generate explicit LFENCE/SFENCE instructions, that may be expensive and as written above redundant.

ryao · 2022-09-06T22:43:39Z

@amotin membar_producer() and membar_consumer() are supposed to be sfence and lfence instructions respectively. That is how they are on OpenSolaris/Illumos:

https://github.com/illumos/illumos-gate/blob/master/usr/src/common/atomic/amd64/atomic.s#L564

On FreeBSD, atomic_thread_fence_rel() and atomic_thread_fence_acq() are just __compiler_membar():

https://github.com/freebsd/freebsd-src/blob/main/sys/amd64/include/atomic.h#L344

membar_producer() and membar_consumer() are used in lockless code that absolutely must have a store/load fence to work correctly. As far as I can tell, every use of these functions originated in either OpenSolaris or Linux with the expectation that they be hardware fences. For example, the smp_rmb() that I want to clean up is used in manipulation of a lockless linked list that was introduced to fix a race condition. Replacing the hardware fences with compiler memory barriers would introduce tiny race conditions. This would suggest that the FreeBSD port currently has some very tiny race conditions.

In hindsight, I should have labelled this as a bug fix, rather than a performance enhancement. I mislabelled it as a performance enhancement because I misunderstood __compiler_membar() to generate a full hardware fence when I read it (it was late at night). :/

As for being expensive, partial hardware fences are less expensive than full hardware fences, which are less expensive than atomic instructions, which are less expensive than locks. These partial hardware fences are in the code because they were the cheapest option for synchronization.

Do you still want to map membar_producer() and membar_consumer() to functions that map to __compiler_membar() on FreeBSD, even if Linux maps them to lfence and sfence like OpenSolaris does, and the code calling them expects them to be hardware fences?

amotin · 2022-09-07T00:01:59Z

Unfortunately I can't say that I am a big expert in this area. But I can say that atomic_thread_fence_acq/rel() on FreeBSD are mapped to __compiler_membar() only on x86 platforms. For other platforms with less strict memory ordering they do explicitly call hardware synchronization. On x86 AFAIK and as written in the quote above inter-CPU synchronization is a duty of hardware. Exceptions are very rare, like devices access, rdtsc (which is not serializing, but often wanted so) and few others. So from one side I don't want FreeBSD to be different, but from the other I'd like the code using these primitives in ZFS to be re-validated from the point whether the strict semantics is really required.

I'll call few other FreeBSD developers for help with this: @markjdb , @mjguzik , @kostikbel .

amotin · 2022-09-07T00:10:46Z

As another argument to support my point, here are amd64 FreeBSD implementations of atomic_store_rel() and atomic_load_acq():

#define ATOMIC_LOAD(TYPE)                                       \
static __inline u_##TYPE                                        \
atomic_load_acq_##TYPE(volatile u_##TYPE *p)                    \
{                                                               \
        u_##TYPE res;                                           \
                                                                \
        res = *p;                                               \
        __compiler_membar();                                    \
        return (res);                                           \
}                                                               \
struct __hack

#define ATOMIC_STORE(TYPE)                                      \
static __inline void                                            \
atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)       \
{                                                               \
                                                                \
        __compiler_membar();                                    \
        *p = v;                                                 \
}                                                               \
struct __hack

As you may see, they have only compiler barriers, and those are sufficient for the kernel primitives. All other synchronization on x86 is done by hardware.

ryao · 2022-09-07T03:09:29Z

As another argument to support my point, here are amd64 FreeBSD implementations of atomic_store_rel() and atomic_load_acq():

#define ATOMIC_LOAD(TYPE)                                       \
static __inline u_##TYPE                                        \
atomic_load_acq_##TYPE(volatile u_##TYPE *p)                    \
{                                                               \
        u_##TYPE res;                                           \
                                                                \
        res = *p;                                               \
        __compiler_membar();                                    \
        return (res);                                           \
}                                                               \
struct __hack

#define ATOMIC_STORE(TYPE)                                      \
static __inline void                                            \
atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)       \
{                                                               \
                                                                \
        __compiler_membar();                                    \
        *p = v;                                                 \
}                                                               \
struct __hack

As you may see, they have only compiler barriers

The existing FreeBSD SPL uses atomic_thread_fence_rel(), which is not implemented by either of these.

and those are sufficient for the kernel primitives.

Relying on this whenever superscalar processors cannot be allowed to reorder would introduce a tiny race condition unless another synchronization primitive, such as a lock (that does a full memory barrier) is able to protect it. If a lock provides protection, then it would be unnecessary.

Something like this would only be useful to prevent the compiler from introducing bugs like the one described here:

https://lwn.net/Articles/508991/

In code like that, you do not need a hardware memory fence, but you do need to keep the compiler from doing loop invariant optimizations. A hardware memory fence would be overkill for that.

All other synchronization on x86 is done by hardware.

We support ARM, which does not enforce total store ordering:

https://en.wikipedia.org/wiki/Memory_ordering#In_symmetric_multiprocessing_(SMP)_microprocessor_systems

For example, let us take a look at zfsdev_state_init(). It implements a linked list that may be read without taking a lock to fix this deadlock:

#2301

The way that it works is fairly simple:

Modification is only ever done under a lock.
Entries in the list are never actually freed.
Entries may be marked unused by setting the minor to -1.
New entries are added at the tail of the list.

It uses a store fence when either adding a new entry or marking an existing entry as in use. Let us say that we are marking an entry as in use without the store fence on processor A. That means a superscalar architecture is free to reorder zs->zs_minor = minor; before zfs_zevent_init((zfs_zevent_t **)&zs->zs_zevent); sets zs->zs_zevent. Then processor B executes zfsdev_get_state(minor, ZST_ZEVENT), which reads the list. Since zs->zs_minor == minor matches, it then returns zs->zs_zevent, which is a bad value because the correct value is still in the store buffer of processor A (as per the MOESI protocol).

If we were excecuting on a scalar processor, then a compiler memory barrier would be fine, since the stores would be done in-order, but modern CPUs, being superscalar, are free to re-order stores, so compiler memory barriers no longer enforce an order. Sometimes, the hardware might not reorder and we would be fine, but then a new processor can be made that exploits the re-ordering opportunity and then we have a very rare and very hard to debug race condition. This is just one example. Every case in the code that uses either membar_producer() or membar_consumer() should expect a memory fence, as those functions would have been used specifically to put a memory fence in that place.

Despite what some documentation might say, memory fences are cheap. They are cheaper than any other synchronization primitive when you need synchronization. They are only expensive relative to not having any synchronization at all, but you cannot have concurrent access to data structures without some kind of synchronization. As I said previously, a compiler memory barrier does not enforce ordering on a superscalar processor. The result of not enforcing ordering in code that requires it to perform synchronization is a race condition.

Lastly, we might by clever and observe, that in the above example, releasing the zfsdev_state_lock mutex would flush the store buffer. While that is true, there is still an opportunity for another processor to see memory in an invalid state before that is done. Furthermore, a CPU interrupt occurring right before the mutex release does a store buffer flush could delay the store buffer flush for an indefinite period of time, making the race much bigger than the few cycles that it would seem to be. That said, even races that are a few cycles long can cause problems. :/

amotin · 2022-09-07T03:39:49Z

@ryao I do understand the concept of barriers in general, but x86 in particular is more forgiving than other architectures. You may look on "Memory ordering in some architectures" table at https://en.wikipedia.org/wiki/Memory_ordering. Or if you wish more serious document on "8.2.2 Memory Ordering in P6 and More Recent Processor Families" chapter at Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide, Part 1:

In a single-processor system for memory regions defined as write-back cacheable, the memory-ordering model respects the following principles:
• Reads are not reordered with other reads.
• Writes are not reordered with older reads.
• Writes to memory are not reordered with other writes, with the following exceptions ...
...
In a multiple-processor system, the following ordering principles apply:
• Individual processors use the same ordering principles as in a single-processor system.
• Writes by a single processor are observed in the same order by all processors.
...
The processor-ordering model described in this section is virtually identical to that used by the Pentium and Intel486 processors. The only enhancements in the Pentium 4, Intel Xeon, and P6 family processors are:
• Added support for speculative reads, while still adhering to the ordering principles above.
• Store-buffer forwarding, when a read passes a write to the same memory location.
• Out of order store from long string store and string move operations ...

ryao · 2022-09-07T03:45:18Z

@ryao I do understand the concept of barriers in general, but x86 in particular is more forgiving than other architectures. You may look on "Memory ordering in some architectures" table at https://en.wikipedia.org/wiki/Memory_ordering. Or if you wish more serious document on "8.2.2 Memory Ordering in P6 and More Recent Processor Families" chapter at Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide, Part 1:

In a single-processor system for memory regions defined as write-back cacheable, the memory-ordering model respects the following principles: • Reads are not reordered with other reads. • Writes are not reordered with older reads. • Writes to memory are not reordered with other writes, with the following exceptions ... ... In a multiple-processor system, the following ordering principles apply: • Individual processors use the same ordering principles as in a single-processor system. • Writes by a single processor are observed in the same order by all processors. ... The processor-ordering model described in this section is virtually identical to that used by the Pentium and Intel486 processors. The only enhancements in the Pentium 4, Intel Xeon, and P6 family processors are: • Added support for speculative reads, while still adhering to the ordering principles above. • Store-buffer forwarding, when a read passes a write to the same memory location. • Out of order store from long string store and string move operations ...

@amotin Honestly, I posted that before I should have and ended up editing it up until you replied. Anyway, I think I understand the problem here. You assume that we only support architectures that do total store ordering. We support both ARM and POWER, which do not implement total store ordering:

https://en.wikipedia.org/wiki/Memory_ordering#In_symmetric_multiprocessing_(SMP)_microprocessor_systems

I understand that FreeBSD also supports ARM and PPC:

https://www.freebsd.org/platforms/arm/
https://www.freebsd.org/platforms/ppc/

That said, I now understand your aversion to using these instructions on x86/amd64. It does look like they are unnecessary, provided that we use compiler memory barriers. I am open to switching to compiler memory barriers on x86/amd64, but we still need the fences on other architectures.

Would that be satisfactory?

amotin · 2022-09-07T03:54:00Z

That said, I now understand your aversion to using these instructions on x86/amd64. It does look like they are unnecessary, provided that we use compiler memory barriers. I am open to switching to compiler memory barriers on x86/amd64, but we still need the fences on other architectures.

Would that be satisfactory?

As I have told above: "But I can say that atomic_thread_fence_acq/rel() on FreeBSD are mapped to __compiler_membar() only on x86 platforms. For other platforms with less strict memory ordering they do explicitly call hardware synchronization." It would be good to find closest atomic_thread_fence_acq/rel() counterparts on other architectures. Otherwise FreeBSD may indeed appear different. Unless I am completely wrong and somebody correct me. ;)

ryao · 2022-09-07T04:11:59Z

@amotin It has been years since I last thought about this and honestly, I never thought about exploiting x86/amd64's TSO to avoid explicit fences. The practice in both OpenSolaris and Linux is to use explicit fences and I never had a reason to question it until now. After thinking about it, I realized that you were right from the start. I just pushed a new patch that uses atomic_thread_fence_acq/rel() on FreeBSD and adopts that behavior on Linux.

Thank you for your feedback and for taking the time to explain things to me after I failed to understand them at first. :)

We inherited membar_consumer() and membar_producer() from OpenSolaris, but we had replaced membar_consumer() with Linux's smp_rmb() in zfs_ioctl.c. The FreeBSD SPL consequently implemented a shim for the Linux-only smp_rmb(). We reinstate membar_consumer() in platform independent code and fix the FreeBSD SPL to implement membar_consumer() in a way analogous to Linux. Signed-off-by: Richard Yao <[email protected]>

ryao · 2022-09-07T04:25:09Z

After scrutinizing the Linux kernel sources more closely, it turns out that Linux also implements the optimization where it replaces lfence/sfence with a compiler memory barrier in smp_rmb() and smp_wmb(). I just simplified my patch based on that. OpenSolaris/Illumos/(Solaris?) are alone in inserting lfence/sfence on x86/amd64.

module/zfs/zfs_ioctl.c

mjguzik · 2022-09-07T10:01:01Z

As @amotin said these are the correct barriers, so the patch is fine in terms of what it modifies.

However, I could not help but note there is no full barrier implemented in the list which I find highly suspicious.

On Linux it would be smp_mb, on FreeBSD atomic_thread_fence_seq_cst. These happen to not just compile out even on amd64.

Grep reveals membar_enter and membar_exit, which are quite frankly weird -- both provide a full barrier and are not used anywhere.

Instead there is a wrong redefinition in module/icp/include/sys/crypto/impl.h:

/* atomic operations in linux implicitly form a memory barrier */
#define membar_exit()

Should the above be needed, on Linux you would smp_mb__before/after_atomic. Unfortunately FreeBSD does not provide an equivalent right now.

Sample usage:

#define KCF_PROV_IREFRELE(desc) {                               \
        ASSERT((desc)->pd_irefcnt != 0);                        \
        membar_exit();                                          \
        if (atomic_add_32_nv(&(desc)->pd_irefcnt, -1) == 0) {   \
                cv_broadcast(&(desc)->pd_remove_cv);            \
        }                                                       \
}

This most likely only needs a release fence, not a full barrier -- as in membar_producer, although it does look weird given the naming.

That said, the patch as is provides an improvement and perhaps can go in, but there is more work to do in the area.

ryao · 2022-09-07T16:20:30Z

@mjguzik The identical membar_enter and membar_exit weirdness is explained in the Illumos/OpenSolaris source code:

https://github.com/illumos/illumos-gate/blob/master/usr/src/common/atomic/amd64/atomic.s#L547

In short, it was for DTrace.

ryao · 2022-09-12T19:43:43Z

/* atomic operations in linux implicitly form a memory barrier */
#define membar_exit()

This is definitely wrong. Atomics only form memory barriers on architectures that do not reorder atomics (and a quick look at the Linux kernel source code confirmed the lack of barriers on IA64). Most architectures will reorder atomics:

https://en.wikipedia.org/wiki/Memory_ordering#In_symmetric_multiprocessing_(SMP)_microprocessor_systems

That said, the original code did not use atomics, but instead used a mutex:

https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/sys/crypto/impl.h#L252

I will tackle this in a separate patch, since I will not be implementing membar_exit(), but eliminating the need for it entirely.

ryao · 2022-09-12T20:07:56Z

@mjguzik I have just opened #13880 with fixes for module/icp/include/sys/crypto/impl.h.

We inherited membar_consumer() and membar_producer() from OpenSolaris, but we had replaced membar_consumer() with Linux's smp_rmb() in zfs_ioctl.c. The FreeBSD SPL consequently implemented a shim for the Linux-only smp_rmb(). We reinstate membar_consumer() in platform independent code and fix the FreeBSD SPL to implement membar_consumer() in a way analogous to Linux. Reviewed-by: Konstantin Belousov <[email protected]> Reviewed-by: Mateusz Guzik <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes openzfs#13843

ryao force-pushed the membarrier branch 3 times, most recently from fdf5c3f to 1cb1630 Compare September 6, 2022 03:52

behlendorf requested a review from amotin September 6, 2022 17:01

behlendorf added the Status: Code Review Needed Ready for review and testing label Sep 6, 2022

amotin requested changes Sep 6, 2022

View reviewed changes

ryao force-pushed the membarrier branch from 1cb1630 to b0f36cf Compare September 7, 2022 04:10

ryao changed the title ~~Cleanup: Correct memory barrier definitions~~ Make memory barrier definitions consistent across kernels Sep 7, 2022

ryao force-pushed the membarrier branch from b0f36cf to 23095d8 Compare September 7, 2022 04:22

ryao force-pushed the membarrier branch from 23095d8 to 825ab2a Compare September 7, 2022 04:24

ryao changed the title ~~Make memory barrier definitions consistent across kernels~~ Cleanup: Make memory barrier definitions consistent across kernels Sep 7, 2022

kostikbel reviewed Sep 7, 2022

View reviewed changes

module/zfs/zfs_ioctl.c Show resolved Hide resolved

amotin approved these changes Sep 7, 2022

View reviewed changes

behlendorf approved these changes Sep 8, 2022

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Sep 8, 2022

Conan-Kudo approved these changes Sep 11, 2022

View reviewed changes

ryao mentioned this pull request Sep 12, 2022

Fix assertions in crypto reference helpers #13880

Merged

13 tasks

behlendorf merged commit cf66e7e into openzfs:master Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup: Make memory barrier definitions consistent across kernels #13843

Cleanup: Make memory barrier definitions consistent across kernels #13843

ryao commented Sep 6, 2022 •

edited

Loading

amotin left a comment •

edited

Loading

ryao commented Sep 6, 2022 •

edited

Loading

amotin commented Sep 7, 2022

amotin commented Sep 7, 2022 •

edited

Loading

ryao commented Sep 7, 2022 •

edited

Loading

amotin commented Sep 7, 2022

ryao commented Sep 7, 2022

amotin commented Sep 7, 2022

ryao commented Sep 7, 2022 •

edited

Loading

ryao commented Sep 7, 2022 •

edited

Loading

mjguzik commented Sep 7, 2022 •

edited

Loading

ryao commented Sep 7, 2022

ryao commented Sep 12, 2022

ryao commented Sep 12, 2022

Cleanup: Make memory barrier definitions consistent across kernels #13843

Cleanup: Make memory barrier definitions consistent across kernels #13843

Conversation

ryao commented Sep 6, 2022 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

amotin left a comment • edited Loading

Choose a reason for hiding this comment

ryao commented Sep 6, 2022 • edited Loading

amotin commented Sep 7, 2022

amotin commented Sep 7, 2022 • edited Loading

ryao commented Sep 7, 2022 • edited Loading

amotin commented Sep 7, 2022

ryao commented Sep 7, 2022

amotin commented Sep 7, 2022

ryao commented Sep 7, 2022 • edited Loading

ryao commented Sep 7, 2022 • edited Loading

mjguzik commented Sep 7, 2022 • edited Loading

ryao commented Sep 7, 2022

ryao commented Sep 12, 2022

ryao commented Sep 12, 2022

ryao commented Sep 6, 2022 •

edited

Loading

amotin left a comment •

edited

Loading

ryao commented Sep 6, 2022 •

edited

Loading

amotin commented Sep 7, 2022 •

edited

Loading

ryao commented Sep 7, 2022 •

edited

Loading

ryao commented Sep 7, 2022 •

edited

Loading

ryao commented Sep 7, 2022 •

edited

Loading

mjguzik commented Sep 7, 2022 •

edited

Loading