Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More aggsum optimizations. #12145

Merged
merged 1 commit into from
Jun 7, 2021
Merged

More aggsum optimizations. #12145

merged 1 commit into from
Jun 7, 2021

Conversation

amotin
Copy link
Member

@amotin amotin commented May 28, 2021

  • Avoid atomic_add() when updating as_lower_bound/as_upper_bound.
    Previous code was excessively strong on 64bit systems while not
    strong enough on 32bit ones. Instead introduce and use real
    atomic_load() and atomic_store() operations, just an assignments
    on 64bit machines, but using proper atomics on 32bit ones to avoid
    torn reads/writes.

  • Reduce number of buckets on large systems. Extra buckets not as
    much improve add speed, as hurt reads. Unlike wmsum for aggsum
    reads are still important.

How Has This Been Tested?

On 40-thread FreeBSD system with heavy uncached 4KB ZVOLs reads profiler shows reduction of time spent in all aggsum_*() functions from 3.2% to 1.8% globally, and from 15.6% to 8.8% in ARC evict thread. The nested locks contention sometimes visible in profiles before has vanished.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin amotin requested review from pcd1193182 and behlendorf May 28, 2021 04:10
@amotin amotin force-pushed the aggsum3 branch 2 times, most recently from cd5437a to 7144634 Compare May 28, 2021 04:38
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label May 28, 2021
@amotin amotin force-pushed the aggsum3 branch 3 times, most recently from f6c29c1 to ad3d3c4 Compare May 28, 2021 14:23
 - Avoid atomic_add() when updating as_lower_bound/as_upper_bound.
Previous code was excessively strong on 64bit systems while not
strong enough on 32bit ones.  Instead introduce and use real
atomic_load() and atomic_store() operations, just an assignments
on 64bit machines, but using proper atomics on 32bit ones to avoid
torn reads/writes.

 - Reduce number of buckets on large systems.  Extra buckets not as
much improve add speed, as hurt reads.  Unlike wmsum for aggsum
reads are still important.

Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
/*
* Atomically read variable.
*/
#define atomic_load_char(p) (*(volatile uchar_t *)(p))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know libspl's versions of these functions are okay being a bit loose, but I don't believe that volatile guarantees atomicity, that's just a common implementation side-effect.
How about GCC's __atomic builtins (also provided by Clang)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

volatile does not provide atomicity, it only makes compiler to not optimize memory accesses. Atomicity is provided by machine register size, copied with single instruction. That is why there are #ifdef _LP64. This is a copy/paste from FreeBSD's atomic_common.h, used on all supported architectures, so I believe it to be correct. I have no opinion about __atomic builtins. When I written this, I haven't seen recent наб's commit to start using them in lib/libspl/atomic.c. It actually makes me wonder why they were not inlined here, but that is a different topic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be nice to finally have proper wrappers for atomic loads and stores! This is functionality I know would be helpful elsewhere with some kstats which are a bit dodgy in this regard (though pretty harmless).

Since it sounds like these are all well tested already on FreeBSD this seems fine to me. Though I wouldn't object to updating them to use the GCC atomic builtins for consistency in a similar fashion as was done for the other atomics.

Copy link
Member Author

@amotin amotin May 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what is the point now to have all the atomics in separate C file, not inlined in the header. Is it supposed to improve compatibility when library is built with compiler supporting atomics and header is used by less capable? Previously it had sense, since alternative implementations were written in assembler. But now with compiler supposed to implement all of that -- what's the point? To troubles compiler/CPU with function calls?

PS: On a second thought, some operations may require more complicated implementation where function call could have sense, but since compiler hides it from us, we have no way to differentiate. For trivial load/store/etc I'd prefer to not call functions for single machine instruction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well since this is only for libzpool.so in user space it's probably not a huge deal since only zdb and ztest would use these (at least until there's FUSE or BUSE implementation). The only reason I can think of to not move everything to the header (to be inlined) would be if we wanted a single shared user/kernel space header. But we don't have that today and I don't think we'd gain much from it. We're already inlining these on the kernel side, doing the same in user space would make sense.

lib/libspl/atomic.c Show resolved Hide resolved
/*
* Atomically read variable.
*/
#define atomic_load_char(p) (*(volatile uchar_t *)(p))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be nice to finally have proper wrappers for atomic loads and stores! This is functionality I know would be helpful elsewhere with some kstats which are a bit dodgy in this regard (though pretty harmless).

Since it sounds like these are all well tested already on FreeBSD this seems fine to me. Though I wouldn't object to updating them to use the GCC atomic builtins for consistency in a similar fashion as was done for the other atomics.

@amotin
Copy link
Member Author

amotin commented Jun 1, 2021

@nabijaczleweli Would you take a look on atomics side, since you've touched them recently?

@nabijaczleweli
Copy link
Contributor

nabijaczleweli commented Jun 1, 2021

I, uh, don't have much to say, really. I just replaced the libspl atomics with versions that were as strong as the pthreads implementation, and didn't touch anything else (even if I kinda wanted to, these 100% should just be statics), the strength of an atomic operation is to be determined by the call-site and ordering you want there, and these don't have that luxury (you could say they're therefore terminally broken, and you wouldn't be wrong).
Per WG14, volatile guarantees that you can't reorder nor fuse accesses, but it's impossible to use them for atomics because you could observe a race and that'd be UB, which is guaranteed not to happen (for a less annoying phrasing: yes they can tear, all you're guaranteed is that the reads/writes you read are the reads/writes that are emitted, in the same order).

lib/libspl/atomic.c Show resolved Hide resolved
@behlendorf behlendorf requested a review from a user June 4, 2021 20:35
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jun 7, 2021
@behlendorf behlendorf merged commit ea40012 into openzfs:master Jun 7, 2021
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 8, 2021
- Avoid atomic_add() when updating as_lower_bound/as_upper_bound.
Previous code was excessively strong on 64bit systems while not
strong enough on 32bit ones.  Instead introduce and use real
atomic_load() and atomic_store() operations, just an assignments
on 64bit machines, but using proper atomics on 32bit ones to avoid
torn reads/writes.

 - Reduce number of buckets on large systems.  Extra buckets not as
much improve add speed, as hurt reads.  Unlike wmsum for aggsum
reads are still important.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes openzfs#12145
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 9, 2021
- Avoid atomic_add() when updating as_lower_bound/as_upper_bound.
Previous code was excessively strong on 64bit systems while not
strong enough on 32bit ones.  Instead introduce and use real
atomic_load() and atomic_store() operations, just an assignments
on 64bit machines, but using proper atomics on 32bit ones to avoid
torn reads/writes.

 - Reduce number of buckets on large systems.  Extra buckets not as
much improve add speed, as hurt reads.  Unlike wmsum for aggsum
reads are still important.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes openzfs#12145
@amotin amotin deleted the aggsum3 branch August 24, 2021 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants