Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for vectorized algorithms on x86 #4381

Closed
wants to merge 1 commit into from

Conversation

ironMann
Copy link
Contributor

This is initial support for x86 vectorized implementations of ZFS parity
and checksum algorithms.

For the compilation phase, configure step checks if toolchain supports relevant
instruction sets. Each implementation must ensure that the code is not passed
to compiler if relevant instruction set is not supported. For this purpose,
following new defines are provided if instruction set is supported:
- HAVE_SSE,
- HAVE_SSE2,
- HAVE_SSE3,
- HAVE_SSSE3,
- HAVE_SSE4_1,
- HAVE_SSE4_2,
- HAVE_AVX,
- HAVE_AVX2.

For detecting if an instruction set can be used in runtime, following functions
are provided in simd_x86.h:
- sse_available()
- sse2_available()
- sse3_available()
- ssse3_available()
- sse4_1_available()
- sse4_2_available()
- avx_available()
- avx2_available()

These function should be called once, on module load, or initialization.
They are safe to use from user and kernel space.

If an implementation is using more than single instruction set, both compiler
and runtime support for all relevant instruction sets should be checked.

This is relevant for:
#2351 - sha256 avx optimization
#3374 - raidz avx/avx2/sse optimization
#4328 - raidz avl/av2/sse optimization
#4330 - fletcher avl optimzation

@behlendorf
Copy link
Contributor

This appears to work great and aside from a few questions and comments it looks like a good basis to me. It would be great if we could resolve the remaining loose ends and get it merged so the other outstanding changes could be rebased to use it.

@ironMann
Copy link
Contributor Author

ironMann commented Mar 9, 2016

It was my understanding that gcc already provided these in cpuid.h. Are you trying to avoid adding a dependency on gcc?

Actually clang provides cpuid.h also, but their version is missing bit_AVX2 define, since these are Intel defined constants, we can provide our own and still use __get_cpuid and __get_cpuid_max functions. If this is acceptable, I would rather use the compiler provided functions here.

It appears the cpu_has_osxsave and cpu_has_avx wrappers weren't added until 3.0 kernels...

I did similar research on kernel tree and it seems that this is simplest solution that works for all relevant kernel versions.

@behlendorf
Copy link
Contributor

I would rather use the compiler provided functions here.

Yes, that should be fine. And after giving using the cpu_has_* macros a bit more thought I think your solution is the right way to go. It's simple, consistent, and works for all existing kernel versions.

If you can squash these patches and force update the PR these should be ready to merge.

@ironMann
Copy link
Contributor Author

Done.
One missing piece is to export kernel_fpu_begin()/kernel_fpu_end() in a portable way. This interface has even switched to EXPORT_GPL recently (taking down some GPU drivers), but I think there is a portable way. I'll have to research the options a bit more.

@behlendorf
Copy link
Contributor

@ironMann yes, I'd forgotten about the kernel_fpu_* interface. Now seems like the best time to get that sorted out and to attempt to update some of the proposed consumers to use these interfaces. That should give us a good idea if these are going to be workable interfaces.

One lingering concern I have is that we may want to prefix all these macro and functions with a short string. The names a currently very generic and I'm worried that we may end up with a namespace collision at some point in the future.

return (boot_cpu_has(X86_FEATURE_AVX) &&
boot_cpu_has(X86_FEATURE_OSXSAVE));
#else
return (cpuid_has_avx());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to check osxsave for userspace, and you need to check if xsave saves avx registers using xgetbv.
tuxoko@8baa4a5#diff-0d2e10cd21fcf823e8e9a62a934e5115R48

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, I'll add a userspace check for that and more inst. sets (bmi, bmi2...)

@tuxoko
Copy link
Contributor

tuxoko commented Mar 10, 2016

The kernel_fpu_begin can be easily work around
tuxoko@8baa4a5#diff-0d2e10cd21fcf823e8e9a62a934e5115R82

This is initial support for x86 vectorized implementations of ZFS parity
and checksum algorithms.

For the compilation phase, configure step checks if toolchain supports relevant
instruction sets. Each implementation must ensure that the code is not passed
to compiler if relevant instruction set is not supported. For this purpose,
following new defines are provided if instruction set is supported:
	- HAVE_SSE,
	- HAVE_SSE2,
	- HAVE_SSE3,
	- HAVE_SSSE3,
	- HAVE_SSE4_1,
	- HAVE_SSE4_2,
	- HAVE_AVX,
	- HAVE_AVX2.

For detecting if an instruction set can be used in runtime, following functions
are provided in (include/linux/simd_x86.h):
	- zfs_sse_available()
	- zfs_sse2_available()
	- zfs_sse3_available()
	- zfs_ssse3_available()
	- zfs_sse4_1_available()
	- zfs_sse4_2_available()
	- zfs_avx_available()
	- zfs_avx2_available()
	- zfs_bmi1_available()
	- zfs_bmi2_available()

These function should be called once, on module load, or initialization.
They are safe to use from user and kernel space.
If an implementation is using more than single instruction set, both compiler
and runtime support for all relevant instruction sets should be checked.

Kernel fpu methods:
	- kfpu_begin()
	- kfpu_end()

Use __get_cpuid_max and __cpuid_count from <cpuid.h>
Both gcc and clang have support for these. They also handle ebx register
in case it is used for PIC code.
@ironMann ironMann force-pushed the simd-01 branch 2 times, most recently from 41b7279 to 6c36357 Compare March 11, 2016 19:27
@ironMann
Copy link
Contributor Author

@behlendorf, @tuxoko can you take another look

Added:

  • kfpu_begin() \ kfpu_end() interface
  • OSXSAVE and xgetbv tests for YMM support, where needed
  • checks for BMI1 & BMI2 instruction sets

@behlendorf
Copy link
Contributor

@ironMann the updated patch LGTM. However, before it can be merged I think we should update one or more of the proposed patches which would depended on this infrastructure to make sure it provides everything we need (at least initially). The fletcher patch in #4330 is probably the simplest to update.

@ironMann
Copy link
Contributor Author

@behlendorf sure, I'll take a look at #4330.
I've 'integrated' this patch into #4328 and it's been working fine there for SSE and AVX2. I've also added userspace verify and reconstruct tool cmd/raidz_test that is using userspace checks provided by this patch.

@behlendorf
Copy link
Contributor

Awesome. Well if it looks like it's going to work for the fletcher patch and @tuxoko doesn't have any remaining concerns it looks like we'll be able to move forward on merging this. I wasn't aware you'd already updated #4328.

ironMann added a commit to ironMann/zfs that referenced this pull request Mar 17, 2016
ironMann added a commit to ironMann/zfs that referenced this pull request Mar 17, 2016
ironMann added a commit to ironMann/zfs that referenced this pull request Mar 17, 2016
ironMann added a commit to ironMann/zfs that referenced this pull request Mar 17, 2016
@behlendorf
Copy link
Contributor

@ironMann OK this LGTM as a base to build on. Let's plan on merging this on Monday if you happy with this as a final version and @tuxoko doesn't have any objections. It would be great if we could get a few more people to look it over before then and sign off on it.

@tuxoko
Copy link
Contributor

tuxoko commented Mar 19, 2016

LGTM

@behlendorf
Copy link
Contributor

Merged. @ironMann could you please rebase your updated #4328 against master so we can work on moving forward on it.

fc0c72b Support for vectorized algorithms on x86

@ironMann
Copy link
Contributor Author

@behlendorf #4328 rebased and pushed

ironMann added a commit to ironMann/zfs that referenced this pull request Jun 30, 2016
This is initial support for x86 vectorized implementations of ZFS parity
and checksum algorithms.

For the compilation phase, configure step checks if toolchain supports relevant
instruction sets. Each implementation must ensure that the code is not passed
to compiler if relevant instruction set is not supported. For this purpose,
following new defines are provided if instruction set is supported:
	- HAVE_SSE,
	- HAVE_SSE2,
	- HAVE_SSE3,
	- HAVE_SSSE3,
	- HAVE_SSE4_1,
	- HAVE_SSE4_2,
	- HAVE_AVX,
	- HAVE_AVX2.

For detecting if an instruction set can be used in runtime, following functions
are provided in (include/linux/simd_x86.h):
	- zfs_sse_available()
	- zfs_sse2_available()
	- zfs_sse3_available()
	- zfs_ssse3_available()
	- zfs_sse4_1_available()
	- zfs_sse4_2_available()
	- zfs_avx_available()
	- zfs_avx2_available()
	- zfs_bmi1_available()
	- zfs_bmi2_available()

These function should be called once, on module load, or initialization.
They are safe to use from user and kernel space.
If an implementation is using more than single instruction set, both compiler
and runtime support for all relevant instruction sets should be checked.

Kernel fpu methods:
	- kfpu_begin()
	- kfpu_end()

Use __get_cpuid_max and __cpuid_count from <cpuid.h>
Both gcc and clang have support for these. They also handle ebx register
in case it is used for PIC code.

Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes openzfs#4381

Conflicts:
	config/kernel.m4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants