-
-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD: Replace SVML/ASM of tanh(f32, f64) with universal intrinsics #20363
Conversation
419c3f7
to
54a28e8
Compare
7e7913d
to
99c3216
Compare
no errors yay |
Cool!. At some point this should get a benchmark. There are ULP tests in test_umath_accuracy, so if that is passing the code should be at least as accurate as SVML. |
Nice, will review it over the next few days. The |
I still need to introduce other two universal intrinsics (npyv_all_##sfx, npyv_any_##sfx) to speed up boolean tests across the vector elements to speed up falling back to
I just converted the x86/avx512 instructions of SMVL into C intrinsics, I did little tweaks but still almost the same implementation.
I thought |
My bad, it is enabled for |
8e9f073
to
9ed5995
Compare
take it back, no performance changes
done |
github still doesn't offer an option to change the pr branch, no way to replace |
I am a bit surprised that the benchmarks improve AVX512 performance. Isn't it using SVML? |
@mattip, depending on the compiler, the latest versions of clang and GCC may even perform better, the current benchmark only shows improvements on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doubts and perceptions.
@@ -0,0 +1,395 @@ | |||
/*@targets | |||
** $maxopt baseline | |||
** (avx2 fma3) AVX512_SKX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
** (avx2 fma3) AVX512_SKX | |
** (avx2 fma3) avx512f avx512_skx |
adding target for avx512f
is important for xeon phi
but that is going to increase the binary size, alternatives:
- remove
avx512_skx
and keepavx512f
may lead to lose ~(5:10)% approximated(didn't test it) - keep it as-is, users who target
xeon phi
should build NumPy with at least--cpu-baseline=avx512f
otherwise they will get(AVX2, FMA3)
kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document/warn that the build might be suboptimal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just keep it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only xeon phi chips are affected. all newer chips from intel with avx512 support skx features. if intel is no longer care about xeon phi then why we does? we can update the doc to notify xeon phi users of this approach and recommend them to raise the ceiling of the baseline features to avx512f feature at least. we already have this notification within the build options doc. see quick start. I will leave it as-is till we see what kind AVX512 features AMD is going to support into their new chips.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did verify that this change produces the exact same output for platforms with AVX-512. I would suggest getting rid of the C callout for special cases (large numbers, INF and NAN). Apart from the callout portion, everything else looks good.
To bring the benefits of performance for all platforms not just for avx512 on linux without performance/accuracy regression, actually the other way around, better performance and after all maintainable code.
9ed5995
to
70226f7
Compare
done for single-precision. |
Once its fixed for double precision, this PR will be good to merge. |
70226f7
to
eba2d29
Compare
…sics instead of fallback to C
eba2d29
to
d99bf0e
Compare
@r-devulap, it's now for both single & double. I have revisited the assembly code again and I just realized that there's no special handling when |
@mattip, @r-devulap ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This is a great start in converting SVML to C code :)
Thanks @seiko2plus |
I just went on a little goose chase to find out to the
|
@rgommers, #21955 didn't reuse the universal intrinsic implementation to handle half-precision over single, instead, it brings SVML objects back. However, @r-devulap should have removed only So you can choose between filtering out |
…ementation See comments on gh-20363
Thanks for the context, very helpful. Reducing binary size is important. I commented out |
… small drift rate Numpy 1.23 reimplemented tanh function [0,1], which changed the result of positive skew result for small drift rate. [0] numpy/numpy#20363 [1] numpy/numpy@75edab9 Signed-off-by: Jan Vesely <[email protected]>
… small drift rate Numpy 1.23 reimplemented tanh function [0,1], which changed the result of positive skew result for small drift rate. [0] numpy/numpy#20363 [1] numpy/numpy@75edab9 Signed-off-by: Jan Vesely <[email protected]>
Numpy 1.23 reimplemented tanh function [0,1], which changed the result of positive skew result for small drift rate. This gives a 3rd possible set of results for DriftDiffusionAnalytical SmallDriftRate tests. Print numpy information before running the tests to provide more information about optimizations used by numpy. [0] numpy/numpy#20363 [1] numpy/numpy@75edab9
Replace SVML/ASM of tanh for both single and double precision with universal intrinsics
To bring the benefits of performance for all platforms
not just for
avx512
on Linux without performance/accuracy regression,actually the other way around, better performance and
after all maintainable code.
The original code can be found in:
Benchmarks
X86
CPU
OS
Linux ip-172-31-32-40 5.11.0-1020-aws #21~20.04.2-Ubuntu SMP Fri Oct 1 13:03:59 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Python 3.8.10 gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Benchmark
SVML/AVX512_SKX(before) vs AVX512_SKX(after)
unset NPY_DISABLE_CPU_FEATURES python3 runtests.py -n --bench-compare parent/main tanh -- --sort name
AVX2_FMA3
Power little-endian
CPU
OS
Linux e517009a912a 4.19.0-2-powerpc64le #1 SMP Debian 4.19.16-1 (2019-01-17) ppc64le ppc64le ppc64le GNU/Linux gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Benchmark
VSX2
AArch64
CPU
OS
Linux ip-172-31-44-172 5.11.0-1020-aws #21~20.04.2-Ubuntu SMP Fri Oct 1 13:01:34 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Benchmark
ASIMD
Binary size(striped)