SIMD: Replace SVML/ASM of tanh(f32, f64) with universal intrinsics #20363

seiko2plus · 2021-11-12T22:10:28Z

Replace SVML/ASM of tanh for both single and double precision with universal intrinsics

To bring the benefits of performance for all platforms
not just for avx512 on Linux without performance/accuracy regression,
actually the other way around, better performance and
after all maintainable code.

The original code can be found in:

Benchmarks

X86

CPU

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              2
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:                        4
CPU MHz:                         3410.808
BogoMIPS:                        5999.99
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       64 KiB
L1i cache:                       64 KiB
L2 cache:                        2 MiB
L3 cache:                        24.8 MiB
NUMA node0 CPU(s):               0-3
Vulnerability Itlb multihit:     KVM: Mitigation: VMX unsupported
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant
                                 _tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
                                 tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep b
                                 mi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat
                                 pku ospke

OS

Linux ip-172-31-32-40 5.11.0-1020-aws #21~20.04.2-Ubuntu SMP Fri Oct 1 13:03:59 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Python 3.8.10
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Benchmark

SVML/AVX512_SKX(before) vs AVX512_SKX(after)

unset NPY_DISABLE_CPU_FEATURES
python3 runtests.py -n --bench-compare parent/main tanh -- --sort name

       before           after         ratio
     [a1813504]       [b5bd8620]
     <svml2npyv/tanh~3>       <svml2npyv/tanh>
-       365±0.5μs        218±0.9μs     0.60  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'd')
-      82.1±0.3μs       70.9±0.5μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'f')
-       375±0.6μs        238±0.2μs     0.63  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'd')
-        379±20μs         248±20μs     0.65  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'd')
-       366±0.5μs          228±3μs     0.62  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'd')
-      88.2±0.5μs       83.4±0.3μs     0.94  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'f')
-       378±0.8μs          254±2μs     0.67  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'd')
-         382±2μs          272±1μs     0.71  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'd')

AVX2_FMA3

export NPY_DISABLE_CPU_FEATURES="AVX512F AVX512_SKX"
python3 runtests.py -n --bench-compare parent/main tanh -- --sort name

      before           after         ratio
     [a1813504]       [b5bd8620]
     <svml2npyv/tanh~3>       <svml2npyv/tanh>
-        1.96±0ms          857±1μs     0.44  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'd')
-     1.86±0.01ms          250±2μs     0.13  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'f')
-     1.97±0.05ms         873±30μs     0.44  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'd')
-     1.86±0.01ms          283±2μs     0.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'f')
-     2.14±0.09ms         962±50μs     0.45  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'd')
-     1.93±0.04ms          296±8μs     0.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'f')
-        1.96±0ms          884±1μs     0.45  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'd')
-        1.86±0ms        312±0.9μs     0.17  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'f')
-        1.97±0ms          885±1μs     0.45  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'd')
-     1.86±0.02ms          329±4μs     0.18  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'f')
-      1.97±0.1ms         888±50μs     0.45  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'd')
-        1.86±0ms        329±0.9μs     0.18  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'f')
-        1.97±0ms        888±0.9μs     0.45  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'd')
-        1.86±0ms        315±0.7μs     0.17  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'f')
-        1.97±0ms          889±2μs     0.45  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'd')
-        1.86±0ms        331±0.3μs     0.18  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'f')
-        1.97±0ms          890±3μs     0.45  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'd')
-     1.87±0.04ms          330±7μs     0.18  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'f')

Power little-endian

CPU

Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       8
NUMA node(s):                    1
Model:                           2.2 (pvr 004e 1202)
Model name:                      POWER9 (architected), altivec supported
L1d cache:                       256 KiB
L1i cache:                       256 KiB
NUMA node0 CPU(s):               0-7
Vulnerability L1tf:              Not affected
Vulnerability Meltdown:          Mitigation; RFI Flush
Vulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable

processor   : 7
cpu     : POWER9 (architected), altivec supported
clock       : 2200.000000MHz
revision    : 2.2 (pvr 004e 1202)

timebase    : 512000000
platform    : pSeries
model       : IBM pSeries (emulated by qemu)
machine     : CHRP IBM pSeries (emulated by qemu)
MMU     : Radix

OS

Linux e517009a912a 4.19.0-2-powerpc64le #1 SMP Debian 4.19.16-1 (2019-01-17) ppc64le ppc64le ppc64le GNU/Linux
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Benchmark

VSX2

python3 runtests.py -n --bench-compare parent/main tanh -- --sort name

       before           after         ratio
     [a1813504]       [feacd298]
     <svml2npyv/tanh~3>       <svml2npyv/tanh>
-     5.63±0.01ms         2.46±0ms     0.44  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'd')
-     6.11±0.01ms         1.71±0ms     0.28  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'f')
-     5.63±0.03ms      2.48±0.01ms     0.44  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'd')
-     6.11±0.03ms         1.77±0ms     0.29  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'f')
-     5.70±0.05ms      2.51±0.03ms     0.44  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'd')
-     6.17±0.04ms      1.78±0.01ms     0.29  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'f')
-     5.63±0.01ms      2.67±0.04ms     0.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'd')
-     6.12±0.01ms      1.81±0.01ms     0.30  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'f')
-     5.62±0.01ms      2.68±0.01ms     0.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'd')
-     6.13±0.01ms         1.85±0ms     0.30  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'f')
-     5.63±0.03ms      2.69±0.02ms     0.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'd')
-     6.13±0.01ms         1.85±0ms     0.30  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'f')
-     5.62±0.02ms      2.68±0.01ms     0.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'd')
-     6.10±0.02ms      1.82±0.01ms     0.30  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'f')
-     5.63±0.01ms         2.70±0ms     0.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'd')
-     6.10±0.02ms         1.86±0ms     0.30  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'f')
-        5.63±0ms       2.70±0.2ms     0.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'd')
-     6.14±0.02ms      1.85±0.01ms     0.30  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'f')

AArch64

CPU

Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          2
On-line CPU(s) list:             0,1
Thread(s) per core:              1
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       ARM
Model:                           1
Model name:                      Neoverse-N1
Stepping:                        r3p1
BogoMIPS:                        243.75
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        2 MiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0,1
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

OS

Linux ip-172-31-44-172 5.11.0-1020-aws #21~20.04.2-Ubuntu SMP Fri Oct 1 13:01:34 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Benchmark

ASIMD

python3 runtests.py --bench-compare parent/main tanh -- --sort name

       before           after         ratio
     [a1813504]       [869eae28]
     <svml2npyv/tanh~3>       <svml2npyv/tanh>
-     2.69±0.01ms         1.39±0ms     0.52  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'd')
-     2.71±0.01ms          533±2μs     0.20  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'f')
-     2.69±0.05ms      1.42±0.02ms     0.53  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'd')
-     2.70±0.01ms          568±5μs     0.21  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'f')
-     2.85±0.09ms      1.50±0.04ms     0.52  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'd')
-     2.78±0.04ms         585±10μs     0.21  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'f')
-        2.70±0ms         1.44±0ms     0.53  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'd')
-     2.70±0.01ms          560±3μs     0.21  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'f')
-        2.69±0ms         1.47±0ms     0.55  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'd')
-     2.70±0.02ms          598±7μs     0.22  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'f')
-     2.69±0.09ms      1.47±0.05ms     0.55  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'd')
-        2.70±0ms          600±4μs     0.22  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'f')
-     2.70±0.01ms         1.45±0ms     0.54  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'd')
-     2.70±0.01ms          593±4μs     0.22  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'f')
-     2.69±0.01ms         1.47±0ms     0.55  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'd')
-        2.71±0ms          628±4μs     0.23  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'f')
-     2.69±0.01ms         1.47±0ms     0.55  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'd')
-     2.72±0.04ms         630±10μs     0.23  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'f')

Binary size(striped)

LIB	Before(KBytes)	After(KBytes)	Diff(KBytes)
_multiarray_umath.cpython-38-x86_64-linux-gnu.so	4140	4144	4
_multiarray_umath.cpython-38-aarch64-linux-gnu.so	3364	3384	20
_multiarray_umath.cpython-38-powerpc64le-linux-gnu.so	4108	4116	8

seiko2plus · 2021-11-13T05:41:03Z

no errors yay

mattip · 2021-11-13T17:20:22Z

Cool!. At some point this should get a benchmark. There are ULP tests in test_umath_accuracy, so if that is passing the code should be at least as accurate as SVML.

r-devulap · 2021-11-13T20:17:40Z

Nice, will review it over the next few days. The test_umath_accuracy tests run only on x86_64 and on CPU's with AVX-512, we might have to find a way to run them on other platforms too.

seiko2plus · 2021-11-13T20:34:47Z

At some point this should get a benchmark

I still need to introduce other two universal intrinsics (npyv_all_##sfx, npyv_any_##sfx) to speed up boolean tests across the vector elements to speed up falling back to C on ppc64le and arm64 then I'm gonna publish benchmark result and set it ready for review.

There are ULP tests in test_umath_accuracy

I just converted the x86/avx512 instructions of SMVL into C intrinsics, I did little tweaks but still almost the same implementation.

test_umath_accuracy tests run only on x86_64 and on CPU's with AVX-512

I thought test_umath_accuracy is enabled with AVX2 & FMA3 too. since the universal intrinsics passed with one platform it should pass with the others too and also the current code only enabled for SIMD extensions with fused support.

r-devulap · 2021-11-13T20:42:17Z

I thought test_umath_accuracy is enabled with AVX2 & FMA3 too. since the universal intrinsics passed with one platform it should pass with the others too and also the current code only enabled for SIMD extensions with fused support.

My bad, it is enabled for AVX2 & FMA3 too. And yes, fma instruction is a must, otherwise the accuracy can get really bad.

seiko2plus · 2021-11-15T05:32:37Z

I still need to introduce other two universal intrinsics (npyv_all_##sfx, npyv_any_##sfx) to speed up boolean tests across

take it back, no performance changes npyv_tobits almost perform the same in this case, however, I will need to implement them for other cases in different pr one day.

@mattip,

At some point this should get a benchmark

done

seiko2plus · 2021-11-15T05:36:03Z

github still doesn't offer an option to change the pr branch, no way to replace seiko2plus:svml2npyv/tanh_f32 to seiko2plus:svml2npyv/tanh, the plan was just to replace single precision but now I replaced both. I think the branch name doesn't matter anyway :).

mattip · 2021-11-15T08:49:11Z

I am a bit surprised that the benchmarks improve AVX512 performance. Isn't it using SVML?

seiko2plus · 2021-11-15T09:35:35Z

@mattip, depending on the compiler, the latest versions of clang and GCC may even perform better, the current benchmark only shows improvements on avx512 for non-contiguous only, assuming SVML compiled against the same compiler still inlining can reduce extra load/store over the stack. There's already an open discussion on this matter, https://www.mail-archive.com/[email protected]/msg05645.html.

seiko2plus

Doubts and perceptions.

seiko2plus · 2021-11-20T20:23:09Z

numpy/core/src/umath/loops_hyperbolic.dispatch.c.src

@@ -0,0 +1,395 @@
+/*@targets
+ ** $maxopt baseline
+ ** (avx2 fma3) AVX512_SKX


Suggested change

** (avx2 fma3) AVX512_SKX

** (avx2 fma3) avx512f avx512_skx

adding target for avx512f is important for xeon phi but that is going to increase the binary size, alternatives:

remove avx512_skx and keep avx512f may lead to lose ~(5:10)% approximated(didn't test it)

keep it as-is, users who target xeon phi should build NumPy with at least --cpu-baseline=avx512f otherwise they will get (AVX2, FMA3) kernel.

Can we document/warn that the build might be suboptimal?

I would just keep it as is.

only xeon phi chips are affected. all newer chips from intel with avx512 support skx features. if intel is no longer care about xeon phi then why we does? we can update the doc to notify xeon phi users of this approach and recommend them to raise the ceiling of the baseline features to avx512f feature at least. we already have this notification within the build options doc. see quick start. I will leave it as-is till we see what kind AVX512 features AMD is going to support into their new chips.

numpy/core/src/umath/loops_hyperbolic.dispatch.c.src

r-devulap

I did verify that this change produces the exact same output for platforms with AVX-512. I would suggest getting rid of the C callout for special cases (large numbers, INF and NAN). Apart from the callout portion, everything else looks good.

To bring the benefits of performance for all platforms not just for avx512 on linux without performance/accuracy regression, actually the other way around, better performance and after all maintainable code.

seiko2plus · 2022-01-19T15:55:08Z

@r-devulap,

I would suggest getting rid of the C callout for special cases (large numbers, INF and NAN)

done for single-precision.

numpy/core/src/umath/loops_hyperbolic.dispatch.c.src

r-devulap · 2022-01-28T21:35:04Z

@r-devulap,

I would suggest getting rid of the C callout for special cases (large numbers, INF and NAN)

done for single-precision.

Once its fixed for double precision, this PR will be good to merge.

…sics instead of fallback to C

seiko2plus · 2022-02-01T02:23:18Z

@r-devulap, it's now for both single & double. I have revisited the assembly code again and I just realized that there's no special handling when |x| > HUGE_THRESHOLD just returning +-1 even for double-precision. unfortunately, SVML/tanh description wasn't clear enough and asm paths of the callout were strange a little bit.

seiko2plus · 2022-02-10T09:03:40Z

@mattip, @r-devulap ping

r-devulap

LGTM. This is a great start in converting SVML to C code :)

mattip · 2022-02-10T18:17:51Z

Thanks @seiko2plus

rgommers · 2022-11-15T17:06:26Z

I just went on a little goose chase to find out to the svml_filter content added in this PR, so let me cross-link it for future reference.

This PR adds float32 and float64 implementations of tanh, and therefore filters out the relevant object files from SVML
ENH: Vectorize FP16 umath functions using AVX512 #21955 adds a float16 implementation in SVML, and removes the contents of svml_filter` to include the two object files that are now needed again in the build
- It does leave the empty filter and the code comment in place, which is now confusing.
- In the meson branch, I will just get rid of that, but instead put a one line note that filenames which are no longer needed because they're fully implemented with universal intrinsics should be commented out.

seiko2plus · 2022-11-16T01:16:36Z

@rgommers, #21955 didn't reuse the universal intrinsic implementation to handle half-precision over single, instead, it brings SVML objects back. However, @r-devulap should have removed only svml_z0_tanh_s_la.s and kept svml_z0_tanh_d_la.s.

So you can choose between filtering out svml_z0_tanh_d_la.s, or ignoring its existence. In general, the reason behind this filtration is to reduce the binary size.

…ementation See comments on gh-20363

rgommers · 2022-11-16T15:21:59Z

Thanks for the context, very helpful. Reducing binary size is important. I commented out svml_z0_tanh_d_la.s in commit 636255b.

… small drift rate Numpy 1.23 reimplemented tanh function [0,1], which changed the result of positive skew result for small drift rate. [0] numpy/numpy#20363 [1] numpy/numpy@75edab9 Signed-off-by: Jan Vesely <[email protected]>

Numpy 1.23 reimplemented tanh function [0,1], which changed the result of positive skew result for small drift rate. This gives a 3rd possible set of results for DriftDiffusionAnalytical SmallDriftRate tests. Print numpy information before running the tests to provide more information about optimizations used by numpy. [0] numpy/numpy#20363 [1] numpy/numpy@75edab9

seiko2plus added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Nov 12, 2021

seiko2plus force-pushed the svml2npyv/tanh_f32 branch from 419c3f7 to 54a28e8 Compare November 12, 2021 23:01

seiko2plus changed the title ~~SIMD: Replace SVML/ASM of tanh with universal intrinsics~~ SIMD: Replace SVML/ASM of tanh(f32, f64) with universal intrinsics Nov 12, 2021

seiko2plus force-pushed the svml2npyv/tanh_f32 branch 3 times, most recently from 7e7913d to 99c3216 Compare November 13, 2021 04:25

rgommers requested a review from r-devulap November 13, 2021 10:45

seiko2plus force-pushed the svml2npyv/tanh_f32 branch 4 times, most recently from 8e9f073 to 9ed5995 Compare November 15, 2021 05:28

seiko2plus marked this pull request as ready for review November 15, 2021 05:28

seiko2plus commented Nov 20, 2021

View reviewed changes

r-devulap requested changes Jan 4, 2022

View reviewed changes

seiko2plus mentioned this pull request Jan 6, 2022

BLD: Add NPY_DISABLE_SVML env var to opt out of SVML #20695

Merged

seiko2plus added 3 commits January 19, 2022 07:54

ENH, SIMD: replace SVML/tanh with universal intrinsics

2946e02

To bring the benefits of performance for all platforms not just for avx512 on linux without performance/accuracy regression, actually the other way around, better performance and after all maintainable code.

SIMD: Add new universal intrinsics for lookup table

47644b2

SIMD: Test lookup table intrinsics

a52cdaf

seiko2plus force-pushed the svml2npyv/tanh_f32 branch from 9ed5995 to 70226f7 Compare January 19, 2022 05:55

r-devulap reviewed Jan 28, 2022

View reviewed changes

numpy/core/src/umath/loops_hyperbolic.dispatch.c.src Outdated Show resolved Hide resolved

seiko2plus force-pushed the svml2npyv/tanh_f32 branch from 70226f7 to eba2d29 Compare February 1, 2022 00:39

SIMD: handel |x| > HUGE_THRESHOLD, special cases via universal intrin…

d99bf0e

…sics instead of fallback to C

seiko2plus force-pushed the svml2npyv/tanh_f32 branch from eba2d29 to d99bf0e Compare February 1, 2022 00:56

r-devulap approved these changes Feb 10, 2022

View reviewed changes

mattip merged commit 75edab9 into numpy:main Feb 10, 2022

seiko2plus added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Feb 13, 2022

seiko2plus mentioned this pull request Apr 28, 2022

ENH, SIMD: Optimize/vectorize comparison and logical operations for VSX #21258

Closed

rgommers added a commit that referenced this pull request Nov 16, 2022

BLD: don't use svml_z0_tanh_d_la.s, has a universal intrinsics impl…

636255b

…ementation See comments on gh-20363

jvesely mentioned this pull request May 17, 2023

requirements: update numpy requirement from <1.22.5 to <1.23.6 PrincetonUniversity/PsyNeuLink#2672

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD: Replace SVML/ASM of tanh(f32, f64) with universal intrinsics #20363

SIMD: Replace SVML/ASM of tanh(f32, f64) with universal intrinsics #20363

seiko2plus commented Nov 12, 2021 •

edited

Loading

seiko2plus commented Nov 13, 2021

mattip commented Nov 13, 2021

r-devulap commented Nov 13, 2021

seiko2plus commented Nov 13, 2021

r-devulap commented Nov 13, 2021

seiko2plus commented Nov 15, 2021

seiko2plus commented Nov 15, 2021 •

edited

Loading

mattip commented Nov 15, 2021

seiko2plus commented Nov 15, 2021

seiko2plus left a comment

seiko2plus Nov 20, 2021 •

edited

Loading

mattip Dec 24, 2021

r-devulap Jan 4, 2022

seiko2plus Jan 19, 2022

r-devulap left a comment

seiko2plus commented Jan 19, 2022

r-devulap commented Jan 28, 2022

seiko2plus commented Feb 1, 2022

seiko2plus commented Feb 10, 2022

r-devulap left a comment •

edited

Loading

mattip commented Feb 10, 2022

rgommers commented Nov 15, 2022

seiko2plus commented Nov 16, 2022 •

edited

Loading

rgommers commented Nov 16, 2022

SIMD: Replace SVML/ASM of tanh(f32, f64) with universal intrinsics #20363

SIMD: Replace SVML/ASM of tanh(f32, f64) with universal intrinsics #20363

Conversation

seiko2plus commented Nov 12, 2021 • edited Loading

Replace SVML/ASM of tanh for both single and double precision with universal intrinsics

Benchmarks

X86

Benchmark

Power little-endian

Benchmark

AArch64

Benchmark

Binary size(striped)

seiko2plus commented Nov 13, 2021

mattip commented Nov 13, 2021

r-devulap commented Nov 13, 2021

seiko2plus commented Nov 13, 2021

r-devulap commented Nov 13, 2021

seiko2plus commented Nov 15, 2021

seiko2plus commented Nov 15, 2021 • edited Loading

mattip commented Nov 15, 2021

seiko2plus commented Nov 15, 2021

seiko2plus left a comment

Choose a reason for hiding this comment

seiko2plus Nov 20, 2021 • edited Loading

Choose a reason for hiding this comment

mattip Dec 24, 2021

Choose a reason for hiding this comment

r-devulap Jan 4, 2022

Choose a reason for hiding this comment

seiko2plus Jan 19, 2022

Choose a reason for hiding this comment

r-devulap left a comment

Choose a reason for hiding this comment

seiko2plus commented Jan 19, 2022

r-devulap commented Jan 28, 2022

seiko2plus commented Feb 1, 2022

seiko2plus commented Feb 10, 2022

r-devulap left a comment • edited Loading

Choose a reason for hiding this comment

mattip commented Feb 10, 2022

rgommers commented Nov 15, 2022

seiko2plus commented Nov 16, 2022 • edited Loading

rgommers commented Nov 16, 2022

seiko2plus commented Nov 12, 2021 •

edited

Loading

seiko2plus commented Nov 15, 2021 •

edited

Loading

seiko2plus Nov 20, 2021 •

edited

Loading

r-devulap left a comment •

edited

Loading

seiko2plus commented Nov 16, 2022 •

edited

Loading