Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD: Replace SVML/ASM of tanh(f32, f64) with universal intrinsics #20363

Merged
merged 4 commits into from
Feb 10, 2022

Conversation

seiko2plus
Copy link
Member

@seiko2plus seiko2plus commented Nov 12, 2021

Replace SVML/ASM of tanh for both single and double precision with universal intrinsics

To bring the benefits of performance for all platforms
not just for avx512 on Linux without performance/accuracy regression,
actually the other way around, better performance and
after all maintainable code.

The original code can be found in:

Benchmarks

X86

CPU
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              2
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:                        4
CPU MHz:                         3410.808
BogoMIPS:                        5999.99
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       64 KiB
L1i cache:                       64 KiB
L2 cache:                        2 MiB
L3 cache:                        24.8 MiB
NUMA node0 CPU(s):               0-3
Vulnerability Itlb multihit:     KVM: Mitigation: VMX unsupported
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant
                                 _tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
                                 tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep b
                                 mi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat
                                 pku ospke
OS
Linux ip-172-31-32-40 5.11.0-1020-aws #21~20.04.2-Ubuntu SMP Fri Oct 1 13:03:59 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Python 3.8.10
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Benchmark

SVML/AVX512_SKX(before) vs AVX512_SKX(after)
unset NPY_DISABLE_CPU_FEATURES
python3 runtests.py -n --bench-compare parent/main tanh -- --sort name
       before           after         ratio
     [a1813504]       [b5bd8620]
     <svml2npyv/tanh~3>       <svml2npyv/tanh>
-       365±0.5μs        218±0.9μs     0.60  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'd')
-      82.1±0.3μs       70.9±0.5μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'f')
-       375±0.6μs        238±0.2μs     0.63  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'd')
-        379±20μs         248±20μs     0.65  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'd')
-       366±0.5μs          228±3μs     0.62  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'd')
-      88.2±0.5μs       83.4±0.3μs     0.94  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'f')
-       378±0.8μs          254±2μs     0.67  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'd')
-         382±2μs          272±1μs     0.71  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'd')
AVX2_FMA3
export NPY_DISABLE_CPU_FEATURES="AVX512F AVX512_SKX"
python3 runtests.py -n --bench-compare parent/main tanh -- --sort name
      before           after         ratio
     [a1813504]       [b5bd8620]
     <svml2npyv/tanh~3>       <svml2npyv/tanh>
-        1.96±0ms          857±1μs     0.44  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'd')
-     1.86±0.01ms          250±2μs     0.13  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'f')
-     1.97±0.05ms         873±30μs     0.44  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'd')
-     1.86±0.01ms          283±2μs     0.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'f')
-     2.14±0.09ms         962±50μs     0.45  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'd')
-     1.93±0.04ms          296±8μs     0.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'f')
-        1.96±0ms          884±1μs     0.45  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'd')
-        1.86±0ms        312±0.9μs     0.17  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'f')
-        1.97±0ms          885±1μs     0.45  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'd')
-     1.86±0.02ms          329±4μs     0.18  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'f')
-      1.97±0.1ms         888±50μs     0.45  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'd')
-        1.86±0ms        329±0.9μs     0.18  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'f')
-        1.97±0ms        888±0.9μs     0.45  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'd')
-        1.86±0ms        315±0.7μs     0.17  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'f')
-        1.97±0ms          889±2μs     0.45  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'd')
-        1.86±0ms        331±0.3μs     0.18  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'f')
-        1.97±0ms          890±3μs     0.45  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'd')
-     1.87±0.04ms          330±7μs     0.18  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'f')

Power little-endian

CPU
Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       8
NUMA node(s):                    1
Model:                           2.2 (pvr 004e 1202)
Model name:                      POWER9 (architected), altivec supported
L1d cache:                       256 KiB
L1i cache:                       256 KiB
NUMA node0 CPU(s):               0-7
Vulnerability L1tf:              Not affected
Vulnerability Meltdown:          Mitigation; RFI Flush
Vulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable

processor   : 7
cpu     : POWER9 (architected), altivec supported
clock       : 2200.000000MHz
revision    : 2.2 (pvr 004e 1202)

timebase    : 512000000
platform    : pSeries
model       : IBM pSeries (emulated by qemu)
machine     : CHRP IBM pSeries (emulated by qemu)
MMU     : Radix

OS
Linux e517009a912a 4.19.0-2-powerpc64le #1 SMP Debian 4.19.16-1 (2019-01-17) ppc64le ppc64le ppc64le GNU/Linux
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Benchmark

VSX2
python3 runtests.py -n --bench-compare parent/main tanh -- --sort name
       before           after         ratio
     [a1813504]       [feacd298]
     <svml2npyv/tanh~3>       <svml2npyv/tanh>
-     5.63±0.01ms         2.46±0ms     0.44  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'd')
-     6.11±0.01ms         1.71±0ms     0.28  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'f')
-     5.63±0.03ms      2.48±0.01ms     0.44  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'd')
-     6.11±0.03ms         1.77±0ms     0.29  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'f')
-     5.70±0.05ms      2.51±0.03ms     0.44  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'd')
-     6.17±0.04ms      1.78±0.01ms     0.29  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'f')
-     5.63±0.01ms      2.67±0.04ms     0.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'd')
-     6.12±0.01ms      1.81±0.01ms     0.30  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'f')
-     5.62±0.01ms      2.68±0.01ms     0.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'd')
-     6.13±0.01ms         1.85±0ms     0.30  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'f')
-     5.63±0.03ms      2.69±0.02ms     0.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'd')
-     6.13±0.01ms         1.85±0ms     0.30  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'f')
-     5.62±0.02ms      2.68±0.01ms     0.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'd')
-     6.10±0.02ms      1.82±0.01ms     0.30  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'f')
-     5.63±0.01ms         2.70±0ms     0.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'd')
-     6.10±0.02ms         1.86±0ms     0.30  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'f')
-        5.63±0ms       2.70±0.2ms     0.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'd')
-     6.14±0.02ms      1.85±0.01ms     0.30  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'f')

AArch64

CPU
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          2
On-line CPU(s) list:             0,1
Thread(s) per core:              1
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       ARM
Model:                           1
Model name:                      Neoverse-N1
Stepping:                        r3p1
BogoMIPS:                        243.75
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        2 MiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0,1
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
OS
Linux ip-172-31-44-172 5.11.0-1020-aws #21~20.04.2-Ubuntu SMP Fri Oct 1 13:01:34 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Benchmark

ASIMD
python3 runtests.py --bench-compare parent/main tanh -- --sort name
       before           after         ratio
     [a1813504]       [869eae28]
     <svml2npyv/tanh~3>       <svml2npyv/tanh>
-     2.69±0.01ms         1.39±0ms     0.52  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'd')
-     2.71±0.01ms          533±2μs     0.20  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'f')
-     2.69±0.05ms      1.42±0.02ms     0.53  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'd')
-     2.70±0.01ms          568±5μs     0.21  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'f')
-     2.85±0.09ms      1.50±0.04ms     0.52  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'd')
-     2.78±0.04ms         585±10μs     0.21  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'f')
-        2.70±0ms         1.44±0ms     0.53  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'd')
-     2.70±0.01ms          560±3μs     0.21  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'f')
-        2.69±0ms         1.47±0ms     0.55  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'd')
-     2.70±0.02ms          598±7μs     0.22  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'f')
-     2.69±0.09ms      1.47±0.05ms     0.55  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'd')
-        2.70±0ms          600±4μs     0.22  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'f')
-     2.70±0.01ms         1.45±0ms     0.54  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'd')
-     2.70±0.01ms          593±4μs     0.22  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'f')
-     2.69±0.01ms         1.47±0ms     0.55  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'd')
-        2.71±0ms          628±4μs     0.23  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'f')
-     2.69±0.01ms         1.47±0ms     0.55  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'd')
-     2.72±0.04ms         630±10μs     0.23  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'f')

Binary size(striped)

LIB Before(KBytes) After(KBytes) Diff(KBytes)
_multiarray_umath.cpython-38-x86_64-linux-gnu.so 4140 4144 4
_multiarray_umath.cpython-38-aarch64-linux-gnu.so 3364 3384 20
_multiarray_umath.cpython-38-powerpc64le-linux-gnu.so 4108 4116 8

@seiko2plus seiko2plus added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Nov 12, 2021
@seiko2plus seiko2plus changed the title SIMD: Replace SVML/ASM of tanh with universal intrinsics SIMD: Replace SVML/ASM of tanh(f32, f64) with universal intrinsics Nov 12, 2021
@seiko2plus seiko2plus force-pushed the svml2npyv/tanh_f32 branch 3 times, most recently from 7e7913d to 99c3216 Compare November 13, 2021 04:25
@seiko2plus
Copy link
Member Author

no errors yay

@rgommers rgommers requested a review from r-devulap November 13, 2021 10:45
@mattip
Copy link
Member

mattip commented Nov 13, 2021

Cool!. At some point this should get a benchmark. There are ULP tests in test_umath_accuracy, so if that is passing the code should be at least as accurate as SVML.

@r-devulap
Copy link
Member

Nice, will review it over the next few days. The test_umath_accuracy tests run only on x86_64 and on CPU's with AVX-512, we might have to find a way to run them on other platforms too.

@seiko2plus
Copy link
Member Author

At some point this should get a benchmark

I still need to introduce other two universal intrinsics (npyv_all_##sfx, npyv_any_##sfx) to speed up boolean tests across the vector elements to speed up falling back to C on ppc64le and arm64 then I'm gonna publish benchmark result and set it ready for review.

There are ULP tests in test_umath_accuracy

I just converted the x86/avx512 instructions of SMVL into C intrinsics, I did little tweaks but still almost the same implementation.

test_umath_accuracy tests run only on x86_64 and on CPU's with AVX-512

I thought test_umath_accuracy is enabled with AVX2 & FMA3 too. since the universal intrinsics passed with one platform it should pass with the others too and also the current code only enabled for SIMD extensions with fused support.

@r-devulap
Copy link
Member

I thought test_umath_accuracy is enabled with AVX2 & FMA3 too. since the universal intrinsics passed with one platform it should pass with the others too and also the current code only enabled for SIMD extensions with fused support.

My bad, it is enabled for AVX2 & FMA3 too. And yes, fma instruction is a must, otherwise the accuracy can get really bad.

@seiko2plus seiko2plus force-pushed the svml2npyv/tanh_f32 branch 4 times, most recently from 8e9f073 to 9ed5995 Compare November 15, 2021 05:28
@seiko2plus seiko2plus marked this pull request as ready for review November 15, 2021 05:28
@seiko2plus
Copy link
Member Author

I still need to introduce other two universal intrinsics (npyv_all_##sfx, npyv_any_##sfx) to speed up boolean tests across

take it back, no performance changes npyv_tobits almost perform the same in this case, however, I will need to implement them for other cases in different pr one day.

@mattip,

At some point this should get a benchmark

done

@seiko2plus
Copy link
Member Author

seiko2plus commented Nov 15, 2021

github still doesn't offer an option to change the pr branch, no way to replace seiko2plus:svml2npyv/tanh_f32 to seiko2plus:svml2npyv/tanh, the plan was just to replace single precision but now I replaced both. I think the branch name doesn't matter anyway :).

@mattip
Copy link
Member

mattip commented Nov 15, 2021

I am a bit surprised that the benchmarks improve AVX512 performance. Isn't it using SVML?

@seiko2plus
Copy link
Member Author

@mattip, depending on the compiler, the latest versions of clang and GCC may even perform better, the current benchmark only shows improvements on avx512 for non-contiguous only, assuming SVML compiled against the same compiler still inlining can reduce extra load/store over the stack. There's already an open discussion on this matter, https://www.mail-archive.com/[email protected]/msg05645.html.

Copy link
Member Author

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doubts and perceptions.

@@ -0,0 +1,395 @@
/*@targets
** $maxopt baseline
** (avx2 fma3) AVX512_SKX
Copy link
Member Author

@seiko2plus seiko2plus Nov 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
** (avx2 fma3) AVX512_SKX
** (avx2 fma3) avx512f avx512_skx

adding target for avx512f is important for xeon phi but that is going to increase the binary size, alternatives:

  • remove avx512_skx and keep avx512f may lead to lose ~(5:10)% approximated(didn't test it)
  • keep it as-is, users who target xeon phi should build NumPy with at least --cpu-baseline=avx512f otherwise they will get (AVX2, FMA3) kernel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document/warn that the build might be suboptimal?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just keep it as is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only xeon phi chips are affected. all newer chips from intel with avx512 support skx features. if intel is no longer care about xeon phi then why we does? we can update the doc to notify xeon phi users of this approach and recommend them to raise the ceiling of the baseline features to avx512f feature at least. we already have this notification within the build options doc. see quick start. I will leave it as-is till we see what kind AVX512 features AMD is going to support into their new chips.

numpy/core/src/umath/loops_hyperbolic.dispatch.c.src Outdated Show resolved Hide resolved
numpy/core/src/umath/loops_hyperbolic.dispatch.c.src Outdated Show resolved Hide resolved
numpy/core/src/umath/loops_hyperbolic.dispatch.c.src Outdated Show resolved Hide resolved
Copy link
Member

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did verify that this change produces the exact same output for platforms with AVX-512. I would suggest getting rid of the C callout for special cases (large numbers, INF and NAN). Apart from the callout portion, everything else looks good.

   To bring the benefits of performance for all platforms
   not just for avx512 on linux without performance/accuracy regression,
   actually the other way around, better performance and
   after all maintainable code.
@seiko2plus
Copy link
Member Author

@r-devulap,

I would suggest getting rid of the C callout for special cases (large numbers, INF and NAN)

done for single-precision.

@r-devulap
Copy link
Member

@r-devulap,

I would suggest getting rid of the C callout for special cases (large numbers, INF and NAN)

done for single-precision.

Once its fixed for double precision, this PR will be good to merge.

@seiko2plus
Copy link
Member Author

@r-devulap, it's now for both single & double. I have revisited the assembly code again and I just realized that there's no special handling when |x| > HUGE_THRESHOLD just returning +-1 even for double-precision. unfortunately, SVML/tanh description wasn't clear enough and asm paths of the callout were strange a little bit.

@seiko2plus
Copy link
Member Author

@mattip, @r-devulap ping

Copy link
Member

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This is a great start in converting SVML to C code :)

@mattip mattip merged commit 75edab9 into numpy:main Feb 10, 2022
@mattip
Copy link
Member

mattip commented Feb 10, 2022

Thanks @seiko2plus

@rgommers
Copy link
Member

I just went on a little goose chase to find out to the svml_filter content added in this PR, so let me cross-link it for future reference.

  • This PR adds float32 and float64 implementations of tanh, and therefore filters out the relevant object files from SVML
  • ENH: Vectorize FP16 umath functions using AVX512 #21955 adds a float16 implementation in SVML, and removes the contents of svml_filter` to include the two object files that are now needed again in the build
    • It does leave the empty filter and the code comment in place, which is now confusing.
    • In the meson branch, I will just get rid of that, but instead put a one line note that filenames which are no longer needed because they're fully implemented with universal intrinsics should be commented out.

@seiko2plus
Copy link
Member Author

seiko2plus commented Nov 16, 2022

@rgommers, #21955 didn't reuse the universal intrinsic implementation to handle half-precision over single, instead, it brings SVML objects back. However, @r-devulap should have removed only svml_z0_tanh_s_la.s and kept svml_z0_tanh_d_la.s.

So you can choose between filtering out svml_z0_tanh_d_la.s, or ignoring its existence. In general, the reason behind this filtration is to reduce the binary size.

rgommers added a commit that referenced this pull request Nov 16, 2022
@rgommers
Copy link
Member

Thanks for the context, very helpful. Reducing binary size is important. I commented out svml_z0_tanh_d_la.s in commit 636255b.

jvesely added a commit to jvesely/PsyNeuLink that referenced this pull request May 17, 2023
… small drift rate

Numpy 1.23 reimplemented tanh function [0,1], which changed the result of
positive skew result for small drift rate.

[0] numpy/numpy#20363
[1] numpy/numpy@75edab9

Signed-off-by: Jan Vesely <[email protected]>
jvesely added a commit to jvesely/PsyNeuLink that referenced this pull request May 17, 2023
… small drift rate

Numpy 1.23 reimplemented tanh function [0,1], which changed the result of
positive skew result for small drift rate.

[0] numpy/numpy#20363
[1] numpy/numpy@75edab9

Signed-off-by: Jan Vesely <[email protected]>
jvesely added a commit to PrincetonUniversity/PsyNeuLink that referenced this pull request May 17, 2023
Numpy 1.23 reimplemented tanh function [0,1], which changed the result of
positive skew result for small drift rate.
This gives a 3rd possible set of results for DriftDiffusionAnalytical SmallDriftRate tests.
Print numpy information before running the tests to provide more information about optimizations used by numpy.

[0] numpy/numpy#20363
[1] numpy/numpy@75edab9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants